101-139-s3.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377
  1. .. _s3:
  2. Walk-through: Amazon S3 as a special remote
  3. -------------------------------------------
  4. `Amazon S3 <https://aws.amazon.com/s3>`_ (or Amazon Simple Storage Service) is a
  5. popular service by `Amazon Web Services <https://aws.amazon.com>`_ (AWS) that
  6. provides object storage through a web service interface. An S3 bucket can be
  7. configured as a :term:`git-annex` :term:`special remote`, allowing it to be used
  8. as a DataLad publication target. This means that you can use Amazon S3 to store your
  9. annexed data contents and allow users to install your full dataset with DataLad
  10. from a publicly available repository service such as GitHub.
  11. In this section, we provide a walkthrough on how to set up Amazon S3 for hosting
  12. your DataLad dataset, and how to access this data locally from GitHub.
  13. Prerequisites
  14. ^^^^^^^^^^^^^
  15. In order to use Amazon S3 for hosting your datasets, and to follow the steps below, you need to:
  16. - Signup for an `AWS account <https://aws.amazon.com>`_
  17. - Verify your account
  18. - Find your AWS access key
  19. - Signup for a `GitHub account <https://github.com/join>`_
  20. - Install `wget <https://www.gnu.org/software/wget>`_ in order to download sample data
  21. - Optional: install the `AWS Command Line Interface <https://aws.amazon.com/cli>`_
  22. The `AWS signup <https://aws.amazon.com>`_ procedure requires you to provide your
  23. e-mail address, physical address, and credit card details before verification is possible.
  24. .. importantnote:: AWS account usage can incur costs
  25. While Amazon provides `Free Tier <https://aws.amazon.com/free>`_ access to its services,
  26. it can still potentially result in costs if usage exceeds `Free Tier Limits <https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/free-tier-limits.html>`_.
  27. Be sure to take note of these limits, or set up `automatic tracking alerts <https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/tracking-free-tier-usage.html>`_
  28. to be notified before incurring unnecessary costs.
  29. To find your AWS access key, log in to the `AWS Console <https://console.aws.amazon.com>`_,
  30. open the dropdown menu at your username (top right), and select "My Security
  31. Credentials". A new page will open with several options, including "Access keys
  32. (access key ID and secret access key)" from where you can select "Create New Access
  33. Key" or access existing credentials. Take note to copy both the "Access Key ID" and
  34. "Secret Access Key".
  35. .. figure:: ../artwork/src/aws_s3_create_access_key.png
  36. Create a new AWS access key from "My Security Credentials"
  37. To ensure that your access key details are known when initializing the special
  38. remote, export them as :term:`environment variable`\s in your shell:
  39. .. code-block:: console
  40. $ export AWS_ACCESS_KEY_ID="your-access-key-ID"
  41. $ export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
  42. In order to work directly with AWS via your command-line shell, you can
  43. `install the AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_.
  44. However, that is not required for this walkthrough.
  45. Lastly, to publish your data repository to GitHub, from which users will be able do install
  46. the complete dataset, you will need a `GitHub account <https://github.com/join>`_.
  47. Your DataLad dataset
  48. ^^^^^^^^^^^^^^^^^^^^
  49. If you already have a small DataLad dataset to practice with, feel free to use it
  50. during the rest of the walkthrough. If you do not have data, no problem! As a general
  51. introduction, the steps below will download a small public neuroimaging dataset,
  52. and transform it into a DataLad dataset. We'll use the `MoAEpilot <https://www.fil.ion.ucl.ac.uk/spm/data/auditory>`_
  53. dataset containing anatomical and functional images from a single brain, as well as some metadata.
  54. In the first step, we create a new directory called ``neuro-data-s3``, we download and extract the data,
  55. and then we move the extracted contents into our new directory:
  56. .. code-block:: console
  57. $ cd <wherever-you-want-to-create-the-dataset>
  58. $ mkdir neuro-data-s3 && \
  59. wget https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip -O neuro-data-s3.zip && \
  60. unzip neuro-data-s3.zip -d neuro-data-s3 && \
  61. rm neuro-data-s3.zip && \
  62. cd neuro-data-s3 && \
  63. mv MoAEpilot/* . && \
  64. rm -R MoAEpilot
  65. --2021-06-01 09:32:25-- https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip
  66. Resolving www.fil.ion.ucl.ac.uk (www.fil.ion.ucl.ac.uk)... 193.62.66.18
  67. Connecting to www.fil.ion.ucl.ac.uk (www.fil.ion.ucl.ac.uk)|193.62.66.18|:443... connected.
  68. HTTP request sent, awaiting response... 200 OK
  69. Length: 30176409 (29M) [application/zip]
  70. Saving to: ‘neuro-data-s3.zip’
  71. neuro-data-s3.zip 100%[=============================================================================================================================>] 28.78M 55.3MB/s in 0.5s
  72. 2021-06-01 09:32:25 (55.3 MB/s) - ‘neuro-data-s3.zip’ saved [30176409/30176409]
  73. Archive: neuro-data-s3.zip
  74. creating: neuro-data-s3/MoAEpilot/
  75. inflating: neuro-data-s3/MoAEpilot/task-auditory_bold.json
  76. inflating: neuro-data-s3/MoAEpilot/README
  77. inflating: neuro-data-s3/MoAEpilot/dataset_description.json
  78. inflating: neuro-data-s3/MoAEpilot/CHANGES
  79. creating: neuro-data-s3/MoAEpilot/sub-01/
  80. creating: neuro-data-s3/MoAEpilot/sub-01/func/
  81. inflating: neuro-data-s3/MoAEpilot/sub-01/func/sub-01_task-auditory_events.tsv
  82. inflating: neuro-data-s3/MoAEpilot/sub-01/func/sub-01_task-auditory_bold.nii
  83. creating: neuro-data-s3/MoAEpilot/sub-01/anat/
  84. inflating: neuro-data-s3/MoAEpilot/sub-01/anat/sub-01_T1w.nii
  85. Now we can view the directory tree to see the dataset content:
  86. .. code-block:: console
  87. $ tree
  88. .
  89. ├── CHANGES
  90. ├── README
  91. ├── dataset_description.json
  92. ├── sub-01
  93. │   ├── anat
  94. │   │   └── sub-01_T1w.nii
  95. │   └── func
  96. │   ├── sub-01_task-auditory_bold.nii
  97. │   └── sub-01_task-auditory_events.tsv
  98. └── task-auditory_bold.json
  99. The next step is to ensure that this is a valid DataLad dataset,
  100. with ``main`` as the default branch.
  101. We can turn our ``neuro-data-s3`` directory into a DataLad dataset with the
  102. :dlcmd:`create --force` command. After that, we save the dataset with :dlcmd:`save`:
  103. .. code-block:: console
  104. $ datalad create --force --description "neuro data to host on s3"
  105. [INFO ] Creating a new annex repo at /Users/jsheunis/Documents/neuro-data-s3
  106. [INFO ] Scanning for unlocked files (this may take some time)
  107. create(ok): /Users/jsheunis/Documents/neuro-data-s3 (dataset)
  108. $ datalad save -m "Add public data"
  109. add(ok): CHANGES (file)
  110. add(ok): README (file)
  111. add(ok): dataset_description.json (file)
  112. add(ok): sub-01/anat/sub-01_T1w.nii (file)
  113. add(ok): sub-01/func/sub-01_task-auditory_bold.nii (file)
  114. add(ok): sub-01/func/sub-01_task-auditory_events.tsv (file)
  115. add(ok): task-auditory_bold.json (file)
  116. save(ok): . (dataset)
  117. action summary:
  118. add (ok: 7)
  119. save (ok: 1)
  120. Initialize the S3 special remote
  121. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  122. The steps below have been adapted from instructions provided on `git-annex documentation <https://git-annex.branchable.com/tips/public_Amazon_S3_remote>`_.
  123. By initializing the special remote, what actually happens in the background
  124. is that a :term:`sibling` is added to the DataLad dataset. This can be verified
  125. by running :dlcmd:`siblings` before and after initializing the special
  126. remote. Before, the only "sibling" is the actual DataLad dataset:
  127. .. code-block:: console
  128. $ datalad siblings
  129. .: here(+) [git]
  130. To initialize a public S3 bucket as a special remote, we run :gitannexcmd:`initremote`
  131. with several options, for which `git-annex documentation on S3 <https://git-annex.branchable.com/special_remotes/S3>`_
  132. provides detailed information. Be sure to select a unique bucket name
  133. that adheres to Amazon S3's `bucket naming rules <https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html>`_.
  134. You can declare the bucket name (in this example "sample-neurodata-public") as a variable since
  135. it will be used again later.
  136. .. code-block:: console
  137. $ BUCKET=sample-neurodata-public
  138. $ git annex initremote public-s3 type=S3 encryption=none \
  139. bucket=$BUCKET public=yes datacenter=EU autoenable=true
  140. initremote public-s3 (checking bucket...) (creating bucket in EU...) ok
  141. (recording state in git...)
  142. The options used in this example include:
  143. - ``public-s3``: the name we select for our special remote, so that git-annex and DataLad can identify it
  144. - ``type=S3``: the type of special remote (git-annex can work with many `special remote types <https://git-annex.branchable.com/special_remotes>`_)
  145. - ``encryption=none``: no encryption (alternatively enable ``encryption=shared``, meaning files will be encrypted on S3, and anyone with a clone of the git repository will be able to download and decrypt them)
  146. - ``bucket=$BUCKET``: the name of the bucket to be created on S3 (using the declared variable)
  147. - ``public=yes``: Set to "yes" to allow public read access to files sent to the S3 remote
  148. - ``datacenter=EU``: specify where the data will be located; here we set "EU" which is EU/Ireland a.k.a. ``eu-west-1`` (defaults to "US" if not specified)
  149. - ``autoenable=true``: git-annex will attempt to enable the special remote when it is run in a new clone, implying that users won't have to run extra steps when installing the dataset with DataLad
  150. After :gitannexcmd:`initremote` has successfully initialized the special remote,
  151. you can run :dlcmd:`siblings` to see that a sibling has been added:
  152. .. code-block:: console
  153. $ datalad siblings
  154. .: here(+) [git]
  155. .: public-s3(+) [git]
  156. You can also visit the `S3 Console <https://console.aws.amazon.com/s3>`_ and navigate
  157. to "Buckets" to see your newly created bucket. It should only have a single
  158. ``annex-uuid`` file as content, since no actual file content has been pushed yet.
  159. .. figure:: ../artwork/src/aws_s3_bucket_empty.png
  160. A newly created public S3 bucket
  161. Lastly, for git-annex to be able to download files from the bucket without requiring your
  162. AWS credentials, it needs to know where to find the bucket. We do this by setting the bucket
  163. URL, which takes a standard format incorporating the bucket name and location (see the code block below).
  164. Alternatively, this URL can also be copied from your AWS console.
  165. .. code-block:: console
  166. $ git annex enableremote public-s3 \
  167. publicurl="https://$BUCKET.s3-eu-west-1.amazonaws.com"
  168. enableremote public-s3 ok
  169. (recording state in git...)
  170. Publish the dataset
  171. ^^^^^^^^^^^^^^^^^^^
  172. The special remote is ready, and now we want to give people seamless access to the
  173. DataLad dataset. A common way to do this is to create a sibling of the dataset on
  174. GitHub using :dlcmd:`create-sibling-github`. In order to link the contents in the
  175. S3 special remote to the GitHub sibling, we also need to configure a publication
  176. dependency to the ``public-s3`` sibling, which is done with the ``publish-depends <sibling>``
  177. option. For consistency, we'll give the GitHub repository the same name as the dataset name.
  178. .. code-block:: console
  179. $ datalad create-sibling-github -d . neuro-data-s3 \
  180. --publish-depends public-s3
  181. [INFO ] Configure additional publication dependency on "public-s3"
  182. .: github(-) [https://github.com/jsheunis/sample-neuro-data.git (git)]
  183. 'https://github.com/jsheunis/sample-neuro-data.git' configured as sibling 'github' for Dataset(/Users/jsheunis/Documents/neuro-data-s3)
  184. Notice that by creating this sibling, DataLad created an actual (empty) dataset repository
  185. on GitHub, which required preconfigured GitHub authentication details.
  186. The creation of the sibling (named ``github``) can also be confirmed with :dlcmd:`siblings`:
  187. .. code-block:: console
  188. $ datalad siblings
  189. .: here(+) [git]
  190. .: public-s3(+) [git]
  191. .: github(-) [https://github.com/jsheunis/neuro-data-s3.git (git)]
  192. The next step is to actually push the file content to where it needs to be in order
  193. to allow others to access the data. We do this with :dlcmd:`push --to github`.
  194. The ``--to github`` specifies which sibling to push the dataset to, but because of the
  195. publication dependency DataLad will push the annexed contents to the special remote first.
  196. .. code-block:: console
  197. $ datalad push --to github
  198. copy(ok): CHANGES (file) [to public-s3...]
  199. copy(ok): README (file) [to public-s3...]
  200. copy(ok): dataset_description.json (file) [to public-s3...]
  201. copy(ok): sub-01/anat/sub-01_T1w.nii (file) [to public-s3...]
  202. copy(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [to public-s3...]
  203. copy(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [to public-s3...]
  204. copy(ok): task-auditory_bold.json (file) [to public-s3...]
  205. publish(ok): . (dataset) [refs/heads/main->github:refs/heads/main [new branch]]
  206. publish(ok): . (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
  207. You can now view the annexed file content (with MD5 hashes as filenames) in the
  208. `S3 bucket <https://console.aws.amazon.com/s3>`_:
  209. .. figure:: ../artwork/src/aws_s3_bucket_full.png
  210. The public S3 bucket with annexed file content pushed
  211. Lastly, the GitHub repository will also show the newly pushed dataset (with
  212. the "files" being symbolic links to the annexed content on the S3 remote):
  213. .. figure:: ../artwork/src/aws_s3_github_repo.png
  214. The public GitHub repository with the DataLad dataset
  215. Test the setup!
  216. ^^^^^^^^^^^^^^^
  217. You have now successfully created a DataLad dataset with an AWS S3 special remote for
  218. annexed file content and with a public GitHub sibling from which the dataset can be accessed.
  219. Users can now :dlcmd:`clone` the dataset using the GitHub repository URL:
  220. .. code-block:: console
  221. $ cd /tmp
  222. $ datalad clone https://github.com/<enter-your-your-organization-or-account-name-here>/neuro-data-s3.git
  223. [INFO ] Scanning for unlocked files (this may take some time)
  224. [INFO ] Remote origin not usable by git-annex; setting annex-ignore
  225. install(ok): /tmp/neuro-data-s3 (dataset)
  226. $ cd neuro-data-s3
  227. $ datalad get . -r
  228. [INFO ] Installing Dataset(/tmp/neuro-data-s3) to get /tmp/neuro-data-s3 recursively
  229. get(ok): CHANGES (file) [from public-s3...]
  230. get(ok): README (file) [from public-s3...]
  231. get(ok): dataset_description.json (file) [from public-s3...]
  232. get(ok): sub-01/anat/sub-01_T1w.nii (file) [from public-s3...]
  233. get(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [from public-s3...]
  234. get(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [from public-s3...]
  235. get(ok): task-auditory_bold.json (file) [from public-s3...]
  236. action summary:
  237. get (ok: 7)
  238. The results of running the code above show that DataLad could :dlcmd:`install` the dataset correctly
  239. and :dlcmd:`get` all annexed file content successfully from the ``public-s3`` sibling.
  240. Congrats!
  241. Advanced examples
  242. ^^^^^^^^^^^^^^^^^
  243. When there is a lot to upload, automation is your friend.
  244. One example is the automated upload of dataset hierarchies to S3
  245. The script below is a quick-and-dirty solution to the task of exporting a hierarchy of datasets to an S3 bucket.
  246. It needs to be invoked with three positional arguments, the path to the :term:`DataLad superdataset`, the S3 bucket name, and a prefix.
  247. .. code-block:: bash
  248. #!/bin/bash
  249. set -eu
  250. export PS4='> '
  251. set -x
  252. topds="$1"
  253. bucket="$2"
  254. prefix="$3"
  255. srname="${bucket}5"
  256. topdsfull=$PWD/$topds/
  257. if ! git annex version | grep 8.2021 ; then
  258. echo "E: need recent git annex. check what you have"
  259. exit 1
  260. fi
  261. { echo "$topdsfull"; datalad -f '{path}' subdatasets -r -d "$topds"; } | \
  262. while read ds; do
  263. relds=$(relpath "$ds" "$topdsfull")
  264. fileprefix="$prefix/$relds/"
  265. fileprefix=$(python -c "import os,sys; print(os.path.normpath(sys.argv[1]))" "$fileprefix")
  266. echo $relds;
  267. (
  268. cd "$ds";
  269. # TODO: make sure that there is no ./ or // in fileprefix
  270. if ! git remote | grep -q "$srname"; then
  271. git annex initremote --debug "$srname" \
  272. type=S3 \
  273. autoenable=true \
  274. bucket=$bucket \
  275. encryption=none \
  276. exporttree=yes \
  277. "fileprefix=$fileprefix/" \
  278. host=s3.amazonaws.com \
  279. partsize=1GiB \
  280. port=80 \
  281. "publicurl=https://s3.amazonaws.com/$bucket" \
  282. public=yes \
  283. versioning=yes
  284. fi
  285. git annex export --to "$srname" --jobs 6 main
  286. )
  287. done