HCP_dataset.rst 29 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532
  1. .. index:: ! Usecase; Scaling up: 80TB and 15 million files
  2. .. _usecase_HCP_dataset:
  3. Scaling up: Managing 80TB and 15 million files from the HCP release
  4. -------------------------------------------------------------------
  5. This use case outlines how a large data collection can be version controlled
  6. and published in an accessible manner with DataLad in a remote indexed
  7. archive (RIA) data store. Using the
  8. `Human Connectome Project <http://www.humanconnectomeproject.org>`_
  9. (HCP) data as an example, it shows how large-scale datasets can be managed
  10. with the help of modular nesting, and how access to data that is contingent on
  11. usage agreements and external service credentials is possible via DataLad
  12. without circumventing or breaching the data providers terms:
  13. #. The :dlcmd:`addurls` command is used to automatically aggregate
  14. files and information about their sources from public
  15. `AWS S3 <https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html>`_
  16. bucket storage into small-sized, modular DataLad datasets.
  17. #. Modular datasets are structured into a hierarchy of nested datasets, with a
  18. single HCP superdataset at the top. This modularizes storage and access,
  19. and mitigates performance problems that would arise in oversized standalone
  20. datasets, but maintains access to any subdataset from the top-level dataset.
  21. #. Individual datasets are stored in a remote indexed archive (RIA) store
  22. at store.datalad.org_ under their :term:`dataset ID`.
  23. This setup constitutes a flexible, domain-agnostic, and scalable storage
  24. solution, while dataset configurations enable seamless automatic dataset
  25. retrieval from the store.
  26. #. The top-level dataset is published to GitHub as a public access point for the
  27. full HCP dataset. As the RIA store contains datasets with only file source
  28. information instead of hosting data contents, a :dlcmd:`get` retrieves
  29. file contents from the original AWS S3 sources.
  30. #. With DataLad's authentication management, users will authenticate once -- and
  31. are thus required to accept the HCP projects terms to obtain valid
  32. credentials --, but subsequent :dlcmd:`get` commands work swiftly
  33. without logging in.
  34. #. The :dlcmd:`copy-file` can be used to subsample special-purpose datasets
  35. for faster access.
  36. .. index:: ! Human Connectome Project (HCP)
  37. The Challenge
  38. ^^^^^^^^^^^^^
  39. The `Human Connectome Project <http://www.humanconnectomeproject.org>`_ aims
  40. to provide an unparalleled compilation of neural data through a customized
  41. database. Its largest open access data collection is the
  42. `WU-Minn HCP1200 Data <https://humanconnectome.org/study/hcp-young-adult/document/1200-subjects-data-release>`_.
  43. It is made available via a public AWS S3 bucket and includes high-resolution 3T
  44. `magnetic resonance <https://en.wikipedia.org/wiki/Magnetic_resonance_imaging>`_
  45. scans from young healthy adult twins and non-twin siblings (ages 22-35)
  46. using four imaging modalities: structural images (T1w and T2w),
  47. `resting-state fMRI (rfMRI) <https://en.wikipedia.org/wiki/Resting_state_fMRI>`_,
  48. task-fMRI (tfMRI), and high angular resolution
  49. `diffusion imaging (dMRI) <https://en.wikipedia.org/wiki/Diffusion_MRI>`_.
  50. It further includes behavioral and other individual subject measure
  51. data for all, and `magnetoencephalography <https://en.wikipedia.org/wiki/Magnetoencephalography>`_
  52. data and 7T MR data for a subset of subjects (twin pairs).
  53. In total, the data release encompasses around 80TB of data in 15 million files,
  54. and is of immense value to the field of neuroscience.
  55. Its large amount of data, however, also constitutes a data management challenge:
  56. Such amounts of data are difficult to store, structure, access, and version
  57. control. Even tools such as DataLad, and its foundations, :term:`Git` and
  58. :term:`git-annex`, will struggle or fail with datasets of this size or number
  59. of files. Simply transforming the complete data release into a single DataLad
  60. dataset would at best lead to severe performance issues, but quite likely result
  61. in software errors and crashes.
  62. Moreover, access to the HCP data is contingent on consent to the
  63. `data usage agreement <http://www.humanconnectomeproject.org/wp-content/uploads/2010/01/HCP_Data_Agreement.pdf>`_
  64. of the HCP project and requires valid AWS S3 credentials. Instead of hosting
  65. this data or providing otherwise unrestrained access to it, an HCP
  66. DataLad dataset would need to enable data retrieval from the original sources,
  67. conditional on the user agreeing to the HCP usage terms.
  68. The DataLad Approach
  69. ^^^^^^^^^^^^^^^^^^^^
  70. Using the :dlcmd:`addurls` command, the HCP data release is
  71. aggregated into a large amount (N ~= 4500) of datasets. A lean top-level dataset
  72. combines all datasets into a nested dataset hierarchy that recreates the original
  73. HCP data release's structure. The topmost dataset contains one subdataset per
  74. subject with the subject's release notes, and within each subject's subdataset,
  75. each additional available subdirectory is another subdataset. This preserves
  76. the original structure of the HCP data release, but builds it up from sensible
  77. components that resemble standalone dataset units. As with any DataLad dataset,
  78. dataset nesting and operations across dataset boundaries are seamless, and
  79. allow to easily retrieve data on a subject, modality, or file level.
  80. The highly modular structure has several advantages. For one, with barely any
  81. data in the superdataset, the top-level dataset is very lean. It mainly consists
  82. of an impressive ``.gitmodules`` file [#f1]_ with almost 1200 registered
  83. (subject-level) subdatasets. The superdataset is published to :term:`GitHub` at
  84. `github.com/datalad-datasets/human-connectome-project-openaccess <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
  85. to expose this superdataset and allow anyone to install it with a single
  86. :dlcmd:`clone` command in a few seconds.
  87. Secondly, the modularity from splitting the data release into
  88. several thousand subdatasets has performance advantages. If :term:`Git` or
  89. :term:`git-annex` repositories exceed a certain size (either in terms of
  90. file sizes or the number of files), performance can drop severely [#f2]_.
  91. By dividing the vast amount of data into many subdatasets,
  92. this can be prevented: Subdatasets are small-sized units that are combined to the
  93. complete HCP dataset structure, and nesting comes with no additional costs or
  94. difficulties, as DataLad can work smoothly across hierarchies of subdatasets.
  95. In order to simplify access to the data instead of providing data access
  96. that could circumvent HCP license term agreements for users, DataLad does not
  97. host any HCP data. Instead, thanks to :dlcmd:`addurls`, each
  98. data file knows its source (the public AWS S3 bucket of the HCP project), and a
  99. :dlcmd:`get` will retrieve HCP data from this bucket.
  100. With this setup, anyone who wants to obtain the data will still need to consent
  101. to data usage terms and retrieve AWS credentials from the HCP project, but can
  102. afterwards obtain the data solely with DataLad commands from the command line
  103. or in scripts. Only the first :dlcmd:`get` requires authentication
  104. with AWS credentials provided by the HCP project: DataLad will prompt any user at
  105. the time of retrieval of the first file content of the dataset.
  106. Afterwards, no further authentication is needed, unless the credentials become
  107. invalid or need to be updated for other reasons.
  108. Thus, in order to retrieve HCP data of up to single file level with DataLad,
  109. users only need to:
  110. - :dlcmd:`clone` the superdataset from :term:`GitHub`
  111. (`github.com/datalad-datasets/human-connectome-project-openaccess <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_)
  112. - Create an account at https://db.humanconnectome.org to accept data use terms
  113. and obtain AWS credentials
  114. - Use :dlcmd:`get [-n] [-r] PATH` to retrieve file, directory, or
  115. subdataset contents on demand. Authentication is necessary only
  116. once (at the time of the first :dlcmd:`get`).
  117. The HCP data release, despite its large size, can thus be version controlled and
  118. easily distributed with DataLad. In order to speed up data retrieval, subdataset
  119. installation can be parallelized, and the full HCP dataset can be subsampled
  120. into special-purpose datasets using DataLad's :dlcmd:`copy-file` command
  121. (introduced with DataLad version 0.13.0)
  122. Step-by-Step
  123. ^^^^^^^^^^^^
  124. Building and publishing a DataLad dataset with HCP data consists of several steps:
  125. 1) Creating all necessary datasets, 2) publishing them to a RIA store, and 3) creating
  126. an access point to all files in the HCP data release. The upcoming subsections
  127. detail each of these.
  128. .. index::
  129. pair: addurls; DataLad command
  130. Dataset creation with ``datalad addurls``
  131. """""""""""""""""""""""""""""""""""""""""
  132. The :dlcmd:`addurls` command
  133. allows you to create (and update) potentially nested DataLad datasets from a list
  134. of download URLs that point to the HCP files in the S3 buckets.
  135. By supplying subject specific ``.csv`` files that contain S3 download links,
  136. a subject ID, a file name, and a version specification per file in the HCP dataset,
  137. as well as information on where subdataset boundaries are,
  138. :dlcmd:`addurls` can download all subjects' files and create (nested) datasets
  139. to store them in. With the help of a few bash commands, this task can be
  140. automated, and with the help of a `job scheduler <https://en.wikipedia.org/wiki/Job_scheduler>`_,
  141. it can also be parallelized.
  142. As soon as files are downloaded and saved to a dataset, their content can be
  143. dropped with :dlcmd:`drop`: The origin of the file was successfully
  144. recorded, and a :dlcmd:`get` can now retrieve file contents on demand.
  145. Thus, shortly after a complete download of the HCP project data, the datasets in
  146. which it has been aggregated are small in size, and yet provide access to the HCP
  147. data for anyone who has valid AWS S3 credentials.
  148. At the end of this step, there is one nested dataset per subject in the HCP data
  149. release. If you are interested in the details of this process, checkout the :find-out-more:`on the datasets' generation <fom-hcp>`.
  150. .. find-out-more:: How exactly did the datasets came to be?
  151. :name: fom-hcp
  152. All code and tables necessary to generate the HCP datasets can be found on
  153. GitHub at `github.com/TobiasKadelka/build_hcp <https://github.com/TobiasKadelka/build_hcp>`_.
  154. The :dlcmd:`addurls` command is capable of building all necessary nested
  155. subject datasets automatically, it only needs an appropriate specification of
  156. its tasks. We'll approach the function of :dlcmd:`addurls` and
  157. how exactly it was invoked to build the HCP dataset by looking at the
  158. information it needs. Below are excerpts of the ``.csv`` table of one subject
  159. (``100206``) that illustrate how :dlcmd:`addurls` works:
  160. .. code-block::
  161. :caption: Table header and some of the release note files
  162. "original_url","subject","filename","version"
  163. "s3://hcp-openaccess/HCP_1200/100206/release-notes/Diffusion_unproc.txt","100206","release-notes/Diffusion_unproc.txt","j9bm9Jvph3EzC0t9Jl51KVrq6NFuoznu"
  164. "s3://hcp-openaccess/HCP_1200/100206/release-notes/ReleaseNotes.txt","100206","release-notes/ReleaseNotes.txt","RgG.VC2mzp5xIc6ZGN6vB7iZ0mG7peXN"
  165. "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc.txt","100206","release-notes/Structural_preproc.txt","OeUYjysiX5zR7nRMixCimFa_6yQ3IKqf"
  166. "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc_extended.txt","100206","release-notes/Structural_preproc_extended.txt","cyP8G5_YX5F30gO9Yrpk8TADhkLltrNV"
  167. "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_unproc.txt","100206","release-notes/Structural_unproc.txt","AyW6GmavML6I7LfbULVmtGIwRGpFmfPZ"
  168. .. code-block::
  169. :caption: Some files in the MNINonLinear directory
  170. "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.164k_fs_LR.wb.spec","100206","MNINonLinear//100206.164k_fs_LR.wb.spec","JSZJhZekZnMhv1sDWih.khEVUNZXMHTE"
  171. "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","sP4uw8R1oJyqCWeInSd9jmOBjfOCtN4D"
  172. "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","yD88c.HfsFwjyNXHQQv2SymGIsSYHQVZ"
  173. "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii","100206","MNINonLinear
  174. The ``.csv`` table contains one row per file, and includes the columns
  175. ``original_url``, ``subject``, ``filename``, and ``version``. ``original_url``
  176. is an s3 URL pointing to an individual file in the S3 bucket, ``subject`` is
  177. the subject's ID (here: ``100206``), ``filename`` is the path to the file
  178. within the dataset that will be build, and ``version`` is an S3 specific
  179. file version identifier.
  180. The first table excerpt thus specifies a few files in the directory ``release-notes``
  181. in the dataset of subject ``100206``. For :dlcmd:`addurls`, the
  182. column headers serve as placeholders for fields in each row.
  183. If this table excerpt is given to a :dlcmd:`addurls` call as shown
  184. below, it will create a dataset and download and save the precise version of each file
  185. in it::
  186. $ datalad addurls -d <Subject-ID> <TABLE> '{original_url}?versionId={version}' '{filename}'
  187. This command translates to "create a dataset with the name of the subject ID
  188. (``-d <Subject-ID>``) and use the provided table (``<TABLE>``) to assemble the
  189. dataset contents. Iterate through the table rows, and perform one download per
  190. row. Generate the download URL from the ``original_url`` and ``version``
  191. field of the table (``{original_url}?versionId={version}'``), and save the
  192. downloaded file under the name specified in the ``filename`` field (``'{filename}'``)".
  193. If the file name contains a double slash (``//``), for example seen in the second
  194. table excerpt in ``"MNINonLinear//...``, this file will be created underneath a
  195. *subdataset* of the name in front of the double slash. The rows in the second
  196. table thus translate to "save these files into the subdataset ``MNINonLinear``,
  197. and if this subdataset does not exist, create it".
  198. Thus, with a single subject's table, a nested, subject specific dataset is built.
  199. Here is how the directory hierarchy looks for this particular subject once
  200. :dlcmd:`addurls` worked through its table:
  201. .. code-block:: bash
  202. 100206
  203. ├── MNINonLinear <- subdataset
  204. ├── release-notes
  205. ├── T1w <- subdataset
  206. └── unprocessed <- subdataset
  207. This is all there is to assemble subject specific datasets. The interesting
  208. question is: How can this be done as automated as possible?
  209. **How to create subject-specific tables**
  210. One crucial part of the process are the subject specific tables for
  211. :dlcmd:`addurls`. The information on the file url, its name, and its
  212. version can be queried with the :dlcmd:`ls` command.
  213. It is a DataLad-specific version of the Unix :shcmd:`ls` command and can
  214. be used to list summary information about s3 URLs and datasets. With this
  215. command, the public S3 bucket can be queried and the command will output the
  216. relevant information.
  217. The :dlcmd:`ls` command is a rather old command and less user-friendly
  218. than other commands demonstrated in the handbook. One problem for automation
  219. is that the command is made for interactive use, and it outputs information in
  220. a non-structured fashion. In order to retrieve the relevant information,
  221. a custom Python script was used to split its output and extract it. This
  222. script can be found in the GitHub repository as
  223. `code/create_subject_table.py <https://github.com/TobiasKadelka/build_hcp/blob/master/code/create_subject_table.py>`_.
  224. **How to schedule datalad addurls commands for all tables**
  225. Once the subject specific tables exist, :dlcmd:`addurls` can start
  226. to aggregate the files into datasets. To do it efficiently, this can be done
  227. in parallel by using a job scheduler. On the computer cluster the datasets
  228. were aggregated, this was `HTCondor <https://research.cs.wisc.edu/htcondor>`_.
  229. The jobs (per subject) performed by HTCondor consisted of
  230. - a :dlcmd:`addurls` command to generate the (nested) dataset
  231. and retrieve content once [#f3]_::
  232. datalad -l warning addurls -d "$outds" -c hcp_dataset "$subj_table" '{original_url}?versionId={version}' '{filename}'
  233. - a subsequent :dlcmd:`drop` command to remove file contents as
  234. soon as they were saved to the dataset to save disk space (this is possible
  235. since the S3 source of the file is known, and content can be reobtained using
  236. :dlcmd:`get`)::
  237. datalad drop -d "$outds" -r --nocheck
  238. - a few (Git) commands to clean up well afterwards, as the system the HCP dataset
  239. was downloaded to had a strict 5TB limit on disk usage.
  240. **Summary**
  241. Thus, in order to download the complete HCP project and aggregate it into
  242. nested subject level datasets (on a system with much less disk space than the
  243. complete HCP project's size!), only two DataLad commands, one custom configuration,
  244. and some scripts to parse terminal output into ``.csv`` tables and create
  245. subject-wise HTCondor jobs were necessary. With all tables set up, the jobs
  246. ran over the Christmas break and finished before everyone went back to work.
  247. Getting 15 million files into datasets? Check!
  248. .. index:: Remote Indexed Archive (RIA) store
  249. Using a Remote Indexed Archive Store for dataset hosting
  250. """"""""""""""""""""""""""""""""""""""""""""""""""""""""
  251. All datasets were built on a scientific compute cluster. In this location, however,
  252. datasets would only be accessible to users with an account on this system.
  253. Subsequently, therefore, everything was published with
  254. :dlcmd:`push` to the publicly available
  255. store.datalad.org_, a remote indexed archive (RIA)
  256. store.
  257. A RIA store is a flexible and scalable data storage solution for DataLad datasets.
  258. While its layout may look confusing if one were to take a look at it, a RIA store
  259. is nothing but a clever storage solution, and users never consciously interact
  260. with the store to get the HCP datasets.
  261. On the lowest level, store.datalad.org_
  262. is a directory on a publicly accessible server that holds a great number of datasets
  263. stored as :term:`bare git repositories`. The only important aspect of it for this
  264. use case is that instead of by their names (e.g., ``100206``), datasets are stored
  265. and identified via their :term:`dataset ID`.
  266. The :dlcmd:`clone` command can understand this layout and install
  267. datasets from a RIA store based on their ID.
  268. .. find-out-more:: How would a 'datalad clone' from a RIA store look like?
  269. In order to get a dataset from a RIA store, :dlcmd:`clone` needs
  270. a RIA URL. It is build from the following components:
  271. - a ``ria+`` identifier
  272. - a path/url to the store in question. For store.datalad.org, this is
  273. ``https://store.datalad.org``, but it could also be an SSH url, such as
  274. ``ssh://juseless.inm7.de/data/group/psyinf/dataset_store``
  275. - a pound sign (``#``)
  276. - the dataset ID
  277. - and optionally a version or branch specification (appended with a leading ``@``)
  278. Here is how a valid :dlcmd:`clone` command from the data store
  279. for one dataset would look like:
  280. .. code-block:: bash
  281. datalad clone 'ria+https://store.datalad.org#d1ca308e-3d17-11ea-bf3b-f0d5bf7b5561' subj-01
  282. But worry not! To get the HCP data, no-one will ever need to compose
  283. :dlcmd:`clone` commands to RIA stores apart from DataLad itself.
  284. A RIA store is used, because -- among other advantages -- its layout makes the
  285. store flexible and scalable. With datasets of sizes like the HCP project,
  286. especially scalability becomes an important factor. If you are interested in
  287. finding out why, you can find more technical details on RIA stores, their advantages,
  288. and even how to create and use one yourself in the section :ref:`riastore`.
  289. Making the datasets accessible
  290. """"""""""""""""""""""""""""""
  291. At this point, roughly 1200 nested datasets were created and published to a publicly
  292. accessible RIA store. This modularized the HCP dataset and prevented performance
  293. issues that would arise in oversized datasets. In order to make the complete dataset
  294. available and accessible from one central point, the only thing missing is a
  295. single superdataset.
  296. For this, a new dataset, ``human-connectome-project-openaccess``, was created.
  297. It contains a ``README`` file with short instructions on how to use it,
  298. a text-based copy of the HCP project's data usage agreement, -- and each subject
  299. dataset as a subdataset. The ``.gitmodules`` file [#f1]_ of this superdataset
  300. thus is impressive. Here is an excerpt::
  301. [submodule "100206"]
  302. path = HCP1200/100206
  303. url = ./HCP1200/100206
  304. branch = master
  305. datalad-id = 346a3ae0-2c2e-11ea-a27d-002590496000
  306. [submodule "100307"]
  307. path = HCP1200/100307
  308. url = ./HCP1200/100307
  309. branch = master
  310. datalad-id = a51b84fc-2c2d-11ea-9359-0025904abcb0
  311. [submodule "100408"]
  312. path = HCP1200/100408
  313. url = ./HCP1200/100408
  314. branch = master
  315. datalad-id = d3fa72e4-2c2b-11ea-948f-0025904abcb0
  316. [...]
  317. For each subdataset (named after subject IDs), there is one entry (note that
  318. individual ``url``\s of the subdatasets are pointless and not needed: As will be
  319. demonstrated shortly, DataLad resolves each subdataset ID from the common store
  320. automatically).
  321. Thus, this superdataset combines all individual datasets to the original HCP dataset
  322. structure. This (and only this) superdataset is published to a public :term:`GitHub`
  323. repository that anyone can :dlcmd:`clone` [#f4]_.
  324. Data retrieval and interacting with the repository
  325. """"""""""""""""""""""""""""""""""""""""""""""""""
  326. Procedurally, getting data from this dataset is almost as simple as with any
  327. other public DataLad dataset: One needs to :dlcmd:`clone` the repository
  328. and use :dlcmd:`get [-n] [-r] PATH` to retrieve any file, directory,
  329. or subdataset (content). But because the data will be downloaded from the HCP's
  330. AWS S3 bucket, users will need to create an account at
  331. `db.humanconnectome.org <https://db.humanconnectome.org>`_ to agree to the project's
  332. data usage terms and get credentials. When performing the first :dlcmd:`get` for file contents, DataLad will prompt for these credentials interactively
  333. from the terminal. Once supplied, all subsequent :dlcmd:`get` commands will
  334. retrieve data right away.
  335. .. find-out-more:: Resetting AWS credentials
  336. In case one misenters their AWS credentials or needs to reset them,
  337. this can easily be done using the `Python keyring <https://keyring.readthedocs.io>`_
  338. package. For more information on ``keyring`` and DataLad's authentication
  339. process, see the *Basic process* section in :ref:`providers`.
  340. After launching Python, import the ``keyring`` package and use the
  341. ``set_password()`` function. This function takes 3 arguments:
  342. * ``system``: "datalad-hcp-s3" in this case
  343. * ``username``: "key_id" if modifying the AWS access key ID or "secret_id" if modifying the secret access key
  344. * ``password``: the access key itself
  345. .. code-block:: python
  346. import keyring
  347. keyring.set_password("datalad-hcp-s3", "key_id", <password>)
  348. keyring.set_password("datalad-hcp-s3", "secret_id", <password>)
  349. Alternatively, one can set their credentials using environment variables.
  350. For more details on this method, :ref:`see this Findoutmore <fom-envvar>`.
  351. .. code-block:: bash
  352. $ export DATALAD_hcp_s3_key_id=<password>
  353. $ export DATALAD_hcp_s3_secret_id=<password>
  354. Internally, DataLad cleverly manages the crucial aspects of data retrieval:
  355. Linking registered subdatasets to the correct dataset in the RIA store. If you
  356. inspect the GitHub repository, you will find that the subdataset links in it
  357. will not resolve if you click on them, because none of the subdatasets were
  358. published to GitHub [#f5]_, but lie in the RIA store instead.
  359. Dataset or file content retrieval will nevertheless work automatically with
  360. :dlcmd:`get`: Each ``.gitmodule`` entry lists the subdataset's
  361. dataset ID. Based on a configuration of "subdataset-source-candidates" in
  362. ``.datalad/config`` of the superdataset, the subdataset ID is assembled to a
  363. RIA URL that retrieves the correct dataset from the store by :dlcmd:`get`:
  364. .. code-block:: bash
  365. :emphasize-lines: 4-5
  366. $ cat .datalad/config
  367. [datalad "dataset"]
  368. id = 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
  369. [datalad "get"]
  370. subdataset-source-candidate-origin = "ria+https://store.datalad.org#{id}"
  371. This configuration allows :dlcmd:`get` to flexibly generate RIA URLs from the
  372. base URL in the config file and the dataset IDs listed in ``.gitmodules``. In
  373. the superdataset, it needed to be done "by hand" via the :gitcmd:`config`
  374. command.
  375. Because the configuration should be shared together with the dataset, the
  376. configuration needed to be set in ``.datalad/config`` [#f6]_::
  377. $ git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+https://store.datalad.org#{id}"
  378. With this configuration, :dlcmd:`get` will retrieve all subdatasets from the
  379. RIA store. Any subdataset that is obtained from a RIA store in turn gets the very
  380. same configuration automatically into ``.git/config``. Thus, the configuration
  381. that makes seamless subdataset retrieval from RIA stores possible is propagated
  382. throughout the dataset hierarchy.
  383. With this in place, anyone can clone the top most dataset from GitHub, and --
  384. given they have valid credentials -- get any file in the HCP dataset hierarchy.
  385. Speeding operations up
  386. """"""""""""""""""""""
  387. At this point in time, the HCP dataset is a single, published superdataset with
  388. ~4500 subdatasets that are hosted in a :term:`remote indexed archive (RIA) store`
  389. at store.datalad.org_.
  390. This makes the HCP data accessible via DataLad and its download easier.
  391. One downside to gigantic nested datasets like this one, though, is the time it
  392. takes to retrieve all of it. Some tricks can help to mitigate this: Contents
  393. can either be retrieved in parallel, or, in the case of general need for subsets
  394. of the dataset, subsampled datasets can be created with :dlcmd:`copy-file`.
  395. If the complete HCP dataset is required, subdataset installation and data retrieval
  396. can be sped up by parallelizing. The gists :ref:`parallelize` and
  397. :ref:`retrieveHCP` can shed some light on how to do this.
  398. If you are interested in learning about the :dlcmd:`copy-file`, checkout the section :ref:`copyfile`.
  399. Summary
  400. """""""
  401. This use case demonstrated how it is possible to version control and distribute
  402. datasets of sizes that would otherwise be unmanageably large for version control
  403. systems. With the public HCP dataset available as a DataLad dataset, data access
  404. is simplified, data analysis that use the HCP data can link it (in precise versions)
  405. to their scripts and even share it, and the complete HCP release can be stored
  406. at a fraction of its total size for on demand retrieval.
  407. .. _store.datalad.org: https://store.datalad.org
  408. .. rubric:: Footnotes
  409. .. [#f1] If you want to read up on how DataLad stores information about
  410. registered subdatasets in ``.gitmodules``, checkout section :ref:`config2`.
  411. .. [#f2] Precise performance will always be dependent on the details of the
  412. repository, software setup, and hardware, but to get a feeling for the
  413. possible performance issues in oversized datasets, imagine a mere
  414. :gitcmd:`status` or :dlcmd:`status` command taking several
  415. minutes up to hours in a clean dataset.
  416. .. [#f3] Note that this command is more complex than the previously shown
  417. :dlcmd:`addurls` command. In particular, it has an additional
  418. `loglevel` configuration for the main command, and creates the datasets
  419. with an `hcp_dataset` configuration. The logging level was set (to
  420. ``warning``) to help with post-execution diagnostics in the HTCondors
  421. log files. The configuration can be found in
  422. `code/cfg_hcp_dataset <https://github.com/TobiasKadelka/build_hcp/blob/master/code/cfg_hcp_dataset.sh>`_
  423. and enables a :term:`special remote` in the resulting dataset.
  424. .. [#f4] To re-read about publishing datasets to hosting services such as
  425. :term:`GitHub` or :term:`GitLab`, go back to :ref:`publishtogithub`.
  426. .. [#f5] If you coded along in the Basics part of the book and published your
  427. dataset to :term:`Gin`, you have experienced in :ref:`subdspublishing`
  428. how the links to unpublished subdatasets in a published dataset do not
  429. resolve in the webinterface: Its path points to a URL that would resolve
  430. to lying underneath the superdataset, but there is not published
  431. subdataset on the hosting platform!
  432. .. [#f6] To re-read on configurations of datasets, go back to sections :ref:`config`
  433. and :ref:`config2`.