123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532 |
- .. index:: ! Usecase; Scaling up: 80TB and 15 million files
- .. _usecase_HCP_dataset:
- Scaling up: Managing 80TB and 15 million files from the HCP release
- -------------------------------------------------------------------
- This use case outlines how a large data collection can be version controlled
- and published in an accessible manner with DataLad in a remote indexed
- archive (RIA) data store. Using the
- `Human Connectome Project <http://www.humanconnectomeproject.org>`_
- (HCP) data as an example, it shows how large-scale datasets can be managed
- with the help of modular nesting, and how access to data that is contingent on
- usage agreements and external service credentials is possible via DataLad
- without circumventing or breaching the data providers terms:
- #. The :dlcmd:`addurls` command is used to automatically aggregate
- files and information about their sources from public
- `AWS S3 <https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html>`_
- bucket storage into small-sized, modular DataLad datasets.
- #. Modular datasets are structured into a hierarchy of nested datasets, with a
- single HCP superdataset at the top. This modularizes storage and access,
- and mitigates performance problems that would arise in oversized standalone
- datasets, but maintains access to any subdataset from the top-level dataset.
- #. Individual datasets are stored in a remote indexed archive (RIA) store
- at store.datalad.org_ under their :term:`dataset ID`.
- This setup constitutes a flexible, domain-agnostic, and scalable storage
- solution, while dataset configurations enable seamless automatic dataset
- retrieval from the store.
- #. The top-level dataset is published to GitHub as a public access point for the
- full HCP dataset. As the RIA store contains datasets with only file source
- information instead of hosting data contents, a :dlcmd:`get` retrieves
- file contents from the original AWS S3 sources.
- #. With DataLad's authentication management, users will authenticate once -- and
- are thus required to accept the HCP projects terms to obtain valid
- credentials --, but subsequent :dlcmd:`get` commands work swiftly
- without logging in.
- #. The :dlcmd:`copy-file` can be used to subsample special-purpose datasets
- for faster access.
- .. index:: ! Human Connectome Project (HCP)
- The Challenge
- ^^^^^^^^^^^^^
- The `Human Connectome Project <http://www.humanconnectomeproject.org>`_ aims
- to provide an unparalleled compilation of neural data through a customized
- database. Its largest open access data collection is the
- `WU-Minn HCP1200 Data <https://humanconnectome.org/study/hcp-young-adult/document/1200-subjects-data-release>`_.
- It is made available via a public AWS S3 bucket and includes high-resolution 3T
- `magnetic resonance <https://en.wikipedia.org/wiki/Magnetic_resonance_imaging>`_
- scans from young healthy adult twins and non-twin siblings (ages 22-35)
- using four imaging modalities: structural images (T1w and T2w),
- `resting-state fMRI (rfMRI) <https://en.wikipedia.org/wiki/Resting_state_fMRI>`_,
- task-fMRI (tfMRI), and high angular resolution
- `diffusion imaging (dMRI) <https://en.wikipedia.org/wiki/Diffusion_MRI>`_.
- It further includes behavioral and other individual subject measure
- data for all, and `magnetoencephalography <https://en.wikipedia.org/wiki/Magnetoencephalography>`_
- data and 7T MR data for a subset of subjects (twin pairs).
- In total, the data release encompasses around 80TB of data in 15 million files,
- and is of immense value to the field of neuroscience.
- Its large amount of data, however, also constitutes a data management challenge:
- Such amounts of data are difficult to store, structure, access, and version
- control. Even tools such as DataLad, and its foundations, :term:`Git` and
- :term:`git-annex`, will struggle or fail with datasets of this size or number
- of files. Simply transforming the complete data release into a single DataLad
- dataset would at best lead to severe performance issues, but quite likely result
- in software errors and crashes.
- Moreover, access to the HCP data is contingent on consent to the
- `data usage agreement <http://www.humanconnectomeproject.org/wp-content/uploads/2010/01/HCP_Data_Agreement.pdf>`_
- of the HCP project and requires valid AWS S3 credentials. Instead of hosting
- this data or providing otherwise unrestrained access to it, an HCP
- DataLad dataset would need to enable data retrieval from the original sources,
- conditional on the user agreeing to the HCP usage terms.
- The DataLad Approach
- ^^^^^^^^^^^^^^^^^^^^
- Using the :dlcmd:`addurls` command, the HCP data release is
- aggregated into a large amount (N ~= 4500) of datasets. A lean top-level dataset
- combines all datasets into a nested dataset hierarchy that recreates the original
- HCP data release's structure. The topmost dataset contains one subdataset per
- subject with the subject's release notes, and within each subject's subdataset,
- each additional available subdirectory is another subdataset. This preserves
- the original structure of the HCP data release, but builds it up from sensible
- components that resemble standalone dataset units. As with any DataLad dataset,
- dataset nesting and operations across dataset boundaries are seamless, and
- allow to easily retrieve data on a subject, modality, or file level.
- The highly modular structure has several advantages. For one, with barely any
- data in the superdataset, the top-level dataset is very lean. It mainly consists
- of an impressive ``.gitmodules`` file [#f1]_ with almost 1200 registered
- (subject-level) subdatasets. The superdataset is published to :term:`GitHub` at
- `github.com/datalad-datasets/human-connectome-project-openaccess <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
- to expose this superdataset and allow anyone to install it with a single
- :dlcmd:`clone` command in a few seconds.
- Secondly, the modularity from splitting the data release into
- several thousand subdatasets has performance advantages. If :term:`Git` or
- :term:`git-annex` repositories exceed a certain size (either in terms of
- file sizes or the number of files), performance can drop severely [#f2]_.
- By dividing the vast amount of data into many subdatasets,
- this can be prevented: Subdatasets are small-sized units that are combined to the
- complete HCP dataset structure, and nesting comes with no additional costs or
- difficulties, as DataLad can work smoothly across hierarchies of subdatasets.
- In order to simplify access to the data instead of providing data access
- that could circumvent HCP license term agreements for users, DataLad does not
- host any HCP data. Instead, thanks to :dlcmd:`addurls`, each
- data file knows its source (the public AWS S3 bucket of the HCP project), and a
- :dlcmd:`get` will retrieve HCP data from this bucket.
- With this setup, anyone who wants to obtain the data will still need to consent
- to data usage terms and retrieve AWS credentials from the HCP project, but can
- afterwards obtain the data solely with DataLad commands from the command line
- or in scripts. Only the first :dlcmd:`get` requires authentication
- with AWS credentials provided by the HCP project: DataLad will prompt any user at
- the time of retrieval of the first file content of the dataset.
- Afterwards, no further authentication is needed, unless the credentials become
- invalid or need to be updated for other reasons.
- Thus, in order to retrieve HCP data of up to single file level with DataLad,
- users only need to:
- - :dlcmd:`clone` the superdataset from :term:`GitHub`
- (`github.com/datalad-datasets/human-connectome-project-openaccess <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_)
- - Create an account at https://db.humanconnectome.org to accept data use terms
- and obtain AWS credentials
- - Use :dlcmd:`get [-n] [-r] PATH` to retrieve file, directory, or
- subdataset contents on demand. Authentication is necessary only
- once (at the time of the first :dlcmd:`get`).
- The HCP data release, despite its large size, can thus be version controlled and
- easily distributed with DataLad. In order to speed up data retrieval, subdataset
- installation can be parallelized, and the full HCP dataset can be subsampled
- into special-purpose datasets using DataLad's :dlcmd:`copy-file` command
- (introduced with DataLad version 0.13.0)
- Step-by-Step
- ^^^^^^^^^^^^
- Building and publishing a DataLad dataset with HCP data consists of several steps:
- 1) Creating all necessary datasets, 2) publishing them to a RIA store, and 3) creating
- an access point to all files in the HCP data release. The upcoming subsections
- detail each of these.
- .. index::
- pair: addurls; DataLad command
- Dataset creation with ``datalad addurls``
- """""""""""""""""""""""""""""""""""""""""
- The :dlcmd:`addurls` command
- allows you to create (and update) potentially nested DataLad datasets from a list
- of download URLs that point to the HCP files in the S3 buckets.
- By supplying subject specific ``.csv`` files that contain S3 download links,
- a subject ID, a file name, and a version specification per file in the HCP dataset,
- as well as information on where subdataset boundaries are,
- :dlcmd:`addurls` can download all subjects' files and create (nested) datasets
- to store them in. With the help of a few bash commands, this task can be
- automated, and with the help of a `job scheduler <https://en.wikipedia.org/wiki/Job_scheduler>`_,
- it can also be parallelized.
- As soon as files are downloaded and saved to a dataset, their content can be
- dropped with :dlcmd:`drop`: The origin of the file was successfully
- recorded, and a :dlcmd:`get` can now retrieve file contents on demand.
- Thus, shortly after a complete download of the HCP project data, the datasets in
- which it has been aggregated are small in size, and yet provide access to the HCP
- data for anyone who has valid AWS S3 credentials.
- At the end of this step, there is one nested dataset per subject in the HCP data
- release. If you are interested in the details of this process, checkout the :find-out-more:`on the datasets' generation <fom-hcp>`.
- .. find-out-more:: How exactly did the datasets came to be?
- :name: fom-hcp
- All code and tables necessary to generate the HCP datasets can be found on
- GitHub at `github.com/TobiasKadelka/build_hcp <https://github.com/TobiasKadelka/build_hcp>`_.
- The :dlcmd:`addurls` command is capable of building all necessary nested
- subject datasets automatically, it only needs an appropriate specification of
- its tasks. We'll approach the function of :dlcmd:`addurls` and
- how exactly it was invoked to build the HCP dataset by looking at the
- information it needs. Below are excerpts of the ``.csv`` table of one subject
- (``100206``) that illustrate how :dlcmd:`addurls` works:
- .. code-block::
- :caption: Table header and some of the release note files
- "original_url","subject","filename","version"
- "s3://hcp-openaccess/HCP_1200/100206/release-notes/Diffusion_unproc.txt","100206","release-notes/Diffusion_unproc.txt","j9bm9Jvph3EzC0t9Jl51KVrq6NFuoznu"
- "s3://hcp-openaccess/HCP_1200/100206/release-notes/ReleaseNotes.txt","100206","release-notes/ReleaseNotes.txt","RgG.VC2mzp5xIc6ZGN6vB7iZ0mG7peXN"
- "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc.txt","100206","release-notes/Structural_preproc.txt","OeUYjysiX5zR7nRMixCimFa_6yQ3IKqf"
- "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc_extended.txt","100206","release-notes/Structural_preproc_extended.txt","cyP8G5_YX5F30gO9Yrpk8TADhkLltrNV"
- "s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_unproc.txt","100206","release-notes/Structural_unproc.txt","AyW6GmavML6I7LfbULVmtGIwRGpFmfPZ"
- .. code-block::
- :caption: Some files in the MNINonLinear directory
- "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.164k_fs_LR.wb.spec","100206","MNINonLinear//100206.164k_fs_LR.wb.spec","JSZJhZekZnMhv1sDWih.khEVUNZXMHTE"
- "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","sP4uw8R1oJyqCWeInSd9jmOBjfOCtN4D"
- "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","yD88c.HfsFwjyNXHQQv2SymGIsSYHQVZ"
- "s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii","100206","MNINonLinear
- The ``.csv`` table contains one row per file, and includes the columns
- ``original_url``, ``subject``, ``filename``, and ``version``. ``original_url``
- is an s3 URL pointing to an individual file in the S3 bucket, ``subject`` is
- the subject's ID (here: ``100206``), ``filename`` is the path to the file
- within the dataset that will be build, and ``version`` is an S3 specific
- file version identifier.
- The first table excerpt thus specifies a few files in the directory ``release-notes``
- in the dataset of subject ``100206``. For :dlcmd:`addurls`, the
- column headers serve as placeholders for fields in each row.
- If this table excerpt is given to a :dlcmd:`addurls` call as shown
- below, it will create a dataset and download and save the precise version of each file
- in it::
- $ datalad addurls -d <Subject-ID> <TABLE> '{original_url}?versionId={version}' '{filename}'
- This command translates to "create a dataset with the name of the subject ID
- (``-d <Subject-ID>``) and use the provided table (``<TABLE>``) to assemble the
- dataset contents. Iterate through the table rows, and perform one download per
- row. Generate the download URL from the ``original_url`` and ``version``
- field of the table (``{original_url}?versionId={version}'``), and save the
- downloaded file under the name specified in the ``filename`` field (``'{filename}'``)".
- If the file name contains a double slash (``//``), for example seen in the second
- table excerpt in ``"MNINonLinear//...``, this file will be created underneath a
- *subdataset* of the name in front of the double slash. The rows in the second
- table thus translate to "save these files into the subdataset ``MNINonLinear``,
- and if this subdataset does not exist, create it".
- Thus, with a single subject's table, a nested, subject specific dataset is built.
- Here is how the directory hierarchy looks for this particular subject once
- :dlcmd:`addurls` worked through its table:
- .. code-block:: bash
- 100206
- ├── MNINonLinear <- subdataset
- ├── release-notes
- ├── T1w <- subdataset
- └── unprocessed <- subdataset
- This is all there is to assemble subject specific datasets. The interesting
- question is: How can this be done as automated as possible?
- **How to create subject-specific tables**
- One crucial part of the process are the subject specific tables for
- :dlcmd:`addurls`. The information on the file url, its name, and its
- version can be queried with the :dlcmd:`ls` command.
- It is a DataLad-specific version of the Unix :shcmd:`ls` command and can
- be used to list summary information about s3 URLs and datasets. With this
- command, the public S3 bucket can be queried and the command will output the
- relevant information.
- The :dlcmd:`ls` command is a rather old command and less user-friendly
- than other commands demonstrated in the handbook. One problem for automation
- is that the command is made for interactive use, and it outputs information in
- a non-structured fashion. In order to retrieve the relevant information,
- a custom Python script was used to split its output and extract it. This
- script can be found in the GitHub repository as
- `code/create_subject_table.py <https://github.com/TobiasKadelka/build_hcp/blob/master/code/create_subject_table.py>`_.
- **How to schedule datalad addurls commands for all tables**
- Once the subject specific tables exist, :dlcmd:`addurls` can start
- to aggregate the files into datasets. To do it efficiently, this can be done
- in parallel by using a job scheduler. On the computer cluster the datasets
- were aggregated, this was `HTCondor <https://research.cs.wisc.edu/htcondor>`_.
- The jobs (per subject) performed by HTCondor consisted of
- - a :dlcmd:`addurls` command to generate the (nested) dataset
- and retrieve content once [#f3]_::
- datalad -l warning addurls -d "$outds" -c hcp_dataset "$subj_table" '{original_url}?versionId={version}' '{filename}'
- - a subsequent :dlcmd:`drop` command to remove file contents as
- soon as they were saved to the dataset to save disk space (this is possible
- since the S3 source of the file is known, and content can be reobtained using
- :dlcmd:`get`)::
- datalad drop -d "$outds" -r --nocheck
- - a few (Git) commands to clean up well afterwards, as the system the HCP dataset
- was downloaded to had a strict 5TB limit on disk usage.
- **Summary**
- Thus, in order to download the complete HCP project and aggregate it into
- nested subject level datasets (on a system with much less disk space than the
- complete HCP project's size!), only two DataLad commands, one custom configuration,
- and some scripts to parse terminal output into ``.csv`` tables and create
- subject-wise HTCondor jobs were necessary. With all tables set up, the jobs
- ran over the Christmas break and finished before everyone went back to work.
- Getting 15 million files into datasets? Check!
- .. index:: Remote Indexed Archive (RIA) store
- Using a Remote Indexed Archive Store for dataset hosting
- """"""""""""""""""""""""""""""""""""""""""""""""""""""""
- All datasets were built on a scientific compute cluster. In this location, however,
- datasets would only be accessible to users with an account on this system.
- Subsequently, therefore, everything was published with
- :dlcmd:`push` to the publicly available
- store.datalad.org_, a remote indexed archive (RIA)
- store.
- A RIA store is a flexible and scalable data storage solution for DataLad datasets.
- While its layout may look confusing if one were to take a look at it, a RIA store
- is nothing but a clever storage solution, and users never consciously interact
- with the store to get the HCP datasets.
- On the lowest level, store.datalad.org_
- is a directory on a publicly accessible server that holds a great number of datasets
- stored as :term:`bare git repositories`. The only important aspect of it for this
- use case is that instead of by their names (e.g., ``100206``), datasets are stored
- and identified via their :term:`dataset ID`.
- The :dlcmd:`clone` command can understand this layout and install
- datasets from a RIA store based on their ID.
- .. find-out-more:: How would a 'datalad clone' from a RIA store look like?
- In order to get a dataset from a RIA store, :dlcmd:`clone` needs
- a RIA URL. It is build from the following components:
- - a ``ria+`` identifier
- - a path/url to the store in question. For store.datalad.org, this is
- ``https://store.datalad.org``, but it could also be an SSH url, such as
- ``ssh://juseless.inm7.de/data/group/psyinf/dataset_store``
- - a pound sign (``#``)
- - the dataset ID
- - and optionally a version or branch specification (appended with a leading ``@``)
- Here is how a valid :dlcmd:`clone` command from the data store
- for one dataset would look like:
- .. code-block:: bash
- datalad clone 'ria+https://store.datalad.org#d1ca308e-3d17-11ea-bf3b-f0d5bf7b5561' subj-01
- But worry not! To get the HCP data, no-one will ever need to compose
- :dlcmd:`clone` commands to RIA stores apart from DataLad itself.
- A RIA store is used, because -- among other advantages -- its layout makes the
- store flexible and scalable. With datasets of sizes like the HCP project,
- especially scalability becomes an important factor. If you are interested in
- finding out why, you can find more technical details on RIA stores, their advantages,
- and even how to create and use one yourself in the section :ref:`riastore`.
- Making the datasets accessible
- """"""""""""""""""""""""""""""
- At this point, roughly 1200 nested datasets were created and published to a publicly
- accessible RIA store. This modularized the HCP dataset and prevented performance
- issues that would arise in oversized datasets. In order to make the complete dataset
- available and accessible from one central point, the only thing missing is a
- single superdataset.
- For this, a new dataset, ``human-connectome-project-openaccess``, was created.
- It contains a ``README`` file with short instructions on how to use it,
- a text-based copy of the HCP project's data usage agreement, -- and each subject
- dataset as a subdataset. The ``.gitmodules`` file [#f1]_ of this superdataset
- thus is impressive. Here is an excerpt::
- [submodule "100206"]
- path = HCP1200/100206
- url = ./HCP1200/100206
- branch = master
- datalad-id = 346a3ae0-2c2e-11ea-a27d-002590496000
- [submodule "100307"]
- path = HCP1200/100307
- url = ./HCP1200/100307
- branch = master
- datalad-id = a51b84fc-2c2d-11ea-9359-0025904abcb0
- [submodule "100408"]
- path = HCP1200/100408
- url = ./HCP1200/100408
- branch = master
- datalad-id = d3fa72e4-2c2b-11ea-948f-0025904abcb0
- [...]
- For each subdataset (named after subject IDs), there is one entry (note that
- individual ``url``\s of the subdatasets are pointless and not needed: As will be
- demonstrated shortly, DataLad resolves each subdataset ID from the common store
- automatically).
- Thus, this superdataset combines all individual datasets to the original HCP dataset
- structure. This (and only this) superdataset is published to a public :term:`GitHub`
- repository that anyone can :dlcmd:`clone` [#f4]_.
- Data retrieval and interacting with the repository
- """"""""""""""""""""""""""""""""""""""""""""""""""
- Procedurally, getting data from this dataset is almost as simple as with any
- other public DataLad dataset: One needs to :dlcmd:`clone` the repository
- and use :dlcmd:`get [-n] [-r] PATH` to retrieve any file, directory,
- or subdataset (content). But because the data will be downloaded from the HCP's
- AWS S3 bucket, users will need to create an account at
- `db.humanconnectome.org <https://db.humanconnectome.org>`_ to agree to the project's
- data usage terms and get credentials. When performing the first :dlcmd:`get` for file contents, DataLad will prompt for these credentials interactively
- from the terminal. Once supplied, all subsequent :dlcmd:`get` commands will
- retrieve data right away.
- .. find-out-more:: Resetting AWS credentials
- In case one misenters their AWS credentials or needs to reset them,
- this can easily be done using the `Python keyring <https://keyring.readthedocs.io>`_
- package. For more information on ``keyring`` and DataLad's authentication
- process, see the *Basic process* section in :ref:`providers`.
- After launching Python, import the ``keyring`` package and use the
- ``set_password()`` function. This function takes 3 arguments:
- * ``system``: "datalad-hcp-s3" in this case
- * ``username``: "key_id" if modifying the AWS access key ID or "secret_id" if modifying the secret access key
- * ``password``: the access key itself
- .. code-block:: python
- import keyring
- keyring.set_password("datalad-hcp-s3", "key_id", <password>)
- keyring.set_password("datalad-hcp-s3", "secret_id", <password>)
- Alternatively, one can set their credentials using environment variables.
- For more details on this method, :ref:`see this Findoutmore <fom-envvar>`.
- .. code-block:: bash
- $ export DATALAD_hcp_s3_key_id=<password>
- $ export DATALAD_hcp_s3_secret_id=<password>
- Internally, DataLad cleverly manages the crucial aspects of data retrieval:
- Linking registered subdatasets to the correct dataset in the RIA store. If you
- inspect the GitHub repository, you will find that the subdataset links in it
- will not resolve if you click on them, because none of the subdatasets were
- published to GitHub [#f5]_, but lie in the RIA store instead.
- Dataset or file content retrieval will nevertheless work automatically with
- :dlcmd:`get`: Each ``.gitmodule`` entry lists the subdataset's
- dataset ID. Based on a configuration of "subdataset-source-candidates" in
- ``.datalad/config`` of the superdataset, the subdataset ID is assembled to a
- RIA URL that retrieves the correct dataset from the store by :dlcmd:`get`:
- .. code-block:: bash
- :emphasize-lines: 4-5
- $ cat .datalad/config
- [datalad "dataset"]
- id = 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
- [datalad "get"]
- subdataset-source-candidate-origin = "ria+https://store.datalad.org#{id}"
- This configuration allows :dlcmd:`get` to flexibly generate RIA URLs from the
- base URL in the config file and the dataset IDs listed in ``.gitmodules``. In
- the superdataset, it needed to be done "by hand" via the :gitcmd:`config`
- command.
- Because the configuration should be shared together with the dataset, the
- configuration needed to be set in ``.datalad/config`` [#f6]_::
- $ git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+https://store.datalad.org#{id}"
- With this configuration, :dlcmd:`get` will retrieve all subdatasets from the
- RIA store. Any subdataset that is obtained from a RIA store in turn gets the very
- same configuration automatically into ``.git/config``. Thus, the configuration
- that makes seamless subdataset retrieval from RIA stores possible is propagated
- throughout the dataset hierarchy.
- With this in place, anyone can clone the top most dataset from GitHub, and --
- given they have valid credentials -- get any file in the HCP dataset hierarchy.
- Speeding operations up
- """"""""""""""""""""""
- At this point in time, the HCP dataset is a single, published superdataset with
- ~4500 subdatasets that are hosted in a :term:`remote indexed archive (RIA) store`
- at store.datalad.org_.
- This makes the HCP data accessible via DataLad and its download easier.
- One downside to gigantic nested datasets like this one, though, is the time it
- takes to retrieve all of it. Some tricks can help to mitigate this: Contents
- can either be retrieved in parallel, or, in the case of general need for subsets
- of the dataset, subsampled datasets can be created with :dlcmd:`copy-file`.
- If the complete HCP dataset is required, subdataset installation and data retrieval
- can be sped up by parallelizing. The gists :ref:`parallelize` and
- :ref:`retrieveHCP` can shed some light on how to do this.
- If you are interested in learning about the :dlcmd:`copy-file`, checkout the section :ref:`copyfile`.
- Summary
- """""""
- This use case demonstrated how it is possible to version control and distribute
- datasets of sizes that would otherwise be unmanageably large for version control
- systems. With the public HCP dataset available as a DataLad dataset, data access
- is simplified, data analysis that use the HCP data can link it (in precise versions)
- to their scripts and even share it, and the complete HCP release can be stored
- at a fraction of its total size for on demand retrieval.
- .. _store.datalad.org: https://store.datalad.org
- .. rubric:: Footnotes
- .. [#f1] If you want to read up on how DataLad stores information about
- registered subdatasets in ``.gitmodules``, checkout section :ref:`config2`.
- .. [#f2] Precise performance will always be dependent on the details of the
- repository, software setup, and hardware, but to get a feeling for the
- possible performance issues in oversized datasets, imagine a mere
- :gitcmd:`status` or :dlcmd:`status` command taking several
- minutes up to hours in a clean dataset.
- .. [#f3] Note that this command is more complex than the previously shown
- :dlcmd:`addurls` command. In particular, it has an additional
- `loglevel` configuration for the main command, and creates the datasets
- with an `hcp_dataset` configuration. The logging level was set (to
- ``warning``) to help with post-execution diagnostics in the HTCondors
- log files. The configuration can be found in
- `code/cfg_hcp_dataset <https://github.com/TobiasKadelka/build_hcp/blob/master/code/cfg_hcp_dataset.sh>`_
- and enables a :term:`special remote` in the resulting dataset.
- .. [#f4] To re-read about publishing datasets to hosting services such as
- :term:`GitHub` or :term:`GitLab`, go back to :ref:`publishtogithub`.
- .. [#f5] If you coded along in the Basics part of the book and published your
- dataset to :term:`Gin`, you have experienced in :ref:`subdspublishing`
- how the links to unpublished subdatasets in a published dataset do not
- resolve in the webinterface: Its path points to a URL that would resolve
- to lying underneath the superdataset, but there is not published
- subdataset on the hosting platform!
- .. [#f6] To re-read on configurations of datasets, go back to sections :ref:`config`
- and :ref:`config2`.
|