|
@@ -1,10 +1,9 @@
|
|
|
+.. index:: ! Usecase; Scaling up: 80TB and 15 million files
|
|
|
.. _usecase_HCP_dataset:
|
|
|
|
|
|
Scaling up: Managing 80TB and 15 million files from the HCP release
|
|
|
-------------------------------------------------------------------
|
|
|
|
|
|
-.. index:: ! Usecase; Scaling up: 80TB and 15 million files
|
|
|
-
|
|
|
This use case outlines how a large data collection can be version controlled
|
|
|
and published in an accessible manner with DataLad in a remote indexed
|
|
|
archive (RIA) data store. Using the
|
|
@@ -38,11 +37,11 @@ without circumventing or breaching the data providers terms:
|
|
|
#. The :dlcmd:`copy-file` can be used to subsample special-purpose datasets
|
|
|
for faster access.
|
|
|
|
|
|
+.. index:: ! Human Connectome Project (HCP)
|
|
|
+
|
|
|
The Challenge
|
|
|
^^^^^^^^^^^^^
|
|
|
|
|
|
-.. index:: ! Human Connectome Project (HCP)
|
|
|
-
|
|
|
The `Human Connectome Project <http://www.humanconnectomeproject.org>`_ aims
|
|
|
to provide an unparalleled compilation of neural data through a customized
|
|
|
database. Its largest open access data collection is the
|
|
@@ -143,12 +142,12 @@ Building and publishing a DataLad dataset with HCP data consists of several step
|
|
|
an access point to all files in the HCP data release. The upcoming subsections
|
|
|
detail each of these.
|
|
|
|
|
|
-Dataset creation with ``datalad addurls``
|
|
|
-"""""""""""""""""""""""""""""""""""""""""
|
|
|
-
|
|
|
.. index::
|
|
|
pair: addurls; DataLad command
|
|
|
|
|
|
+Dataset creation with ``datalad addurls``
|
|
|
+"""""""""""""""""""""""""""""""""""""""""
|
|
|
+
|
|
|
The :dlcmd:`addurls` command
|
|
|
allows you to create (and update) potentially nested DataLad datasets from a list
|
|
|
of download URLs that point to the HCP files in the S3 buckets.
|
|
@@ -296,11 +295,11 @@ hidden section below.
|
|
|
ran over the Christmas break and finished before everyone went back to work.
|
|
|
Getting 15 million files into datasets? Check!
|
|
|
|
|
|
+.. index:: Remote Indexed Archive (RIA) store
|
|
|
+
|
|
|
Using a Remote Indexed Archive Store for dataset hosting
|
|
|
""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
|
|
|
|
|
-.. index:: Remote Indexed Archive (RIA) store
|
|
|
-
|
|
|
All datasets were built on a scientific compute cluster. In this location, however,
|
|
|
datasets would only be accessible to users with an account on this system.
|
|
|
Subsequently, therefore, everything was published with
|