datastorage_for_institutions.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326
  1. .. index:: ! 3-001
  2. .. index:: ! Usecase; Remote Indexed Archive (RIA) store
  3. .. _3-001:
  4. .. _usecase_datastore:
  5. Building a scalable data storage for scientific computing
  6. ---------------------------------------------------------
  7. Research can require enormous amounts of data. Such data needs to be accessed by
  8. multiple people at the same time, and is used across a diverse range of
  9. computations or research questions.
  10. The size of the dataset, the need for simultaneous access and transformation
  11. of this data by multiple people, and the subsequent storing of multiple copies
  12. or derivatives of the data constitutes a challenge for computational clusters
  13. and requires state-of-the-art data management solutions.
  14. This use case details a model implementation for a scalable data storage
  15. solution, suitable to serve the computational and logistic demands of data
  16. science in big (scientific) institutions, while keeping workflows for users
  17. as simple as possible. It elaborates on
  18. #. How to implement a scalable :term:`Remote Indexed Archive (RIA) store` to flexibly
  19. store large amounts of DataLad datasets, potentially remote to lower storage
  20. strains on computing infrastructure,
  21. #. How disk-space aware computing can be eased by DataLad based workflows and
  22. enforced by infrastructural incentives and limitations, and
  23. #. How to reduce technical complexities for users and encourage reproducible,
  24. version-controlled, and scalable scientific workflows.
  25. .. importantnote:: Use case target audience
  26. This use case is technical in nature and aimed at IT/data management
  27. personnel seeking insights into the technical implementation and
  28. configuration of a RIA store or into its workflows. In particular, it
  29. describes the RIA data storage and workflow implementation as done in INM-7,
  30. Research Centre Juelich, Germany.
  31. The Challenge
  32. ^^^^^^^^^^^^^
  33. The data science institute XYZ consists of dozens of people: Principal
  34. investigators, PhD students, general research staff, system administration,
  35. and IT support. It does research on important global issues, and prides
  36. itself on ground-breaking insights obtained from elaborate and complex
  37. computations run on a large scientific computing cluster.
  38. The datasets used in the institute are big both in size and number of files,
  39. and expensive to collect.
  40. Therefore, datasets are used for various different research questions, by
  41. multiple researchers. Every member of the institute has an account on an expensive
  42. and large compute cluster, and all of the data exists in dedicated directories
  43. on this server. However, researchers struggle with the technical overhead of
  44. data management *and* data science.
  45. In order to work on their research questions without modifying
  46. original data, every user creates their own copies of the full data in their
  47. user account on the cluster -- even if it contains many files that are not
  48. necessary for their analysis. In addition, as version control is not a standard
  49. skill, they add all computed derivatives and outputs, even old versions, out of
  50. fear of losing work that may become relevant again. Thus, an excess of (unorganized)
  51. data copies and derivatives exists in addition to the already substantial
  52. amount of original data. At the same time, the compute cluster is both the
  53. data storage and the analysis playground for the institute. With data
  54. directories of several TB in size, *and* computationally heavy analyses, the
  55. compute cluster is quickly brought to its knees: Insufficient memory and
  56. IOPS starvation make computations painstakingly slow, and hinder scientific
  57. progress. Despite the elaborate and expensive cluster setup, exciting datasets
  58. cannot be stored or processed, as there just doesn't seem to be enough disk
  59. space.
  60. Therefore, the challenge is two-fold: On an infrastructural level, institute XYZ
  61. needs a scalable, flexible, and maintainable data storage solution for their
  62. growing collection of large datasets.
  63. On the level of human behavior, researchers not formally trained in data
  64. management need to apply and adhere to advanced data management principles.
  65. The DataLad approach
  66. ^^^^^^^^^^^^^^^^^^^^
  67. The compute cluster is refurbished to a state-of-the-art data management
  68. system.
  69. For a scalable and flexible dataset storage, the data store is a
  70. :term:`Remote Indexed Archive (RIA) store` -- an extendable, file-system based
  71. storage solution for DataLad datasets that aligns well with the requirements of
  72. scientific computing (infrastructure).
  73. The RIA store is configured as a git-annex ORA-remote ("optional remote archive")
  74. special remote for access to annexed keys in the store and so that full
  75. datasets can be (compressed) 7-zip archives.
  76. The latter is especially useful in case of file system inode
  77. limitations, such as on HPC storage systems: Regardless of a dataset's number of
  78. files and size, (compressed) 7zipped datasets use only few inodes, but retain the
  79. ability to query available files.
  80. Unlike traditional solutions, both because of the size of the large
  81. amounts of data, and for more efficient use of compute power for
  82. calculations instead of data storage, the RIA store is set up *remote*: Data is
  83. stored on a different machine than the one the scientific analyses are computed
  84. on. While unconventional, it is convenient, and perfectly possible with DataLad.
  85. The infrastructural changes are accompanied by changes in the mindset and
  86. workflows of the researchers that perform analyses on the cluster.
  87. By using a RIA store, the institute's work routines are adjusted around
  88. DataLad datasets. Simple configurations, distributed system-wide with DataLad's
  89. run-procedures, or basic data management principles improve the efficiency and
  90. reproducibility of research projects:
  91. Analyses are set-up inside of DataLad datasets, and for every
  92. analysis, an associated ``project`` is created under the namespace of the
  93. institute on the institute's :term:`GitLab` instance automatically. This
  94. not only leads to vastly simplified version control workflows, but also to
  95. simplified access to projects and research logs for collaborators and supervisors.
  96. Input data gets installed as subdatasets from the RIA store. This automatically
  97. links analysis projects to data sets, and allows for fine-grained access of up
  98. to individual file level. With only precisely needed data, analysis datasets are
  99. already much leaner than with previous complete dataset copies, but as data can
  100. be reobtained on-demand from the store, original input files or files that are
  101. easily recomputed can safely be dropped to save even more disk-space.
  102. Beyond this, upon creation of an analysis project, the associated GitLab project
  103. is automatically configured as a remote with a publication dependency on the
  104. data store, thus enabling vastly simplified data publication routines and
  105. backups of pristine results: After computing their results, a
  106. :dlcmd:`push` is all it takes to backup and share one's scientific
  107. insights. Thus, even with a complex setup of data store, compute infrastructure,
  108. and repository hosting, configurations adjusted to the compute infrastructure
  109. can be distributed and used to mitigate any potential remaining technical overhead.
  110. Finally, with all datasets stored in a RIA store and in a single place, any remaining
  111. maintenance and query tasks in the datasets can be performed by data management
  112. personnel without requiring domain knowledge about dataset contents.
  113. Step-by-step
  114. ^^^^^^^^^^^^
  115. The following section will elaborate on the details of the technical
  116. implementation of a RIA store, and the workflow requirements and incentives for
  117. researchers. Both of them are aimed at making scientific analyses on a
  118. compute cluster scale and can be viewed as complementary but independent.
  119. .. importantnote:: Note on the generality of the described setup
  120. Some hardware-specific implementation details are unique to the real-world
  121. example this use case is based on, and are not a requirement. In this particular
  122. case of application, for example, a *remote* setup for a RIA store made sense:
  123. Parts of an old compute cluster and of the super computer at the Juelich
  124. Supercomputing Centre (JSC) instead of the institute's compute cluster are used
  125. to host the data store. This may be an unconventional storage location,
  126. but it is convenient: The data does not strain the compute cluster, and with
  127. DataLad, it is irrelevant where the RIA store is located. The next subsection
  128. introduces the general layout of the compute infrastructure and some
  129. DataLad-unrelated incentives and restrictions.
  130. Incentives and imperatives for disk-space aware computing
  131. """""""""""""""""""""""""""""""""""""""""""""""""""""""""
  132. On a high level, the layout and relationships of the relevant computational
  133. infrastructure in this use case are as follows:
  134. Every researcher has a workstation that they can access the compute cluster with.
  135. On the compute cluster's head node, every user account has their own
  136. home directory. These are the private spaces of researchers and are referred to
  137. as ``$HOME`` in :numref:`fig_store`.
  138. Analyses should be conducted on the cluster's compute nodes (``$COMPUTE``).
  139. ``$HOME`` and ``$COMPUTE`` are not managed or trusted by data management personnel,
  140. and are seen as *ephemeral* (short-lived).
  141. The RIA store (``$DATA``) can be accessed both from ``$HOME`` and ``$COMPUTE``,
  142. in both directions: Researchers can pull datasets from the store, push new
  143. datasets to it, or update (certain) existing datasets. ``$DATA`` is the one location
  144. in which experienced data management personnel ensures back-up and archival, performs
  145. house-keeping, and handles :term:`permissions`, and is thus where pristine raw
  146. data is stored or analysis code or results from ``$COMPUTE`` and ``$HOME`` should
  147. end up. This aids organization, and allows a central management of back-ups
  148. and archival, potentially by data stewards or similar data management personnel
  149. with no domain knowledge about data contents.
  150. .. _fig_store:
  151. .. figure:: ../artwork/src/ephemeral_infra.svg
  152. :alt: A simple, local version control workflow with datalad.
  153. :figwidth: 80%
  154. Trinity of research data handling: The data store (``$DATA``) is managed and
  155. backed-up. The compute cluster (``$COMPUTE``) has an analysis-appropriate structure
  156. with adequate resources, but just as users workstations/laptops (``$HOME``),
  157. it is not concerned with data hosting.
  158. One aspect of the problem are disk-space unaware computing workflows. Researchers
  159. make and keep numerous copies of data in their home directory and perform
  160. computationally expensive analyses on the headnode of a compute cluster because
  161. they do not know better, and/or want to do it in the easiest way possible.
  162. A general change for the better can be achieved by imposing sensible limitations
  163. and restrictions on what can be done at which scale:
  164. Data from the RIA store (``$DATA``) is accessible to researchers for exploration
  165. and computation, but the scale of the operations they want to perform can require
  166. different approaches.
  167. In their ``$HOME``, researchers are free to do whatever they want as long as it
  168. is within the limits of their machines or their user accounts (100GB). Thus,
  169. researchers can explore data, test and develop code, or visualize results,
  170. but they cannot create complete dataset copies or afford to keep an excess of
  171. unused data around.
  172. Only ``$COMPUTE`` has the necessary hardware requirements for expensive computations.
  173. Thus, within ``$HOME``, researchers are free to explore data
  174. as they wish, but scaling requires them to use ``$COMPUTE``. By using a job
  175. scheduler, compute jobs of multiple researchers are distributed fairly across
  176. the available compute infrastructure. Version controlled (and potentially
  177. reproducible) research logs and the results of the analyses can be pushed from
  178. ``COMPUTE`` to ``$DATA`` for back-up and archival, and hence anything that is
  179. relevant for a research project is tracked, backed-up, and stored, all without
  180. straining available disk-space on the cluster afterwards. While the imposed
  181. limitations are independent of DataLad, DataLad can make sure that the necessary
  182. workflows are simple enough for researchers of any seniority, background, or
  183. skill level.
  184. Remote indexed archive (RIA) stores
  185. """""""""""""""""""""""""""""""""""
  186. A RIA store is a storage solution for DataLad datasets that can be flexibly
  187. extended with new datasets, independent of static file names or directory
  188. hierarchies, and that can be (automatically) maintained or queried without
  189. requiring expert or domain knowledge about the data. At its core, it is a flat,
  190. file-system based repository representation of any number of datasets, limited
  191. only by disk-space constrains of the machine it lies on.
  192. .. index::
  193. pair: create-sibling-ria; DataLad command
  194. Put simply, a RIA store is a dataset storage location that allows for access to
  195. and collaboration on DataLad datasets.
  196. The high-level workflow overview is as follows: Create a dataset,
  197. use the :dlcmd:`create-sibling-ria` command to establish a connection
  198. to an either pre-existing or not-yet-existing RIA store, publish dataset contents
  199. with :dlcmd:`push`, (let others) clone the dataset from the
  200. RIA store, and (let others) publish and pull updates. In the
  201. case of large, institute-wide datasets, a RIA store (or multiple RIA stores)
  202. can serve as a central storage location that enables fine-grained data access to
  203. everyone who needs it, and as a storage and back-up location for all analysis datasets.
  204. Beyond constituting central storage locations, RIA stores also ease dataset
  205. maintenance and queries:
  206. If all datasets of an institute are kept in a single RIA store, questions such
  207. as "Which projects use this data as their input?", "In which projects was the
  208. student with this Git identity involved?", "Give me a complete research log
  209. of what was done for this publication", or "Which datasets weren't used in the
  210. last 5 years?" can be answered automatically with Git tools, without requiring
  211. expert knowledge about the contents of any of the datasets, or access to the
  212. original creators of the dataset.
  213. To find out more about RIA stores, check out section :ref:`riastore`.
  214. .. todo::
  215. Add a paragraph on the setup in INM-7 once it exists (bulk nodes, project-wise
  216. RIA stores, stores in home directories, etc.
  217. RIA store workflows
  218. """""""""""""""""""
  219. .. todo::
  220. Sketch a RIA store workflow from a user's perspective
  221. **Configurations can hide the technical layers**
  222. Setting up a RIA store and appropriate siblings is fairly easy -- it requires
  223. only the :dlcmd:`create-sibling-ria` command.
  224. However, in the institute this use case describes, in order to spare users
  225. knowing about RIA stores, custom configurations are distributed via DataLad's
  226. run-procedures to simplify workflows further and hide the technical layers of
  227. the RIA setup:
  228. A `custom procedure <https://jugit.fz-juelich.de/inm7/infrastructure/inm7-datalad/blob/master/inm7_datalad/resources/procedures/cfg_inm7.py>`_
  229. performs the relevant sibling setup with a fully configured link to the RIA store,
  230. and, on top of it, also creates an associated repository with a publication
  231. dependency on the RIA store to an institute's GitLab instance [#f1]_.
  232. With a procedure like this in place system-wide, an individual researcher only
  233. needs to call the procedure right at the time of dataset creation, and has a
  234. fully configured and set up analysis dataset afterwards:
  235. .. code-block:: bash
  236. $ datalad create -c inm7 <PATH>
  237. Working in this dataset will require only :dlcmd:`save` and
  238. :dlcmd:`push` commands, and configurations ensure that the projects
  239. history and results are published where they need to be: The RIA store, for storing
  240. and archiving the project including data, and GitLab, for exposing the projects
  241. progress to the outside and easing collaboration or supervision. Users do not need
  242. to know the location of the store, its layout, or how it works -- they can go
  243. about doing their science, while DataLad handles publications routines.
  244. In order to get input data from datasets hosted in the datastore without requiring
  245. users to know about dataset IDs or construct ``ria+`` URLs, superdatasets
  246. get a :term:`sibling` on :term:`GitLab` or :term:`GitHub` with a human readable
  247. name. Users can clone the superdatasets from the web hosting service, and obtain data
  248. via :dlcmd:`get`. A concrete example for this is described in
  249. the use case :ref:`usecase_HCP_dataset`. While :dlcmd:`get` will retrieve file
  250. or subdataset contents from the RIA store, users will not need to bother where
  251. the data actually comes from.
  252. Summary
  253. """""""
  254. The infrastructural and workflow changes around DataLad datasets in RIA stores
  255. improve the efficiency of the institute:
  256. With easy local version control workflows and DataLad-based data management routines,
  257. researchers are able to focus on science and face barely any technical overhead for
  258. data management. As file content for analyses is obtained *on demand*
  259. via :dlcmd:`get`, researchers selectively obtain only those data they
  260. need instead of having complete copies of datasets as before, and thus save disk
  261. space. Upon :dlcmd:`push`, computed results and project histories
  262. can be pushed to the data store and the institute's GitLab instance, and be thus
  263. backed-up and accessible for collaborators or supervisors. Easy-to-reobtain input
  264. data can safely be dropped to free disk space on the compute cluster. Sensible
  265. incentives for computing and limitations on disk space prevent unmanaged clutter.
  266. With a RIA store full of bare git repositories, it is easily maintainable by data
  267. stewards or system administrators. Common compression or cleaning operations of
  268. Git and git-annex are performed without requiring knowledge about the data
  269. inside of the store, as are queries on interesting aspects of datasets, potentially
  270. across all of the datasets of the institute.
  271. With a remote data store setup, the compute cluster is efficiently used for
  272. computations instead of data storage. Researchers can not only compute their
  273. analyses faster and on larger datasets than before, but with DataLad's version
  274. control capabilities their work also becomes more transparent, open, and
  275. reproducible.
  276. .. rubric:: Footnotes
  277. .. [#f1] To re-read about DataLad's run-procedures, check out section
  278. :ref:`procedures`. You can find the source code of the procedure
  279. `on GitLab <https://jugit.fz-juelich.de/inm7/infrastructure/inm7-datalad/blob/master/inm7_datalad/resources/procedures/cfg_inm7.py>`_.