collaborative_data_management.rst 8.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
  1. .. index:: ! Usecase; Collaboration
  2. .. _usecase_collab:
  3. A typical collaborative data management workflow
  4. ------------------------------------------------
  5. This use case sketches the basics of a common, collaborative
  6. data management workflow for an analysis:
  7. #. A 3rd party dataset is obtained to serve as input for an analysis.
  8. #. Data processing is collaboratively performed by two colleagues.
  9. #. Upon completion, the results are published alongside the original data
  10. for further consumption.
  11. The data types and methods mentioned in this use case belong to the scientific
  12. field of neuroimaging, but the basic workflow is domain-agnostic.
  13. The Challenge
  14. ^^^^^^^^^^^^^
  15. Bob is a new PhD student and about to work on his first analysis.
  16. He wants to use an open dataset as the input for his analysis, so he asks
  17. a friend who has worked with the same dataset for the data and gets it
  18. on a hard drive.
  19. Later, he's stuck with his analysis. Luckily, Alice, a senior grad
  20. student in the same lab, offers to help him. He sends his script to
  21. her via email and hopes she finds the solution to his problem. She
  22. responds a week later with the fixed script, but in the meantime
  23. Bob already performed some miscellaneous changes to his script as well.
  24. Identifying and integrating her fix into his slightly changed script
  25. takes him half a day. When he finally finishes his analysis, he wants to
  26. publish code and data online, but cannot find a way to share his data
  27. together with his code.
  28. The DataLad Approach
  29. ^^^^^^^^^^^^^^^^^^^^
  30. Bob creates his analysis project as a DataLad dataset. Complying with
  31. the :ref:`YODA principles <yoda>`,
  32. he creates his scripts in a dedicated
  33. ``code/`` directory, and clones the open dataset as a standalone
  34. DataLad subdataset within a dedicated subdirectory.
  35. To collaborate with his senior grad
  36. student Alice, he shares the dataset on the lab's SSH server, and they
  37. can collaborate on the version controlled dataset almost in real time
  38. with no need for Bob to spend much time integrating the fix that Alice
  39. provides him with. Afterwards, Bob can execute his scripts in a way that captures
  40. all provenance for this results with a :dlcmd:`run` command.
  41. Bob can share his whole project after completion by creating a sibling
  42. on a web server, and pushing all of his dataset, including the input data,
  43. to this sibling, for everyone to access and recompute.
  44. Step-by-Step
  45. ^^^^^^^^^^^^
  46. Bob creates a DataLad dataset for his analysis project to live in.
  47. Because he knows about the YODA principles, he configures the dataset
  48. to be a YODA dataset right at the time of creation:
  49. .. runrecord:: _examples/collab-101
  50. :workdir: usecases/collab
  51. :language: console
  52. $ datalad create -c yoda --description "my 1st phd project on work computer" myanalysis
  53. After creation, there already is a ``code/`` directory, and all of its
  54. inputs are version-controlled by :term:`Git` instead of :term:`git-annex`
  55. thanks to the yoda procedure:
  56. .. runrecord:: _examples/collab-102
  57. :workdir: usecases/collab
  58. :language: console
  59. $ cd myanalysis
  60. $ tree
  61. .. index::
  62. pair: clone; DataLad command
  63. Bob knows that a DataLad dataset can contain other datasets. He also knows that
  64. as any content of a dataset is tracked and its precise state is recorded,
  65. this is a powerful method to specify and later resolve data dependencies,
  66. and that including the dataset as a standalone data component will it also
  67. make it easier to keep his analysis organized and share it later.
  68. The dataset that Bob wants to work with is structural brain imaging data from the
  69. `studyforrest project <https://www.studyforrest.org>`_, a public
  70. data resource that the original authors share as a DataLad dataset through
  71. :term:`GitHub`. This means that Bob can simply clone the relevant dataset from this
  72. service and into his own dataset. To do that, he clones it as a subdataset
  73. into a directory he calls ``src/`` as he wants to make it obvious which parts
  74. of his analysis steps and code require 3rd party data:
  75. .. runrecord:: _examples/collab-103
  76. :workdir: usecases/collab/myanalysis
  77. :language: console
  78. $ datalad clone -d . https://github.com/psychoinformatics-de/studyforrest-data-structural.git src/forrest_structural
  79. Now that he executed this command, Bob has access to the entire dataset
  80. content, and the precise version of the dataset got linked to his top-level dataset
  81. ``myanalysis``. However, no data was actually downloaded (yet). Bob very much
  82. appreciates that DataLad datasets primarily contain information on a dataset’s
  83. content and where to obtain it: Cloning above was done rather
  84. quickly, and will still be relatively lean even for a dataset that contains
  85. several hundred GBs of data. He knows that his script can obtain the
  86. relevant data he needs on demand if he wraps it into a :dlcmd:`run`
  87. command and therefore does not need to care about getting the data yet. Instead,
  88. he focuses to write his script ``code/run_analysis.sh``.
  89. To save this progress, he runs frequent :dlcmd:`save` commands:
  90. .. runrecord:: _examples/collab-104
  91. :workdir: usecases/collab/myanalysis
  92. :language: console
  93. :realcommand: echo "#! /usr/bin/env python" > code/run_analysis.py && datalad save -m "First steps: start analysis script" code/run_analysis.py
  94. $ datalad save -m "First steps: start analysis script" code/run_analysis.py
  95. Once Bob's analysis is finished, he can wrap it into :dlcmd:`run`.
  96. To ease execution, he first makes his script executable by adding a :term:`shebang`
  97. that specifies Python as an interpreter at the start of his script, and giving it
  98. executable :term:`permissions`:
  99. .. runrecord:: _examples/collab-105
  100. :workdir: usecases/collab/myanalysis
  101. :language: console
  102. $ chmod +x code/run_analysis.py
  103. $ datalad save -m "make script executable"
  104. Importantly, prior to a :dlcmd:`run`, he specifies the necessary
  105. inputs such that DataLad can take care of the data retrieval for him:
  106. .. runrecord:: _examples/collab-106
  107. :workdir: usecases/collab/myanalysis
  108. :language: console
  109. :realcommand: datalad run -m "run first part of analysis workflow" --input "src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz" --output results.txt "code/run_analysis.py"
  110. $ datalad run -m "run first part of analysis workflow" \
  111. --input "src/forrest_structural" \
  112. --output results.txt \
  113. "code/run_analysis.py"
  114. This will take care of retrieving the data, running Bobs script, and
  115. saving all outputs.
  116. Some time later, Bob needs help with his analysis. He turns to his senior
  117. grad student Alice for help. Alice and Bob both work on the same computing server.
  118. Bob has told Alice in which directory he keeps his analysis dataset, and
  119. the directory is configured to have :term:`permissions` that allow for
  120. read-access for all lab-members, so Alice can obtain Bob’s work directly
  121. from his home directory:
  122. .. runrecord:: _examples/collab-107
  123. :workdir: usecases/collab
  124. :language: console
  125. :realcommand: echo "$ datalad clone "$BOBS_HOME/myanalysis" bobs_analysis" && datalad clone "myanalysis" bobs_analysis
  126. .. runrecord:: _examples/collab-108
  127. :workdir: usecases/collab
  128. :language: console
  129. :realcommand: cd bobs_analysis && echo "some contribution" >> code/run_analysis.py && datalad save
  130. $ cd bobs_analysis
  131. # ... make contributions, and save them
  132. $ [...]
  133. $ datalad save -m "you're welcome, bob"
  134. Alice can get the studyforrest data Bob used as an input as well as the
  135. result file, but she can also rerun his analysis by using :dlcmd:`rerun`.
  136. She goes ahead and fixes Bobs script, and saves the changes. To integrate her
  137. changes into his dataset, Bob registers Alice's dataset as a sibling:
  138. .. runrecord:: _examples/collab-109
  139. :workdir: usecases/collab/myanalysis
  140. :language: console
  141. :realcommand: echo "$ datalad siblings add -s alice --url '$ALICES_HOME/bobs_analysis'" && datalad siblings add -s alice --url '../bobs_analysis'
  142. #in Bobs home directory
  143. Afterwards, he can get her changes with a :dlcmd:`update --merge`
  144. command:
  145. .. runrecord:: _examples/collab-110
  146. :workdir: usecases/collab/myanalysis
  147. :language: console
  148. $ datalad update -s alice --merge
  149. .. index::
  150. pair: create-sibling; DataLad command
  151. Finally, when Bob is ready to share his results with the world or a remote
  152. collaborator, he makes his dataset available by uploading them to a web server
  153. via SSH. Bob does so by creating a sibling for the dataset on the server, to
  154. which the dataset can be published and later also updated.
  155. .. code-block:: bash
  156. # this generated sibling for the dataset and all subdatasets
  157. $ datalad create-sibling --recursive -s public "$SERVER_URL"
  158. Once the remote sibling is created and registered under the name “public”,
  159. Bob can publish his version to it.
  160. .. code-block:: bash
  161. $ datalad push -r --to public .
  162. This workflow allowed Bob to obtain data, collaborate with Alice, and publish
  163. or share his dataset with others easily -- he cannot wait for his next project,
  164. given that this workflow made his life so simple.