123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215 |
- .. index:: ! Usecase; Collaboration
- .. _usecase_collab:
- A typical collaborative data management workflow
- ------------------------------------------------
- This use case sketches the basics of a common, collaborative
- data management workflow for an analysis:
- #. A 3rd party dataset is obtained to serve as input for an analysis.
- #. Data processing is collaboratively performed by two colleagues.
- #. Upon completion, the results are published alongside the original data
- for further consumption.
- The data types and methods mentioned in this use case belong to the scientific
- field of neuroimaging, but the basic workflow is domain-agnostic.
- The Challenge
- ^^^^^^^^^^^^^
- Bob is a new PhD student and about to work on his first analysis.
- He wants to use an open dataset as the input for his analysis, so he asks
- a friend who has worked with the same dataset for the data and gets it
- on a hard drive.
- Later, he's stuck with his analysis. Luckily, Alice, a senior grad
- student in the same lab, offers to help him. He sends his script to
- her via email and hopes she finds the solution to his problem. She
- responds a week later with the fixed script, but in the meantime
- Bob already performed some miscellaneous changes to his script as well.
- Identifying and integrating her fix into his slightly changed script
- takes him half a day. When he finally finishes his analysis, he wants to
- publish code and data online, but cannot find a way to share his data
- together with his code.
- The DataLad Approach
- ^^^^^^^^^^^^^^^^^^^^
- Bob creates his analysis project as a DataLad dataset. Complying with
- the :ref:`YODA principles <yoda>`,
- he creates his scripts in a dedicated
- ``code/`` directory, and clones the open dataset as a standalone
- DataLad subdataset within a dedicated subdirectory.
- To collaborate with his senior grad
- student Alice, he shares the dataset on the lab's SSH server, and they
- can collaborate on the version controlled dataset almost in real time
- with no need for Bob to spend much time integrating the fix that Alice
- provides him with. Afterwards, Bob can execute his scripts in a way that captures
- all provenance for this results with a :dlcmd:`run` command.
- Bob can share his whole project after completion by creating a sibling
- on a web server, and pushing all of his dataset, including the input data,
- to this sibling, for everyone to access and recompute.
- Step-by-Step
- ^^^^^^^^^^^^
- Bob creates a DataLad dataset for his analysis project to live in.
- Because he knows about the YODA principles, he configures the dataset
- to be a YODA dataset right at the time of creation:
- .. runrecord:: _examples/collab-101
- :workdir: usecases/collab
- :language: console
- $ datalad create -c yoda --description "my 1st phd project on work computer" myanalysis
- After creation, there already is a ``code/`` directory, and all of its
- inputs are version-controlled by :term:`Git` instead of :term:`git-annex`
- thanks to the yoda procedure:
- .. runrecord:: _examples/collab-102
- :workdir: usecases/collab
- :language: console
- $ cd myanalysis
- $ tree
- .. index::
- pair: clone; DataLad command
- Bob knows that a DataLad dataset can contain other datasets. He also knows that
- as any content of a dataset is tracked and its precise state is recorded,
- this is a powerful method to specify and later resolve data dependencies,
- and that including the dataset as a standalone data component will it also
- make it easier to keep his analysis organized and share it later.
- The dataset that Bob wants to work with is structural brain imaging data from the
- `studyforrest project <https://www.studyforrest.org>`_, a public
- data resource that the original authors share as a DataLad dataset through
- :term:`GitHub`. This means that Bob can simply clone the relevant dataset from this
- service and into his own dataset. To do that, he clones it as a subdataset
- into a directory he calls ``src/`` as he wants to make it obvious which parts
- of his analysis steps and code require 3rd party data:
- .. runrecord:: _examples/collab-103
- :workdir: usecases/collab/myanalysis
- :language: console
- $ datalad clone -d . https://github.com/psychoinformatics-de/studyforrest-data-structural.git src/forrest_structural
- Now that he executed this command, Bob has access to the entire dataset
- content, and the precise version of the dataset got linked to his top-level dataset
- ``myanalysis``. However, no data was actually downloaded (yet). Bob very much
- appreciates that DataLad datasets primarily contain information on a dataset’s
- content and where to obtain it: Cloning above was done rather
- quickly, and will still be relatively lean even for a dataset that contains
- several hundred GBs of data. He knows that his script can obtain the
- relevant data he needs on demand if he wraps it into a :dlcmd:`run`
- command and therefore does not need to care about getting the data yet. Instead,
- he focuses to write his script ``code/run_analysis.sh``.
- To save this progress, he runs frequent :dlcmd:`save` commands:
- .. runrecord:: _examples/collab-104
- :workdir: usecases/collab/myanalysis
- :language: console
- :realcommand: echo "#! /usr/bin/env python" > code/run_analysis.py && datalad save -m "First steps: start analysis script" code/run_analysis.py
- $ datalad save -m "First steps: start analysis script" code/run_analysis.py
- Once Bob's analysis is finished, he can wrap it into :dlcmd:`run`.
- To ease execution, he first makes his script executable by adding a :term:`shebang`
- that specifies Python as an interpreter at the start of his script, and giving it
- executable :term:`permissions`:
- .. runrecord:: _examples/collab-105
- :workdir: usecases/collab/myanalysis
- :language: console
- $ chmod +x code/run_analysis.py
- $ datalad save -m "make script executable"
- Importantly, prior to a :dlcmd:`run`, he specifies the necessary
- inputs such that DataLad can take care of the data retrieval for him:
- .. runrecord:: _examples/collab-106
- :workdir: usecases/collab/myanalysis
- :language: console
- :realcommand: datalad run -m "run first part of analysis workflow" --input "src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz" --output results.txt "code/run_analysis.py"
- $ datalad run -m "run first part of analysis workflow" \
- --input "src/forrest_structural" \
- --output results.txt \
- "code/run_analysis.py"
- This will take care of retrieving the data, running Bobs script, and
- saving all outputs.
- Some time later, Bob needs help with his analysis. He turns to his senior
- grad student Alice for help. Alice and Bob both work on the same computing server.
- Bob has told Alice in which directory he keeps his analysis dataset, and
- the directory is configured to have :term:`permissions` that allow for
- read-access for all lab-members, so Alice can obtain Bob’s work directly
- from his home directory:
- .. runrecord:: _examples/collab-107
- :workdir: usecases/collab
- :language: console
- :realcommand: echo "$ datalad clone "$BOBS_HOME/myanalysis" bobs_analysis" && datalad clone "myanalysis" bobs_analysis
- .. runrecord:: _examples/collab-108
- :workdir: usecases/collab
- :language: console
- :realcommand: cd bobs_analysis && echo "some contribution" >> code/run_analysis.py && datalad save
- $ cd bobs_analysis
- # ... make contributions, and save them
- $ [...]
- $ datalad save -m "you're welcome, bob"
- Alice can get the studyforrest data Bob used as an input as well as the
- result file, but she can also rerun his analysis by using :dlcmd:`rerun`.
- She goes ahead and fixes Bobs script, and saves the changes. To integrate her
- changes into his dataset, Bob registers Alice's dataset as a sibling:
- .. runrecord:: _examples/collab-109
- :workdir: usecases/collab/myanalysis
- :language: console
- :realcommand: echo "$ datalad siblings add -s alice --url '$ALICES_HOME/bobs_analysis'" && datalad siblings add -s alice --url '../bobs_analysis'
- #in Bobs home directory
- Afterwards, he can get her changes with a :dlcmd:`update --merge`
- command:
- .. runrecord:: _examples/collab-110
- :workdir: usecases/collab/myanalysis
- :language: console
- $ datalad update -s alice --merge
- .. index::
- pair: create-sibling; DataLad command
- Finally, when Bob is ready to share his results with the world or a remote
- collaborator, he makes his dataset available by uploading them to a web server
- via SSH. Bob does so by creating a sibling for the dataset on the server, to
- which the dataset can be published and later also updated.
- .. code-block:: bash
- # this generated sibling for the dataset and all subdatasets
- $ datalad create-sibling --recursive -s public "$SERVER_URL"
- Once the remote sibling is created and registered under the name “public”,
- Bob can publish his version to it.
- .. code-block:: bash
- $ datalad push -r --to public .
- This workflow allowed Bob to obtain data, collaborate with Alice, and publish
- or share his dataset with others easily -- he cannot wait for his next project,
- given that this workflow made his life so simple.
|