reproducible_neuroimaging_analysis_simple.rst 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236
  1. .. index:: ! Usecase; Basic Reproducible Neuroimaging
  2. .. _usecase_reproduce_neuroimg_simple:
  3. A basic automatically and computationally reproducible neuroimaging analysis
  4. ----------------------------------------------------------------------------
  5. This use case sketches the basics of a portable analysis of public neuroimaging data
  6. that can be automatically computationally reproduced by anyone:
  7. #. Public open data stems from :term:`the DataLad superdataset ///`.
  8. #. Automatic data retrieval can be ensured by using DataLad's commands in the analysis scripts, or the ``--input`` specification of :dlcmd:`run`,
  9. #. Analyses are executed using :dlcmd:`run` and :dlcmd:`rerun` commands to capture everything relevant to reproduce the analysis.
  10. #. The final dataset can be kept as lightweight as possible by dropping input that can be easily reobtained.
  11. #. A complete reproduction of the computation (including input retrieval), is possible with a single :dlcmd:`rerun` command.
  12. This use case is a specialization of :ref:`usecase_reproducible_paper`, and a simpler version of :ref:`usecase_reproduce_neuroimg`:
  13. It is a data analysis that requires and creates large data files, uses specialized analysis software, and is fully automated using solely DataLad commands and tools.
  14. While exact data types, analysis methods, and software mentioned in this use case belong to the scientific field of neuroimaging, the basic workflow is domain-agnostic.
  15. The Challenge
  16. ^^^^^^^^^^^^^
  17. Creating reproducible (scientific) analyses seems to require so much:
  18. One needs to share data, scripts, results, and instructions on how to use data and scripts to obtain the results.
  19. A researcher at any stage of their career can struggle to remember which script needs to be run in which order, or to create comprehensible instructions for others on where and how to obtain data and how to run which script at what point in time.
  20. This leads to failed replications, a loss of confidence in results, and major time requirements for anyone trying to reproduce others or even their own analyses.
  21. The DataLad Approach
  22. ^^^^^^^^^^^^^^^^^^^^
  23. Scientific studies should be reproducible, and with the increasing accessibility of data, there is not much excuse for a lack of reproducibility anymore.
  24. DataLad can help with the technical aspects of reproducible science.
  25. For neuroscientific studies, :term:`the DataLad superdataset ///` provides unified access to a large amount of data.
  26. Using it to install datasets into an analysis-superdataset makes it easy to share this data together with the analysis.
  27. By ensuring that all relevant data is downloaded via :dlcmd:`get` via DataLad's command line tools in the analysis scripts, or ``--input`` specifications in a :dlcmd:`run`, an analysis can retrieve all required inputs fully automatically during execution.
  28. Recording executed commands with :dlcmd:`run` allows to rerun complete analysis workflows with a single command, even if input data does not exist locally.
  29. Combining these three steps allows to share fully automatically reproducible analyses as lightweight datasets.
  30. Step-by-Step
  31. ^^^^^^^^^^^^
  32. It always starts with a dataset:
  33. .. runrecord:: _examples/repro-101
  34. :language: console
  35. :workdir: usecases/repro
  36. $ datalad create -c yoda demo
  37. For this demo we are using two public brain imaging datasets that were published on `OpenFMRI.org <https://legacy.openfmri.org>`_, and are available from :term:`the DataLad superdataset ///` (datasets.datalad.org).
  38. When installing datasets from this superdataset, we can use its abbreviation ``///``.
  39. The two datasets, `ds000001 <https://legacy.openfmri.org/dataset/ds000001>`_ and `ds000002 <https://legacy.openfmri.org/dataset/ds000002>`_, are installed into the subdirectory ``inputs/``.
  40. .. runrecord:: _examples/repro-102
  41. :language: console
  42. :workdir: usecases/repro
  43. $ cd demo
  44. $ datalad clone -d . ///openfmri/ds000001 inputs/ds000001
  45. .. runrecord:: _examples/repro-103
  46. :language: console
  47. :workdir: usecases/repro
  48. $ cd demo
  49. $ datalad clone -d . ///openfmri/ds000002 inputs/ds000002
  50. Both datasets are now registered as subdatasets, and their precise versions (e.g. in the form of the commit shasum of the latest commit) are on record:
  51. .. runrecord:: _examples/repro-104
  52. :language: console
  53. :workdir: usecases/repro/demo
  54. $ datalad --output-format '{path}: {gitshasum}' subdatasets
  55. DataLad datasets are fairly lightweight in size, they only contain pointers to data and history information in their minimal form.
  56. Thus, so far very little data were actually downloaded:
  57. .. runrecord:: _examples/repro-105
  58. :language: console
  59. :workdir: usecases/repro/demo
  60. $ du -sh inputs/
  61. Both datasets would actually be several gigabytes in size, once the dataset content gets downloaded:
  62. .. runrecord:: _examples/repro-106
  63. :language: console
  64. :workdir: usecases/repro/demo
  65. $ datalad -C inputs/ds000001 status --annex
  66. $ datalad -C inputs/ds000002 status --annex
  67. Both datasets contain brain imaging data, and are compliant with the `BIDS standard <https://bids.neuroimaging.io>`_.
  68. This makes it really easy to locate particular images and perform analysis across datasets.
  69. Here we will use a small script that performs ‘brain extraction’ using `FSL <https://fsl.fmrib.ox.ac.uk>`__ as a stand-in for a full analysis pipeline. The script will be stored inside of the ``code/`` directory that the yoda-procedure created that at the time of dataset-creation.
  70. .. runrecord:: _examples/repro-107
  71. :language: console
  72. :workdir: usecases/repro/demo
  73. :emphasize-lines: 6
  74. $ cat << EOT > code/brain_extraction.sh
  75. # enable FSL
  76. . /etc/fsl/5.0/fsl.sh
  77. # obtain all inputs
  78. datalad get \$@
  79. # perform brain extraction
  80. count=1
  81. for nifti in \$@; do
  82. subdir="sub-\$(printf %03d \$count)"
  83. mkdir -p \$subdir
  84. echo "Processing \$nifti"
  85. bet \$nifti \$subdir/anat -m
  86. count=\$((count + 1))
  87. done
  88. EOT
  89. Note that this script uses the :dlcmd:`get` command which automatically obtains the required files from their remote source – we will see this in action shortly.
  90. We are saving this script in the dataset. This way, we will know exactly which code was used for the analysis.
  91. Everything inside of ``code/`` is tracked with Git thanks to the yoda-procedure, so we can see more easily how it was edited over time.
  92. In addition, we will “tag” this state of the dataset with the tag ``setup_done`` to mark the repository state at which the analysis script was completed.
  93. This is optional, but it can help to identify important milestones more easily.
  94. .. runrecord:: _examples/repro-108
  95. :language: console
  96. :workdir: usecases/repro/demo
  97. $ datalad save --version-tag setup_done -m "Brain extraction script" code/brain_extraction.sh
  98. Now we can run our analysis code to produce results. However, instead of running it directly, we will run it with DataLad – this will automatically create a record of exactly how this script was executed.
  99. For this demo we will just run it on the structural images (T1w) of the first subject (sub-01) from each dataset.
  100. The uniform structure of the datasets makes this very easy.
  101. Of course we could run it on all subjects; we are simply saving some time for this demo.
  102. While the command runs, you should notice a few things:
  103. 1) We run this command with ‘bash -e’ to stop at any failure that may occur
  104. 2) You’ll see the required data files being obtained as they are needed – and only those that are actually required will be downloaded (because of the appropriate ``--input`` specification of the :dlcmd:`run` -- but as a :dlcmd:`get` is also included in the bash script, forgetting an ``--input`` specification would not be problem).
  105. .. runrecord:: _examples/repro-109
  106. :language: console
  107. :workdir: usecases/repro/demo
  108. $ datalad run -m "run brain extract workflow" \
  109. --input "inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz" \
  110. --output "sub-*/anat" \
  111. bash -e code/brain_extraction.sh inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz
  112. The analysis step is done, all generated results were saved in the dataset.
  113. All changes, including the command that caused them are on record:
  114. .. runrecord:: _examples/repro-110
  115. :language: console
  116. :workdir: usecases/repro/demo
  117. $ git show --stat
  118. DataLad has enough information stored to be able to rerun a command.
  119. On command exit, it will inspect the results and save them again, but only if they are different.
  120. In our case, the rerun yields bit-identical results, hence nothing new is saved.
  121. .. runrecord:: _examples/repro-111
  122. :language: console
  123. :workdir: usecases/repro/demo
  124. $ datalad rerun
  125. Now that we are done, and have checked that we can reproduce the results ourselves, we can clean up. DataLad can easily verify if any part of our input dataset was modified since we configured our analysis, using :dlcmd:`diff` and the tag we provided:
  126. .. runrecord:: _examples/repro-112
  127. :language: console
  128. :workdir: usecases/repro/demo
  129. $ datalad diff setup_done inputs
  130. Nothing was changed.
  131. With DataLad with don’t have to keep those inputs around – without losing the ability to reproduce an analysis.
  132. Let’s uninstall them, and check the size on disk before and after.
  133. .. runrecord:: _examples/repro-113
  134. :language: console
  135. :workdir: usecases/repro/demo
  136. $ du -sh
  137. .. runrecord:: _examples/repro-114
  138. :language: console
  139. :workdir: usecases/repro/demo
  140. $ datalad uninstall inputs/*
  141. .. runrecord:: _examples/repro-115
  142. :language: console
  143. :workdir: usecases/repro/demo
  144. $ du -sh
  145. The dataset is substantially smaller as all inputs are gone…
  146. .. runrecord:: _examples/repro-116
  147. :language: console
  148. :workdir: usecases/repro/demo
  149. $ ls inputs/*
  150. But as these inputs were registered in the dataset when we installed them, getting them back is very easy.
  151. Only the remaining data (our code and the results) need to be kept and require a backup for long term archival.
  152. Everything else can be reobtained as needed, when needed.
  153. As DataLad knows everything needed about the inputs, including where to get the right version, we can rerun the analysis with a single command.
  154. Watch how DataLad reobtains all required data, reruns the code, and checks that none of the results changed and need saving.
  155. .. runrecord:: _examples/repro-117
  156. :language: console
  157. :workdir: usecases/repro/demo
  158. $ datalad rerun
  159. Reproduced!
  160. This dataset could now be published and shared as a lightweight yet fully reproducible resource and enable anyone to replicate the exact same analysis -- with a single command.
  161. Public data and reproducible execution for the win!
  162. Note though that reproducibility can and should go further: With more complex software dependencies, it is inevitable to keep track of the software environment involved in the analysis as well.
  163. If you are curious on how to do this, read on into :ref:`usecase_reproduce_neuroimg`.