reproducible_neuroimaging_analysis.rst 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519
  1. .. index:: ! Usecase; Reproducible Neuroimaging
  2. .. _usecase_reproduce_neuroimg:
  3. An automatically and computationally reproducible neuroimaging analysis from scratch
  4. ------------------------------------------------------------------------------------
  5. This use case sketches the basics of a portable analysis that can be
  6. automatically computationally reproduced, starting from the
  7. acquisition of a neuroimaging dataset with a magnetic resonance imaging (MRI)
  8. scanner up to complete data analysis results:
  9. #. Two extension packages, `datalad-container <https://github.com/datalad/datalad-container>`_
  10. and `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_
  11. extend DataLad's functionality with the ability to work with computational
  12. containers and neuroimaging data workflows.
  13. #. The analysis is conducted in a way that leaves comprehensive provenance
  14. (including software environments) all the way from the raw data, and
  15. structures study components in a way that facilitates reuse.
  16. #. It starts with preparing a raw data (dicom) dataset, and subsequently uses
  17. the prepared data for a general linear model (GLM) based analysis.
  18. #. After completion, data and results are archived, and disk usage of the
  19. dataset is maximally reduced.
  20. This use case is adapted from the
  21. `ReproIn/DataLad tutorial <https://www.repronim.org/ohbm2018-training/03-01-reproin>`_
  22. by Michael Hanke and Yaroslav Halchenko, given at the 2018 OHBM training course
  23. ran by `ReproNim <https://www.repronim.org>`_.
  24. The Challenge
  25. ^^^^^^^^^^^^^
  26. Allan is an exemplary neuroscientist and researcher. He has spent countless
  27. hours diligently learning not only the statistical methods for his research
  28. questions and the software tools for his computations, but also taught
  29. himself about version control and data standards in neuroimaging, such as
  30. the brain imaging data structure (`BIDS <https://bids.neuroimaging.io>`_).
  31. For his final PhD project, he patiently acquires functional MRI data of many
  32. participants, and prepares it according to the BIDS standard afterwards.
  33. It takes him a full week of time and two failed attempts, but he
  34. eventually has a `BIDS-compliant <https://bids-standard.github.io/bids-validator>`_
  35. dataset.
  36. When he writes his analysis scripts he takes extra care to responsibly
  37. version control every change. He happily notices how much cleaner his
  38. directories are, and how he and others can transparently see how his code
  39. evolved. Once everything is set up, he runs his analysis using large and
  40. complex neuroscientific software packages that he installed on his computer a
  41. few years back. Finally, he writes a paper and publishes his findings in a
  42. prestigious peer-reviewed journal. His data and code can be accessed by
  43. others easily, as he makes them publicly available. Colleagues and
  44. supervisors admire him for his wonderful contribution to open science.
  45. However, a few months after publication, Allan starts to get
  46. emails by that report that his scripts do not produce the same
  47. results as the ones reported in the publication. Startled and confused he
  48. investigates what may be the issue. After many sleepless nights he realizes:
  49. The software he used was fairly old! More recent versions of the same
  50. software compute results slightly different, changed function's names, or
  51. fixed discovered bugs in the underlying source code. Shocked, he realizes that
  52. his scripts are even incompatible with the most recent release of the software
  53. package he used and throw an error. Luckily, he can quickly fix this by
  54. adding information about the required software versions to the ``README`` of his
  55. project, and he is grateful for colleagues and other scientists that provide
  56. adjusted versions of his code for more recent software releases. In the end,
  57. his results prove to be robust regardless of software version. But while
  58. Allen shared code and data, not including any information about his *software*
  59. environment prevented his analysis from becoming *computationally*
  60. reproducible.
  61. The DataLad Approach
  62. ^^^^^^^^^^^^^^^^^^^^
  63. Even if an analysis workflow is fully captured and version-controlled, and
  64. data and code are being linked, an analysis may not reproduce. Comprehensive
  65. *computational* reproducibility requires that also the *software* involved
  66. in an analysis and its precise versions need to be known.
  67. DataLad can help with this. Using the ``datalad-containers`` extension,
  68. complete software environments can be captured in computational containers,
  69. added to (and thus shared together with) datasets, and linked with commands
  70. and outputs they were used for.
  71. Step-by-Step
  72. ^^^^^^^^^^^^
  73. The first part of this Step-by-Step guide details how to computationally
  74. reproducibly and automatically reproducibly perform data preparation from raw
  75. `DICOM <https://www.dicomstandard.org>`_ files to BIDS-compliant
  76. `NIfTI <https://nifti.nimh.nih.gov>`_ images. The actual analysis, a
  77. first-level GLM for a localization task, is performed in the second part. A
  78. final paragraph shows how to prepare the dataset for the afterlife.
  79. For this use case, two DataLad extensions are required:
  80. - `datalad-container <https://github.com/datalad/datalad-container>`_ and
  81. - `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_
  82. You can install them via ``pip`` like this::
  83. $ pip install datalad-neuroimaging datalad-container
  84. Data Preparation
  85. """"""""""""""""
  86. We start by creating a home for the raw data:
  87. .. runrecord:: _examples/repro2-101
  88. :language: console
  89. :workdir: usecases/repro2
  90. $ datalad create localizer_scans
  91. $ cd localizer_scans
  92. For this example, we use a number of publicly available DICOM files. Luckily,
  93. at the time of data acquisition, these DICOMs were already equipped with the
  94. relevant metadata: Their headers contain all necessary information to
  95. identify the purpose of individual scans and encode essential properties to
  96. create a BIDS compliant dataset from them. The DICOMs are stored on Github
  97. (as a Git repository [#f1]_), so they can be installed as a subdataset. As
  98. they are the raw inputs of the analysis, we store them in a directory we call
  99. ``inputs/raw``.
  100. .. runrecord:: _examples/repro2-102
  101. :language: console
  102. :workdir: usecases/repro2/localizer_scans
  103. $ datalad clone --dataset . \
  104. https://github.com/datalad/example-dicom-functional.git \
  105. inputs/rawdata
  106. The :dlcmd:`subdatasets` reports the installed dataset to be indeed
  107. a subdataset of the superdataset ``localizer_scans``:
  108. .. runrecord:: _examples/repro2-103
  109. :language: console
  110. :workdir: usecases/repro2/localizer_scans
  111. $ datalad subdatasets
  112. Given that we have obtained *raw* data, this data is not yet ready for data
  113. analysis. Prior to performing actual computations, the data needs to be
  114. transformed into appropriate formats and standardized to an intuitive layout.
  115. For neuroimaging, a useful transformation is a transformation from
  116. DICOMs into the NIfTI format, a format specifically designed for scientific
  117. analyses of brain images. An intuitive layout is the BIDS standard.
  118. Performing these transformations and standardizations, however, requires
  119. software. For the task at hand, `HeudiConv <https://heudiconv.readthedocs.io>`_,
  120. a DICOM converter, is our software of choice. Beyond converting DICOMs, it
  121. also provides assistance in converting a raw data set to the BIDS standard,
  122. and it integrates with DataLad to place converted and original data under
  123. Git/Git-annex version control, while automatically annotating files with
  124. sensitive information (e.g., non-defaced anatomicals, etc).
  125. To take extra care to know exactly what software is used both to be
  126. able to go back to it at a later stage should we have the
  127. need to investigate an issue, and to capture *full* provenance of the
  128. transformation process, we are using a software container that contains the
  129. relevant software setup.
  130. A ready-made `singularity <https://singularity.lbl.gov>`_ container is
  131. available from `singularity-hub <https://singularity-hub.org>`_ at
  132. `shub://ReproNim/ohbm2018-training:heudiconvn <shub://ReproNim/ohbm2018-training:heudiconvn>`_.
  133. Using the :dlcmd:`containers-add` command we can add this container
  134. to the ``localizer_scans`` superdataset. We are giving it the name
  135. ``heudiconv``.
  136. .. runrecord:: _examples/repro2-104
  137. :language: console
  138. :workdir: usecases/repro2/localizer_scans
  139. :realcommand: datalad containers-add heudiconv --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}' --url shub://ReproNim/ohbm2018-training:heudiconvn
  140. $ datalad containers-add heudiconv --url shub://ReproNim/ohbm2018-training:heudiconvn
  141. The command :dlcmd:`containers-list` can verify that this worked:
  142. .. runrecord:: _examples/repro2-105
  143. :language: console
  144. :workdir: usecases/repro2/localizer_scans
  145. $ datalad containers-list
  146. Great. The dataset now tracks all of the input data *and* the computational
  147. environment for the DICOM conversion. Thus far, we have a complete record of
  148. all components. Let's stay transparent, but also automatically reproducible
  149. in the actual data conversion by wrapping the necessary ``heudiconv`` command
  150. seen below::
  151. $ heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . \
  152. -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms
  153. within a :dlcmd:`containers-run` command.
  154. To save time, we will only transfer one subjects data (``sub-02``, hence the
  155. subject identifier ``-s 02`` in the command). Note that the output below is
  156. how it indeed should look like -- the software we are using in this example
  157. produces very wordy output.
  158. .. runrecord:: _examples/repro2-106
  159. :language: console
  160. :workdir: usecases/repro2/localizer_scans
  161. $ datalad containers-run -m "Convert sub-02 DICOMs into BIDS" \
  162. --container-name heudiconv \
  163. 'heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms'
  164. Find out what changed after this command by comparing the most recent commit
  165. by DataLad (i.e., ``HEAD``) to the previous one (i.e., ``HEAD~1``) with
  166. :dlcmd:`diff`:
  167. .. runrecord:: _examples/repro2-107
  168. :language: console
  169. :workdir: usecases/repro2/localizer_scans
  170. $ datalad diff -f HEAD~1
  171. As expected, DICOM files of one subject were converted into NIfTI files,
  172. **and** the outputs follow the BIDS standard's layout and naming conventions!
  173. But what's even better is that this action and the relevant software
  174. environment was fully recorded.
  175. There is only one thing missing before the functional imaging data can be
  176. analyzed: A stimulation protocol, so that we know what stimulation was done
  177. at which point during the scan.
  178. Thankfully, the data was collected using an implementation that exported
  179. this information directly in the BIDS events.tsv format. The file came with
  180. our DICOM dataset and can be found at ``inputs/rawdata/events.tsv``. All we need
  181. to do is copy it to the right location under the BIDS-mandated name. To keep
  182. track of where this file came from, we will also wrap the copying into a
  183. :dlcmd:`run` command. The ``{inputs}`` and ``{outputs}``
  184. placeholders can help to avoid duplication in the command call:
  185. .. runrecord:: _examples/repro2-108
  186. :language: console
  187. :workdir: usecases/repro2/localizer_scans
  188. $ datalad run -m "Import stimulation events" \
  189. --input inputs/rawdata/events.tsv \
  190. --output sub-02/func/sub-02_task-oneback_run-01_events.tsv \
  191. cp {inputs} {outputs}
  192. ``git log`` shows what information DataLad captured about this command's
  193. execution:
  194. .. runrecord:: _examples/repro2-109
  195. :language: console
  196. :workdir: usecases/repro2/localizer_scans
  197. $ git log -n 1
  198. Analysis execution
  199. """"""""""""""""""
  200. Since the raw data are reproducibly prepared in BIDS standard we can now go
  201. further and conduct an analysis. For this example, we will implement a very
  202. basic first-level GLM analysis using `FSL <https://fsl.fmrib.ox.ac.uk>`__
  203. that takes only a few minutes to run. As before, we will capture all provenance:
  204. inputs, computational environments, code, and outputs.
  205. Following the YODA principles [#f2]_, the analysis is set up in a new
  206. dataset, with the input dataset ``localizer_scans`` as a subdataset:
  207. .. runrecord:: _examples/repro2-110
  208. :language: console
  209. :workdir: usecases/repro2/localizer_scans
  210. # get out of localizer_scans
  211. $ cd ../
  212. $ datalad create glm_analysis
  213. $ cd glm_analysis
  214. We install ``localizer_scans`` by providing its path as a ``--source`` to
  215. :dlcmd:`install`:
  216. .. runrecord:: _examples/repro2-111
  217. :language: console
  218. :workdir: usecases/repro2/glm_analysis
  219. $ datalad clone -d . \
  220. ../localizer_scans \
  221. inputs/rawdata
  222. :dlcmd:`subdatasets` reports the number of installed subdatasets
  223. again:
  224. .. runrecord:: _examples/repro2-112
  225. :language: console
  226. :workdir: usecases/repro2/glm_analysis
  227. $ datalad subdatasets
  228. We almost forgot something really useful: Structuring the dataset with
  229. the help of DataLad! Luckily, procedures such as ``yoda`` can not only be
  230. applied upon creating of a dataset (as in :ref:`createDS`), but also with the
  231. :dlcmd:`run-procedure` command (as in :ref:`procedures`)
  232. .. runrecord:: _examples/repro2-113
  233. :language: console
  234. :workdir: usecases/repro2/glm_analysis
  235. $ datalad run-procedure cfg_yoda
  236. The analysis obviously needs custom code. For the simple GLM analysis with
  237. FSL we use:
  238. #. A small script to convert BIDS-formatted ``events.tsv`` files into the
  239. ``EV3`` format FSL understands, available at
  240. `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh>`_
  241. #. An FSL analysis configuration template script, available at
  242. `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf>`_
  243. These script should be stored and tracked inside the dataset within ``code/``.
  244. The :dlcmd:`download-url` command downloads these scripts *and*
  245. records where they were obtained from:
  246. .. runrecord:: _examples/repro2-114
  247. :language: console
  248. :workdir: usecases/repro2/glm_analysis
  249. $ datalad download-url --path code/ \
  250. https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh \
  251. https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
  252. The commit message that DataLad created shows the URL where each script has
  253. been downloaded from:
  254. .. runrecord:: _examples/repro2-115
  255. :language: console
  256. :workdir: usecases/repro2/glm_analysis
  257. $ git log -n 1
  258. Prior to the actual analysis, we need to run the ``events2ev3.sh`` script to
  259. transform inputs into the format that FSL expects. The :dlcmd:`run`
  260. makes this maximally reproducible and easy, as the files given as
  261. ``--inputs`` and ``--outputs`` are automatically managed by DataLad.
  262. .. runrecord:: _examples/repro2-116
  263. :workdir: usecases/repro2/glm_analysis
  264. :language: console
  265. $ datalad run -m 'Build FSL EV3 design files' \
  266. --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv \
  267. --output 'sub-02/onsets' \
  268. bash code/events2ev3.sh sub-02 {inputs}
  269. The dataset now contains and manages all of the required inputs, and we're
  270. ready for FSL. Since FSL is not a simple program, we make sure to record the
  271. precise software environment for the analysis with
  272. :dlcmd:`containers-run`. First, we get a container with FSL in the
  273. version we require:
  274. .. runrecord:: _examples/repro2-117
  275. :workdir: usecases/repro2/glm_analysis
  276. :language: console
  277. :realcommand: datalad containers-add fsl --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}' --url shub://mih/ohbm2018-training:fsl
  278. $ datalad containers-add fsl --url shub://mih/ohbm2018-training:fsl
  279. As the analysis setup is now complete, let's label this state of the dataset:
  280. .. runrecord:: _examples/repro2-118
  281. :workdir: usecases/repro2/glm_analysis
  282. :language: console
  283. $ datalad save --version-tag ready4analysis
  284. All we have left is to configure the desired first-level GLM analysis with
  285. FSL. At this point, the template contains placeholders for the ``basepath``
  286. and the subject ID, and they need to be replaced.
  287. The following command uses the arcane, yet powerful :term:`sed` editor to do
  288. this. We will again use :dlcmd:`run` to invoke our command so that
  289. we store in the history how this template was generated (so that we may
  290. audit, alter, or regenerate this file in the future — fearlessly).
  291. .. runrecord:: _examples/repro2-119
  292. :workdir: usecases/repro2/glm_analysis
  293. :language: console
  294. $ datalad run \
  295. -m "FSL FEAT analysis config script" \
  296. --output sub-02/1stlvl_design.fsf \
  297. bash -c 'sed -e "s,##BASEPATH##,{pwd},g" -e "s,##SUB##,sub-02,g" \
  298. code/ffa_design.fsf > {outputs}'
  299. To compute the analysis, a simple ``feat sub-02/1stlvl_design.fsf`` command
  300. is wrapped into a ``datalad containers-run`` command with appropriate
  301. ``--input`` and ``--output`` specification:
  302. .. runrecord:: _examples/repro2-120
  303. :language: console
  304. :workdir: usecases/repro2/glm_analysis
  305. :lines: 1-12, 356-
  306. $ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \
  307. --input sub-02/1stlvl_design.fsf \
  308. --input sub-02/onsets \
  309. --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
  310. --output sub-02/1stlvl_glm.feat \
  311. feat {inputs[0]}
  312. Once this command finishes, DataLad will have captured the entire FSL output,
  313. and the dataset will contain a complete record all the way from the input BIDS
  314. dataset to the GLM results. The BIDS subdataset in turn has a
  315. complete record of all processing down from the raw DICOMs onwards.
  316. .. importantnote:: Many files need more planning
  317. See how many files were created and added in this computation of a single
  318. participant? If your study has many participants, analyses like the one above
  319. could inflate your dataset. Please check out the chapter :ref:`chapter_gobig`.
  320. in particular the section :ref:`big_analysis` for tips and tricks on how to
  321. create analyses datasets that scale.
  322. Archive data and results
  323. """"""""""""""""""""""""
  324. After study completion it is important to properly archive data and results,
  325. for example for future inquiries by reviewers or readers of the associated
  326. publication. Thanks to the modularity of the study units, this tasks is easy
  327. and avoids needless duplication.
  328. The raw data is tracked in its own dataset (``localizer_scans``) that only
  329. needs to be archived once, regardless of how many analysis are using it as
  330. input. This means that we can “throw away” this subdataset copy within this
  331. analysis dataset. DataLad can re-obtain the correct version at any point in
  332. the future, as long as the recorded location remains accessible.
  333. To make sure we're not deleting information we are not aware of,
  334. :dlcmd:`diff` and :gitcmd:`log` can help to verify that the
  335. subdataset is in the same state as when it was initially added:
  336. .. runrecord:: _examples/repro2-121
  337. :language: console
  338. :workdir: usecases/repro2/glm_analysis
  339. $ datalad diff -- inputs
  340. The command does not show any output, thus indicating that there is indeed no
  341. difference. ``git log`` confirms that the only action that was performed on
  342. ``inputs/`` was the addition of it as a subdataset:
  343. .. runrecord:: _examples/repro2-122
  344. :language: console
  345. :workdir: usecases/repro2/glm_analysis
  346. $ git log -- inputs
  347. Since the state of the subdataset is exactly the state of the original
  348. ``localizer_scans`` dataset it is safe to uninstall it.
  349. .. runrecord:: _examples/repro2-123
  350. :language: console
  351. :workdir: usecases/repro2/glm_analysis
  352. $ datalad uninstall --dataset . inputs --recursive
  353. Prior to archiving the results, we can go one step further and verify their
  354. computational reproducibility. DataLad's ``rerun`` command is
  355. capable of “replaying” any recorded command. The following command
  356. re-executes the FSL analysis by re-running everything since the dataset was
  357. tagged as ``ready4analysis``). It will record the recomputed results in a
  358. separate Git branch named ``verify``. Afterwards, we can automatically
  359. compare these new results to the original ones in the ``master`` branch. We
  360. will see that all outputs can be reproduced in bit-identical form. The only
  361. changes are observed in log files that contain volatile information, such as
  362. time steps.
  363. .. runrecord:: _examples/repro2-124
  364. :language: console
  365. :workdir: usecases/repro2/glm_analysis
  366. :lines: 1-17, 362-
  367. $ datalad rerun --branch verify --onto ready4analysis --since ready4analysis
  368. .. runrecord:: _examples/repro2-125
  369. :language: console
  370. :workdir: usecases/repro2/glm_analysis
  371. # check that we are now on the new `verify` branch
  372. $ git branch
  373. .. runrecord:: _examples/repro2-126
  374. :language: console
  375. :workdir: usecases/repro2/glm_analysis
  376. # compare which files have changes with respect to the original results
  377. $ git diff master --stat
  378. .. runrecord:: _examples/repro2-127
  379. :language: console
  380. :workdir: usecases/repro2/glm_analysis
  381. # switch back to the master branch and remove the `verify` branch
  382. $ git checkout master
  383. $ git branch -D verify
  384. The outcome of this use case can be found as a dataset on Github
  385. `here <https://github.com/myyoda/demo-dataset-glmanalysis>`_.
  386. .. rubric:: Footnotes
  387. .. [#f1] "Why can such data exist as a Git repository, shouldn't large files
  388. be always stored outside of Git?" you may ask. The DICOMs exist in a
  389. Git-repository for a number of reasons: First, it makes them easily
  390. available for demonstrations and tutorials without involving DataLad
  391. at all. Second, the DICOMs are *comparatively* small: 21K per file.
  392. Importantly, the repository is not meant to version control those
  393. files *and* future states or derivatives and results obtained from
  394. them -- this would bring a Git repositories to its knees.
  395. .. [#f2] To re-read everything about the YODA principles, checkout out section
  396. :ref:`yoda`.