123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519 |
- .. index:: ! Usecase; Reproducible Neuroimaging
- .. _usecase_reproduce_neuroimg:
- An automatically and computationally reproducible neuroimaging analysis from scratch
- ------------------------------------------------------------------------------------
- This use case sketches the basics of a portable analysis that can be
- automatically computationally reproduced, starting from the
- acquisition of a neuroimaging dataset with a magnetic resonance imaging (MRI)
- scanner up to complete data analysis results:
- #. Two extension packages, `datalad-container <https://github.com/datalad/datalad-container>`_
- and `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_
- extend DataLad's functionality with the ability to work with computational
- containers and neuroimaging data workflows.
- #. The analysis is conducted in a way that leaves comprehensive provenance
- (including software environments) all the way from the raw data, and
- structures study components in a way that facilitates reuse.
- #. It starts with preparing a raw data (dicom) dataset, and subsequently uses
- the prepared data for a general linear model (GLM) based analysis.
- #. After completion, data and results are archived, and disk usage of the
- dataset is maximally reduced.
- This use case is adapted from the
- `ReproIn/DataLad tutorial <https://www.repronim.org/ohbm2018-training/03-01-reproin>`_
- by Michael Hanke and Yaroslav Halchenko, given at the 2018 OHBM training course
- ran by `ReproNim <https://www.repronim.org>`_.
- The Challenge
- ^^^^^^^^^^^^^
- Allan is an exemplary neuroscientist and researcher. He has spent countless
- hours diligently learning not only the statistical methods for his research
- questions and the software tools for his computations, but also taught
- himself about version control and data standards in neuroimaging, such as
- the brain imaging data structure (`BIDS <https://bids.neuroimaging.io>`_).
- For his final PhD project, he patiently acquires functional MRI data of many
- participants, and prepares it according to the BIDS standard afterwards.
- It takes him a full week of time and two failed attempts, but he
- eventually has a `BIDS-compliant <https://bids-standard.github.io/bids-validator>`_
- dataset.
- When he writes his analysis scripts he takes extra care to responsibly
- version control every change. He happily notices how much cleaner his
- directories are, and how he and others can transparently see how his code
- evolved. Once everything is set up, he runs his analysis using large and
- complex neuroscientific software packages that he installed on his computer a
- few years back. Finally, he writes a paper and publishes his findings in a
- prestigious peer-reviewed journal. His data and code can be accessed by
- others easily, as he makes them publicly available. Colleagues and
- supervisors admire him for his wonderful contribution to open science.
- However, a few months after publication, Allan starts to get
- emails by that report that his scripts do not produce the same
- results as the ones reported in the publication. Startled and confused he
- investigates what may be the issue. After many sleepless nights he realizes:
- The software he used was fairly old! More recent versions of the same
- software compute results slightly different, changed function's names, or
- fixed discovered bugs in the underlying source code. Shocked, he realizes that
- his scripts are even incompatible with the most recent release of the software
- package he used and throw an error. Luckily, he can quickly fix this by
- adding information about the required software versions to the ``README`` of his
- project, and he is grateful for colleagues and other scientists that provide
- adjusted versions of his code for more recent software releases. In the end,
- his results prove to be robust regardless of software version. But while
- Allen shared code and data, not including any information about his *software*
- environment prevented his analysis from becoming *computationally*
- reproducible.
- The DataLad Approach
- ^^^^^^^^^^^^^^^^^^^^
- Even if an analysis workflow is fully captured and version-controlled, and
- data and code are being linked, an analysis may not reproduce. Comprehensive
- *computational* reproducibility requires that also the *software* involved
- in an analysis and its precise versions need to be known.
- DataLad can help with this. Using the ``datalad-containers`` extension,
- complete software environments can be captured in computational containers,
- added to (and thus shared together with) datasets, and linked with commands
- and outputs they were used for.
- Step-by-Step
- ^^^^^^^^^^^^
- The first part of this Step-by-Step guide details how to computationally
- reproducibly and automatically reproducibly perform data preparation from raw
- `DICOM <https://www.dicomstandard.org>`_ files to BIDS-compliant
- `NIfTI <https://nifti.nimh.nih.gov>`_ images. The actual analysis, a
- first-level GLM for a localization task, is performed in the second part. A
- final paragraph shows how to prepare the dataset for the afterlife.
- For this use case, two DataLad extensions are required:
- - `datalad-container <https://github.com/datalad/datalad-container>`_ and
- - `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_
- You can install them via ``pip`` like this::
- $ pip install datalad-neuroimaging datalad-container
- Data Preparation
- """"""""""""""""
- We start by creating a home for the raw data:
- .. runrecord:: _examples/repro2-101
- :language: console
- :workdir: usecases/repro2
- $ datalad create localizer_scans
- $ cd localizer_scans
- For this example, we use a number of publicly available DICOM files. Luckily,
- at the time of data acquisition, these DICOMs were already equipped with the
- relevant metadata: Their headers contain all necessary information to
- identify the purpose of individual scans and encode essential properties to
- create a BIDS compliant dataset from them. The DICOMs are stored on Github
- (as a Git repository [#f1]_), so they can be installed as a subdataset. As
- they are the raw inputs of the analysis, we store them in a directory we call
- ``inputs/raw``.
- .. runrecord:: _examples/repro2-102
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad clone --dataset . \
- https://github.com/datalad/example-dicom-functional.git \
- inputs/rawdata
- The :dlcmd:`subdatasets` reports the installed dataset to be indeed
- a subdataset of the superdataset ``localizer_scans``:
- .. runrecord:: _examples/repro2-103
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad subdatasets
- Given that we have obtained *raw* data, this data is not yet ready for data
- analysis. Prior to performing actual computations, the data needs to be
- transformed into appropriate formats and standardized to an intuitive layout.
- For neuroimaging, a useful transformation is a transformation from
- DICOMs into the NIfTI format, a format specifically designed for scientific
- analyses of brain images. An intuitive layout is the BIDS standard.
- Performing these transformations and standardizations, however, requires
- software. For the task at hand, `HeudiConv <https://heudiconv.readthedocs.io>`_,
- a DICOM converter, is our software of choice. Beyond converting DICOMs, it
- also provides assistance in converting a raw data set to the BIDS standard,
- and it integrates with DataLad to place converted and original data under
- Git/Git-annex version control, while automatically annotating files with
- sensitive information (e.g., non-defaced anatomicals, etc).
- To take extra care to know exactly what software is used both to be
- able to go back to it at a later stage should we have the
- need to investigate an issue, and to capture *full* provenance of the
- transformation process, we are using a software container that contains the
- relevant software setup.
- A ready-made `singularity <https://singularity.lbl.gov>`_ container is
- available from `singularity-hub <https://singularity-hub.org>`_ at
- `shub://ReproNim/ohbm2018-training:heudiconvn <shub://ReproNim/ohbm2018-training:heudiconvn>`_.
- Using the :dlcmd:`containers-add` command we can add this container
- to the ``localizer_scans`` superdataset. We are giving it the name
- ``heudiconv``.
- .. runrecord:: _examples/repro2-104
- :language: console
- :workdir: usecases/repro2/localizer_scans
- :realcommand: datalad containers-add heudiconv --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}' --url shub://ReproNim/ohbm2018-training:heudiconvn
- $ datalad containers-add heudiconv --url shub://ReproNim/ohbm2018-training:heudiconvn
- The command :dlcmd:`containers-list` can verify that this worked:
- .. runrecord:: _examples/repro2-105
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad containers-list
- Great. The dataset now tracks all of the input data *and* the computational
- environment for the DICOM conversion. Thus far, we have a complete record of
- all components. Let's stay transparent, but also automatically reproducible
- in the actual data conversion by wrapping the necessary ``heudiconv`` command
- seen below::
- $ heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . \
- -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms
- within a :dlcmd:`containers-run` command.
- To save time, we will only transfer one subjects data (``sub-02``, hence the
- subject identifier ``-s 02`` in the command). Note that the output below is
- how it indeed should look like -- the software we are using in this example
- produces very wordy output.
- .. runrecord:: _examples/repro2-106
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad containers-run -m "Convert sub-02 DICOMs into BIDS" \
- --container-name heudiconv \
- 'heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms'
- Find out what changed after this command by comparing the most recent commit
- by DataLad (i.e., ``HEAD``) to the previous one (i.e., ``HEAD~1``) with
- :dlcmd:`diff`:
- .. runrecord:: _examples/repro2-107
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad diff -f HEAD~1
- As expected, DICOM files of one subject were converted into NIfTI files,
- **and** the outputs follow the BIDS standard's layout and naming conventions!
- But what's even better is that this action and the relevant software
- environment was fully recorded.
- There is only one thing missing before the functional imaging data can be
- analyzed: A stimulation protocol, so that we know what stimulation was done
- at which point during the scan.
- Thankfully, the data was collected using an implementation that exported
- this information directly in the BIDS events.tsv format. The file came with
- our DICOM dataset and can be found at ``inputs/rawdata/events.tsv``. All we need
- to do is copy it to the right location under the BIDS-mandated name. To keep
- track of where this file came from, we will also wrap the copying into a
- :dlcmd:`run` command. The ``{inputs}`` and ``{outputs}``
- placeholders can help to avoid duplication in the command call:
- .. runrecord:: _examples/repro2-108
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ datalad run -m "Import stimulation events" \
- --input inputs/rawdata/events.tsv \
- --output sub-02/func/sub-02_task-oneback_run-01_events.tsv \
- cp {inputs} {outputs}
- ``git log`` shows what information DataLad captured about this command's
- execution:
- .. runrecord:: _examples/repro2-109
- :language: console
- :workdir: usecases/repro2/localizer_scans
- $ git log -n 1
- Analysis execution
- """"""""""""""""""
- Since the raw data are reproducibly prepared in BIDS standard we can now go
- further and conduct an analysis. For this example, we will implement a very
- basic first-level GLM analysis using `FSL <https://fsl.fmrib.ox.ac.uk>`__
- that takes only a few minutes to run. As before, we will capture all provenance:
- inputs, computational environments, code, and outputs.
- Following the YODA principles [#f2]_, the analysis is set up in a new
- dataset, with the input dataset ``localizer_scans`` as a subdataset:
- .. runrecord:: _examples/repro2-110
- :language: console
- :workdir: usecases/repro2/localizer_scans
- # get out of localizer_scans
- $ cd ../
- $ datalad create glm_analysis
- $ cd glm_analysis
- We install ``localizer_scans`` by providing its path as a ``--source`` to
- :dlcmd:`install`:
- .. runrecord:: _examples/repro2-111
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad clone -d . \
- ../localizer_scans \
- inputs/rawdata
- :dlcmd:`subdatasets` reports the number of installed subdatasets
- again:
- .. runrecord:: _examples/repro2-112
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad subdatasets
- We almost forgot something really useful: Structuring the dataset with
- the help of DataLad! Luckily, procedures such as ``yoda`` can not only be
- applied upon creating of a dataset (as in :ref:`createDS`), but also with the
- :dlcmd:`run-procedure` command (as in :ref:`procedures`)
- .. runrecord:: _examples/repro2-113
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad run-procedure cfg_yoda
- The analysis obviously needs custom code. For the simple GLM analysis with
- FSL we use:
- #. A small script to convert BIDS-formatted ``events.tsv`` files into the
- ``EV3`` format FSL understands, available at
- `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh>`_
- #. An FSL analysis configuration template script, available at
- `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf>`_
- These script should be stored and tracked inside the dataset within ``code/``.
- The :dlcmd:`download-url` command downloads these scripts *and*
- records where they were obtained from:
- .. runrecord:: _examples/repro2-114
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad download-url --path code/ \
- https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh \
- https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
- The commit message that DataLad created shows the URL where each script has
- been downloaded from:
- .. runrecord:: _examples/repro2-115
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ git log -n 1
- Prior to the actual analysis, we need to run the ``events2ev3.sh`` script to
- transform inputs into the format that FSL expects. The :dlcmd:`run`
- makes this maximally reproducible and easy, as the files given as
- ``--inputs`` and ``--outputs`` are automatically managed by DataLad.
- .. runrecord:: _examples/repro2-116
- :workdir: usecases/repro2/glm_analysis
- :language: console
- $ datalad run -m 'Build FSL EV3 design files' \
- --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv \
- --output 'sub-02/onsets' \
- bash code/events2ev3.sh sub-02 {inputs}
- The dataset now contains and manages all of the required inputs, and we're
- ready for FSL. Since FSL is not a simple program, we make sure to record the
- precise software environment for the analysis with
- :dlcmd:`containers-run`. First, we get a container with FSL in the
- version we require:
- .. runrecord:: _examples/repro2-117
- :workdir: usecases/repro2/glm_analysis
- :language: console
- :realcommand: datalad containers-add fsl --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}' --url shub://mih/ohbm2018-training:fsl
- $ datalad containers-add fsl --url shub://mih/ohbm2018-training:fsl
- As the analysis setup is now complete, let's label this state of the dataset:
- .. runrecord:: _examples/repro2-118
- :workdir: usecases/repro2/glm_analysis
- :language: console
- $ datalad save --version-tag ready4analysis
- All we have left is to configure the desired first-level GLM analysis with
- FSL. At this point, the template contains placeholders for the ``basepath``
- and the subject ID, and they need to be replaced.
- The following command uses the arcane, yet powerful :term:`sed` editor to do
- this. We will again use :dlcmd:`run` to invoke our command so that
- we store in the history how this template was generated (so that we may
- audit, alter, or regenerate this file in the future — fearlessly).
- .. runrecord:: _examples/repro2-119
- :workdir: usecases/repro2/glm_analysis
- :language: console
- $ datalad run \
- -m "FSL FEAT analysis config script" \
- --output sub-02/1stlvl_design.fsf \
- bash -c 'sed -e "s,##BASEPATH##,{pwd},g" -e "s,##SUB##,sub-02,g" \
- code/ffa_design.fsf > {outputs}'
- To compute the analysis, a simple ``feat sub-02/1stlvl_design.fsf`` command
- is wrapped into a ``datalad containers-run`` command with appropriate
- ``--input`` and ``--output`` specification:
- .. runrecord:: _examples/repro2-120
- :language: console
- :workdir: usecases/repro2/glm_analysis
- :lines: 1-12, 356-
- $ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \
- --input sub-02/1stlvl_design.fsf \
- --input sub-02/onsets \
- --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
- --output sub-02/1stlvl_glm.feat \
- feat {inputs[0]}
- Once this command finishes, DataLad will have captured the entire FSL output,
- and the dataset will contain a complete record all the way from the input BIDS
- dataset to the GLM results. The BIDS subdataset in turn has a
- complete record of all processing down from the raw DICOMs onwards.
- .. importantnote:: Many files need more planning
- See how many files were created and added in this computation of a single
- participant? If your study has many participants, analyses like the one above
- could inflate your dataset. Please check out the chapter :ref:`chapter_gobig`.
- in particular the section :ref:`big_analysis` for tips and tricks on how to
- create analyses datasets that scale.
- Archive data and results
- """"""""""""""""""""""""
- After study completion it is important to properly archive data and results,
- for example for future inquiries by reviewers or readers of the associated
- publication. Thanks to the modularity of the study units, this tasks is easy
- and avoids needless duplication.
- The raw data is tracked in its own dataset (``localizer_scans``) that only
- needs to be archived once, regardless of how many analysis are using it as
- input. This means that we can “throw away” this subdataset copy within this
- analysis dataset. DataLad can re-obtain the correct version at any point in
- the future, as long as the recorded location remains accessible.
- To make sure we're not deleting information we are not aware of,
- :dlcmd:`diff` and :gitcmd:`log` can help to verify that the
- subdataset is in the same state as when it was initially added:
- .. runrecord:: _examples/repro2-121
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad diff -- inputs
- The command does not show any output, thus indicating that there is indeed no
- difference. ``git log`` confirms that the only action that was performed on
- ``inputs/`` was the addition of it as a subdataset:
- .. runrecord:: _examples/repro2-122
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ git log -- inputs
- Since the state of the subdataset is exactly the state of the original
- ``localizer_scans`` dataset it is safe to uninstall it.
- .. runrecord:: _examples/repro2-123
- :language: console
- :workdir: usecases/repro2/glm_analysis
- $ datalad uninstall --dataset . inputs --recursive
- Prior to archiving the results, we can go one step further and verify their
- computational reproducibility. DataLad's ``rerun`` command is
- capable of “replaying” any recorded command. The following command
- re-executes the FSL analysis by re-running everything since the dataset was
- tagged as ``ready4analysis``). It will record the recomputed results in a
- separate Git branch named ``verify``. Afterwards, we can automatically
- compare these new results to the original ones in the ``master`` branch. We
- will see that all outputs can be reproduced in bit-identical form. The only
- changes are observed in log files that contain volatile information, such as
- time steps.
- .. runrecord:: _examples/repro2-124
- :language: console
- :workdir: usecases/repro2/glm_analysis
- :lines: 1-17, 362-
- $ datalad rerun --branch verify --onto ready4analysis --since ready4analysis
- .. runrecord:: _examples/repro2-125
- :language: console
- :workdir: usecases/repro2/glm_analysis
- # check that we are now on the new `verify` branch
- $ git branch
- .. runrecord:: _examples/repro2-126
- :language: console
- :workdir: usecases/repro2/glm_analysis
- # compare which files have changes with respect to the original results
- $ git diff master --stat
- .. runrecord:: _examples/repro2-127
- :language: console
- :workdir: usecases/repro2/glm_analysis
- # switch back to the master branch and remove the `verify` branch
- $ git checkout master
- $ git branch -D verify
- The outcome of this use case can be found as a dataset on Github
- `here <https://github.com/myyoda/demo-dataset-glmanalysis>`_.
- .. rubric:: Footnotes
- .. [#f1] "Why can such data exist as a Git repository, shouldn't large files
- be always stored outside of Git?" you may ask. The DICOMs exist in a
- Git-repository for a number of reasons: First, it makes them easily
- available for demonstrations and tutorials without involving DataLad
- at all. Second, the DICOMs are *comparatively* small: 21K per file.
- Importantly, the repository is not meant to version control those
- files *and* future states or derivatives and results obtained from
- them -- this would bring a Git repositories to its knees.
- .. [#f2] To re-read everything about the YODA principles, checkout out section
- :ref:`yoda`.
|