101-171-enki.rst 25 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431
  1. .. _hcpenki:
  2. Walkthrough: Parallel ENKI preprocessing with fMRIprep
  3. ------------------------------------------------------
  4. .. importantnote:: This workflow has an update!
  5. The workflow below is valid and working, but over many months and a few very large scale projects we have improved it with a more flexible and scalable setup.
  6. Currently, this work can be found as a comprehensive tutorial and bootstrapping script on GitHub (`github.com/psychoinformatics-de/fairly-big-processing-workflow <https://github.com/psychoinformatics-de/fairly-big-processing-workflow>`_), and a corresponding show case implementation with fMRIprep (`github.com/psychoinformatics-de/fairly-big-processing-workflow-tutorial <https://github.com/psychoinformatics-de/fairly-big-processing-workflow-tutorial>`_).
  7. Also, there is an accompanying preprint with more high-level descriptions of the workflow at `www.biorxiv.org/content/10.1101/2021.10.12.464122v1 <https://www.biorxiv.org/content/10.1101/2021.10.12.464122v1>`_.
  8. Its main advantages over the workflow below lie in a distributed (and thus independent) setup of all involved dataset locations; built-in support for two kinds of job schedulers (HTCondor, SLURM); enhanced scalability (tested on 42k datasets of the `UK Biobank dataset <https://www.ukbiobank.ac.uk>`_; and use of :term:`Remote Indexed Archive (RIA) store`\s that provide support for additional security or technical features.
  9. It is advised to use the updated workflow over the one below.
  10. In the future, this chapter will be updated with an implementation of the updated workflow.
  11. The previous section has been an overview on parallel, provenance-tracked computations in DataLad datasets.
  12. While the general workflow entails a complete setup, it is usually easier to understand it by seeing it applied to a concrete usecase.
  13. Its even more informative if that use case includes some complexities that do not exist in the "picture-perfect" example but are likely to arise in real life.
  14. Therefore, the following walk-through in this section is a write-up of an existing and successfully executed analysis.
  15. The analysis
  16. ^^^^^^^^^^^^
  17. The analysis goal was standard data preprocessing using `fMRIprep <https://fmriprep.readthedocs.io>`_ on neuroimaging data of 1300 subjects in the `eNKI <https://fcon_1000.projects.nitrc.org/indi/enhanced>`_ dataset.
  18. This computational task is ideal for parallelization: Each subject can be preprocessed individually, each preprocessing takes between 6 and 8 hours per subject, resulting in 1300x7h of serial computing, but only about 7 hours of computing time when executed completely in parallel, and
  19. fMRIprep is a containerized pipeline that can be pointed to a specific subject to preprocess.
  20. ENKI was transformed into a DataLad dataset beforehand, and to set up the analysis, the fMRIprep container was placed -- with a custom configuration to make it generalizable -- into a new dataset called ``pipeline``.
  21. Both of these datasets, input data and ``pipeline`` dataset, became subdataset of a data analysis superdataset.
  22. In order to associate input data, containerized pipeline, and outputs, the analysis was carried out in a toplevel analysis DataLad dataset and with the :dlcmd:`containers-run` command.
  23. Finally, as an additional complexity, due to the additional complexity of a large quantity of results, the output was collected in subdatasets.
  24. .. _pipelineenki:
  25. Starting point: Datasets for software and input data
  26. """"""""""""""""""""""""""""""""""""""""""""""""""""
  27. At the beginning of this endeavour, two important analysis components already exist as DataLad datasets:
  28. 1. The input data
  29. 2. The containerized pipeline
  30. Following the :ref:`YODA principles <yoda>`, each of these components is a standalone dataset.
  31. While the input dataset creation is straightforwards, some thinking went into the creation of containerized pipeline dataset to set it up in a way that allows it to be installed as a subdataset and invoked from the superdataset.
  32. If you are interested in this, find the details in the findoutmore below.
  33. Also note that there is a large collection of pre-existing container datasets available at `github.com/ReproNim/containers <https://github.com/ReproNim/containers>`_.
  34. .. find-out-more:: pipeline dataset creation
  35. We start with a dataset (called ``pipelines`` in this example)::
  36. $ datalad create pipelines
  37. [INFO ] Creating a new annex repo at /data/projects/enki/pipelines
  38. create(ok): /data/projects/enki/pipelines (dataset)
  39. $ cd pipelines
  40. As one of tools used in fMRIprep's the pipeline, `freesurfer <https://surfer.nmr.mgh.harvard.edu>`_, requires a license file, this license file needs to be added into the dataset.
  41. Only then can this dataset be moved around flexibly and also to different machines.
  42. In order to have the license file available right away, it is saved ``--to-git`` and not annexed [#f1]_::
  43. $ cp <location/to/fs-license.txt> .
  44. $ datalad save --to-git -m "add freesurfer license file" fs-license.txt
  45. Finally, we add a container with the pipeline to the dataset using :dlcmd:`containers-add` [#f2]_.
  46. The important part is the configuration of the container -- it has to be done in a way that makes the container usable in any superdataset the pipeline dataset.
  47. Depending on how the container/pipeline needs to be called, the configuration differs.
  48. In the case of an fMRIprep run, we want to be able to invoke the container from a data analysis superdataset.
  49. The superdataset contains input data and ``pipelines`` dataset as subdatasets, and will collect all of the results.
  50. Thus, these are arguments we want to supply the invocation with (following `fMRIprep's documentation <https://fmriprep.org/en/stable/usage.html>`_) during a ``containers-run`` command::
  51. $ datalad containers-run \
  52. [...]
  53. <BIDS_dir> <output_dir> <analysis_level> \
  54. --n_cpus <N> \
  55. --participant-label <ID> \
  56. [...]
  57. Note how this list does not include bind-mounts of the necessary directories or of the freesurfer license -- this makes the container invocation convenient and easy for any user.
  58. Starting an fMRIprep run requires only a ``datalad containers-run`` with all of the desired fMRIprep options.
  59. This convenience for the user requires that all of the bind-mounts should be taken care of -- in a generic way -- in the container call specification, though.
  60. Here is how this is done::
  61. $ datalad containers-add fmriprep \
  62. --url /data/project/singularity/fmriprep-20.2.0.simg \
  63. --call-fmt singularity run --cleanenv -B "$PWD" {img} {cmd} --fs-license-file "$PWD/{img_dspath}/freesurfer_license.txt"
  64. During a :dlcmd:`containers-run` command, the ``--call-fmt`` specification will be used to call the container.
  65. The placeholders ``{img}`` and ``{cmd}`` will be replaced with the container (``{img}``) and the command given to ``datalad containers-run`` (``{cmd}``).
  66. Thus, the ``--cleanenv`` flag as well as bind-mounts are handled prior to the container invocation, and the ``--fs-license-file`` option with a path to the license file within the container is appended to the command.
  67. Bind-mounting the working directory (``-B "$PWD"``) makes sure to bind mount the directory from which the container is being called, which should be the superdataset that contains input data and ``pipelines`` subdataset.
  68. With these bind-mounts, input data and the freesurfer license file within ``pipelines`` are available in the container.
  69. With such a setup, the ``pipelines`` dataset can be installed in any dataset and will work out of the box.
  70. Analysis dataset setup
  71. """"""""""""""""""""""
  72. The size of the input dataset and the nature of preprocessing results with fMRIprep constitute an additional complexity:
  73. Based on the amount of input data and test runs of fMRIprep on single subjects, we estimated that the preprocessing results from fMRIprep would encompass several TB in size and about half a million files.
  74. This amount of files is too large to be stored in a single dataset, though, and results will therefore need to be split into two result datasets.
  75. These will be included as direct subdatasets of the toplevel analysis dataset.
  76. This is inconvenient -- it separates results (in the result subdatasets) from their provenance (the run-records in the top-level dataset) -- but inevitable given the dataset size.
  77. A final analysis dataset will consist of the following components:
  78. - input data as a subdataset
  79. - ``pipelines`` container dataset as a subdataset
  80. - subdatasets to hold the results
  81. Following the benchmarks and tips in the chapter :ref:`chapter_gobig`, the amount of files produced by fMRIprep on 1300 subjects requires two datasets to hold them.
  82. In this particular computation, following the naming scheme and structure of fMRIpreps output directories, one subdataset is created for the `freesurfer <https://surfer.nmr.mgh.harvard.edu>`_ results of fMRIprep in a subdataset called ``freesurfer``, and one for the minimally preprocessed input data in a subdataset called ``fmriprep``.
  83. Here is an overview of the directory structure in the superdataset::
  84. superds
  85. ├── code # directory
  86. │   └── pipelines # subdataset with fMRIprep
  87. ├── fmriprep # subdataset for results
  88. ├── freesurfer # subdataset for results
  89. └── sourcedata # subdataset with BIDS-formatted data
  90. ├── sourcedata # subdataset with raw data
  91. ├── sub-A00008326 # directory
  92. ├── sub-...
  93. When running fMRIprep on a smaller set of subjects, or a containerized pipeline that produces fewer files, saving results into subdatasets isn't necessary.
  94. Workflow script
  95. """""""""""""""
  96. Based on the general principles introduced in the previous section, there is a sketch of the workflow in the :term:`bash` (shell) script below.
  97. It still lacks ``fMRIprep`` specific fine-tuning -- the complete script is shown in the findoutmore afterwards.
  98. This initial sketch serves to highlight key differences and adjustments due to the complexity and size of the analysis, explained below and highlighted in the script as well:
  99. * **Getting subdatasets**: The empty result subdatasets wouldn't be installed in the clone automatically -- ``datalad get -n -r -R1 .`` installs all first-level subdatasets so that they are available to be populated with results.
  100. * **recursive throw-away clones**: In the simpler general workflow, we ran ``git annex dead here`` in the topmost dataset.
  101. This dataset contains the results within subdatasets.
  102. In order to make them "throw-away" as well, the ``git annex dead here`` configuration needs to be applied recursively for all datasets with ``git submodule foreach --recursive git annex dead here``.
  103. * **Checkout unique branches in the subdataset**: Since the results will be pushed from the subdatasets, it is in there that unique branches need to be checked out.
  104. We're using ``git -C <path>`` to apply a command in dataset under ``path``.
  105. * **Complex container call**: The ``containers-run`` command is more complex because it supplies all desired ``fMRIprep`` arguments.
  106. * **Push the subdatasets only**: We only need to push the results, i.e., there is one push per each subdataset.
  107. .. code-block:: bash
  108. :emphasize-lines: 10, 13, 19-20, 24, 43-44
  109. # everything is running under /tmp inside a compute job,
  110. # /tmp is job-specific local file system not shared between jobs
  111. $ cd /tmp
  112. # clone the superdataset with locking
  113. $ flock --verbose $DSLOCKFILE datalad clone /data/project/enki/super ds
  114. $ cd ds
  115. # get first-level subdatasets (-R1 = --recursion-limit 1)
  116. $ datalad get -n -r -R1 .
  117. # make git-annex disregard the clones - they are meant to be thrown away
  118. $ git submodule foreach --recursive git annex dead here
  119. # checkout unique branches (names derived from job IDs) in both subdatasets
  120. # to enable pushing the results without interference from other jobs
  121. # In a setup with no subdatasets, "-C <subds-name>" would be stripped,
  122. # and a new branch would be checked out in the superdataset instead.
  123. $ git -C fmriprep checkout -b "job-$JOBID"
  124. $ git -C freesurfer checkout -b "job-$JOBID"
  125. # call fmriprep with datalad containers-run. Use all relevant fMRIprep
  126. # arguments for your usecase
  127. $ datalad containers-run \
  128. -m "fMRIprep $subid" \
  129. --explicit \
  130. -o freesurfer -o fmriprep \
  131. -i "$1" \
  132. -n code/pipelines/fmriprep \
  133. sourcedata . participant \
  134. --n_cpus 1 \
  135. --skip-bids-validation \
  136. -w .git/tmp/wdir \
  137. --participant-label "$subid" \
  138. --random-seed 12345 \
  139. --skull-strip-fixed-seed \
  140. --md-only-boilerplate \
  141. --output-spaces MNI152NLin6Asym \
  142. --use-aroma \
  143. --cifti-output
  144. # push back the results
  145. $ flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin
  146. $ flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin
  147. # job handler should clean up workspace
  148. Just like the general script from the last section, this script can be submitted to any job scheduler -- here with a subject ID as a ``$subid`` command line variable and a job ID as environment variable as identifiers for the fMRIprep run and branch names.
  149. At this point, the workflow misses a tweak that is necessary in fMRIprep to enable rerunning computations (the complete file is in :ref:`this Findoutmore <fom-enki>`.
  150. .. find-out-more:: Fine-tuning: Enable rerunning
  151. If you want to make sure that your dataset is set up in a way that you have the ability to rerun a computation quickly, the following fMRIprep-specific consideration is important:
  152. If fMRIprep finds preexisting results, it will fail to run.
  153. Therefore, all outputs of a job need to be removed before the job is started [#f3]_.
  154. We can simply add an attempt to do this in the script (it wouldn't do any harm if there is nothing to be removed)::
  155. (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv)
  156. (cd freesurfer && rm -rf fsaverage "$subid")
  157. With this in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``).
  158. You can find the full script with rich comments in :ref:`this Findoutmore <fom-enki>`.
  159. .. find-out-more:: See the complete bash script
  160. :name: fom-enki
  161. :float: p
  162. This script is placed in ``code/fmriprep_participant_job``.
  163. For technical reasons (rendering of the handbook), we break it into several blocks of code:
  164. .. code-block:: bash
  165. #!/bin/bash
  166. # fail whenever something is fishy, use -x to get verbose logfiles
  167. set -e -u -x
  168. # we pass in "sourcedata/sub-...", extract subject id from it
  169. subid=$(basename $1)
  170. # this is all running under /tmp inside a compute job, /tmp is a performant
  171. # local file system
  172. cd /tmp
  173. # get the output dataset, which includes the inputs as well
  174. # flock makes sure that this does not interfere with another job
  175. # finishing at the same time, and pushing its results back
  176. # importantly, we clone from the location that we want to push the
  177. # results too
  178. flock --verbose $DSLOCKFILE \
  179. datalad clone /data/project/enki/super ds
  180. # all following actions are performed in the context of the superdataset
  181. cd ds
  182. # obtain all first-level subdatasets:
  183. # dataset with fmriprep singularity container and pre-configured
  184. # pipeline call; also get the output dataset to prep them for output
  185. # consumption, we need to tune them for this particular job, sourcedata
  186. # important: because we will push additions to the result datasets back
  187. # at the end of the job, the installation of these result datasets
  188. # must happen from the location we want to push back too
  189. datalad get -n -r -R1 .
  190. # let git-annex know that we do not want to remember any of these clones
  191. # (we could have used an --ephemeral clone, but that might deposite data
  192. # of failed jobs at the origin location, if the job runs on a shared
  193. # file system -- let's stay self-contained)
  194. git submodule foreach --recursive git annex dead here
  195. .. code-block:: bash
  196. # checkout new branches in both subdatasets
  197. # this enables us to store the results of this job, and push them back
  198. # without interference from other jobs
  199. git -C fmriprep checkout -b "job-$JOBID"
  200. git -C freesurfer checkout -b "job-$JOBID"
  201. # create workdir for fmriprep inside to simplify singularity call
  202. # PWD will be available in the container
  203. mkdir -p .git/tmp/wdir
  204. # pybids (inside fmriprep) gets angry when it sees dangling symlinks
  205. # of .json files -- wipe them out, spare only those that belong to
  206. # the participant we want to process in this job
  207. find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete
  208. # next one is important to get job-reruns correct. We remove all
  209. # anticipated output, such that fmriprep isn't confused by the presence
  210. # of stale symlinks. Otherwise we would need to obtain and unlock file
  211. # content. But that takes some time, for no reason other than being
  212. # discarded at the end
  213. (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv)
  214. (cd freesurfer && rm -rf fsaverage "$subid")
  215. .. code-block:: bash
  216. # the meat of the matter, add actual parameterization after --participant-label
  217. datalad containers-run \
  218. -m "fMRIprep $subid" \
  219. --explicit \
  220. -o freesurfer -o fmriprep \
  221. -i "$1" \
  222. -n code/pipelines/fmriprep \
  223. sourcedata . participant \
  224. --n_cpus 1 \
  225. --skip-bids-validation \
  226. -w .git/tmp/wdir \
  227. --participant-label "$subid" \
  228. --random-seed 12345 \
  229. --skull-strip-fixed-seed \
  230. --md-only-boilerplate \
  231. --output-spaces MNI152NLin6Asym \
  232. --use-aroma \
  233. --cifti-output
  234. # selectively push outputs only
  235. # ignore root dataset, despite recorded changes, needs coordinated
  236. # merge at receiving end
  237. flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin
  238. flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin
  239. # job handler should clean up workspace
  240. Pending modifications to paths provided in clone locations, the above script and dataset setup is generic enough to be run on different systems and with different job schedulers.
  241. .. _jobsubmit:
  242. Job submission
  243. """"""""""""""
  244. Job submission now only boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``.
  245. Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file can thus be lean even though it queues up more than 1000 jobs.
  246. You can find the submit file used in this analyses in :ref:`this Findoutmore <fom-condor>`.
  247. .. find-out-more:: HTCondor submit file
  248. :name: fom-condor
  249. :float:
  250. The following submit file was created and saved in ``code/fmriprep_all_participants.submit``:
  251. .. code-block:: bash
  252. universe = vanilla
  253. get_env = True
  254. # resource requirements for each job, determined by
  255. # investigating the demands of a single test job
  256. request_cpus = 1
  257. request_memory = 20G
  258. request_disk = 210G
  259. executable = $ENV(PWD)/code/fmriprep_participant_job
  260. # the job expects to environment variables for labeling and synchronization
  261. environment = "JOBID=$(Cluster).$(Process) DSLOCKFILE=$ENV(PWD)/.git/datalad_lock"
  262. log = $ENV(PWD)/../logs/$(Cluster).$(Process).log
  263. output = $ENV(PWD)/../logs/$(Cluster).$(Process).out
  264. error = $ENV(PWD)/../logs/$(Cluster).$(Process).err
  265. arguments = $(subid)
  266. # find all participants, based on the subdirectory names in the source dataset
  267. # each relative path to such a subdirectory with become the value of `subid`
  268. # and another job is queued. Will queue a total number of jobs matching the
  269. # number of matching subdirectories
  270. queue subid matching dirs sourcedata/sub-*
  271. All it takes to submit is a single ``condor_submit <submit_file>``.
  272. Merging results
  273. """""""""""""""
  274. Once all jobs have finished, the results lie in individual branches of the output datasets.
  275. In this concrete example, the subdatasets ``fmriprep`` and ``freesurfer`` will each have more than 1000 branches that hold individual job results.
  276. The only thing left to do now is merging all of these branches into :term:`main` -- and potentially solve any merge conflicts that arise.
  277. As explained in the previous section, the necessary merging was done with `Octopus merges <https://git-scm.com/docs/git-merge#Documentation/git-merge.txt-octopus>`_ -- one in each subdataset (``fmriprep`` and ``freesurfer``).
  278. The merge command was assembled with the trick introduced in the previous section, based on job-ID-named branches.
  279. Importantly, this needs to be carried out inside of the subdatasets, i.e., within ``fmriprep`` and ``freesurfer``.
  280. .. code-block:: bash
  281. $ git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ')
  282. **Merging with merge conflicts**
  283. When attempting an octopus merge like the one above and a merge conflict arises, the merge is aborted automatically. This is what it looks like in ``fmriprep/``, in which all jobs created a slightly different ``CITATION.md`` file::
  284. $ cd fmriprep
  285. $ git merge -m "Merge results from job cluster 107890" $(git branch -l | grep 'job-' | tr -d ' ')
  286. Fast-forwarding to: job-107890.0
  287. Trying simple merge with job-107890.1
  288. Simple merge did not work, trying automatic merge.
  289. ERROR: logs/CITATION.md: Not merging symbolic link changes.
  290. fatal: merge program failed
  291. Automated merge did not work.
  292. Should not be doing an octopus.
  293. Merge with strategy octopus failed.
  294. This merge conflict is in prinicple helpful -- since there are multiple different ``CITATION.md`` files in each branch, Git refuses to randomly pick one that it likes to keep, and instead aborts so that the user can intervene.
  295. .. find-out-more:: How to fix this?
  296. As the file ``CITATION.md`` does not contain meaningful changes between jobs, one of the files is kept as a backup (e.g., copied into a temporary location, or brought back to life afterwards with ``git cat-file``), then all ``CITATION.md`` files of all branches deleted prior to the merge, and the back-up ``CITATION.md`` file is copied and saved into the dataset as a last step.
  297. .. code-block:: bash
  298. # First, checkout any job branch
  299. $ git checkout job-<insert-number>
  300. # then, copy the file out of the dataset (here, it is copied into your home directory)
  301. $ cp logs/CITATION.md ~/CITATION.md
  302. # checkout main again
  303. $ git checkout main
  304. Then, remove all CITATION.md files from the last commit.
  305. Here is a bash loop that would do exactly that::
  306. $ for b in $(git branch -l | grep 'job-' | tr -d ' ');
  307. do ( git checkout -b m$b $b && git rm logs/CITATION.md && git commit --amend --no-edit ) ;
  308. done
  309. Afterwards, merge the results::
  310. $ git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'mjob-' | tr -d ' ')
  311. Finally, move the back-up file into the dataset::
  312. $ mv ~/CITATION.md logs/
  313. $ datalad save -m "Add CITATION file from one job" logs/CITATION.md
  314. **Merging without merge conflicts**
  315. If no merge conflicts arise and the octopus merge is successful, all results are aggregated in the ``main`` branch.
  316. The commit log looks like a work of modern art when visualized with tools such as :term:`tig`:
  317. .. figure:: ../artwork/src/octopusmerge_tig.png
  318. Summary
  319. """""""
  320. Once all jobs are computed in parallel and the resulting branches merged, the superdataset is populated with two subdatasets that hold the preprocessing results.
  321. Each result contains a machine-readable record of provenance on when, how, and by whom it was computed.
  322. From this point, the results in the subdatasets can be used for further analysis, while a record of how they were preprocessed is attached to them.
  323. .. rubric:: Footnotes
  324. .. [#f1] If the distinction between annexed and unannexed files is new to you, please read section :ref:`symlink`
  325. .. [#f2] Note that this requires the ``datalad containers`` extension. Find an overview of all datalad extensions in :ref:`extensions_intro`.
  326. .. [#f3] The brackets around the commands are called *command grouping* in bash, and yield a subshell environment: `www.gnu.org/software/bash/manual/html_node/Command-Grouping.html <https://www.gnu.org/software/bash/manual/html_node/Command-Grouping.html>`_.