101-161-biganalyses.rst 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
  1. .. _big_analysis:
  2. Calculate in greater numbers
  3. ----------------------------
  4. When creating and populating datasets yourself it may be easy to monitor the
  5. overall size of the dataset and its file number, and introduce
  6. subdatasets whenever and where ever necessary. It may not be as straightforward
  7. when you are not population datasets yourself, but when *software* or
  8. analyses scripts suddenly dump vast amounts of output.
  9. Certain analysis software can create myriads of files. A standard
  10. `FEAT analysis <https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FEAT/UserGuide>`_ [#f1]_
  11. in `FSL <https://fsl.fmrib.ox.ac.uk>`_, for example, can easily output
  12. several dozens of directories and up to thousands of result files per subject.
  13. Maybe your own custom scripts are writing out many files as outputs, too.
  14. Regardless of *why* a lot of files are produced by an analyses, if the analysis
  15. or software in question runs on a substantially sized input dataset, the results
  16. may overwhelm the capacities of a single dataset.
  17. This section demonstrates some tips on how to prevent swamping your datasets
  18. with files. If you already accidentally got stuck with an overflowing dataset,
  19. checkout section :ref:`cleanup` first.
  20. Solution: Subdatasets
  21. ^^^^^^^^^^^^^^^^^^^^^
  22. To stick to the example of FEAT, here is a quick overview on what this software
  23. does: It is modeling neuroimaging data based on general linear modeling (GLM),
  24. and creates web page analyses reports, color activation images, time-course plots
  25. of data and model, preprocessed intermediate data, images with filtered data,
  26. statistical output images, color rendered output images, log files, and many more
  27. -- in short: A LOT of files.
  28. Plenty of these outputs are text-based, but there are also many sizable files.
  29. Depending on the type of analysis, not all types of outputs
  30. will be relevant. At the end of the analysis, one usually has session-,
  31. subject-specific, or aggregated "group" directories with many subdirectories
  32. filled with log files, intermediate and preprocessed files, and results for all
  33. levels of the analysis.
  34. In such a setup, the output directories (be it on a session/run, subject, or group
  35. level) are predictably named, or custom nameable. In order to not flood a single
  36. dataset, therefore, one can pre-create appropriate subdatasets of the necessary
  37. granularity and have them filled by their analyses.
  38. This approach is by no means limited to analyses with certain software, and
  39. can be automated. For scripting languages other than Python or shell, standard
  40. system calls can create output directories as DataLad subdatasets right away,
  41. Python scripts can even use DataLad's Python API [#f2]_.
  42. Thus, you can create scripts that take care of subdataset creation, or, if you
  43. write analysis scripts yourself, you can take care of subdataset creation right
  44. in the scripts that are computing and saving your results.
  45. As it is easy to link datasets and operate (e.g., save, clone) across dataset
  46. hierarchies, splitting datasets into a hierarchy of datasets
  47. does not have many downsides. One substantial disadvantage, though, is that
  48. on their own, results in subdirectories don't have meaningful provenance
  49. attached. The information about what script or software created them is attached
  50. to the superdataset. Should only the subdataset be cloned or inspected, the information
  51. on how it was generated is not found.
  52. Solutions without creating subdatasets
  53. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  54. It is also possible to scale up without going through the complexities of
  55. creating several subdatasets, or tuning your scaling beyond the creation of
  56. subdatasets. It involves more thought, or compromising, though.
  57. The following section highlights a few caveats to bear in mind if you attempt
  58. a big analyses in single-level datasets, and outlines solutions that may not
  59. need to involve subdatasets. If you have something to add, please
  60. `get in touch <https://github.com/datalad-handbook/book/issues/new>`_.
  61. Too many files
  62. """"""""""""""
  63. **Caveat**: Drown a dataset with too many files.
  64. **Examples**: The FSL FEAT analysis mentioned in the introduction produces
  65. several 100k files, but not all of these files are important.
  66. ``tsplot/``, for example, is a directory that contains time series plots for
  67. various data and results, and may be of little interested for many analyses once
  68. general quality control is done.
  69. **Solutions**:
  70. - Don't put irrelevant files under version control at all: Consider creating
  71. a *.gitignore* file with patterns that match files or directories that are of no
  72. relevance to you. These files will not be version controlled or saved to your
  73. dataset. Section :ref:`gitignore` can tell you more about this. Be mindful, though:
  74. Having too many files in a single directory can still be problematic for your
  75. file system. A concrete example: Consider your analyses create log files that
  76. are not precious enough to be version controlled. Adding ``logs/*`` to your
  77. ``.gitignore`` file and saving this change will keep these files out of
  78. version control.
  79. - Similarly, you can instruct :dlcmd:`run` to save only specific directories
  80. or files by specifying them with the ``--output`` option and executing the command
  81. with the ``--explicit`` flag. This may be more suitable an approach if you know
  82. what you want to keep rather than what is irrelevant.
  83. Too many files in Git
  84. """""""""""""""""""""
  85. **Caveat**: Drown Git because of configurations.
  86. **Example**: If your dataset is configured with a configuration such as ``text2git`` or if
  87. you have modified your ``.gitattributes`` file [#f3]_ to store files below a certain
  88. size of certain types in :term:`Git` instead of :term:`git-annex`, an
  89. excess of sudden text files can still be overwhelming in terms of total file size.
  90. Several thousand, or tens of thousand, text files may still add up to several GB
  91. in size even if they are each small in size.
  92. **Solutions**:
  93. - Add files to git-annex instead of Git: Consider creating custom ``largefile``
  94. rules for directories that you generate these files in or for patterns that
  95. match file names that do not need to be in Git. This way, these files will be
  96. put under git-annex's version control. A concrete example: Consider that your
  97. analyses output a few thousand text files into all ``sub-*/correlations/``
  98. directories in your dataset. Appending
  99. ``sub-*/correlations/* annex.largefiles=anything`` to ``.gitattributes`` and
  100. saving this change will store all of in the dataset's annex instead of in Git.
  101. - Don't put irrelevant files under version control at all: Consider creating
  102. a *.gitignore* file with patterns that match files or directories that are of no
  103. relevance to you. These files will not be version controlled or saved to your
  104. dataset. Section :ref:`gitignore` can tell you more about this. Be mindful, though:
  105. Having too many files in a single directory can still be problematic for your
  106. file system. A concrete example: Consider your analyses create log files that
  107. are not precious enough to be version controlled. Adding ``logs/*`` to your
  108. ``.gitignore`` file and saving this change will keep these files out of
  109. version control.
  110. .. todo::
  111. Add more caveats and examples
  112. .. rubric:: Footnotes
  113. .. [#f1] FEAT is a software tool for model-based fMRI data analysis and part of of
  114. `FSL <https://fsl.fmrib.ox.ac.uk>`_.
  115. .. [#f2] Read more about DataLad's Python API in the :ref:`Find-out-more on it <pythonapi>` in
  116. :ref:`yoda_project`.
  117. .. [#f3] Read up on these configurations in the chapter :ref:`chapter_config`.