123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144 |
- .. _big_analysis:
- Calculate in greater numbers
- ----------------------------
- When creating and populating datasets yourself it may be easy to monitor the
- overall size of the dataset and its file number, and introduce
- subdatasets whenever and where ever necessary. It may not be as straightforward
- when you are not population datasets yourself, but when *software* or
- analyses scripts suddenly dump vast amounts of output.
- Certain analysis software can create myriads of files. A standard
- `FEAT analysis <https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FEAT/UserGuide>`_ [#f1]_
- in `FSL <https://fsl.fmrib.ox.ac.uk>`_, for example, can easily output
- several dozens of directories and up to thousands of result files per subject.
- Maybe your own custom scripts are writing out many files as outputs, too.
- Regardless of *why* a lot of files are produced by an analyses, if the analysis
- or software in question runs on a substantially sized input dataset, the results
- may overwhelm the capacities of a single dataset.
- This section demonstrates some tips on how to prevent swamping your datasets
- with files. If you already accidentally got stuck with an overflowing dataset,
- checkout section :ref:`cleanup` first.
- Solution: Subdatasets
- ^^^^^^^^^^^^^^^^^^^^^
- To stick to the example of FEAT, here is a quick overview on what this software
- does: It is modeling neuroimaging data based on general linear modeling (GLM),
- and creates web page analyses reports, color activation images, time-course plots
- of data and model, preprocessed intermediate data, images with filtered data,
- statistical output images, color rendered output images, log files, and many more
- -- in short: A LOT of files.
- Plenty of these outputs are text-based, but there are also many sizable files.
- Depending on the type of analysis, not all types of outputs
- will be relevant. At the end of the analysis, one usually has session-,
- subject-specific, or aggregated "group" directories with many subdirectories
- filled with log files, intermediate and preprocessed files, and results for all
- levels of the analysis.
- In such a setup, the output directories (be it on a session/run, subject, or group
- level) are predictably named, or custom nameable. In order to not flood a single
- dataset, therefore, one can pre-create appropriate subdatasets of the necessary
- granularity and have them filled by their analyses.
- This approach is by no means limited to analyses with certain software, and
- can be automated. For scripting languages other than Python or shell, standard
- system calls can create output directories as DataLad subdatasets right away,
- Python scripts can even use DataLad's Python API [#f2]_.
- Thus, you can create scripts that take care of subdataset creation, or, if you
- write analysis scripts yourself, you can take care of subdataset creation right
- in the scripts that are computing and saving your results.
- As it is easy to link datasets and operate (e.g., save, clone) across dataset
- hierarchies, splitting datasets into a hierarchy of datasets
- does not have many downsides. One substantial disadvantage, though, is that
- on their own, results in subdirectories don't have meaningful provenance
- attached. The information about what script or software created them is attached
- to the superdataset. Should only the subdataset be cloned or inspected, the information
- on how it was generated is not found.
- Solutions without creating subdatasets
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- It is also possible to scale up without going through the complexities of
- creating several subdatasets, or tuning your scaling beyond the creation of
- subdatasets. It involves more thought, or compromising, though.
- The following section highlights a few caveats to bear in mind if you attempt
- a big analyses in single-level datasets, and outlines solutions that may not
- need to involve subdatasets. If you have something to add, please
- `get in touch <https://github.com/datalad-handbook/book/issues/new>`_.
- Too many files
- """"""""""""""
- **Caveat**: Drown a dataset with too many files.
- **Examples**: The FSL FEAT analysis mentioned in the introduction produces
- several 100k files, but not all of these files are important.
- ``tsplot/``, for example, is a directory that contains time series plots for
- various data and results, and may be of little interested for many analyses once
- general quality control is done.
- **Solutions**:
- - Don't put irrelevant files under version control at all: Consider creating
- a *.gitignore* file with patterns that match files or directories that are of no
- relevance to you. These files will not be version controlled or saved to your
- dataset. Section :ref:`gitignore` can tell you more about this. Be mindful, though:
- Having too many files in a single directory can still be problematic for your
- file system. A concrete example: Consider your analyses create log files that
- are not precious enough to be version controlled. Adding ``logs/*`` to your
- ``.gitignore`` file and saving this change will keep these files out of
- version control.
- - Similarly, you can instruct :dlcmd:`run` to save only specific directories
- or files by specifying them with the ``--output`` option and executing the command
- with the ``--explicit`` flag. This may be more suitable an approach if you know
- what you want to keep rather than what is irrelevant.
- Too many files in Git
- """""""""""""""""""""
- **Caveat**: Drown Git because of configurations.
- **Example**: If your dataset is configured with a configuration such as ``text2git`` or if
- you have modified your ``.gitattributes`` file [#f3]_ to store files below a certain
- size of certain types in :term:`Git` instead of :term:`git-annex`, an
- excess of sudden text files can still be overwhelming in terms of total file size.
- Several thousand, or tens of thousand, text files may still add up to several GB
- in size even if they are each small in size.
- **Solutions**:
- - Add files to git-annex instead of Git: Consider creating custom ``largefile``
- rules for directories that you generate these files in or for patterns that
- match file names that do not need to be in Git. This way, these files will be
- put under git-annex's version control. A concrete example: Consider that your
- analyses output a few thousand text files into all ``sub-*/correlations/``
- directories in your dataset. Appending
- ``sub-*/correlations/* annex.largefiles=anything`` to ``.gitattributes`` and
- saving this change will store all of in the dataset's annex instead of in Git.
- - Don't put irrelevant files under version control at all: Consider creating
- a *.gitignore* file with patterns that match files or directories that are of no
- relevance to you. These files will not be version controlled or saved to your
- dataset. Section :ref:`gitignore` can tell you more about this. Be mindful, though:
- Having too many files in a single directory can still be problematic for your
- file system. A concrete example: Consider your analyses create log files that
- are not precious enough to be version controlled. Adding ``logs/*`` to your
- ``.gitignore`` file and saving this change will keep these files out of
- version control.
- .. todo::
- Add more caveats and examples
- .. rubric:: Footnotes
- .. [#f1] FEAT is a software tool for model-based fMRI data analysis and part of of
- `FSL <https://fsl.fmrib.ox.ac.uk>`_.
- .. [#f2] Read more about DataLad's Python API in the :ref:`Find-out-more on it <pythonapi>` in
- :ref:`yoda_project`.
- .. [#f3] Read up on these configurations in the chapter :ref:`chapter_config`.
|