101-162-springcleaning.rst 5.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
  1. .. _cleanup:
  2. Fixing up too-large datasets
  3. ----------------------------
  4. The previous section highlighted problems of too large monorepos and advised
  5. strategies to them prevent them.
  6. This section introduces some strategies to clean and fix up datasets that got out
  7. of hand size-wise. If there are use cases you would want to see discussed here
  8. or propose solutions for, please
  9. `get in touch <https://github.com/datalad-handbook/book/issues/new>`_.
  10. Getting contents out of Git
  11. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  12. Let's say you did a :dlcmd:`run` with an analysis that put too
  13. many files under version control by Git, and you want to see them gone.
  14. Sticking to the FSL FEAT analysis example from earlier, you may, for example,
  15. want to get rid of every ``tsplot`` directory, as it contains results that are
  16. irrelevant for you.
  17. Note that there is no way to ``drop`` the files as they are in Git instead of
  18. git-annex. Removing
  19. the files with plain file system (``rm``, ``git rm``) operation also does not
  20. shrink your dataset. The files are snapshot and even though they don't exist in
  21. the current state of your dataset anymore, they still exist -- and thus clutter
  22. -- your datasets history. In order to *really* get committed files out of Git,
  23. you need to rewrite history. And for this you need heavy machinery:
  24. `git-filter-repo <https://github.com/newren/git-filter-repo>`_ [#f1]_.
  25. It is a powerful and potentially dangerous tool to rewrite Git history.
  26. Treat this tool like a chainsaw. Very helpful for heavy duty tasks, but also
  27. life-threatening. The command
  28. ``git-filter-repo <path-specification> --force`` will "filter-out", i.e., remove
  29. all files **but the ones specified** in ``<path-specification>`` from the datasets
  30. history. Before you use it, please make sure to read its help page thoroughly.
  31. .. find-out-more:: Installing git-filter-repo
  32. ``git-filter-repo`` is not part of Git but needs to be installed separately.
  33. Its `GitHub repository <https://github.com/newren/git-filter-repo>`_ contains
  34. more and more detailed instructions, but it is possible to install via :term:`pip`
  35. (``pip install git-filter-repo``), and available via standard package managers
  36. for macOS and some Linux distributions (mostly rpm-based ones).
  37. The general procedure you should follow is the following:
  38. 1. :dlcmd:`clone` the repository. This is a safeguard to protect your
  39. dataset should something go wrong. The clone you are creating will be your
  40. new, cleaned up dataset.
  41. 2. :dlcmd:`get` all the dataset contents by running ``datalad get .``
  42. in the clone.
  43. 3. ``git-filter-repo`` what you don't want anymore (see below)
  44. 4. Run ``git annex unused`` and a subsequent ``git annex dropunused all`` to remove
  45. stale file contents that are not referenced anymore.
  46. 5. Finally, do some aggressive `garbage collection <https://git-scm.com/docs/git-gc>`_
  47. with ``git gc --aggressive``
  48. In order to get a hang on the ``git-filter-repo`` step, consider a directory
  49. structure similar to this exemplary run-wise FEAT analysis output structure:
  50. .. code-block:: bash
  51. $ tree
  52. sub-*/run-*_<task>-<level>.feat
  53. ├── custom_timing_files
  54. ├── logs
  55. ├── reg
  56. ├── reg_standard
  57. │   ├── reg
  58. │   └── stats
  59. ├── stats
  60. └── tsplot
  61. Each of such ``sub-*`` directories contains about 3000 files, and the majority of
  62. them are irrelevant text files in ``tsplot/``.
  63. In order to remove them for all subjects and runs from the dataset history,
  64. the following command can be used::
  65. $ git-filter-repo --path-regex '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$' --invert-paths --force
  66. The ``--path-regex`` and the regex expression ``'^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$'`` [#f2]_
  67. match all file paths inside of the ``tsplot/`` directories of all subjects and
  68. runs.
  69. The option ``--invert-paths`` then *inverts* this path specification, and leads
  70. to only the files in ``tsplot/`` to be filtered out. Note that there are also
  71. non-regex based path specifications possible, for example with the option
  72. ``--path-match`` or ``path-glob``, or with a specification placed in a file.
  73. Please see the manual of ``git-filter-repo`` for more information.
  74. .. rubric:: Footnotes
  75. .. [#f1] Wait, what about ``git filter-branch``? Beyond better performance of
  76. ``git-filter-repo``, Git also discourages the use of ``filter-branch``
  77. for safety reasons and points to ``git-filter-repo`` as an alternative.
  78. For more background info, see this
  79. `thread <https://lore.kernel.org/git/CABPp-BEr8LVM+yWTbi76hAq7Moe1hyp2xqxXfgVV4_teh_9skA@mail.gmail.com>`_.
  80. .. [#f2] Regular expressions can be a pain to comprehend if you're not used to
  81. reading them. This one matches paths that start with (``^``) ``sub-``
  82. followed by exactly two (``{2}``) numbers that can be between 0 and 9
  83. (``[0-9]``), followed by ``/run-`` with exactly one (``{1}``) digit
  84. between 0 and 9 (``[0-9]``), followed by zero or more other characters
  85. (``*``) until ``.feat/tsplot/``, and ending (``$``) with any amount of
  86. any character (``.*``). Not exactly easy, but effective.
  87. One way to practice reading regular expressions, if you're interested
  88. in that, is by playing `regex crossword <https://regexcrossword.com>`_.