101-146-gists.rst 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277
  1. .. _gists:
  2. Gists
  3. =====
  4. The more complex and larger your DataLad project, the more difficult it is to do
  5. efficient housekeeping.
  6. This section is a selection of code snippets tuned to perform specific,
  7. non-trivial tasks in datasets. Often, they are not limited to single commands of
  8. the version control tools you know, but combine helpful other command line
  9. tools and general Unix command line magic. Just like
  10. `GitHub gists <https://gist.github.com>`_, its a collection of lightweight
  11. and easily accessible tips and tricks. For a more basic command overview,
  12. take a look at the :ref:`cheat`. The
  13. `tips collection of git-annex <https://git-annex.branchable.com/tips>`_ is also
  14. a very valuable resource.
  15. .. image:: ../artwork/src/gists.svg
  16. :width: 50%
  17. :align: center
  18. .. _parallelize:
  19. Parallelize subdataset processing
  20. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  21. DataLad cannot yet parallelize processes that are performed
  22. independently over a large number of subdatasets. Pushing across a dataset
  23. hierarchy, for example, is performed one after the other.
  24. Unix however, has a few tools such as `xargs <https://en.wikipedia.org/wiki/Xargs>`_
  25. or the ``parallel`` tool of `moreutils <https://joeyh.name/code/moreutils>`_
  26. that can assist.
  27. Here is an example of pushing all subdatasets (and their respective subdatasets)
  28. recursively to their (identically named) siblings:
  29. .. code-block:: console
  30. $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d
  31. ``datalad -f '{path}' subdatasets`` discovers the paths of all subdatasets,
  32. and ``xargs`` hands them individually (``-n 1``) to a (recursive) :dlcmd:`push`,
  33. but performs 10 of these operations in parallel (``-P 10``), thus achieving
  34. parallelization.
  35. Here is an example of cross-dataset download parallelization:
  36. .. code-block:: console
  37. $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d
  38. Operations like this can safely be attempted for all commands that are independent
  39. across subdatasets.
  40. Check whether all file content is present locally
  41. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  42. In order to check if all the files in a dataset have their file contents locally
  43. available, you can ask git-annex:
  44. .. code-block:: console
  45. $ git annex find --not --in=here
  46. Any file that does not have its contents locally available will be listed.
  47. If there are subdatasets you want to recurse into, use the following command
  48. .. code-block:: console
  49. $ git submodule foreach --quiet --recursive \
  50. 'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'
  51. Alternatively, to get very comprehensive output, you can use
  52. .. code-block:: console
  53. $ datalad -f json status --recursive --annex availability
  54. The output will be returned as json, and the key ``has_content`` indicates local
  55. content availability (``true`` or ``false``). To filter through it, the command
  56. line tool `jq <https://stedolan.github.io/jq>`_ works well:
  57. .. code-block:: console
  58. $ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'
  59. Drop annexed files from all past commits
  60. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  61. If there is annexed file content that is not used anymore (i.e., data in the
  62. annex that no files in any branch point to anymore such as corrupt files),
  63. you can find out about it and remove this file content out of your dataset
  64. (i.e., completely and irrecoverably delete it) with git-annex's commands
  65. :gitannexcmd:`unused` and :gitannexcmd:`dropunused``.
  66. Find out which file contents are unused (not referenced by any current branch):
  67. .. code-block:: console
  68. $ git annex unused
  69. unused . (checking for unused data...)
  70. Some annexed data is no longer used by any files in the repository.
  71. NUMBER KEY
  72. 1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
  73. 2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
  74. (To see where data was previously used, try: git log --stat -S'KEY')
  75. (To remove unwanted data: git-annex dropunused NUMBER)
  76. ok
  77. Remove a single unused file by specifying its number in the listing above:
  78. .. code-block:: console
  79. $ git annex dropunused 1
  80. dropunused 1 ok
  81. Or a range of unused data with
  82. .. code-block:: console
  83. $ git annex dropunused 1-1000
  84. Or all
  85. .. code-block:: console
  86. $ git annex dropunused all
  87. Getting single file sizes prior to downloading from the Python API and the CLI
  88. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  89. For a single file, :dlcmd:`status --annex -- myfile` will report on
  90. the size of the file prior to a :dlcmd:`get`.
  91. If you want to do it in Python, try this approach:
  92. .. code-block:: python
  93. import datalad.api as dl
  94. ds = dl.Dataset("/path/to/some/dataset")
  95. results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)
  96. Check whether a dataset contains an annex
  97. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  98. Datasets can be either GitRepos (i.e., sole Git repositories; this happens when
  99. they are created with the ``--no-annex`` flag, for example), or AnnexRepos
  100. (i.e., datasets that contain an annex). Information on what kind of repository it
  101. is is stored in the dataset report of :dlcmd:`wtf` under the key ``repo``.
  102. Here is a one-liner to get this info:
  103. .. code-block:: console
  104. $ datalad -f'{infos[dataset][repo]}' wtf
  105. .. index::
  106. pair: create-sibling; DataLad command
  107. Backing-up datasets
  108. ^^^^^^^^^^^^^^^^^^^
  109. In order to back-up datasets you can publish them to a
  110. :term:`Remote Indexed Archive (RIA) store` or to a sibling dataset. The former
  111. solution does not require Git, git-annex, or DataLad to be installed on the
  112. machine that the back-up is pushed to, the latter does require them.
  113. To find out more about RIA stores, checkout the online-handbook.
  114. A sketch of how to implement a sibling for backups is below:
  115. .. code-block:: console
  116. $ # create a back up sibling
  117. $ datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
  118. $ # publish a full backup of the current branch
  119. $ datalad publish --to=myserver -r
  120. $ # subsequently, publish updates to be backed up with
  121. $ datalad publish --to=myserver -r --since= --missing=inherit
  122. In order to push not only the current branch, but refs, add the option
  123. ``--publish-by-default "refs/*"`` to the :dlcmd:`create-sibling` call.
  124. Should you want to back up all annexed data, even past versions of files, use
  125. :gitannexcmd:`sync` to push to the sibling:
  126. .. code-block:: console
  127. $ git annex sync --all --content <sibling-name>
  128. For an in-depth explanation and example take a look at the
  129. `GitHub issue that raised this question <https://github.com/datalad/datalad/issues/4369>`_.
  130. .. _retrieveHCP:
  131. Retrieve partial content from a hierarchy of (uninstalled) datasets
  132. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  133. In order to :dlcmd:`get` dataset content across a range of subdatasets, a bit
  134. of UNIX command line foo can increase the efficiency of your command.
  135. Example: consider retrieving all ``ribbon.nii.gz`` files for all subjects in the
  136. `HCP open access dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
  137. (a dataset with about 4500 subdatasets -- read on more about it in
  138. :ref:`usecase_HCP_dataset`).
  139. If all subject-subdatasets are installed (e.g., with ``datalad get -n -r`` for
  140. a recursive installation without file retrieval), :term:`globbing` with the
  141. shell works fine:
  142. .. code-block:: console
  143. $ datalad get HCP1200/*/T1W/ribbon.nii.gz
  144. The Gist :ref:`parallelize` can show you how to parallelize this.
  145. If the subdatasets are not yet installed, globbing will not work, because the
  146. shell can't expand non-existent paths. As an alternative, you can pipe the output
  147. of an (arbitrarily complex) :dlcmd:`search` command into
  148. :dlcmd:`get`:
  149. .. code-block:: console
  150. $ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get
  151. However, if you know the file locations within the dataset hierarchy and they
  152. are predictably named and consistent, you can create a file containing all paths to
  153. be retrieved and pipe that into :dlcmd:`get` as well:
  154. .. code-block:: console
  155. $ # create file with all file paths
  156. $ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
  157. $ # pipe it into datalad get
  158. $ cat toget.txt | xargs -n 100 datalad get
  159. .. _speedystatus:
  160. Speed up status reports in large datasets
  161. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  162. In datasets with deep dataset hierarchies or large numbers of files,
  163. :dlcmd:`status` calls can be expensive. Handily,
  164. the command provides options that can boost performance by limiting what is being
  165. tested and reported. In order to speed up subdataset state state evaluation,
  166. ``-e/--eval-subdataset-state`` can be set ``commit`` or ``no``. Instead of checking
  167. recursively for uncommitted modifications in subdatasets, this would lead ``status``
  168. to only compare the most recent commit :term:`shasum` in the subdataset against
  169. the recorded subdataset state in the superdataset (``commit``), or skip subdataset
  170. state evaluation completely (``no``). In order to speed up file type evaluation,
  171. the option ``-t/--report-filetype`` can be set to ``raw``. This skips an evaluation
  172. on whether symlinks are pointers to annexed file (upon which, if true, the symlink
  173. would be reported as type "file"). Instead, all symlinks will be reported as
  174. being of type "symlink".
  175. Squashing git-annex history
  176. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  177. A large number of commits in the :term:`git-annex branch` (think: thousands
  178. rather than hundreds) can inflate your repository and increase the size of the
  179. ``.git`` directory, which can lead to slower cloning operations.
  180. There are, however, ways to shrink the commit history in the annex branch.
  181. In order to :term:`squash` the entire git-annex history into a single commit, run
  182. .. code-block:: console
  183. $ git annex forget --drop-dead --force
  184. Afterwards, if your dataset has a sibling, the branch needs to be
  185. :term:`force-push`\ed. If you attempt an operation to shrink your git-annex
  186. history, also checkout
  187. `this thread <https://git-annex.branchable.com/forum/safely_dropping_git-annex_history>`_
  188. for more information on shrinking git-annex's history and helpful safeguards and
  189. potential caveats.