123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277 |
- .. _gists:
- Gists
- =====
- The more complex and larger your DataLad project, the more difficult it is to do
- efficient housekeeping.
- This section is a selection of code snippets tuned to perform specific,
- non-trivial tasks in datasets. Often, they are not limited to single commands of
- the version control tools you know, but combine helpful other command line
- tools and general Unix command line magic. Just like
- `GitHub gists <https://gist.github.com>`_, its a collection of lightweight
- and easily accessible tips and tricks. For a more basic command overview,
- take a look at the :ref:`cheat`. The
- `tips collection of git-annex <https://git-annex.branchable.com/tips>`_ is also
- a very valuable resource.
- .. image:: ../artwork/src/gists.svg
- :width: 50%
- :align: center
- .. _parallelize:
- Parallelize subdataset processing
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- DataLad cannot yet parallelize processes that are performed
- independently over a large number of subdatasets. Pushing across a dataset
- hierarchy, for example, is performed one after the other.
- Unix however, has a few tools such as `xargs <https://en.wikipedia.org/wiki/Xargs>`_
- or the ``parallel`` tool of `moreutils <https://joeyh.name/code/moreutils>`_
- that can assist.
- Here is an example of pushing all subdatasets (and their respective subdatasets)
- recursively to their (identically named) siblings:
- .. code-block:: console
- $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d
- ``datalad -f '{path}' subdatasets`` discovers the paths of all subdatasets,
- and ``xargs`` hands them individually (``-n 1``) to a (recursive) :dlcmd:`push`,
- but performs 10 of these operations in parallel (``-P 10``), thus achieving
- parallelization.
- Here is an example of cross-dataset download parallelization:
- .. code-block:: console
- $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d
- Operations like this can safely be attempted for all commands that are independent
- across subdatasets.
- Check whether all file content is present locally
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- In order to check if all the files in a dataset have their file contents locally
- available, you can ask git-annex:
- .. code-block:: console
- $ git annex find --not --in=here
- Any file that does not have its contents locally available will be listed.
- If there are subdatasets you want to recurse into, use the following command
- .. code-block:: console
- $ git submodule foreach --quiet --recursive \
- 'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'
- Alternatively, to get very comprehensive output, you can use
- .. code-block:: console
- $ datalad -f json status --recursive --annex availability
- The output will be returned as json, and the key ``has_content`` indicates local
- content availability (``true`` or ``false``). To filter through it, the command
- line tool `jq <https://stedolan.github.io/jq>`_ works well:
- .. code-block:: console
- $ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'
- Drop annexed files from all past commits
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- If there is annexed file content that is not used anymore (i.e., data in the
- annex that no files in any branch point to anymore such as corrupt files),
- you can find out about it and remove this file content out of your dataset
- (i.e., completely and irrecoverably delete it) with git-annex's commands
- :gitannexcmd:`unused` and :gitannexcmd:`dropunused``.
- Find out which file contents are unused (not referenced by any current branch):
- .. code-block:: console
- $ git annex unused
- unused . (checking for unused data...)
- Some annexed data is no longer used by any files in the repository.
- NUMBER KEY
- 1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
- 2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
- (To see where data was previously used, try: git log --stat -S'KEY')
- (To remove unwanted data: git-annex dropunused NUMBER)
- ok
- Remove a single unused file by specifying its number in the listing above:
- .. code-block:: console
- $ git annex dropunused 1
- dropunused 1 ok
- Or a range of unused data with
- .. code-block:: console
- $ git annex dropunused 1-1000
- Or all
- .. code-block:: console
- $ git annex dropunused all
- Getting single file sizes prior to downloading from the Python API and the CLI
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- For a single file, :dlcmd:`status --annex -- myfile` will report on
- the size of the file prior to a :dlcmd:`get`.
- If you want to do it in Python, try this approach:
- .. code-block:: python
- import datalad.api as dl
- ds = dl.Dataset("/path/to/some/dataset")
- results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)
- Check whether a dataset contains an annex
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Datasets can be either GitRepos (i.e., sole Git repositories; this happens when
- they are created with the ``--no-annex`` flag, for example), or AnnexRepos
- (i.e., datasets that contain an annex). Information on what kind of repository it
- is is stored in the dataset report of :dlcmd:`wtf` under the key ``repo``.
- Here is a one-liner to get this info:
- .. code-block:: console
- $ datalad -f'{infos[dataset][repo]}' wtf
- .. index::
- pair: create-sibling; DataLad command
- Backing-up datasets
- ^^^^^^^^^^^^^^^^^^^
- In order to back-up datasets you can publish them to a
- :term:`Remote Indexed Archive (RIA) store` or to a sibling dataset. The former
- solution does not require Git, git-annex, or DataLad to be installed on the
- machine that the back-up is pushed to, the latter does require them.
- To find out more about RIA stores, checkout the online-handbook.
- A sketch of how to implement a sibling for backups is below:
- .. code-block:: console
- $ # create a back up sibling
- $ datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
- $ # publish a full backup of the current branch
- $ datalad publish --to=myserver -r
- $ # subsequently, publish updates to be backed up with
- $ datalad publish --to=myserver -r --since= --missing=inherit
- In order to push not only the current branch, but refs, add the option
- ``--publish-by-default "refs/*"`` to the :dlcmd:`create-sibling` call.
- Should you want to back up all annexed data, even past versions of files, use
- :gitannexcmd:`sync` to push to the sibling:
- .. code-block:: console
- $ git annex sync --all --content <sibling-name>
- For an in-depth explanation and example take a look at the
- `GitHub issue that raised this question <https://github.com/datalad/datalad/issues/4369>`_.
- .. _retrieveHCP:
- Retrieve partial content from a hierarchy of (uninstalled) datasets
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- In order to :dlcmd:`get` dataset content across a range of subdatasets, a bit
- of UNIX command line foo can increase the efficiency of your command.
- Example: consider retrieving all ``ribbon.nii.gz`` files for all subjects in the
- `HCP open access dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
- (a dataset with about 4500 subdatasets -- read on more about it in
- :ref:`usecase_HCP_dataset`).
- If all subject-subdatasets are installed (e.g., with ``datalad get -n -r`` for
- a recursive installation without file retrieval), :term:`globbing` with the
- shell works fine:
- .. code-block:: console
- $ datalad get HCP1200/*/T1W/ribbon.nii.gz
- The Gist :ref:`parallelize` can show you how to parallelize this.
- If the subdatasets are not yet installed, globbing will not work, because the
- shell can't expand non-existent paths. As an alternative, you can pipe the output
- of an (arbitrarily complex) :dlcmd:`search` command into
- :dlcmd:`get`:
- .. code-block:: console
- $ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get
- However, if you know the file locations within the dataset hierarchy and they
- are predictably named and consistent, you can create a file containing all paths to
- be retrieved and pipe that into :dlcmd:`get` as well:
- .. code-block:: console
- $ # create file with all file paths
- $ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
- $ # pipe it into datalad get
- $ cat toget.txt | xargs -n 100 datalad get
- .. _speedystatus:
- Speed up status reports in large datasets
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- In datasets with deep dataset hierarchies or large numbers of files,
- :dlcmd:`status` calls can be expensive. Handily,
- the command provides options that can boost performance by limiting what is being
- tested and reported. In order to speed up subdataset state state evaluation,
- ``-e/--eval-subdataset-state`` can be set ``commit`` or ``no``. Instead of checking
- recursively for uncommitted modifications in subdatasets, this would lead ``status``
- to only compare the most recent commit :term:`shasum` in the subdataset against
- the recorded subdataset state in the superdataset (``commit``), or skip subdataset
- state evaluation completely (``no``). In order to speed up file type evaluation,
- the option ``-t/--report-filetype`` can be set to ``raw``. This skips an evaluation
- on whether symlinks are pointers to annexed file (upon which, if true, the symlink
- would be reported as type "file"). Instead, all symlinks will be reported as
- being of type "symlink".
- Squashing git-annex history
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- A large number of commits in the :term:`git-annex branch` (think: thousands
- rather than hundreds) can inflate your repository and increase the size of the
- ``.git`` directory, which can lead to slower cloning operations.
- There are, however, ways to shrink the commit history in the annex branch.
- In order to :term:`squash` the entire git-annex history into a single commit, run
- .. code-block:: console
- $ git annex forget --drop-dead --force
- Afterwards, if your dataset has a sibling, the branch needs to be
- :term:`force-push`\ed. If you attempt an operation to shrink your git-annex
- history, also checkout
- `this thread <https://git-annex.branchable.com/forum/safely_dropping_git-annex_history>`_
- for more information on shrinking git-annex's history and helpful safeguards and
- potential caveats.
|