adswa
/
handbook


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277
							.. _gists:

Gists
=====


The more complex and larger your DataLad project, the more difficult it is to do
efficient housekeeping.
This section is a selection of code snippets tuned to perform specific,
non-trivial tasks in datasets. Often, they are not limited to single commands of
the version control tools you know, but combine helpful other command line
tools and general Unix command line magic. Just like
`GitHub gists <https://gist.github.com>`_, its a collection of lightweight
and easily accessible tips and tricks. For a more basic command overview,
take a look at the :ref:`cheat`. The
`tips collection of git-annex <https://git-annex.branchable.com/tips>`_ is also
a very valuable resource.

.. image:: ../artwork/src/gists.svg
   :width: 50%
   :align: center


.. _parallelize:

Parallelize subdataset processing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DataLad cannot yet parallelize processes that are performed
independently over a large number of subdatasets. Pushing across a dataset
hierarchy, for example, is performed one after the other.
Unix however, has a few tools such as `xargs <https://en.wikipedia.org/wiki/Xargs>`_
or the ``parallel`` tool of `moreutils <https://joeyh.name/code/moreutils>`_
that can assist.

Here is an example of pushing all subdatasets (and their respective subdatasets)
recursively to their (identically named) siblings:

.. code-block:: console

   $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d

``datalad -f '{path}' subdatasets`` discovers the paths of all subdatasets,
and ``xargs`` hands them individually (``-n 1``) to a (recursive) :dlcmd:`push`,
but performs 10 of these operations in parallel (``-P 10``), thus achieving
parallelization.

Here is an example of cross-dataset download parallelization:

.. code-block:: console

   $ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d

Operations like this can safely be attempted for all commands that are independent
across subdatasets.

Check whether all file content is present locally
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to check if all the files in a dataset have their file contents locally
available, you can ask git-annex:

.. code-block:: console

   $ git annex find --not --in=here

Any file that does not have its contents locally available will be listed.
If there are subdatasets you want to recurse into, use the following command

.. code-block:: console

   $ git submodule foreach --quiet --recursive \
    'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'

Alternatively, to get very comprehensive output, you can use

.. code-block:: console

   $ datalad -f json status --recursive --annex availability

The output will be returned as json, and the key ``has_content`` indicates local
content availability (``true`` or ``false``). To filter through it, the command
line tool `jq <https://stedolan.github.io/jq>`_ works well:

.. code-block:: console

   $ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'


Drop annexed files from all past commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If there is annexed file content that is not used anymore (i.e., data in the
annex that no files in any branch point to anymore such as corrupt files),
you can find out about it and remove this file content out of your dataset
(i.e., completely and irrecoverably delete it) with git-annex's commands
:gitannexcmd:`unused` and :gitannexcmd:`dropunused``.

Find out which file contents are unused (not referenced by any current branch):

.. code-block:: console

   $ git annex unused
    unused . (checking for unused data...)
      Some annexed data is no longer used by any files in the repository.
        NUMBER  KEY
        1       SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
        2       SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
      (To see where data was previously used, try: git log --stat -S'KEY')
      (To remove unwanted data: git-annex dropunused NUMBER)
    ok

Remove a single unused file by specifying its number in the listing above:

.. code-block:: console

   $ git annex dropunused 1
    dropunused 1 ok

Or a range of unused data with

.. code-block:: console

   $ git annex dropunused 1-1000

Or all

.. code-block:: console

   $ git annex dropunused all


Getting single file sizes prior to downloading from the Python API and the CLI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a single file, :dlcmd:`status --annex -- myfile` will report on
the size of the file prior to a :dlcmd:`get`.

If you want to do it in Python, try this approach:

.. code-block:: python

   import datalad.api as dl

   ds = dl.Dataset("/path/to/some/dataset")
   results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)


Check whether a dataset contains an annex
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Datasets can be either GitRepos (i.e., sole Git repositories; this happens when
they are created with the ``--no-annex`` flag, for example), or AnnexRepos
(i.e., datasets that contain an annex). Information on what kind of repository it
is is stored in the dataset report of :dlcmd:`wtf` under the key ``repo``.
Here is a one-liner to get this info:

.. code-block:: console

   $ datalad -f'{infos[dataset][repo]}' wtf


.. index::
   pair: create-sibling; DataLad command

Backing-up datasets
^^^^^^^^^^^^^^^^^^^

In order to back-up datasets you can publish them to a
:term:`Remote Indexed Archive (RIA) store` or to a sibling dataset. The former
solution does not require Git, git-annex, or DataLad to be installed on the
machine that the back-up is pushed to, the latter does require them.

To find out more about RIA stores, checkout the online-handbook.
A sketch of how to implement a sibling for backups is below:

.. code-block:: console

   $ # create a back up sibling
   $ datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
   $ # publish a full backup of the current branch
   $ datalad publish --to=myserver -r
   $ # subsequently, publish updates to be backed up with
   $ datalad publish --to=myserver -r --since= --missing=inherit

In order to push not only the current branch, but refs, add the option
``--publish-by-default "refs/*"`` to the :dlcmd:`create-sibling` call.
Should you want to back up all annexed data, even past versions of files, use
:gitannexcmd:`sync` to push to the sibling:

.. code-block:: console

   $ git annex sync --all --content <sibling-name>

For an in-depth explanation and example take a look at the
`GitHub issue that raised this question <https://github.com/datalad/datalad/issues/4369>`_.

.. _retrieveHCP:

Retrieve partial content from a hierarchy of (uninstalled) datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to :dlcmd:`get` dataset content across a range of subdatasets, a bit
of UNIX command line foo can increase the efficiency of your command.

Example: consider retrieving all ``ribbon.nii.gz`` files for all subjects in the
`HCP open access dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
(a dataset with about 4500 subdatasets -- read on more about it in
:ref:`usecase_HCP_dataset`).
If all subject-subdatasets are installed (e.g., with ``datalad get -n -r`` for
a recursive installation without file retrieval), :term:`globbing` with the
shell works fine:

.. code-block:: console

   $ datalad get HCP1200/*/T1W/ribbon.nii.gz

The Gist :ref:`parallelize` can show you how to parallelize this.
If the subdatasets are not yet installed, globbing will not work, because the
shell can't expand non-existent paths. As an alternative, you can pipe the output
of an (arbitrarily complex) :dlcmd:`search` command into
:dlcmd:`get`:

.. code-block:: console

   $ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get

However, if you know the file locations within the dataset hierarchy and they
are predictably named and consistent, you can create a file containing all paths to
be retrieved and pipe that into :dlcmd:`get` as well:

.. code-block:: console

   $ # create file with all file paths
   $ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
   $ # pipe it into datalad get
   $ cat toget.txt | xargs -n 100 datalad get

.. _speedystatus:

Speed up status reports in large datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In datasets with deep dataset hierarchies or large numbers of files,
:dlcmd:`status` calls can be expensive. Handily,
the command provides options that can boost performance by limiting what is being
tested and reported. In order to speed up subdataset state state evaluation,
``-e/--eval-subdataset-state`` can be set ``commit`` or ``no``. Instead of checking
recursively for uncommitted modifications in subdatasets, this would lead ``status``
to only compare the most recent commit :term:`shasum` in the subdataset against
the recorded subdataset state in the superdataset (``commit``), or skip subdataset
state evaluation completely (``no``). In order to speed up file type evaluation,
the option ``-t/--report-filetype`` can be set to ``raw``. This skips an evaluation
on whether symlinks are pointers to annexed file (upon which, if true, the symlink
would be reported as type "file"). Instead, all symlinks will be reported as
being of type "symlink".

Squashing git-annex history
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A large number of commits in the :term:`git-annex branch` (think: thousands
rather than hundreds) can inflate your repository and increase the size of the
``.git`` directory, which can lead to slower cloning operations.
There are, however, ways to shrink the commit history in the annex branch.

In order to :term:`squash` the entire git-annex history into a single commit, run

.. code-block:: console

   $ git annex forget --drop-dead --force

Afterwards, if your dataset has a sibling, the branch needs to be
:term:`force-push`\ed. If you attempt an operation to shrink your git-annex
history, also checkout
`this thread <https://git-annex.branchable.com/forum/safely_dropping_git-annex_history>`_
for more information on shrinking git-annex's history and helpful safeguards and
potential caveats.