123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453 |
- .. index:: ! procedures, run-procedures
- .. _procedures:
- Configurations to go
- --------------------
- The past two sections should have given you a comprehensive
- overview on the different configuration options the tools
- Git, git-annex, and DataLad provide. They not only
- showed you a way to configure everything you may need to
- configure, but also gave explanations about what the
- configuration options actually mean.
- But figuring out which configurations are useful and how
- to apply them are also not the easiest tasks. Therefore,
- some clever people decided to assist with
- these tasks, and created pre-configured *procedures*
- that process datasets in a particular way.
- These procedures can be shipped within DataLad or its extensions,
- lie on a system, or can be shared together with datasets.
- One of such procedures is the ``text2git`` configuration.
- In order to learn about procedures in general, let's demystify
- what the ``text2git`` procedure exactly is: It is
- nothing more than a simple script that
- - writes the relevant ``annex_largefiles`` configuration, i.e., "Do not put anything that is a text file in the annex") to the ``.gitattributes`` file of a dataset, and
- - saves this modification with the commit message "Instruct annex to add text files to Git".
- This particular procedure lives in a script called
- ``cfg_text2git`` in the sourcecode of DataLad. The amount of code
- in this script is not large, and the relevant lines of code
- are highlighted:
- .. code-block:: python
- :emphasize-lines: 12, 16-17
- import sys
- import os.path as op
- from datalad.distribution.dataset import require_dataset
- ds = require_dataset(
- sys.argv[1],
- check_installed=True,
- purpose='configuration')
- # the relevant configuration:
- annex_largefiles = '((mimeencoding=binary)and(largerthan=0))'
- attrs = ds.repo.get_gitattributes('*')
- if not attrs.get('*', {}).get(
- 'annex.largefiles', None) == annex_largefiles:
- ds.repo.set_gitattributes([
- ('*', {'annex.largefiles': annex_largefiles})])
- git_attributes_file = op.join(ds.path, '.gitattributes')
- ds.save(
- git_attributes_file,
- message="Instruct annex to add text files to Git",
- )
- Just like ``cfg_text2git``, all DataLad procedures are
- executables (such as a script, or compiled code).
- In principle, they can be written in any language, and perform
- any task inside of a dataset.
- The ``text2git`` configuration, for example, applies a configuration for how
- git-annex treats different file types. Other procedures do not
- only modify ``.gitattributes``, but can also populate a dataset
- with particular content, or automate routine tasks such as
- synchronizing dataset content with certain siblings.
- What makes them a particularly versatile and flexible tool is
- that anyone can write their own procedures.
- If a workflow is a standard in a team and needs to be applied often, turning it into
- a script can save time and effort.
- To learn how to do this, read the :ref:`tutorial on writing own procedures <fom-procedures>`.
- By pointing DataLad to the location the procedures reside in they can be applied, and by
- including them in a dataset they can even be shared.
- And even if the script is simple, it is very handy to have preconfigured
- procedures that can be run in a single command line call. In the
- case of ``text2git``, all text files in a dataset will be stored
- in Git -- this is a useful configuration that is applicable to a
- wide range of datasets. It is a shortcut that
- spares naive users the necessity to learn about the ``.gitattributes``
- file when setting up a dataset.
- .. index::
- pair: run-procedure; DataLad command
- pair: discover dataset procedures; with DataLad
- pair: discover; dataset procedure
- To find out available procedures, the command
- :dlcmd:`run-procedure --discover` is helpful.
- This command will make DataLad search the default location for
- procedures in a dataset, the source code of DataLad or
- installed DataLad extensions, and the default locations for
- procedures on the system for available procedures:
- .. runrecord:: _examples/DL-101-124-101
- :workdir: dl-101/DataLad-101
- :language: console
- $ datalad run-procedure --discover
- The output shows that four procedures available in this particular dataset and the system it exists on:
- ``cfg_metadatatypes``, ``cfg_text2git``, ``cfg_yoda``, and ``cfg_noannex``.
- It also lists where they are stored -- in this case,
- they are all part of the source code of DataLad [#f1]_.
- - ``cfg_noannex`` configures a dataset to not have an annex at all.
- - ``cfg_yoda`` configures a dataset according to the yoda
- principles -- the section :ref:`yoda` talks about this in detail.
- - ``cfg_text2git`` configures text files to be stored in Git.
- - ``cfg_metadatatypes`` lets users configure additional metadata
- types.
- .. index::
- pair: run dataset procedure; with DataLad
- pair: run; dataset procedure
- Applying procedures
- ^^^^^^^^^^^^^^^^^^^
- :dlcmd:`run-procedure` not only *discovers*
- but also *executes* procedures. If given the name of
- a procedure, this command will apply the procedure to
- the current dataset, or the dataset that is specified
- with the ``-d/--dataset`` flag:
- .. code-block:: bash
- datalad run-procedure [-d <PATH>] cfg_text2git
- .. index::
- pair: run dataset procedure on dataset creation; with DataLad
- pair: run on dataset creation; dataset procedure
- The typical workflow is to create a dataset and apply
- a procedure afterwards.
- However, some procedures shipped with DataLad or its extensions with a
- ``cfg_`` prefix can also be applied right at the creation of a dataset
- with the ``-c/--cfg-proc <name>`` option in a :dlcmd:`create`
- command. This is a peculiarity of these procedures because, by convention,
- all of these procedures are written to not require arguments.
- The command structure looks like this:
- .. code-block:: console
- $ datalad create -c text2git DataLad-101
- Note that the ``cfg_`` prefix of the procedures is omitted in these
- calls to keep it extra simple and short. The
- available procedures in this example (``cfg_yoda``, ``cfg_text2git``)
- could thus be applied within a :dlcmd:`create` as
- - ``datalad create -c yoda <DSname>``
- - ``datalad create -c text2git <DSname>``
- .. index:: dataset procedure; apply more than one configuration
- .. find-out-more:: Applying multiple procedures
- If you want to apply several configurations at once, feel free to do so,
- for example like this:
- .. code-block:: console
- $ datalad create -c yoda -c text2git
- .. index:: dataset procedure; apply to subdatasets
- .. find-out-more:: Applying procedures in subdatasets
- Procedures can be applied in datasets on any level in the dataset hierarchy, i.e.,
- also in subdatasets. Note, though, that a subdataset will show up as being
- ``modified`` in :dlcmd:`status` *in the superdataset*
- after applying a procedure.
- This is expected, and it would also be the case with any other modification
- (saved or not) in the subdataset, as the version of the subdataset that is tracked
- in the superdataset simply changed. A :dlcmd:`save` in the superdataset
- will make sure that the version of the subdataset gets updated in the superdataset.
- The section :ref:`nesting2` will elaborate on this general principle later in the
- handbook.
- As a general note, it can be useful to apply procedures
- early in the life of a dataset. Procedures such
- as ``cfg_yoda``, explained in detail in section :ref:`yoda`,
- create files, change ``.gitattributes``, or apply other configurations.
- If many other (possibly complex) configurations are
- already in place, or if files of the same name as the ones created by
- a procedure are already in existence, this can lead to unexpected
- problems or failures, especially for naive users. Applying ``cfg_text2git``
- to a default dataset in which one has saved many text files already
- (as per default added to the annex) will not place the existing, saved
- files into Git -- only those text files created *after* the configuration
- was applied.
- .. index::
- single: configuration item; datalad.locations.system-procedures
- single: configuration item; datalad.locations.user-procedures
- single: configuration item; datalad.locations.dataset-procedures
- single: configuration item; datalad.procedures.<name>.call-format
- single: configuration item; datalad.procedures.<name>.help
- single: datasets procedures; write your own
- .. find-out-more:: Write your own procedures
- :name: fom-procedures
- :float:
- Procedures can come with DataLad or its extensions, but anyone can
- write their own ones in addition, and deploy them on individual machines,
- or ship them within DataLad datasets. This allows to
- automate routine configurations or tasks in a dataset, or share configurations that would otherwise not "stick" to the dataset.
- Here are some general rules for creating a custom procedure:
- - A procedure can be any executable. Executables must have the
- appropriate permissions and, in the case of a script,
- must contain an appropriate :term:`shebang`.
- - If a procedure is not executable, but its filename ends with
- ``.sh``, it is automatically executed via :term:`bash`.
- - Procedures can implement any argument handling, but must be capable
- of taking at least one positional argument (the absolute path to the
- dataset they shall operate on).
- - Custom procedures rely heavily on configurations in ``.datalad/config``
- (or the associated environment variables). Within ``.datalad/config``,
- each procedure should get an individual entry that contains at least
- a short "help" description on what the procedure does. Below is a minimal
- ``.datalad/config`` entry for a custom procedure:
- .. code-block:: ini
- [datalad "procedures.<NAME>"]
- help = This is a string to describe what the procedure does
- - By default, on GNU/Linux systems, DataLad will search for system-wide procedures
- (i.e., procedures on the *system* level) in ``/etc/xdg/datalad/procedures``,
- for user procedures (i.e., procedures on the *global* level) in ``~/.config/datalad/procedures``,
- and for dataset procedures (i.e., the *local* level [#f2]_) in ``.datalad/procedures``
- relative to a dataset root.
- Note that ``.datalad/procedures`` does not exist by default, and the ``procedures``
- directory needs to be created first.
- - Alternatively to the default locations, DataLad can be pointed to the
- location of a procedure with a configuration in ``.datalad/config``
- (or with the help of the associated :term:`environment variable`\s).
- The appropriate configuration keys for ``.datalad/config`` are either
- ``datalad.locations.system-procedures`` (for changing the *system* default),
- ``datalad.locations.user-procedures`` (for changing the *global* default),
- or ``datalad.locations.dataset-procedures`` (for changing the *local* default).
- An example ``.datalad/config`` entry for the local scope is shown below.
- .. code-block:: ini
- [datalad "locations"]
- dataset-procedures = relative/path/from/dataset-root
- - By default, DataLad will call a procedure with a standard template
- defined by a format string:
- .. code-block::
- interpreter {script} {ds} {arguments}
- where arguments can be any additional command line arguments a script
- (procedure) takes or requires. This default format string can be
- customized within ``.datalad/config`` in ``datalad.procedures.<NAME>.call-format``.
- An example ``.datalad/config`` entry with a changed call format string
- is shown below.
- .. code-block:: ini
- [datalad "procedures.<NAME>"]
- help = This is a string to describe what the procedure does
- call-format = python {script} {ds} {somearg1} {somearg2}
- - By convention, procedures should leave a dataset in a clean state.
- Therefore, in order to create a custom procedure, an executable script
- in the appropriate location is fine. Placing a script ``myprocedure``
- into ``.datalad/procedures`` will allow running
- ``datalad run-procedure myprocedure`` in your dataset, and because
- it is part of the dataset it will also allow distributing the procedure.
- Below is a toy-example for a custom procedure:
- .. runrecord:: _examples/DL-101-124-103
- :language: console
- :workdir: procs
- $ datalad create somedataset; cd somedataset
- .. runrecord:: _examples/DL-101-124-104
- :language: console
- :workdir: procs/somedataset
- $ mkdir .datalad/procedures
- $ cat << EOT > .datalad/procedures/example.py
- """A simple procedure to create a file 'example' and store
- it in Git, and a file 'example2' and annex it. The contents
- of 'example' must be defined with a positional argument."""
- import sys
- import os.path as op
- from datalad.distribution.dataset import require_dataset
- from datalad.utils import create_tree
- ds = require_dataset(
- sys.argv[1],
- check_installed=True,
- purpose='showcase an example procedure')
- # this is the content for file "example"
- content = """\
- This file was created by a custom procedure! Neat, huh?
- """
- # create a directory structure template. Write
- tmpl = {
- 'somedir': {
- 'example': content,
- },
- 'example2': sys.argv[2] if sys.argv[2] else "got no input"
- }
- # actually create the structure in the dataset
- create_tree(ds.path, tmpl)
- # rule to store 'example' Git
- ds.repo.set_gitattributes([('example', {'annex.largefiles': 'nothing'})])
- # save the dataset modifications
- ds.save(message="Apply custom procedure")
- EOT
- .. runrecord:: _examples/DL-101-124-105
- :language: console
- :workdir: procs/somedataset
- $ datalad save -m "add custom procedure"
- At this point, the dataset contains the custom procedure ``example``.
- This is how it can be executed and what it does:
- .. runrecord:: _examples/DL-101-124-106
- :language: console
- :workdir: procs/somedataset
- $ datalad run-procedure example "this text will be in the file 'example2'"
- .. runrecord:: _examples/DL-101-124-107
- :language: console
- :workdir: procs/somedataset
- $ # the directory structure has been created
- $ tree
- .. runrecord:: _examples/DL-101-124-108
- :workdir: procs/somedataset
- :language: console
- $ # lets check out the contents in the files
- $ cat example2 && echo '' && cat somedir/example
- .. runrecord:: _examples/DL-101-124-109
- :workdir: procs/somedataset
- :language: console
- $ git config -f .datalad/config datalad.procedures.example.help "A toy example"
- $ datalad save -m "add help description"
- To find out more about a given procedure, you can ask for help:
- .. runrecord:: _examples/DL-101-124-110
- :workdir: procs/somedataset
- :language: console
- $ datalad run-procedure --help-proc example
- Summing up, DataLad's :dlcmd:`run-procedure` command is a handy tool
- with useful existing procedures but much flexibility for your own
- DIY procedure scripts. With the information of the last three sections
- you should be able to write and understand necessary configurations,
- but you can also rely on existing, preconfigured templates in the
- form of procedures, and even write and distribute your own.
- Therefore, envision procedures as
- helper-tools that can minimize technical complexities
- in a dataset -- users can concentrate on the actual task while
- the dataset is set-up, structured, processed, or configured automatically
- with the help of a procedure.
- Especially in the case of trainees and new users, applying procedures
- instead of doing relevant routines "by hand" can help to ease
- working with the dataset. Other than by users, procedures can also be triggered to automatically
- run after any command execution if a command results matches a specific
- requirement.
- Finally, make a note about running procedures inside of ``notes.txt``:
- .. runrecord:: _examples/DL-101-124-111
- :language: console
- :workdir: dl-101/DataLad-101
- $ cat << EOT >> notes.txt
- It can be useful to use pre-configured procedures that can apply
- configurations, create files or file hierarchies, or perform arbitrary
- tasks in datasets. They can be shipped with DataLad, its extensions,
- or datasets, and you can even write your own procedures and distribute
- them.
- The "datalad run-procedure" command is used to apply such a procedure
- to a dataset. Procedures shipped with DataLad or its extensions
- starting with a "cfg" prefix can also be applied at the creation of a
- dataset with "datalad create -c <PROC-NAME> <PATH>" (omitting the
- "cfg" prefix).
- EOT
- .. runrecord:: _examples/DL-101-124-112
- :workdir: dl-101/DataLad-101
- :language: console
- $ datalad save -m "add note on DataLad's procedures"
- .. only:: adminmode
- Add a tag at the section end.
- .. runrecord:: _examples/DL-101-124-112
- :language: console
- :workdir: dl-101/DataLad-101
- $ git branch sct_configurations_to_go
- .. rubric:: Footnotes
- .. [#f1] In theory, because procedures can exist on different levels, and
- because anyone can create (and thus name) their own procedures, there
- can be name conflicts. The order of precedence in such cases is:
- user-level, system-level, dataset, DataLad extension, DataLad, i.e.,
- local procedures take precedence over those coming from "outside" via
- datasets or DataLad extensions.
- If procedures in a higher-level dataset and a subdataset have the same
- name, the procedure closer to the dataset ``run-procedure`` is
- operating on takes precedence.
- .. [#f2] Note that we simplify the level of procedures that exist within a dataset
- by calling them *local*. Even though they apply to a dataset just as *local*
- Git configurations, unlike Git's *local* configurations in ``.git/config``,
- the procedures and procedure configurations in ``.datalad/config`` are committed
- and can be shared together with a dataset. The procedure level *local* therefore
- does not exactly corresponds to the *local* scope in the sense that Git uses it.
|