101-124-procedures.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453
  1. .. index:: ! procedures, run-procedures
  2. .. _procedures:
  3. Configurations to go
  4. --------------------
  5. The past two sections should have given you a comprehensive
  6. overview on the different configuration options the tools
  7. Git, git-annex, and DataLad provide. They not only
  8. showed you a way to configure everything you may need to
  9. configure, but also gave explanations about what the
  10. configuration options actually mean.
  11. But figuring out which configurations are useful and how
  12. to apply them are also not the easiest tasks. Therefore,
  13. some clever people decided to assist with
  14. these tasks, and created pre-configured *procedures*
  15. that process datasets in a particular way.
  16. These procedures can be shipped within DataLad or its extensions,
  17. lie on a system, or can be shared together with datasets.
  18. One of such procedures is the ``text2git`` configuration.
  19. In order to learn about procedures in general, let's demystify
  20. what the ``text2git`` procedure exactly is: It is
  21. nothing more than a simple script that
  22. - writes the relevant ``annex_largefiles`` configuration, i.e., "Do not put anything that is a text file in the annex") to the ``.gitattributes`` file of a dataset, and
  23. - saves this modification with the commit message "Instruct annex to add text files to Git".
  24. This particular procedure lives in a script called
  25. ``cfg_text2git`` in the sourcecode of DataLad. The amount of code
  26. in this script is not large, and the relevant lines of code
  27. are highlighted:
  28. .. code-block:: python
  29. :emphasize-lines: 12, 16-17
  30. import sys
  31. import os.path as op
  32. from datalad.distribution.dataset import require_dataset
  33. ds = require_dataset(
  34. sys.argv[1],
  35. check_installed=True,
  36. purpose='configuration')
  37. # the relevant configuration:
  38. annex_largefiles = '((mimeencoding=binary)and(largerthan=0))'
  39. attrs = ds.repo.get_gitattributes('*')
  40. if not attrs.get('*', {}).get(
  41. 'annex.largefiles', None) == annex_largefiles:
  42. ds.repo.set_gitattributes([
  43. ('*', {'annex.largefiles': annex_largefiles})])
  44. git_attributes_file = op.join(ds.path, '.gitattributes')
  45. ds.save(
  46. git_attributes_file,
  47. message="Instruct annex to add text files to Git",
  48. )
  49. Just like ``cfg_text2git``, all DataLad procedures are
  50. executables (such as a script, or compiled code).
  51. In principle, they can be written in any language, and perform
  52. any task inside of a dataset.
  53. The ``text2git`` configuration, for example, applies a configuration for how
  54. git-annex treats different file types. Other procedures do not
  55. only modify ``.gitattributes``, but can also populate a dataset
  56. with particular content, or automate routine tasks such as
  57. synchronizing dataset content with certain siblings.
  58. What makes them a particularly versatile and flexible tool is
  59. that anyone can write their own procedures.
  60. If a workflow is a standard in a team and needs to be applied often, turning it into
  61. a script can save time and effort.
  62. To learn how to do this, read the :ref:`tutorial on writing own procedures <fom-procedures>`.
  63. By pointing DataLad to the location the procedures reside in they can be applied, and by
  64. including them in a dataset they can even be shared.
  65. And even if the script is simple, it is very handy to have preconfigured
  66. procedures that can be run in a single command line call. In the
  67. case of ``text2git``, all text files in a dataset will be stored
  68. in Git -- this is a useful configuration that is applicable to a
  69. wide range of datasets. It is a shortcut that
  70. spares naive users the necessity to learn about the ``.gitattributes``
  71. file when setting up a dataset.
  72. .. index::
  73. pair: run-procedure; DataLad command
  74. pair: discover dataset procedures; with DataLad
  75. pair: discover; dataset procedure
  76. To find out available procedures, the command
  77. :dlcmd:`run-procedure --discover` is helpful.
  78. This command will make DataLad search the default location for
  79. procedures in a dataset, the source code of DataLad or
  80. installed DataLad extensions, and the default locations for
  81. procedures on the system for available procedures:
  82. .. runrecord:: _examples/DL-101-124-101
  83. :workdir: dl-101/DataLad-101
  84. :language: console
  85. $ datalad run-procedure --discover
  86. The output shows that four procedures available in this particular dataset and the system it exists on:
  87. ``cfg_metadatatypes``, ``cfg_text2git``, ``cfg_yoda``, and ``cfg_noannex``.
  88. It also lists where they are stored -- in this case,
  89. they are all part of the source code of DataLad [#f1]_.
  90. - ``cfg_noannex`` configures a dataset to not have an annex at all.
  91. - ``cfg_yoda`` configures a dataset according to the yoda
  92. principles -- the section :ref:`yoda` talks about this in detail.
  93. - ``cfg_text2git`` configures text files to be stored in Git.
  94. - ``cfg_metadatatypes`` lets users configure additional metadata
  95. types.
  96. .. index::
  97. pair: run dataset procedure; with DataLad
  98. pair: run; dataset procedure
  99. Applying procedures
  100. ^^^^^^^^^^^^^^^^^^^
  101. :dlcmd:`run-procedure` not only *discovers*
  102. but also *executes* procedures. If given the name of
  103. a procedure, this command will apply the procedure to
  104. the current dataset, or the dataset that is specified
  105. with the ``-d/--dataset`` flag:
  106. .. code-block:: bash
  107. datalad run-procedure [-d <PATH>] cfg_text2git
  108. .. index::
  109. pair: run dataset procedure on dataset creation; with DataLad
  110. pair: run on dataset creation; dataset procedure
  111. The typical workflow is to create a dataset and apply
  112. a procedure afterwards.
  113. However, some procedures shipped with DataLad or its extensions with a
  114. ``cfg_`` prefix can also be applied right at the creation of a dataset
  115. with the ``-c/--cfg-proc <name>`` option in a :dlcmd:`create`
  116. command. This is a peculiarity of these procedures because, by convention,
  117. all of these procedures are written to not require arguments.
  118. The command structure looks like this:
  119. .. code-block:: console
  120. $ datalad create -c text2git DataLad-101
  121. Note that the ``cfg_`` prefix of the procedures is omitted in these
  122. calls to keep it extra simple and short. The
  123. available procedures in this example (``cfg_yoda``, ``cfg_text2git``)
  124. could thus be applied within a :dlcmd:`create` as
  125. - ``datalad create -c yoda <DSname>``
  126. - ``datalad create -c text2git <DSname>``
  127. .. index:: dataset procedure; apply more than one configuration
  128. .. find-out-more:: Applying multiple procedures
  129. If you want to apply several configurations at once, feel free to do so,
  130. for example like this:
  131. .. code-block:: console
  132. $ datalad create -c yoda -c text2git
  133. .. index:: dataset procedure; apply to subdatasets
  134. .. find-out-more:: Applying procedures in subdatasets
  135. Procedures can be applied in datasets on any level in the dataset hierarchy, i.e.,
  136. also in subdatasets. Note, though, that a subdataset will show up as being
  137. ``modified`` in :dlcmd:`status` *in the superdataset*
  138. after applying a procedure.
  139. This is expected, and it would also be the case with any other modification
  140. (saved or not) in the subdataset, as the version of the subdataset that is tracked
  141. in the superdataset simply changed. A :dlcmd:`save` in the superdataset
  142. will make sure that the version of the subdataset gets updated in the superdataset.
  143. The section :ref:`nesting2` will elaborate on this general principle later in the
  144. handbook.
  145. As a general note, it can be useful to apply procedures
  146. early in the life of a dataset. Procedures such
  147. as ``cfg_yoda``, explained in detail in section :ref:`yoda`,
  148. create files, change ``.gitattributes``, or apply other configurations.
  149. If many other (possibly complex) configurations are
  150. already in place, or if files of the same name as the ones created by
  151. a procedure are already in existence, this can lead to unexpected
  152. problems or failures, especially for naive users. Applying ``cfg_text2git``
  153. to a default dataset in which one has saved many text files already
  154. (as per default added to the annex) will not place the existing, saved
  155. files into Git -- only those text files created *after* the configuration
  156. was applied.
  157. .. index::
  158. single: configuration item; datalad.locations.system-procedures
  159. single: configuration item; datalad.locations.user-procedures
  160. single: configuration item; datalad.locations.dataset-procedures
  161. single: configuration item; datalad.procedures.<name>.call-format
  162. single: configuration item; datalad.procedures.<name>.help
  163. single: datasets procedures; write your own
  164. .. find-out-more:: Write your own procedures
  165. :name: fom-procedures
  166. :float:
  167. Procedures can come with DataLad or its extensions, but anyone can
  168. write their own ones in addition, and deploy them on individual machines,
  169. or ship them within DataLad datasets. This allows to
  170. automate routine configurations or tasks in a dataset, or share configurations that would otherwise not "stick" to the dataset.
  171. Here are some general rules for creating a custom procedure:
  172. - A procedure can be any executable. Executables must have the
  173. appropriate permissions and, in the case of a script,
  174. must contain an appropriate :term:`shebang`.
  175. - If a procedure is not executable, but its filename ends with
  176. ``.sh``, it is automatically executed via :term:`bash`.
  177. - Procedures can implement any argument handling, but must be capable
  178. of taking at least one positional argument (the absolute path to the
  179. dataset they shall operate on).
  180. - Custom procedures rely heavily on configurations in ``.datalad/config``
  181. (or the associated environment variables). Within ``.datalad/config``,
  182. each procedure should get an individual entry that contains at least
  183. a short "help" description on what the procedure does. Below is a minimal
  184. ``.datalad/config`` entry for a custom procedure:
  185. .. code-block:: ini
  186. [datalad "procedures.<NAME>"]
  187. help = This is a string to describe what the procedure does
  188. - By default, on GNU/Linux systems, DataLad will search for system-wide procedures
  189. (i.e., procedures on the *system* level) in ``/etc/xdg/datalad/procedures``,
  190. for user procedures (i.e., procedures on the *global* level) in ``~/.config/datalad/procedures``,
  191. and for dataset procedures (i.e., the *local* level [#f2]_) in ``.datalad/procedures``
  192. relative to a dataset root.
  193. Note that ``.datalad/procedures`` does not exist by default, and the ``procedures``
  194. directory needs to be created first.
  195. - Alternatively to the default locations, DataLad can be pointed to the
  196. location of a procedure with a configuration in ``.datalad/config``
  197. (or with the help of the associated :term:`environment variable`\s).
  198. The appropriate configuration keys for ``.datalad/config`` are either
  199. ``datalad.locations.system-procedures`` (for changing the *system* default),
  200. ``datalad.locations.user-procedures`` (for changing the *global* default),
  201. or ``datalad.locations.dataset-procedures`` (for changing the *local* default).
  202. An example ``.datalad/config`` entry for the local scope is shown below.
  203. .. code-block:: ini
  204. [datalad "locations"]
  205. dataset-procedures = relative/path/from/dataset-root
  206. - By default, DataLad will call a procedure with a standard template
  207. defined by a format string:
  208. .. code-block::
  209. interpreter {script} {ds} {arguments}
  210. where arguments can be any additional command line arguments a script
  211. (procedure) takes or requires. This default format string can be
  212. customized within ``.datalad/config`` in ``datalad.procedures.<NAME>.call-format``.
  213. An example ``.datalad/config`` entry with a changed call format string
  214. is shown below.
  215. .. code-block:: ini
  216. [datalad "procedures.<NAME>"]
  217. help = This is a string to describe what the procedure does
  218. call-format = python {script} {ds} {somearg1} {somearg2}
  219. - By convention, procedures should leave a dataset in a clean state.
  220. Therefore, in order to create a custom procedure, an executable script
  221. in the appropriate location is fine. Placing a script ``myprocedure``
  222. into ``.datalad/procedures`` will allow running
  223. ``datalad run-procedure myprocedure`` in your dataset, and because
  224. it is part of the dataset it will also allow distributing the procedure.
  225. Below is a toy-example for a custom procedure:
  226. .. runrecord:: _examples/DL-101-124-103
  227. :language: console
  228. :workdir: procs
  229. $ datalad create somedataset; cd somedataset
  230. .. runrecord:: _examples/DL-101-124-104
  231. :language: console
  232. :workdir: procs/somedataset
  233. $ mkdir .datalad/procedures
  234. $ cat << EOT > .datalad/procedures/example.py
  235. """A simple procedure to create a file 'example' and store
  236. it in Git, and a file 'example2' and annex it. The contents
  237. of 'example' must be defined with a positional argument."""
  238. import sys
  239. import os.path as op
  240. from datalad.distribution.dataset import require_dataset
  241. from datalad.utils import create_tree
  242. ds = require_dataset(
  243. sys.argv[1],
  244. check_installed=True,
  245. purpose='showcase an example procedure')
  246. # this is the content for file "example"
  247. content = """\
  248. This file was created by a custom procedure! Neat, huh?
  249. """
  250. # create a directory structure template. Write
  251. tmpl = {
  252. 'somedir': {
  253. 'example': content,
  254. },
  255. 'example2': sys.argv[2] if sys.argv[2] else "got no input"
  256. }
  257. # actually create the structure in the dataset
  258. create_tree(ds.path, tmpl)
  259. # rule to store 'example' Git
  260. ds.repo.set_gitattributes([('example', {'annex.largefiles': 'nothing'})])
  261. # save the dataset modifications
  262. ds.save(message="Apply custom procedure")
  263. EOT
  264. .. runrecord:: _examples/DL-101-124-105
  265. :language: console
  266. :workdir: procs/somedataset
  267. $ datalad save -m "add custom procedure"
  268. At this point, the dataset contains the custom procedure ``example``.
  269. This is how it can be executed and what it does:
  270. .. runrecord:: _examples/DL-101-124-106
  271. :language: console
  272. :workdir: procs/somedataset
  273. $ datalad run-procedure example "this text will be in the file 'example2'"
  274. .. runrecord:: _examples/DL-101-124-107
  275. :language: console
  276. :workdir: procs/somedataset
  277. $ # the directory structure has been created
  278. $ tree
  279. .. runrecord:: _examples/DL-101-124-108
  280. :workdir: procs/somedataset
  281. :language: console
  282. $ # lets check out the contents in the files
  283. $ cat example2 && echo '' && cat somedir/example
  284. .. runrecord:: _examples/DL-101-124-109
  285. :workdir: procs/somedataset
  286. :language: console
  287. $ git config -f .datalad/config datalad.procedures.example.help "A toy example"
  288. $ datalad save -m "add help description"
  289. To find out more about a given procedure, you can ask for help:
  290. .. runrecord:: _examples/DL-101-124-110
  291. :workdir: procs/somedataset
  292. :language: console
  293. $ datalad run-procedure --help-proc example
  294. Summing up, DataLad's :dlcmd:`run-procedure` command is a handy tool
  295. with useful existing procedures but much flexibility for your own
  296. DIY procedure scripts. With the information of the last three sections
  297. you should be able to write and understand necessary configurations,
  298. but you can also rely on existing, preconfigured templates in the
  299. form of procedures, and even write and distribute your own.
  300. Therefore, envision procedures as
  301. helper-tools that can minimize technical complexities
  302. in a dataset -- users can concentrate on the actual task while
  303. the dataset is set-up, structured, processed, or configured automatically
  304. with the help of a procedure.
  305. Especially in the case of trainees and new users, applying procedures
  306. instead of doing relevant routines "by hand" can help to ease
  307. working with the dataset. Other than by users, procedures can also be triggered to automatically
  308. run after any command execution if a command results matches a specific
  309. requirement.
  310. Finally, make a note about running procedures inside of ``notes.txt``:
  311. .. runrecord:: _examples/DL-101-124-111
  312. :language: console
  313. :workdir: dl-101/DataLad-101
  314. $ cat << EOT >> notes.txt
  315. It can be useful to use pre-configured procedures that can apply
  316. configurations, create files or file hierarchies, or perform arbitrary
  317. tasks in datasets. They can be shipped with DataLad, its extensions,
  318. or datasets, and you can even write your own procedures and distribute
  319. them.
  320. The "datalad run-procedure" command is used to apply such a procedure
  321. to a dataset. Procedures shipped with DataLad or its extensions
  322. starting with a "cfg" prefix can also be applied at the creation of a
  323. dataset with "datalad create -c <PROC-NAME> <PATH>" (omitting the
  324. "cfg" prefix).
  325. EOT
  326. .. runrecord:: _examples/DL-101-124-112
  327. :workdir: dl-101/DataLad-101
  328. :language: console
  329. $ datalad save -m "add note on DataLad's procedures"
  330. .. only:: adminmode
  331. Add a tag at the section end.
  332. .. runrecord:: _examples/DL-101-124-112
  333. :language: console
  334. :workdir: dl-101/DataLad-101
  335. $ git branch sct_configurations_to_go
  336. .. rubric:: Footnotes
  337. .. [#f1] In theory, because procedures can exist on different levels, and
  338. because anyone can create (and thus name) their own procedures, there
  339. can be name conflicts. The order of precedence in such cases is:
  340. user-level, system-level, dataset, DataLad extension, DataLad, i.e.,
  341. local procedures take precedence over those coming from "outside" via
  342. datasets or DataLad extensions.
  343. If procedures in a higher-level dataset and a subdataset have the same
  344. name, the procedure closer to the dataset ``run-procedure`` is
  345. operating on takes precedence.
  346. .. [#f2] Note that we simplify the level of procedures that exist within a dataset
  347. by calling them *local*. Even though they apply to a dataset just as *local*
  348. Git configurations, unlike Git's *local* configurations in ``.git/config``,
  349. the procedures and procedure configurations in ``.datalad/config`` are committed
  350. and can be shared together with a dataset. The procedure level *local* therefore
  351. does not exactly corresponds to the *local* scope in the sense that Git uses it.