101-123-config2.rst 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526
  1. .. _config2:
  2. More on DIY configurations
  3. --------------------------
  4. As the last section already suggest, within a Git repository,
  5. ``.git/config`` is not the only configuration file.
  6. There are also ``.gitmodules`` and ``.gitattributes``, and in DataLad datasets
  7. there also is a ``.datalad/config`` file.
  8. All of these files store configurations, but have an important difference:
  9. They are version controlled, and upon sharing a dataset these configurations
  10. will be shared as well. An example for a shared configuration
  11. is the one that the ``text2git`` configuration template applied:
  12. In the shared copy of your dataset from :ref:`sibling`, text files are also saved with Git,
  13. and not git-annex. The configuration responsible
  14. for this behavior is in a ``.gitattributes`` file, and we'll start this
  15. section by looking into it.
  16. .. index:: ! configuration file; .gitattributes
  17. ``.gitattributes``
  18. ^^^^^^^^^^^^^^^^^^
  19. This file lies right in the root of your superdataset:
  20. .. runrecord:: _examples/DL-101-123-101
  21. :language: console
  22. :workdir: dl-101/DataLad-101
  23. $ cat .gitattributes
  24. This looks neither spectacular nor pretty. Also, it does not follow the ``section-option-value``
  25. organization of the ``.git/config`` file anymore. Instead, there are three lines,
  26. and all of these seem to have something to do with the configuration of git-annex.
  27. There even is one key word that you recognize: MD5E.
  28. If you have read the :ref:`Find-out-more on object trees <objecttree>`
  29. you will recognize it as a reference to the type of
  30. key used by git-annex to identify and store file content in the object-tree.
  31. The first row, ``* annex.backend=MD5E``, therefore translates to "The ``MD5E`` git-annex backend should be used for any file".
  32. But what is the rest? We'll start with the last row:
  33. .. code-block:: bash
  34. * annex.largefiles=((mimeencoding=binary)and(largerthan=0))
  35. Uhhh, cryptic. The lecturer explains: "git-annex will *annex*, that is, *store in the object-tree*,
  36. anything it considers to be a "large file". By default, anything
  37. in your dataset would be a "large file", that means anything would be annexed.
  38. However, in section :ref:`symlink` I already mentioned that exceptions to this
  39. behavior can be defined based on
  40. #. file size
  41. #. and/or path/pattern, and thus for example file extensions,
  42. or names, or file types (e.g., text files, as with the
  43. ``text2git`` configuration template).
  44. "In ``.gitattributes``, you can define what a large file and what is not
  45. by simply telling git-annex by writing such rules."
  46. What you can see in this ``.gitattributes`` file is a rule based on **file types**:
  47. With ``(mimeencoding=binary))`` [#f1]_, the ``text2git`` configuration template
  48. configured git-annex to regard all files of type "binary" as a large file.
  49. Thanks to this little line, your text files are not annexed, but stored
  50. directly in Git.
  51. The patterns ``*`` and ``**`` are so-called "wildcards" you might recognize from used in :term:`globbing`.
  52. In Git configuration files, an asterisk "*" matches anything except a slash.
  53. The third row therefore
  54. translates to "Do not annex anything that is a text file" for git-annex.
  55. Two leading "``**``" followed by a slash matches
  56. *recursively* in all directories.
  57. Therefore, the second row instructs git-annex to regard nothing starting with ``.git`` as a "large file", including contents inside of ``.git`` directories.
  58. This way, the ``.git`` repositories are protected from being annexed.
  59. If you had a single file (``myfile.pdf``) you would not want annexed, specifying a rule such as:
  60. .. code-block:: bash
  61. myfile.pdf annex.largefiles=nothing
  62. will keep it stored in Git. To see an example of this, navigate into the longnow subdataset,
  63. and view this dataset's ``.gitattributes`` file:
  64. .. runrecord:: _examples/DL-101-123-102
  65. :language: console
  66. :workdir: dl-101/DataLad-101
  67. $ cat recordings/longnow/.gitattributes
  68. The relevant part is ``README.md annex.largefiles=nothing``.
  69. This instructs git-annex to specifically not annex ``README.md``.
  70. Lastly, if you wanted to configure a rule based on **size**, you could add a row such as:
  71. .. code-block:: bash
  72. ** annex.largefiles(largerthan=20kb)
  73. to store only files exceeding 20KB in size in git-annex [#f2]_.
  74. As you may have noticed, unlike ``.git/config`` files,
  75. there can be multiple ``.gitattributes`` files within a dataset. So far, you have seen one
  76. in the root of the superdataset, and in the root of the ``longnow`` subdataset.
  77. In principle, you can add one to every directory-level of your dataset.
  78. For example, there is another ``.gitattributes`` file within the
  79. ``.datalad`` directory:
  80. .. runrecord:: _examples/DL-101-123-103
  81. :language: console
  82. :workdir: dl-101/DataLad-101
  83. $ cat .datalad/.gitattributes
  84. As with Git configuration files, more specific or lower-level configurations take precedence
  85. over more general or higher-level configurations. Specifications in a subdirectory can
  86. therefore overrule specifications made in the ``.gitattributes`` file of the parent
  87. directory.
  88. In summary, the ``.gitattributes`` files will give you the possibility to configure
  89. what should be annexed and what should not be annexed up to individual file level.
  90. This can be very handy, and allows you to tune your dataset to your custom needs.
  91. For example, files you will often edit by hand could be stored in Git if they are
  92. not too large to ease modifying them [#f3]_.
  93. Once you know the basics of this type of configuration syntax, writing
  94. your own rules is easy. For more tips on how configure git-annex's content
  95. management in ``.gitattributes``, take a look at `the git-annex documentation <https://git-annex.branchable.com/tips/largefiles>`_.
  96. Later however you will see preconfigured DataLad *procedures* such as ``text2git`` that
  97. can apply useful configurations for you, just as ``text2git`` added the last line
  98. in the root ``.gitattributes`` file.
  99. .. index:: ! configuration file; .gitmodules
  100. ``.gitmodules``
  101. ^^^^^^^^^^^^^^^
  102. On last configuration file that Git creates is the ``.gitmodules`` file.
  103. There is one right in the root of your dataset:
  104. .. runrecord:: _examples/DL-101-123-104
  105. :language: console
  106. :workdir: dl-101/DataLad-101
  107. $ cat .gitmodules
  108. Based on these contents, you might have already guessed what this file
  109. stores. The ``.gitmodules`` file is a configuration file that stores the mapping between
  110. your own dataset and any subdatasets you have installed in it.
  111. There will be an entry for each submodule (subdataset) in your dataset.
  112. The name *submodule* is Git terminology, and describes a Git repository inside of
  113. another Git repository, i.e., the super- and subdataset principles.
  114. Upon sharing your dataset, the information about subdatasets and where to retrieve
  115. them from is stored and shared with this file.
  116. In addition to modifying it with the ``git config`` command or by hand, the ``datalad subdatasets`` command also has a ``--set-property NAME VALUE`` option that you can use to set subdataset properties.
  117. Section :ref:`sharelocal1` already mentioned one additional configuration option in a footnote: The ``datalad-recursiveinstall`` key.
  118. This key is defined on a per subdataset basis, and if set to "``skip``", the given subdataset will not be recursively installed unless it is explicitly specified as a path to :dlcmd:`get [-n/--no-data] -r`.
  119. If you are a maintainer of a superdataset with monstrous amounts of subdatasets, you can set this option and share it together with the dataset to prevent an accidental, large recursive installation in particularly deeply nested subdatasets.
  120. Below is a minimally functional example on how to apply the configuration and how it works:
  121. Let's create a dataset hierarchy to work with (note that we concatenate multiple commands into a single line using bash's "and" ``&&`` operator):
  122. .. code-block:: console
  123. $ # create a superdataset with two subdatasets
  124. $ datalad create superds && datalad -C superds create -d . subds1 && datalad -C superds create -d . subds2
  125. create(ok): /tmp/superds (dataset)
  126. add(ok): subds1 (file)
  127. add(ok): .gitmodules (file)
  128. save(ok): . (dataset)
  129. create(ok): subds1 (dataset)
  130. add(ok): subds2 (file)
  131. add(ok): .gitmodules (file)
  132. save(ok): . (dataset)
  133. create(ok): subds2 (dataset)
  134. Next, we create subdatasets in the subdatasets:
  135. .. code-block:: console
  136. $ # create two subdatasets in subds1
  137. $ datalad -C superds/subds1 create -d . subsubds1 && datalad -C superds/subds1 create -d . subsubds2
  138. add(ok): subsubds1 (file)
  139. add(ok): .gitmodules (file)
  140. save(ok): . (dataset)
  141. create(ok): subsubds1 (dataset)
  142. add(ok): subsubds2 (file)
  143. add(ok): .gitmodules (file)
  144. save(ok): . (dataset)
  145. create(ok): subsubds2 (dataset)
  146. $ # create two subdatasets in subds2
  147. $ datalad -C superds/subds2 create -d . subsubds1 && datalad -C superds/subds2 create -d . subsubds2
  148. add(ok): subsubds1 (file)
  149. add(ok): .gitmodules (file)
  150. save(ok): . (dataset)
  151. create(ok): subsubds1 (dataset)
  152. add(ok): subsubds2 (file)
  153. add(ok): .gitmodules (file)
  154. save(ok): . (dataset)
  155. create(ok): subsubds2 (dataset)
  156. Here is the directory structure:
  157. .. code-block:: console
  158. $ cd ../ && tree
  159. .
  160. ├── subds1
  161. │   ├── subsubds1
  162. │   └── subsubds2
  163. └── subds2
  164. ├── subsubds1
  165. └── subsubds2
  166. $ # save in the superdataset
  167. datalad save -m "add a few sub and subsub datasets"
  168. add(ok): subds1 (file)
  169. add(ok): subds2 (file)
  170. save(ok): . (dataset)
  171. Now, we can apply the ``datalad-recursiveinstall`` configuration to skip recursive installations for ``subds1``
  172. .. code-block:: console
  173. $ git config -f .gitmodules --add submodule.subds1.datalad-recursiveinstall skip
  174. $ # save this configuration
  175. $ datalad save -m "prevent recursion into subds1, unless explicitly given as path"
  176. add(ok): .gitmodules (file)
  177. save(ok): . (dataset)
  178. If the dataset is cloned, and someone runs a recursive :dlcmd:`get`, the subdatasets of ``subds1`` will not be installed, the subdatasets of ``subds2``, however, will be.
  179. .. code-block:: console
  180. $ # clone the dataset somewhere else
  181. $ cd ../ && datalad clone superds clone_of_superds
  182. [INFO ] Cloning superds into '/tmp/clone_of_superds'
  183. install(ok): /tmp/clone_of_superds (dataset)
  184. $ # recursively get all contents (without data)
  185. $ cd clone_of_superds && datalad get -n -r .
  186. get(ok): /tmp/clone_of_superds/subds2 (dataset)
  187. get(ok): /tmp/clone_of_superds/subds2/subsubds1 (dataset)
  188. get(ok): /tmp/clone_of_superds/subds2/subsubds2 (dataset)
  189. $ # only subsubds of subds2 are installed, not of subds1:
  190. $ tree
  191. .
  192. ├── subds1
  193. └── subds2
  194. ├── subsubds1
  195. └── subsubds2
  196. 4 directories, 0 files
  197. Nevertheless, if ``subds1`` is provided with an explicit path, its subdataset ``subsubds`` will be cloned, essentially overriding the configuration:
  198. .. code-block:: console
  199. $ datalad get -n -r subds1 && tree
  200. install(ok): /tmp/clone_of_superds/subds1 (dataset) [Installed subdataset in order to get /tmp/clone_of_superds/subds1]
  201. .
  202. ├── subds1
  203. │   ├── subsubds1
  204. │   └── subsubds2
  205. └── subds2
  206. ├── subsubds1
  207. └── subsubds2
  208. 6 directories, 0 files
  209. .. index:: ! configuration file; .datalad/config
  210. ``.datalad/config``
  211. ^^^^^^^^^^^^^^^^^^^
  212. DataLad adds a repository-specific configuration file as well.
  213. It can be found in the ``.datalad`` directory, and just like ``.gitattributes``
  214. and ``.gitmodules`` it is version controlled and is thus shared together with
  215. the dataset. One can configure
  216. `many options <https://docs.datalad.org/en/latest/generated/datalad.config.html>`_,
  217. but currently, our ``.datalad/config`` file only stores a :term:`dataset ID`.
  218. This ID serves to identify a dataset as a unit, across its entire history and flavors.
  219. In a geeky way, this is your dataset's social security number: It will only exist
  220. one time on this planet.
  221. .. runrecord:: _examples/DL-101-123-105
  222. :language: console
  223. :workdir: dl-101/DataLad-101
  224. $ cat .datalad/config
  225. Note, though, that local configurations within a Git configuration file
  226. will take precedence over configurations that can be distributed with a dataset.
  227. Otherwise, dataset updates with :dlcmd:`update` (or, for Git-users,
  228. :gitcmd:`pull`) could suddenly and unintentionally alter local DataLad
  229. behavior that was specifically configured.
  230. Also, :term:`Git` and :term:`git-annex` will not query this file for configurations, so please store only sticky options that are specific to DataLad (i.e., under the ``datalad.*`` namespace) in it.
  231. .. index::
  232. pair: modify configuration; with Git
  233. Writing to configuration files other than ``.git/config``
  234. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  235. "Didn't you say that knowing the :gitcmd:`config` command is already
  236. half of what I need to know?" you ask. "Now there are three other configuration
  237. files, and I do not know with which command I can write into these files."
  238. "Excellent question", you hear in return, "but in reality, you **do** know:
  239. it's also the :gitcmd:`config` command. The only part of it you need to
  240. adjust is the ``-f``, ``--file`` parameter. By default, the command writes to
  241. a Git config file. But it can write to a different file if you specify it
  242. appropriately. For example,
  243. ``git config --file=.gitmodules --replace-all submodule."name".url "new URL"``
  244. will update your submodule's URL. Keep in mind though that you would need
  245. to commit this change, as ``.gitmodules`` is version controlled".
  246. Let's try this:
  247. .. runrecord:: _examples/DL-101-123-106
  248. :workdir: dl-101/DataLad-101
  249. :language: console
  250. $ git config --file=.gitmodules --replace-all submodule."recordings/longnow".url "git@github.com:datalad-datasets/longnow-podcasts.git"
  251. This command will replace the submodule's https URL with an SSH URL.
  252. The latter is often used if someone has an *SSH key pair* and added the
  253. public key to their GitHub account (you can read more about this
  254. `here <https://docs.github.com/en/get-started/getting-started-with-git/about-remote-repositories>`_).
  255. We will revert this change shortly, but use it to show the difference between
  256. a :gitcmd:`config` on a ``.git/config`` file and on a version controlled file:
  257. .. runrecord:: _examples/DL-101-123-107
  258. :workdir: dl-101/DataLad-101
  259. :language: console
  260. $ datalad status
  261. .. runrecord:: _examples/DL-101-123-108
  262. :workdir: dl-101/DataLad-101
  263. :language: console
  264. $ git diff
  265. As these two commands show, the ``.gitmodules`` file is modified. The https URL
  266. has been deleted (note the ``-``), and a SSH URL has been added. To keep these
  267. changes, we would need to :dlcmd:`save` them. However, as we want to stay with
  268. https URLs, we will just *checkout* this change -- using a Git tool to undo an
  269. unstaged modification.
  270. .. runrecord:: _examples/DL-101-123-109
  271. :workdir: dl-101/DataLad-101
  272. :language: console
  273. $ git checkout .gitmodules
  274. $ datalad status
  275. Note, though, that the ``.gitattributes`` file cannot be modified with a :gitcmd:`config`
  276. command. This is due to its different format that does not comply to the
  277. ``section.variable.value`` structure of all other configuration files. This file, therefore,
  278. has to be edited by hand, with an editor of your choice.
  279. .. index:: ! environment variable
  280. .. _envvars:
  281. Environment variables
  282. ^^^^^^^^^^^^^^^^^^^^^
  283. An :term:`environment variable` is a variable set up in your shell
  284. that affects the way the shell or certain software works -- for example,
  285. the environment variables ``HOME``, ``PWD``, or ``PATH``.
  286. Configuration options that determine the behavior of Git, git-annex, and
  287. DataLad that could be defined in a configuration file can also be set (or overridden)
  288. by the associated environment variables of these configuration options.
  289. Many configuration items have associated environment variables.
  290. If this environment variable is set, it takes precedence over options set in
  291. configuration files, thus providing both an alternative way to define configurations
  292. as well as an override mechanism. For example, the ``user.name``
  293. configuration of Git can be overridden by its associated environment variable,
  294. ``GIT_AUTHOR_NAME``. Likewise, one can define the environment variable instead
  295. of setting the ``user.name`` configuration in a configuration file.
  296. .. index:: configuration item; datalad.log.level
  297. Git, git-annex, and DataLad have more environment variables than anyone would want to
  298. remember. `The ProGit book <https://git-scm.com/book/en/v2/Git-Internals-Environment-Variables>`__
  299. has a good overview on Git's most useful available environment variables for a start.
  300. All of DataLad's configuration options can be translated to their
  301. associated environment variables. Any environment variable with a name that starts with ``DATALAD_``
  302. will be available as the corresponding ``datalad.`` configuration variable,
  303. replacing any ``__`` (two underscores) with a hyphen, then any ``_`` (single underscore)
  304. with a dot, and finally converting all letters to lower case. The ``datalad.log.level``
  305. configuration option thus is the environment variable ``DATALAD_LOG_LEVEL``.
  306. .. index:: operating system concept; environment variable
  307. .. find-out-more:: Some more general information on environment variables
  308. :name: fom-envvar
  309. Names of environment variables are often all-uppercase. While the ``$`` is not part of
  310. the name of the environment variable, it is necessary to *refer* to the environment
  311. variable: To reference the value of the environment variable ``HOME``, for example, you would
  312. need to use ``echo $HOME`` and not ``echo HOME``. However, environment variables are
  313. set without a leading ``$``. There are several ways to set an environment variable
  314. (note that there are no spaces before and after the ``=`` !), leading to different
  315. levels of availability of the variable:
  316. - ``THEANSWER=42 <command>`` makes the variable ``THEANSWER`` available for the process in ``<command>``.
  317. For example, ``DATALAD_LOG_LEVEL=debug datalad get <file>`` will execute the :dlcmd:`get`
  318. command (and only this one) with the log level set to "debug".
  319. - ``export THEANSWER=42`` makes the variable ``THEANSWER`` available for other processes in the
  320. same session, but it will not be available to other shells.
  321. - ``echo 'export THEANSWER=42' >> ~/.bashrc`` will write the variable definition in the
  322. ``.bashrc`` file and thus available to all future shells of the user (i.e., this will make
  323. the variable permanent for the user)
  324. To list all of the configured environment variables, type ``env`` into your terminal.
  325. Summary
  326. ^^^^^^^
  327. This has been an intense lecture, you have to admit. One definite
  328. take-away from it has been that you now know a second reason why the hidden
  329. ``.git`` and ``.datalad`` directory contents and also the contents of ``.gitmodules`` and
  330. ``.gitattributes`` should not be carelessly tampered with -- they contain all of
  331. the repository's configurations.
  332. But you now also know how to modify these configurations with enough
  333. care and background knowledge such that nothing should go wrong once you
  334. want to work with and change them. You can use the :gitcmd:`config` command
  335. for Git configuration files on different scopes, and even the ``.gitmodules`` or ``datalad/config``
  336. files. Of course you do not yet know all of the available configuration options. However,
  337. you already know some core Git configurations such as name, email, and editor. Even more
  338. important, you know how to configure git-annex's content management based on ``largefile``
  339. rules, and you understand the variables within ``.gitmodules`` or the sections
  340. in ``.git/config``. Slowly, you realize with pride,
  341. you are more and more becoming a DataLad power-user.
  342. Write a note about configurations in datasets into ``notes.txt``.
  343. .. runrecord:: _examples/DL-101-123-110
  344. :workdir: dl-101/DataLad-101
  345. :language: console
  346. $ cat << EOT >> notes.txt
  347. Configurations for datasets exist on different levels (systemwide,
  348. global, and local), and in different types of files (not version
  349. controlled (git)config files, or version controlled .datalad/config,
  350. .gitattributes, or gitmodules files), or environment variables.
  351. With the exception of .gitattributes, all configuration files share a
  352. common structure, and can be modified with the git config command, but
  353. also with an editor by hand.
  354. Depending on whether a configuration file is version controlled or
  355. not, the configurations will be shared together with the dataset.
  356. More specific configurations and not-shared configurations will always
  357. take precedence over more global or hared configurations, and
  358. environment variables take precedence over configurations in files.
  359. The git config --list --show-origin command is a useful tool to give
  360. an overview over existing configurations. Particularly important may
  361. be the .gitattributes file, in which one can set rules for git-annex
  362. about which files should be version-controlled with Git instead of
  363. being annexed.
  364. EOT
  365. .. runrecord:: _examples/DL-101-123-111
  366. :workdir: dl-101/DataLad-101
  367. :language: console
  368. $ datalad save -m "add note on configurations and git config"
  369. .. only:: adminmode
  370. Add a tag at the section end.
  371. .. runrecord:: _examples/DL-101-123-112
  372. :language: console
  373. :workdir: dl-101/DataLad-101
  374. $ git branch sct_more_on_DYI_configurations
  375. .. rubric:: Footnotes
  376. .. [#f1] When opening any file on a UNIX system, the file does not need to have a file
  377. extension (such as ``.txt``, ``.pdf``, ``.jpg``) for the operating system to know
  378. how to open or use this file (in contrast to Windows, which does not know how to
  379. open a file without an extension). To do this, Unix systems rely on a file's
  380. MIME type -- an information about a file's content. A ``.txt`` file, for example,
  381. has MIME type ``text/plain`` as does a bash script (``.sh``), a Python
  382. script has MIME type ``text/x-python``, a ``.jpg`` file is ``image/jpg``, and
  383. a ``.pdf`` file has MIME type ``application/pdf``. You can find out the MIME type
  384. of a file by running:
  385. .. code-block:: console
  386. $ file --mime-type path/to/file
  387. .. [#f2] Specifying annex.largefiles in your .gitattributes file will make the configuration
  388. "portable" -- shared copies of your dataset will retain these configurations.
  389. You could however also set largefiles rules in your ``.git/config`` file. Rules
  390. specified in there take precedence over rules in ``.gitattributes``. You can set
  391. them using the :gitcmd:`config` command:
  392. .. code-block:: console
  393. $ git config annex.largefiles 'largerthan=100kb and not (include=*.c or include=*.h)'
  394. The above command annexes files larger than 100KB, and will never annex files with a
  395. ``.c`` or ``.h`` extension.
  396. .. [#f3] Should you ever need to, this file is also where one would change the git-annex
  397. backend in order to store new files with a new backend. Switching the backend of
  398. *all* files (new as well as existing ones) requires the :gitannexcmd:`migrate`
  399. command
  400. (see `the documentation <https://git-annex.branchable.com/git-annex-migrate>`_ for
  401. more information on this command).