101-123-config2.rst 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524
  1. .. _config2:
  2. More on DIY configurations
  3. --------------------------
  4. As the last section already suggest, within a Git repository,
  5. ``.git/config`` is not the only configuration file.
  6. There are also ``.gitmodules`` and ``.gitattributes``, and in DataLad datasets
  7. there also is a ``.datalad/config`` file.
  8. All of these files store configurations, but have an important difference:
  9. They are version controlled, and upon sharing a dataset these configurations
  10. will be shared as well. An example for a shared configuration
  11. is the one that the ``text2git`` configuration template applied:
  12. In the shared copy of your dataset, text files are also saved with Git,
  13. and not git-annex (see section :ref:`sibling`). The configuration responsible
  14. for this behavior is in a ``.gitattributes`` file, and we'll start this
  15. section by looking into it.
  16. .. index:: ! Config files; .gitattributes
  17. ``.gitattributes``
  18. ^^^^^^^^^^^^^^^^^^
  19. This file lies right in the root of your superdataset:
  20. .. runrecord:: _examples/DL-101-123-101
  21. :language: console
  22. :workdir: dl-101/DataLad-101
  23. $ cat .gitattributes
  24. This looks neither spectacular nor pretty. Also, it does not follow the ``section-option-value``
  25. organization of the ``.git/config`` file anymore. Instead, there are three lines,
  26. and all of these seem to have something to do with the configuration of git-annex.
  27. There even is one key word that you recognize: MD5E.
  28. If you have read the hidden section in :ref:`symlink`
  29. you will recognize it as a reference to the type of
  30. key used by git-annex to identify and store file content in the object-tree.
  31. The first row, ``* annex.backend=MD5E``, therefore translates to "Everything in this
  32. directory should be hashed with a MD5E hash function".
  33. But what is the rest? We'll start with the last row:
  34. .. code-block:: bash
  35. * annex.largefiles=((mimeencoding=binary)and(largerthan=0))
  36. Uhhh, cryptic. The lecturer explains: "git-annex will *annex*, that is, *store in the object-tree*,
  37. anything it considers to be a "large file". By default, anything
  38. in your dataset would be a "large file", that means anything would be annexed.
  39. However, in section :ref:`symlink` I already mentioned that exceptions to this
  40. behavior can be defined based on
  41. #. file size
  42. #. and/or path/pattern, and thus for example file extensions,
  43. or names, or file types (e.g., text files, as with the
  44. ``text2git`` configuration template).
  45. "In ``.gitattributes``, you can define what a large file and what is not
  46. by simply telling git-annex by writing such rules."
  47. What you can see in this ``.gitattributes`` file is a rule based on **file types**:
  48. With ``(mimeencoding=binary))`` [#f1]_, the ``text2git`` configuration template
  49. configured git-annex to regard all files of type "binary" as a large file.
  50. Thanks to this little line, your text files are not annexed, but stored
  51. directly in Git.
  52. The patterns ``*`` and ``**`` are so-called "wildcards" used in :term:`globbing`.
  53. ``*`` matches any file or directory in the current directory, and ``**`` matches
  54. all files and directories in the current directory *and subdirectories*. In technical
  55. terms, ``**`` matches *recursively*. The third row therefore
  56. translates to "Do not annex anything that is a text file in this directory" for git-annex.
  57. However, rules can be even simpler. The second row simply takes a complete directory
  58. (``.git``) and instructs git-annex to regard nothing in it as a "large file".
  59. The second row, ``**/.git* annex.largefiles=nothing`` means that no
  60. ``.git`` repository in this directory or a subdirectory should be considered
  61. a "large file". This way, the ``.git`` repositories are protected from being annexed.
  62. If you had a single file (``myfile.pdf``) you would not want annexed, specifying a rule such as:
  63. .. code-block:: bash
  64. myfile.pdf annex.largefiles=nothing
  65. will keep it stored in Git. To see an example of this, navigate into the longnow subdataset,
  66. and view this dataset's ``.gitattributes`` file:
  67. .. runrecord:: _examples/DL-101-123-102
  68. :language: console
  69. :workdir: dl-101/DataLad-101
  70. $ cat recordings/longnow/.gitattributes
  71. The relevant part is ``README.md annex.largefiles=nothing``.
  72. This instructs git-annex to specifically not annex ``README.md``.
  73. Lastly, if you wanted to configure a rule based on **size**, you could add a row such as:
  74. .. code-block:: bash
  75. ** annex.largefiles(largerthan=20kb)
  76. to store only files exceeding 20KB in size in git-annex [#f2]_.
  77. As you may have noticed, unlike ``.git/config`` files,
  78. there can be multiple ``.gitattributes`` files within a dataset. So far, you have seen one
  79. in the root of the superdataset, and in the root of the ``longnow`` subdataset.
  80. In principle, you can add one to every directory-level of your dataset.
  81. For example, there is another ``.gitattributes`` file within the
  82. ``.datalad`` directory:
  83. .. runrecord:: _examples/DL-101-123-103
  84. :language: console
  85. :workdir: dl-101/DataLad-101
  86. $ cat .datalad/.gitattributes
  87. As with Git configuration files, more specific or lower-level configurations take precedence
  88. over more general or higher-level configurations. Specifications in a subdirectory can
  89. therefore overrule specifications made in the ``.gitattributes`` file of the parent
  90. directory.
  91. In summary, the ``.gitattributes`` files will give you the possibility to configure
  92. what should be annexed and what should not be annexed up to individual file level.
  93. This can be very handy, and allows you to tune your dataset to your custom needs.
  94. For example, files you will often edit by hand could be stored in Git if they are
  95. not too large to ease modifying them [#f3]_.
  96. Once you know the basics of this type of configuration syntax, writing
  97. your own rules is easy. For more tips on how configure git-annex's content
  98. management in ``.gitattributes``, take a look at `the git-annex documentation <https://git-annex.branchable.com/tips/largefiles>`_.
  99. Later however you will see preconfigured DataLad *procedures* such as ``text2git`` that
  100. can apply useful configurations for you, just as ``text2git`` added the last line
  101. in the root ``.gitattributes`` file.
  102. .. index:: ! Config files; .gitmodules
  103. ``.gitmodules``
  104. ^^^^^^^^^^^^^^^
  105. On last configuration file that Git creates is the ``.gitmodules`` file.
  106. There is one right in the root of your dataset:
  107. .. runrecord:: _examples/DL-101-123-104
  108. :language: console
  109. :workdir: dl-101/DataLad-101
  110. $ cat .gitmodules
  111. Based on these contents, you might have already guessed what this file
  112. stores. The ``.gitmodules`` file is a configuration file that stores the mapping between
  113. your own dataset and any subdatasets you have installed in it.
  114. There will be an entry for each submodule (subdataset) in your dataset.
  115. The name *submodule* is Git terminology, and describes a Git repository inside of
  116. another Git repository, i.e., the super- and subdataset principles.
  117. Upon sharing your dataset, the information about subdatasets and where to retrieve
  118. them from is stored and shared with this file.
  119. Section :ref:`sharelocal1` already mentioned one additional configuration option in a footnote: The ``datalad-recursiveinstall`` key.
  120. This key is defined on a per subdataset basis, and if set to "``skip``", the given subdataset will not be recursively installed unless it is explicitly specified as a path to :dlcmd:`get [-n/--no-data] -r`.
  121. If you are a maintainer of a superdataset with monstrous amounts of subdatasets, you can set this option and share it together with the dataset to prevent an accidental, large recursive installation in particularly deeply nested subdatasets.
  122. Below is a minimally functional example on how to apply the configuration and how it works:
  123. Let's create a dataset hierarchy to work with (note that we concatenate multiple commands into a single line using bash's "and" ``&&`` operator):
  124. .. code-block:: bash
  125. # create a superdataset with two subdatasets
  126. $ datalad create superds && datalad -C superds create -d . subds1 && datalad -C superds create -d . subds2
  127. create(ok): /tmp/superds (dataset)
  128. add(ok): subds1 (file)
  129. add(ok): .gitmodules (file)
  130. save(ok): . (dataset)
  131. create(ok): subds1 (dataset)
  132. add(ok): subds2 (file)
  133. add(ok): .gitmodules (file)
  134. save(ok): . (dataset)
  135. create(ok): subds2 (dataset)
  136. Next, we create subdatasets in the subdatasets:
  137. .. code-block:: bash
  138. # create two subdatasets in subds1
  139. $ datalad -C superds/subds1 create -d . subsubds1 && datalad -C superds/subds1 create -d . subsubds2
  140. add(ok): subsubds1 (file)
  141. add(ok): .gitmodules (file)
  142. save(ok): . (dataset)
  143. create(ok): subsubds1 (dataset)
  144. add(ok): subsubds2 (file)
  145. add(ok): .gitmodules (file)
  146. save(ok): . (dataset)
  147. create(ok): subsubds2 (dataset)
  148. # create two subdatasets in subds2
  149. $ datalad -C superds/subds2 create -d . subsubds1 && datalad -C superds/subds2 create -d . subsubds2
  150. add(ok): subsubds1 (file)
  151. add(ok): .gitmodules (file)
  152. save(ok): . (dataset)
  153. create(ok): subsubds1 (dataset)
  154. add(ok): subsubds2 (file)
  155. add(ok): .gitmodules (file)
  156. save(ok): . (dataset)
  157. create(ok): subsubds2 (dataset)
  158. Here is the directory structure:
  159. .. code-block:: bash
  160. $ cd ../ && tree
  161. .
  162. ├── subds1
  163. │   ├── subsubds1
  164. │   └── subsubds2
  165. └── subds2
  166. ├── subsubds1
  167. └── subsubds2
  168. # save in the superdataset
  169. datalad save -m "add a few sub and subsub datasets"
  170. add(ok): subds1 (file)
  171. add(ok): subds2 (file)
  172. save(ok): . (dataset)
  173. Now, we can apply the ``datalad-recursiveinstall`` configuration to skip recursive installations for ``subds1``
  174. .. code-block:: bash
  175. $ git config -f .gitmodules --add submodule.subds1.datalad-recursiveinstall skip
  176. # save this configuration
  177. $ datalad save -m "prevent recursion into subds1, unless explicitly given as path"
  178. add(ok): .gitmodules (file)
  179. save(ok): . (dataset)
  180. If the dataset is cloned, and someone runs a recursive :dlcmd:`get`, the subdatasets of ``subds1`` will not be installed, the subdatasets of ``subds2``, however, will be.
  181. .. code-block:: bash
  182. # clone the dataset somewhere else
  183. $ cd ../ && datalad clone superds clone_of_superds
  184. [INFO ] Cloning superds into '/tmp/clone_of_superds'
  185. install(ok): /tmp/clone_of_superds (dataset)
  186. # recursively get all contents (without data)
  187. $ cd clone_of_superds && datalad get -n -r .
  188. get(ok): /tmp/clone_of_superds/subds2 (dataset)
  189. get(ok): /tmp/clone_of_superds/subds2/subsubds1 (dataset)
  190. get(ok): /tmp/clone_of_superds/subds2/subsubds2 (dataset)
  191. # only subsubds of subds2 are installed, not of subds1:
  192. $ tree
  193. .
  194. ├── subds1
  195. └── subds2
  196. ├── subsubds1
  197. └── subsubds2
  198. 4 directories, 0 files
  199. Nevertheless, if ``subds1`` is provided with an explicit path, its subdataset ``subsubds`` will be cloned, essentially overriding the configuration:
  200. .. code-block:: bash
  201. $ datalad get -n -r subds1 && tree
  202. install(ok): /tmp/clone_of_superds/subds1 (dataset) [Installed subdataset in order to get /tmp/clone_of_superds/subds1]
  203. .
  204. ├── subds1
  205. │   ├── subsubds1
  206. │   └── subsubds2
  207. └── subds2
  208. ├── subsubds1
  209. └── subsubds2
  210. 6 directories, 0 files
  211. .. index:: ! Config files; .datalad/config
  212. ``.datalad/config``
  213. ^^^^^^^^^^^^^^^^^^^
  214. DataLad adds a repository-specific configuration file as well.
  215. It can be found in the ``.datalad`` directory, and just like ``.gitattributes``
  216. and ``.gitmodules`` it is version controlled and is thus shared together with
  217. the dataset. One can configure
  218. `many options <https://docs.datalad.org/en/latest/generated/datalad.config.html>`_,
  219. but currently, our ``.datalad/config`` file only stores a :term:`dataset ID`.
  220. This ID serves to identify a dataset as a unit, across its entire history and flavors.
  221. In a geeky way, this is your dataset's social security number: It will only exist
  222. one time on this planet.
  223. .. runrecord:: _examples/DL-101-123-105
  224. :language: console
  225. :workdir: dl-101/DataLad-101
  226. $ cat .datalad/config
  227. Note, though, that local configurations within a Git configuration file
  228. will take precedence over configurations that can be distributed with a dataset.
  229. Otherwise, dataset updates with :dlcmd:`update` (or, for Git-users,
  230. :gitcmd:`pull`) could suddenly and unintentionally alter local DataLad
  231. behavior that was specifically configured.
  232. Also, :term:`Git` and :term:`git-annex` will not query this file for configurations, so please store only sticky options that are specific to DataLad (i.e., under the ``datalad.*`` namespace) in it.
  233. Writing to configuration files other than ``.git/config``
  234. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  235. "Didn't you say that knowing the :gitcmd:`config` command is already
  236. half of what I need to know?" you ask. "Now there are three other configuration
  237. files, and I do not know with which command I can write into these files."
  238. "Excellent question", you hear in return, "but in reality, you **do** know:
  239. it's also the :gitcmd:`config` command. The only part of it you need to
  240. adjust is the ``-f``, ``--file`` parameter. By default, the command writes to
  241. a Git config file. But it can write to a different file if you specify it
  242. appropriately. For example
  243. ``git config --file=.gitmodules --replace-all submodule."name".url "new URL"``
  244. will update your submodule's URL. Keep in mind though that you would need
  245. to commit this change, as ``.gitmodules`` is version controlled".
  246. Let's try this:
  247. .. runrecord:: _examples/DL-101-123-106
  248. :workdir: dl-101/DataLad-101
  249. :language: console
  250. $ git config --file=.gitmodules --replace-all submodule."recordings/longnow".url "git@github.com:datalad-datasets/longnow-podcasts.git"
  251. This command will replace the submodule's https URL with an SSH URL.
  252. The latter is often used if someone has an *SSH key pair* and added the
  253. public key to their GitHub account (you can read more about this
  254. `here <https://docs.github.com/en/get-started/getting-started-with-git/about-remote-repositories>`_).
  255. We will revert this change shortly, but use it to show the difference between
  256. a :gitcmd:`config` on a ``.git/config`` file and on a version controlled file:
  257. .. runrecord:: _examples/DL-101-123-107
  258. :workdir: dl-101/DataLad-101
  259. :language: console
  260. $ datalad status
  261. .. runrecord:: _examples/DL-101-123-108
  262. :workdir: dl-101/DataLad-101
  263. :language: console
  264. $ git diff
  265. As these two commands show, the ``.gitmodules`` file is modified. The https URL
  266. has been deleted (note the ``-``), and a SSH URL has been added. To keep these
  267. changes, we would need to :dlcmd:`save` them. However, as we want to stay with
  268. https URLs, we will just *checkout* this change -- using a Git tool to undo an
  269. unstaged modification.
  270. .. runrecord:: _examples/DL-101-123-109
  271. :workdir: dl-101/DataLad-101
  272. :language: console
  273. $ git checkout .gitmodules
  274. $ datalad status
  275. Note, though, that the ``.gitattributes`` file can not be modified with a :gitcmd:`config`
  276. command. This is due to its different format that does not comply to the
  277. ``section.variable.value`` structure of all other configuration files. This file, therefore,
  278. has to be edited by hand, with an editor of your choice.
  279. .. index:: ! environment variable
  280. .. _envvars:
  281. Environment variables
  282. ^^^^^^^^^^^^^^^^^^^^^
  283. An :term:`environment variable` is a variable set up in your shell
  284. that affects the way the shell or certain software works -- for example
  285. the environment variables ``HOME``, ``PWD``, or ``PATH``.
  286. Configuration options that determine the behavior of Git, git-annex, and
  287. DataLad that could be defined in a configuration file can also be set (or overridden)
  288. by the associated environment variables of these configuration options.
  289. Many configuration items have associated environment variables.
  290. If this environment variable is set, it takes precedence over options set in
  291. configuration files, thus providing both an alternative way to define configurations
  292. as well as an override mechanism. For example, the ``user.name``
  293. configuration of Git can be overridden by its associated environment variable,
  294. ``GIT_AUTHOR_NAME``. Likewise, one can define the environment variable instead
  295. of setting the ``user.name`` configuration in a configuration file.
  296. Git, git-annex, and DataLad have more environment variables than anyone would want to
  297. remember. `The ProGit book <https://git-scm.com/book/en/v2/Git-Internals-Environment-Variables>`__
  298. has a good overview on Git's most useful available environment variables for a start.
  299. All of DataLad's configuration options can be translated to their
  300. associated environment variables. Any environment variable with a name that starts with ``DATALAD_``
  301. will be available as the corresponding ``datalad.`` configuration variable,
  302. replacing any ``__`` (two underscores) with a hyphen, then any ``_`` (single underscore)
  303. with a dot, and finally converting all letters to lower case. The ``datalad.log.level``
  304. configuration option thus is the environment variable ``DATALAD_LOG_LEVEL``.
  305. .. index:: operating system concept; environment variable
  306. .. find-out-more:: Some more general information on environment variables
  307. :name: fom-envvar
  308. Names of environment variables are often all-uppercase. While the ``$`` is not part of
  309. the name of the environment variable, it is necessary to *refer* to the environment
  310. variable: To reference the value of the environment variable ``HOME`` for example you would
  311. need to use ``echo $HOME`` and not ``echo HOME``. However, environment variables are
  312. set without a leading ``$``. There are several ways to set an environment variable
  313. (note that there are no spaces before and after the ``=`` !), leading to different
  314. levels of availability of the variable:
  315. - ``THEANSWER=42 <command>`` makes the variable ``THEANSWER`` available for the process in ``<command>``.
  316. For example, ``DATALAD_LOG_LEVEL=debug datalad get <file>`` will execute the :dlcmd:`get`
  317. command (and only this one) with the log level set to "debug".
  318. - ``export THEANSWER=42`` makes the variable ``THEANSWER`` available for other processes in the
  319. same session, but it will not be available to other shells.
  320. - ``echo 'export THEANSWER=42' >> ~/.bashrc`` will write the variable definition in the
  321. ``.bashrc`` file and thus available to all future shells of the user (i.e., this will make
  322. the variable permanent for the user)
  323. To list all of the configured environment variables, type ``env`` into your terminal.
  324. Summary
  325. ^^^^^^^
  326. This has been an intense lecture, you have to admit. One definite
  327. take-away from it has been that you now know a second reason why the hidden
  328. ``.git`` and ``.datalad`` directory contents and also the contents of ``.gitmodules`` and
  329. ``.gitattributes`` should not be carelessly tampered with -- they contain all of
  330. the repository's configurations.
  331. But you now also know how to modify these configurations with enough
  332. care and background knowledge such that nothing should go wrong once you
  333. want to work with and change them. You can use the :gitcmd:`config` command
  334. for Git configuration files on different scopes, and even the ``.gitmodules`` or ``datalad/config``
  335. files. Of course you do not yet know all of the available configuration options. However,
  336. you already know some core Git configurations such as name, email, and editor. Even more
  337. important, you know how to configure git-annex's content management based on ``largefile``
  338. rules, and you understand the variables within ``.gitmodules`` or the sections
  339. in ``.git/config``. Slowly, you realize with pride,
  340. you're more and more becoming a DataLad power-user.
  341. Write a note about configurations in datasets into ``notes.txt``.
  342. .. runrecord:: _examples/DL-101-123-110
  343. :workdir: dl-101/DataLad-101
  344. :language: console
  345. $ cat << EOT >> notes.txt
  346. Configurations for datasets exist on different levels (systemwide,
  347. global, and local), and in different types of files (not version
  348. controlled (git)config files, or version controlled .datalad/config,
  349. .gitattributes, or gitmodules files), or environment variables.
  350. With the exception of .gitattributes, all configuration files share a
  351. common structure, and can be modified with the git config command, but
  352. also with an editor by hand.
  353. Depending on whether a configuration file is version controlled or
  354. not, the configurations will be shared together with the dataset.
  355. More specific configurations and not-shared configurations will always
  356. take precedence over more global or hared configurations, and
  357. environment variables take precedence over configurations in files.
  358. The git config --list --show-origin command is a useful tool to give
  359. an overview over existing configurations. Particularly important may
  360. be the .gitattributes file, in which one can set rules for git-annex
  361. about which files should be version-controlled with Git instead of
  362. being annexed.
  363. EOT
  364. .. runrecord:: _examples/DL-101-123-111
  365. :workdir: dl-101/DataLad-101
  366. :language: console
  367. $ datalad save -m "add note on configurations and git config"
  368. .. only:: adminmode
  369. Add a tag at the section end.
  370. .. runrecord:: _examples/DL-101-123-112
  371. :language: console
  372. :workdir: dl-101/DataLad-101
  373. $ git branch sct_more_on_DYI_configurations
  374. .. rubric:: Footnotes
  375. .. [#f1] When opening any file on a UNIX system, the file does not need to have a file
  376. extension (such as ``.txt``, ``.pdf``, ``.jpg``) for the operating system to know
  377. how to open or use this file (in contrast to Windows, which does not know how to
  378. open a file without an extension). To do this, Unix systems rely on a file's
  379. MIME type -- an information about a file's content. A ``.txt`` file for example
  380. has MIME type ``text/plain`` as does a bash script (``.sh``), a Python
  381. script has MIME type ``text/x-python``, a ``.jpg`` file is ``image/jpg``, and
  382. a ``.pdf`` file has MIME type ``application/pdf``. You can find out the MIME type
  383. of a file by running:
  384. .. code-block:: bash
  385. $ file --mime-type path/to/file
  386. .. [#f2] Specifying annex.largefiles in your .gitattributes file will make the configuration
  387. "portable" -- shared copies of your dataset will retain these configurations.
  388. You could however also set largefiles rules in your ``.git/config`` file. Rules
  389. specified in there take precedence over rules in ``.gitattributes``. You can set
  390. them using the :gitcmd:`config` command:
  391. .. code-block:: bash
  392. $ git config annex.largefiles 'largerthan=100kb and not (include=*.c or include=*.h)'
  393. The above command annexes files larger than 100KB, and will never annex files with a
  394. ``.c`` or ``.h`` extension.
  395. .. [#f3] Should you ever need to, this file is also where one would change the git-annex
  396. backend in order to store new files with a new backend. Switching the backend of
  397. *all* files (new as well as existing ones) requires the :gitannexcmd:`migrate`
  398. command
  399. (see `the documentation <https://git-annex.branchable.com/git-annex-migrate>`_ for
  400. more information on this command).