101-122-config.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375
  1. .. _config:
  2. DIY configurations
  3. ------------------
  4. Back in section :ref:`text2git`, you already learned that there
  5. are dataset configurations, and that these configurations can
  6. be modified, for example, with the ``-c text2git`` option.
  7. This option applies a configuration template to store text
  8. files in :term:`Git` instead of :term:`git-annex`, and thereby
  9. modifies the DataLad dataset's default configuration to store
  10. every file in git-annex.
  11. The lecture today focuses entirely on the topic of configurations,
  12. and aims to equip everyone with the basics to configure
  13. their general and dataset specific setup to their needs.
  14. This is not only a handy way to tune a dataset to one's
  15. wishes, but also helpful to understand potential differences in
  16. command execution and file handling between two users,
  17. computers, or datasets.
  18. "First of all, when we talk about configurations, we have
  19. to differentiate between different scopes of configuration,
  20. and different tools the configuration belongs or applies to",
  21. our lecturer starts. "In DataLad datasets, different tools can
  22. have a configuration: :term:`Git`, :term:`git-annex`, and
  23. DataLad itself. Because these tools are all
  24. combined by DataLad to help you manage your data,
  25. it is important to understand how the configuration of one
  26. software is used by or influences a second tool, or the overall
  27. dataset performance."
  28. "Oh crap, one of these theoretical lectures again" mourns a
  29. student from the row behind you. Personally, you'd also
  30. be much more excited
  31. about any hands-on lecture filled with commands. But the
  32. recent lecture about :term:`git-annex` and the :term:`object-tree`
  33. was surprisingly captivating, so you are actually looking forward to today.
  34. "Shht! I want to hear this!", you shush him with a wink.
  35. "We will start by looking into the very first configuration
  36. you did, already before the course started: The *global*
  37. Git configuration." the lecturer says.
  38. .. index::
  39. pair: config; Git command
  40. At one point in time, you likely followed instructions such as
  41. in :ref:`install` and configured your
  42. *Git identity* with the commands:
  43. .. code-block:: console
  44. $ git config --global --add user.name "Elena Piscopia"
  45. $ git config --global --add user.email elena@example.net
  46. "What the above commands do is very simple: They search for
  47. a specific configuration file, and set the variables specified
  48. in the command -- in this case user name and user email address
  49. -- to the values provided with the command." she explains.
  50. "This general procedure, specifying a value for a configuration
  51. variable in a configuration file, is how you can configure the
  52. different tools to your needs. The configuration, therefore,
  53. is really easy. Even if you are only used to ticking boxes
  54. in the ``settings`` tab of a software tool so far, it's intuitive
  55. to understand how a configuration file in principle works and also
  56. how to use it. The only piece of information you will need
  57. are the necessary files, or the command that writes to them, and
  58. the available options for configuration, that's it. And what's
  59. really cool is that all tools we'll be looking at -- Git, git-annex,
  60. and DataLad -- can be configured using the :gitcmd:`config`
  61. command [#f1]_. Therefore, once you understand the syntax of this
  62. command, you already know half of what's relevant. The other half
  63. is understanding what you are doing. Now then, let's learn *how*
  64. to configure settings, but also *understand* what we are doing
  65. with these configurations."
  66. "This seems easy enough", you think. Let's see what types of
  67. configurations there are.
  68. Git config files
  69. ^^^^^^^^^^^^^^^^
  70. The user name and email configuration
  71. is a *user-specific* configuration (called *global*
  72. configuration by Git), and therefore applies to your user account.
  73. Wherever on your computer *you* run a Git, git-annex, or DataLad
  74. command, this global configuration will
  75. associate the name and email address you supplied in
  76. the :gitcmd:`config` commands above with this action.
  77. For example, whenever you
  78. ``datalad save``, the information in this file is used for the
  79. history entry about commit author and email.
  80. Apart from *global* Git configurations, there are also *system-wide* [#f2]_
  81. and *repository* configurations. Each of these configurations
  82. resides in its own file. The global configuration is stored in a file called
  83. ``.gitconfig`` in your home directory. Among
  84. your name and email address, this file can store general
  85. per-user configurations, such as a default editor [#f3]_, or highlighting
  86. options.
  87. The *repository-specific* configurations apply to each individual
  88. repository. Their scope is more limited than the *global*
  89. configuration (namely to a single repository), but it can overrule global
  90. configurations: The more specific the scope of a configuration file is, the more
  91. important it is, and the variables in the more specific configuration
  92. will take precedence over variables in less specific configuration files.
  93. One could, for example, have :term:`vim` configured to be the default editor
  94. on a global scope, but could overrule this by setting the editor to ``nano``
  95. in a given repository. For this reason, the repository-specific configuration
  96. does not reside in a file in your home directory, but in ``.git/config``
  97. within every Git repository (and thus DataLad dataset).
  98. Thus, there are three different scopes of Git configuration, and each is defined
  99. in a ``config`` file in a different location. The configurations will determine
  100. how Git behaves. In principle, all of these files can configure
  101. the same variables differently, but more specific scopes take precedence over broader
  102. scopes. Conveniently, not only can DataLad and git-annex be configured with
  103. the same command as Git, but in many cases they will also use exactly the same
  104. files as Git for their own configurations.
  105. .. index:: ! configuration file; .git/config
  106. Let's find out how the repository-specific configuration file in the ``DataLad-101``
  107. superdataset looks like:
  108. .. runrecord:: _examples/DL-101-122-101
  109. :language: console
  110. :workdir: dl-101/DataLad-101
  111. $ cat .git/config
  112. This file consists of so called "sections" with the section names
  113. in square brackets (e.g., ``core``). Occasionally, a section can have
  114. subsections: This is indicated by subsection names in
  115. quotation marks after the section name. For example, ``roommate`` is a subsection
  116. of the section ``remote``.
  117. Within each section, ``variable = value`` pairs specify configurations
  118. for the given (sub)section.
  119. .. index::
  120. pair: configure editor; with Git
  121. The first section is called ``core`` -- as the name suggests,
  122. this configures core Git functionality. There are
  123. `many more <https://git-scm.com/docs/git-config#Documentation/git-config.txt-corefileMode>`_
  124. configurations than the ones in this config file, but
  125. they are related to Git, and less related or important to the configuration of
  126. a DataLad dataset. We will use this section to showcase the anatomy of the
  127. :gitcmd:`config` command. If, for example, you would want to specifically
  128. configure :term:`nano` to be the default editor in this dataset, you
  129. can do it like this:
  130. .. runrecord:: _examples/DL-101-122-102
  131. :language: console
  132. :workdir: dl-101/DataLad-101
  133. $ git config --local --add core.editor "nano"
  134. The command consists of the base command :gitcmd:`config`,
  135. a specification of the scope of the configuration with the ``--local``
  136. flag, a ``name`` specification consisting of section and key with the
  137. notation ``section.variable`` (here: ``core.editor``), and finally the value
  138. specification ``"nano"``.
  139. Let's see what has changed:
  140. .. runrecord:: _examples/DL-101-122-103
  141. :language: console
  142. :workdir: dl-101/DataLad-101
  143. :emphasize-lines: 7
  144. $ cat .git/config
  145. With this additional line in your repository's Git configuration, ``nano`` will
  146. be used as a default editor regardless of the configuration in your global
  147. or system-wide configuration. Note that the flag ``--local`` applies the
  148. configuration to your repository's ``.git/config`` file, whereas ``--global``
  149. would apply it as a user specific configuration, and ``--system`` as a
  150. system-wide configuration.
  151. If you would want to change this existing line in your ``.git/config``
  152. file, you would replace ``--add`` with ``--replace-all`` such as in:
  153. .. code-block:: console
  154. $ git config --local --replace-all core.editor "vim"
  155. to configure :term:`vim` to be your default editor.
  156. Note that while being a good toy example, it is not a common thing to
  157. configure repository-specific editors.
  158. This example demonstrated the structure of a :gitcmd:`config`
  159. command. By specifying the ``name`` option with ``section.variable``
  160. (or ``section.subsection.variable`` if there is a subsection), and
  161. a value, one can configure Git, git-annex, and DataLad.
  162. *Most* of these configurations will be written to a ``config`` file
  163. of Git, depending on the scope (local, global, system-wide)
  164. specified in the command.
  165. .. index::
  166. pair: unset configuration; with Git
  167. .. find-out-more:: If things go wrong during Git config
  168. If something goes wrong during the :gitcmd:`config` command,
  169. for example, you end up having two keys of the same name because you
  170. added a key instead of replacing an existing one, you can use the
  171. ``--unset`` option to remove the line. Alternatively, you can also open
  172. the config file in an editor and remove or change sections by hand.
  173. The only information you need, therefore, is the name of a section and
  174. variable to configure, and the value you want to specify. But in many cases
  175. it is also useful to find out which configurations are already set in
  176. which way and where. For this, the :gitcmd:`config --list --show-origin`
  177. is useful. It will display all configurations and their location:
  178. .. code-block:: console
  179. $ git config --list --show-origin
  180. file:/home/bob/.gitconfig user.name=Bob McBobface
  181. file:/home/bob/.gitconfig user.email=bob@mcbobface.com
  182. file:.git/config annex.uuid=1f83595e-bcba-4226-aa2c-6f0153eb3c54
  183. file:.git/config annex.backends=MD5E
  184. file:.git/config submodule.recordings/longnow.url=https://github.com/✂
  185. file:.git/config submodule.recordings/longnow.active=true
  186. file:.git/config remote.roommate.url=../mock_user/onemoredir/DataLad-101
  187. file:.git/config remote.roommate.annex-uuid=a5ae24de-1533-4b09-98b9-cd9ba6bf303c
  188. file:.git/config submodule.longnow.url=https://github.com/✂
  189. file:.git/config submodule.longnow.active=true
  190. ...
  191. This example shows some configurations in the global ``.gitconfig``
  192. file, and the configurations within ``DataLad-101/.git/config``.
  193. The command is very handy to display all configurations at once to identify
  194. configuration problems, find the right configuration file to make a change to,
  195. or simply remind oneself of the existing configurations, and it is a useful
  196. helper to keep in the back of your head.
  197. At this point you may feel like many of these configurations or the configuration file
  198. inside of ``DataLad-101`` do not appear to be
  199. intuitively understandable enough to confidently apply changes to them,
  200. or identify necessary changes. And indeed, most of the sections and variables
  201. or values in there are irrelevant for understanding the book, your dataset,
  202. or DataLad, and can just be left as they are. The previous section merely served
  203. to de-mystify the :gitcmd:`config` command and the configuration files.
  204. Nevertheless, it might be helpful to get an overview about the meaning of the
  205. remaining sections in that file, and the :ref:`that dissects this config file further <fom_gitconfig>` can give you a glimpse of this.
  206. .. index:: dataset configuration
  207. .. find-out-more:: Dissecting a Git config file further
  208. :name: fom_gitconfig
  209. :float:
  210. Let's walk through the Git config file of ``DataLad-101``:
  211. As mentioned above, git-annex will use the
  212. :term:`Git config file` for some of its configurations, such as the second section.
  213. It lists the repository version and :term:`annex UUID` [#f4]_ (:gitannexcmd:`whereis` displays information about where the
  214. annexed content is with these UUIDs).
  215. You may recognize the fourth part of the configuration, the subsection
  216. ``"recordings/longnow"`` in the section ``submodule``.
  217. Clearly, this is a reference to the ``longnow`` podcasts
  218. we cloned as a subdataset. The name *submodule* is Git
  219. terminology, and describes a Git repository inside of
  220. another Git repository -- just like
  221. the super- and subdataset principles you discovered in the
  222. section :ref:`nesting`. When you clone a DataLad dataset
  223. as a subdataset, it gets *registered* in this file.
  224. For each subdataset, an individual submodule entry
  225. will store the information about the subdataset's
  226. ``--source`` or *origin* (the "url").
  227. Thus, every subdataset in your dataset
  228. will be listed in this file.
  229. If you want, go back to section :ref:`installds` to see that the
  230. "url" is the same URL we cloned the longnow dataset from, and
  231. go back to section :ref:`sharelocal1` to remind yourself of
  232. how cloning a dataset with subdatasets looked and felt like.
  233. Another interesting part is the last section, "remote".
  234. Here we can find the :term:`sibling` "roommate" we defined
  235. in :ref:`sibling`. The term :term:`remote` is Git-terminology and is
  236. used to describe other repositories or DataLad datasets that the
  237. repository knows about.
  238. This file, therefore, is where DataLad *registered* the sibling
  239. with :dlcmd:`siblings add`, and thanks to it you can
  240. collaborate with your room mate.
  241. The value to the ``url`` variable is a *path*. If at any point
  242. either your superdataset or the remote moves on your file system,
  243. the association between the two datasets breaks -- this can be fixed by adjusting this
  244. path, and a demonstration of this is in section :ref:`file system`.
  245. `fetch` contains a specification which parts of the repository are
  246. updated -- in this case everything (all of the branches).
  247. Lastly, the ``annex-ignore = false`` configuration allows git-annex
  248. to query the remote when it tries to retrieve data from annexed content.
  249. .. index::
  250. pair: configuration; DataLad command
  251. pair: set configuration; with DataLad
  252. The ``datalad configuration`` command
  253. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  254. Although this section put a focus on the ``git config`` command, it is important to mention that there also is a :dlcmd:`configuration` command.
  255. It is not identical to ``git config``, but while it lacks some feature of ``git config``, such as the ability to set system-wide configuration, it has additional features.
  256. Beyond the ``local`` and ``global`` scopes, it also supports :term:`branch` specific configurations in the ``.datalad/config`` file (further discussed in the next section), setting configurations recursively through dataset hierarchies, and multi-configuration queries (such as ``datalad configuration get user.name user.email``).
  257. By default, ``datalad configuration`` will ``dump`` (list) the effective configuration including relevant ``DATALAD_*`` :term:`environment variable`\s, and also annotate the purpose of many common configuration items.
  258. The subcommands ``datalad configuration get`` or ``datalad configuration set`` perform queries or set configurations.
  259. You can find out more information on this command in the command documentation.
  260. ``.git/config`` versus other (configuration) files
  261. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  262. One crucial aspect distinguishes the ``.git/config`` file from many other files
  263. in your dataset: Even though it is part of your dataset, it won't be shared together
  264. with the dataset. The reason for this is that this file is not version
  265. controlled, as it lies within the ``.git`` directory.
  266. Repository-specific configurations within your ``.git/config``
  267. file are thus not written to history. Any local configuration in ``.git/config``
  268. applies to the dataset, but it does not *stick* to the dataset.
  269. One can have the misconception that because the configurations were made *in*
  270. the dataset, these configurations will also be shared together with the dataset.
  271. ``.git/config``, however, behaves just as your global or system-wide configurations.
  272. These configurations are in effect on a system, or for a user, or for a dataset,
  273. but are not shared.
  274. A :dlcmd:`clone` command of someone's dataset will not get you their
  275. editor configuration, should they have included one in their config file.
  276. Instead, upon a :dlcmd:`clone`, a new config file will be created.
  277. This means, however, that configurations that should "stick" to a dataset [#f5]_
  278. need to be defined in different files -- files that are version controlled.
  279. The next section will talk about them.
  280. .. rubric:: Footnotes
  281. .. [#f1] As an alternative to a ``git config`` command, you could also run configuration
  282. templates or procedures that apply predefined configurations or in some cases even
  283. add the information to the configuration file by hand and save it using an editor of your choice. See :ref:`procedures` for more info.
  284. .. [#f2] The third scope of a Git configuration are the system wide configurations.
  285. These are stored (if they exist) in ``/etc/gitconfig`` and contain settings that would
  286. apply to every user on the computer you are using. These configurations
  287. are not relevant for DataLad-101, and we will thus skip them. You can
  288. read more about Git's configurations and different files
  289. `here <https://git-scm.com/docs/git-config>`_.
  290. .. [#f3] If your default editor is :term:`vim` and you do not like this, now can be the time
  291. to change it! Chose either of two options:
  292. 1) Open up the file with an editor for your choice (e.g., `nano <https://www.howtogeek.com/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor>`_), and either paste the following configuration or edit it if it already exists:
  293. .. code-block:: ini
  294. [core]
  295. editor = nano
  296. 2) Run the following command, but exchange ``nano`` with an editor of your choice:
  297. .. code-block:: ini
  298. $ git config --global --add core.editor "nano"
  299. .. [#f4] A UUID is a universally unique identifier -- a 128-bit number
  300. that unambiguously identifies information.
  301. .. [#f5] Please note that not all configurations can be written to files other than ``.git/config``.
  302. Some of the files introduced in the next section will not be queried by Git, and in principle, it is a good thing that one cannot share arbitrary configurations together with a dataset, as this could be a potential security threat.
  303. In those cases where you need dataset clones to inherit certain non-sticky configurations, it is advised to write a custom procedure and distribute it together with the dataset.
  304. The next two sections contain concrete usecases and tutorials.