123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375 |
- .. _config:
- DIY configurations
- ------------------
- Back in section :ref:`text2git`, you already learned that there
- are dataset configurations, and that these configurations can
- be modified, for example, with the ``-c text2git`` option.
- This option applies a configuration template to store text
- files in :term:`Git` instead of :term:`git-annex`, and thereby
- modifies the DataLad dataset's default configuration to store
- every file in git-annex.
- The lecture today focuses entirely on the topic of configurations,
- and aims to equip everyone with the basics to configure
- their general and dataset specific setup to their needs.
- This is not only a handy way to tune a dataset to one's
- wishes, but also helpful to understand potential differences in
- command execution and file handling between two users,
- computers, or datasets.
- "First of all, when we talk about configurations, we have
- to differentiate between different scopes of configuration,
- and different tools the configuration belongs or applies to",
- our lecturer starts. "In DataLad datasets, different tools can
- have a configuration: :term:`Git`, :term:`git-annex`, and
- DataLad itself. Because these tools are all
- combined by DataLad to help you manage your data,
- it is important to understand how the configuration of one
- software is used by or influences a second tool, or the overall
- dataset performance."
- "Oh crap, one of these theoretical lectures again" mourns a
- student from the row behind you. Personally, you'd also
- be much more excited
- about any hands-on lecture filled with commands. But the
- recent lecture about :term:`git-annex` and the :term:`object-tree`
- was surprisingly captivating, so you are actually looking forward to today.
- "Shht! I want to hear this!", you shush him with a wink.
- "We will start by looking into the very first configuration
- you did, already before the course started: The *global*
- Git configuration." the lecturer says.
- .. index::
- pair: config; Git command
- At one point in time, you likely followed instructions such as
- in :ref:`install` and configured your
- *Git identity* with the commands:
- .. code-block:: console
- $ git config --global --add user.name "Elena Piscopia"
- $ git config --global --add user.email elena@example.net
- "What the above commands do is very simple: They search for
- a specific configuration file, and set the variables specified
- in the command -- in this case user name and user email address
- -- to the values provided with the command." she explains.
- "This general procedure, specifying a value for a configuration
- variable in a configuration file, is how you can configure the
- different tools to your needs. The configuration, therefore,
- is really easy. Even if you are only used to ticking boxes
- in the ``settings`` tab of a software tool so far, it's intuitive
- to understand how a configuration file in principle works and also
- how to use it. The only piece of information you will need
- are the necessary files, or the command that writes to them, and
- the available options for configuration, that's it. And what's
- really cool is that all tools we'll be looking at -- Git, git-annex,
- and DataLad -- can be configured using the :gitcmd:`config`
- command [#f1]_. Therefore, once you understand the syntax of this
- command, you already know half of what's relevant. The other half
- is understanding what you are doing. Now then, let's learn *how*
- to configure settings, but also *understand* what we are doing
- with these configurations."
- "This seems easy enough", you think. Let's see what types of
- configurations there are.
- Git config files
- ^^^^^^^^^^^^^^^^
- The user name and email configuration
- is a *user-specific* configuration (called *global*
- configuration by Git), and therefore applies to your user account.
- Wherever on your computer *you* run a Git, git-annex, or DataLad
- command, this global configuration will
- associate the name and email address you supplied in
- the :gitcmd:`config` commands above with this action.
- For example, whenever you
- ``datalad save``, the information in this file is used for the
- history entry about commit author and email.
- Apart from *global* Git configurations, there are also *system-wide* [#f2]_
- and *repository* configurations. Each of these configurations
- resides in its own file. The global configuration is stored in a file called
- ``.gitconfig`` in your home directory. Among
- your name and email address, this file can store general
- per-user configurations, such as a default editor [#f3]_, or highlighting
- options.
- The *repository-specific* configurations apply to each individual
- repository. Their scope is more limited than the *global*
- configuration (namely to a single repository), but it can overrule global
- configurations: The more specific the scope of a configuration file is, the more
- important it is, and the variables in the more specific configuration
- will take precedence over variables in less specific configuration files.
- One could, for example, have :term:`vim` configured to be the default editor
- on a global scope, but could overrule this by setting the editor to ``nano``
- in a given repository. For this reason, the repository-specific configuration
- does not reside in a file in your home directory, but in ``.git/config``
- within every Git repository (and thus DataLad dataset).
- Thus, there are three different scopes of Git configuration, and each is defined
- in a ``config`` file in a different location. The configurations will determine
- how Git behaves. In principle, all of these files can configure
- the same variables differently, but more specific scopes take precedence over broader
- scopes. Conveniently, not only can DataLad and git-annex be configured with
- the same command as Git, but in many cases they will also use exactly the same
- files as Git for their own configurations.
- .. index:: ! configuration file; .git/config
- Let's find out how the repository-specific configuration file in the ``DataLad-101``
- superdataset looks like:
- .. runrecord:: _examples/DL-101-122-101
- :language: console
- :workdir: dl-101/DataLad-101
- $ cat .git/config
- This file consists of so called "sections" with the section names
- in square brackets (e.g., ``core``). Occasionally, a section can have
- subsections: This is indicated by subsection names in
- quotation marks after the section name. For example, ``roommate`` is a subsection
- of the section ``remote``.
- Within each section, ``variable = value`` pairs specify configurations
- for the given (sub)section.
- .. index::
- pair: configure editor; with Git
- The first section is called ``core`` -- as the name suggests,
- this configures core Git functionality. There are
- `many more <https://git-scm.com/docs/git-config#Documentation/git-config.txt-corefileMode>`_
- configurations than the ones in this config file, but
- they are related to Git, and less related or important to the configuration of
- a DataLad dataset. We will use this section to showcase the anatomy of the
- :gitcmd:`config` command. If, for example, you would want to specifically
- configure :term:`nano` to be the default editor in this dataset, you
- can do it like this:
- .. runrecord:: _examples/DL-101-122-102
- :language: console
- :workdir: dl-101/DataLad-101
- $ git config --local --add core.editor "nano"
- The command consists of the base command :gitcmd:`config`,
- a specification of the scope of the configuration with the ``--local``
- flag, a ``name`` specification consisting of section and key with the
- notation ``section.variable`` (here: ``core.editor``), and finally the value
- specification ``"nano"``.
- Let's see what has changed:
- .. runrecord:: _examples/DL-101-122-103
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 7
- $ cat .git/config
- With this additional line in your repository's Git configuration, ``nano`` will
- be used as a default editor regardless of the configuration in your global
- or system-wide configuration. Note that the flag ``--local`` applies the
- configuration to your repository's ``.git/config`` file, whereas ``--global``
- would apply it as a user specific configuration, and ``--system`` as a
- system-wide configuration.
- If you would want to change this existing line in your ``.git/config``
- file, you would replace ``--add`` with ``--replace-all`` such as in:
- .. code-block:: console
- $ git config --local --replace-all core.editor "vim"
- to configure :term:`vim` to be your default editor.
- Note that while being a good toy example, it is not a common thing to
- configure repository-specific editors.
- This example demonstrated the structure of a :gitcmd:`config`
- command. By specifying the ``name`` option with ``section.variable``
- (or ``section.subsection.variable`` if there is a subsection), and
- a value, one can configure Git, git-annex, and DataLad.
- *Most* of these configurations will be written to a ``config`` file
- of Git, depending on the scope (local, global, system-wide)
- specified in the command.
- .. index::
- pair: unset configuration; with Git
- .. find-out-more:: If things go wrong during Git config
- If something goes wrong during the :gitcmd:`config` command,
- for example, you end up having two keys of the same name because you
- added a key instead of replacing an existing one, you can use the
- ``--unset`` option to remove the line. Alternatively, you can also open
- the config file in an editor and remove or change sections by hand.
- The only information you need, therefore, is the name of a section and
- variable to configure, and the value you want to specify. But in many cases
- it is also useful to find out which configurations are already set in
- which way and where. For this, the :gitcmd:`config --list --show-origin`
- is useful. It will display all configurations and their location:
- .. code-block:: console
- $ git config --list --show-origin
- file:/home/bob/.gitconfig user.name=Bob McBobface
- file:/home/bob/.gitconfig user.email=bob@mcbobface.com
- file:.git/config annex.uuid=1f83595e-bcba-4226-aa2c-6f0153eb3c54
- file:.git/config annex.backends=MD5E
- file:.git/config submodule.recordings/longnow.url=https://github.com/✂
- file:.git/config submodule.recordings/longnow.active=true
- file:.git/config remote.roommate.url=../mock_user/onemoredir/DataLad-101
- file:.git/config remote.roommate.annex-uuid=a5ae24de-1533-4b09-98b9-cd9ba6bf303c
- file:.git/config submodule.longnow.url=https://github.com/✂
- file:.git/config submodule.longnow.active=true
- ...
- This example shows some configurations in the global ``.gitconfig``
- file, and the configurations within ``DataLad-101/.git/config``.
- The command is very handy to display all configurations at once to identify
- configuration problems, find the right configuration file to make a change to,
- or simply remind oneself of the existing configurations, and it is a useful
- helper to keep in the back of your head.
- At this point you may feel like many of these configurations or the configuration file
- inside of ``DataLad-101`` do not appear to be
- intuitively understandable enough to confidently apply changes to them,
- or identify necessary changes. And indeed, most of the sections and variables
- or values in there are irrelevant for understanding the book, your dataset,
- or DataLad, and can just be left as they are. The previous section merely served
- to de-mystify the :gitcmd:`config` command and the configuration files.
- Nevertheless, it might be helpful to get an overview about the meaning of the
- remaining sections in that file, and the :ref:`that dissects this config file further <fom_gitconfig>` can give you a glimpse of this.
- .. index:: dataset configuration
- .. find-out-more:: Dissecting a Git config file further
- :name: fom_gitconfig
- :float:
- Let's walk through the Git config file of ``DataLad-101``:
- As mentioned above, git-annex will use the
- :term:`Git config file` for some of its configurations, such as the second section.
- It lists the repository version and :term:`annex UUID` [#f4]_ (:gitannexcmd:`whereis` displays information about where the
- annexed content is with these UUIDs).
- You may recognize the fourth part of the configuration, the subsection
- ``"recordings/longnow"`` in the section ``submodule``.
- Clearly, this is a reference to the ``longnow`` podcasts
- we cloned as a subdataset. The name *submodule* is Git
- terminology, and describes a Git repository inside of
- another Git repository -- just like
- the super- and subdataset principles you discovered in the
- section :ref:`nesting`. When you clone a DataLad dataset
- as a subdataset, it gets *registered* in this file.
- For each subdataset, an individual submodule entry
- will store the information about the subdataset's
- ``--source`` or *origin* (the "url").
- Thus, every subdataset in your dataset
- will be listed in this file.
- If you want, go back to section :ref:`installds` to see that the
- "url" is the same URL we cloned the longnow dataset from, and
- go back to section :ref:`sharelocal1` to remind yourself of
- how cloning a dataset with subdatasets looked and felt like.
- Another interesting part is the last section, "remote".
- Here we can find the :term:`sibling` "roommate" we defined
- in :ref:`sibling`. The term :term:`remote` is Git-terminology and is
- used to describe other repositories or DataLad datasets that the
- repository knows about.
- This file, therefore, is where DataLad *registered* the sibling
- with :dlcmd:`siblings add`, and thanks to it you can
- collaborate with your room mate.
- The value to the ``url`` variable is a *path*. If at any point
- either your superdataset or the remote moves on your file system,
- the association between the two datasets breaks -- this can be fixed by adjusting this
- path, and a demonstration of this is in section :ref:`file system`.
- `fetch` contains a specification which parts of the repository are
- updated -- in this case everything (all of the branches).
- Lastly, the ``annex-ignore = false`` configuration allows git-annex
- to query the remote when it tries to retrieve data from annexed content.
- .. index::
- pair: configuration; DataLad command
- pair: set configuration; with DataLad
- The ``datalad configuration`` command
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Although this section put a focus on the ``git config`` command, it is important to mention that there also is a :dlcmd:`configuration` command.
- It is not identical to ``git config``, but while it lacks some feature of ``git config``, such as the ability to set system-wide configuration, it has additional features.
- Beyond the ``local`` and ``global`` scopes, it also supports :term:`branch` specific configurations in the ``.datalad/config`` file (further discussed in the next section), setting configurations recursively through dataset hierarchies, and multi-configuration queries (such as ``datalad configuration get user.name user.email``).
- By default, ``datalad configuration`` will ``dump`` (list) the effective configuration including relevant ``DATALAD_*`` :term:`environment variable`\s, and also annotate the purpose of many common configuration items.
- The subcommands ``datalad configuration get`` or ``datalad configuration set`` perform queries or set configurations.
- You can find out more information on this command in the command documentation.
- ``.git/config`` versus other (configuration) files
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- One crucial aspect distinguishes the ``.git/config`` file from many other files
- in your dataset: Even though it is part of your dataset, it won't be shared together
- with the dataset. The reason for this is that this file is not version
- controlled, as it lies within the ``.git`` directory.
- Repository-specific configurations within your ``.git/config``
- file are thus not written to history. Any local configuration in ``.git/config``
- applies to the dataset, but it does not *stick* to the dataset.
- One can have the misconception that because the configurations were made *in*
- the dataset, these configurations will also be shared together with the dataset.
- ``.git/config``, however, behaves just as your global or system-wide configurations.
- These configurations are in effect on a system, or for a user, or for a dataset,
- but are not shared.
- A :dlcmd:`clone` command of someone's dataset will not get you their
- editor configuration, should they have included one in their config file.
- Instead, upon a :dlcmd:`clone`, a new config file will be created.
- This means, however, that configurations that should "stick" to a dataset [#f5]_
- need to be defined in different files -- files that are version controlled.
- The next section will talk about them.
- .. rubric:: Footnotes
- .. [#f1] As an alternative to a ``git config`` command, you could also run configuration
- templates or procedures that apply predefined configurations or in some cases even
- add the information to the configuration file by hand and save it using an editor of your choice. See :ref:`procedures` for more info.
- .. [#f2] The third scope of a Git configuration are the system wide configurations.
- These are stored (if they exist) in ``/etc/gitconfig`` and contain settings that would
- apply to every user on the computer you are using. These configurations
- are not relevant for DataLad-101, and we will thus skip them. You can
- read more about Git's configurations and different files
- `here <https://git-scm.com/docs/git-config>`_.
- .. [#f3] If your default editor is :term:`vim` and you do not like this, now can be the time
- to change it! Chose either of two options:
- 1) Open up the file with an editor for your choice (e.g., `nano <https://www.howtogeek.com/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor>`_), and either paste the following configuration or edit it if it already exists:
- .. code-block:: ini
- [core]
- editor = nano
- 2) Run the following command, but exchange ``nano`` with an editor of your choice:
- .. code-block:: ini
- $ git config --global --add core.editor "nano"
- .. [#f4] A UUID is a universally unique identifier -- a 128-bit number
- that unambiguously identifies information.
- .. [#f5] Please note that not all configurations can be written to files other than ``.git/config``.
- Some of the files introduced in the next section will not be queried by Git, and in principle, it is a good thing that one cannot share arbitrary configurations together with a dataset, as this could be a potential security threat.
- In those cases where you need dataset clones to inherit certain non-sticky configurations, it is advised to write a custom procedure and distribute it together with the dataset.
- The next two sections contain concrete usecases and tutorials.
|