123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385 |
- .. _share_hostingservice:
- Publishing datasets to Git repository hosting
- ---------------------------------------------
- Because DataLad datasets are :term:`Git` repositories, it is possible to
- :dlcmd:`push` datasets to any Git repository hosting service, such as
- :term:`GitHub`, :term:`GitLab`, :term:`GIN`, :term:`Bitbucket`, `Gogs <https://gogs.io>`_, or Gitea_.
- These published datasets are ordinary :term:`sibling`\s of your dataset, and among other advantages, they can constitute a back-up, an entry-point to retrieve your dataset for others or yourself, the backbone for collaboration on datasets, or the means to enhance visibility, findability and citeability of your work [#f1]_.
- This section contains a brief overview on how to publish your dataset to different services.
- Git repository hosting and annexed data
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- As outlined in a number of sections before, Git repository hosting sites typically do not support dataset annexes - some, like :term:`GIN` however, do.
- Depending on whether or not an annex is supported, you can push either only your Git history to the sibling, or the complete dataset including annexed file contents.
- You can find out whether a sibling on a remote hosting services carries an annex or not by running the :dlcmd:`siblings` command.
- A ``+``, ``-``, or ``?`` sign in parenthesis indicates whether the sibling carries an annex, does not carry an annex, or whether this information isn't yet known.
- In the example below you can see that the public GitHub repository `github.com/psychoinformatics-de/studyforrest-data-phase2 <https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ does not carry an annex on GitHub (the sibling ``origin``), but that the annexed data are served from an additional sibling ``mddatasrc`` (a :term:`special remote` with annex support).
- Even though the dataset sibling on GitHub does not serve the data, it constitutes a simple, findable access point to retrieve the dataset, and can be used to provide updates and fixes via :term:`pull request`\s, issues, etc.
- .. code-block:: console
- $ # a clone of github/psychoinformatics/studyforrest-data-phase2 has the following siblings:
- $ datalad siblings
- .: here(+) [git]
- .: mddatasrc(+) [https://datapub.fz-juelich.de/studyforrest/studyforrest/phase2/.git (git)]
- .: origin(-) [git@github.com:psychoinformatics-de/studyforrest-data-phase2.git (git)]
- There are multiple ways to create a dataset sibling on a repository hosting site to push your dataset to.
- How to add a sibling on a Git repository hosting site: The manual way
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- #. Create a new repository via the webinterface of the hosting service of your choice. The screenshots in :numref:`fig-newrepogin` and :numref:`fig-newrepogithub` show examples of this.
- The new repository does not need to have the same name as your local dataset, but it helps to associate local dataset and remote siblings.
- #. Afterwards, copy the :term:`SSH` or :term:`HTTPS` URL of the repository. Usually, repository hosting services will provide you with a convenient way to copy it to your clipboard. An SSH URL takes the form ``git@<hosting-service>:/<user>/<repo-name>.git`` and an HTTPS URL takes the form ``https://<hosting-service>/<user>/<repo-name>.git``. The type of URL you choose determines whether and how you will be able to ``push`` to your repository. Note that many services will require you to use the SSH URL to your repository in order to do :dlcmd:`push` operations, so make sure to take the :term:`SSH` and not the :term:`HTTPS` URL if this is the case.
- #. If you pick the :term:`SSH` URL, make sure to have an :term:`SSH key` set up. This usually requires generating an SSH key pair if you do not have one yet, and uploading the public key to the repository hosting service. The :find-out-more:`on SSH keys <fom-sshkey>` points to a useful tutorial for this.
- #. Use the URL to add the repository as a sibling. There are two commands that allow you to do that; both require you give the sibling a name of your choice (common name choices are ``upstream``, or a short-cut for your user name or the hosting platform, but its completely up to you to decide):
- #. ``git remote add <name> <url>``
- #. ``datalad siblings add --dataset . --name <name> --url <url>``
- #. Push your dataset to the new sibling: ``datalad push --to <name>``
- .. _fig-newrepogin:
- .. figure:: ../artwork/src/GIN_newrepo.png
- :width: 80%
- Webinterface of :term:`GIN` during the creation of a new repository.
- .. _fig-newrepogithub:
- .. figure:: ../artwork/src/newrepo-github.png
- :width: 80%
- Webinterface of :term:`GitHub` during the creation of a new repository.
- .. index:: concepts; SSH key, SSH; key
- .. _sshkey:
- .. find-out-more:: What is an SSH key and how can I create one?
- :name: fom-sshkey
- An SSH key is an access credential in the :term:`SSH` protocol that can be used
- to login from one system to remote servers and services, such as from your private
- computer to an :term:`SSH server`. For repository hosting services such as :term:`GIN`,
- :term:`GitHub`, or :term:`GitLab`, it can be used to connect and authenticate
- without supplying your username or password for each action.
- A tutorial by GitHub at `docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh <https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent>`_
- has a detailed step-by-step instruction to generate and use SSH keys for authentication.
- You will also learn how add your public SSH key to your hosting service account
- so that you can install or clone datasets or Git repositories via ``SSH`` (in addition
- to the ``http`` protocol).
- Don't be intimidated if you have never done this before -- it is fast and easy:
- First, you need to create a private and a public key (an SSH key pair).
- All this takes is a single command in the terminal. The resulting files are
- text files that look like someone spilled alphabet soup in them, but constitute
- a secure password procedure.
- You keep the private key on your own machine (the system you are connecting from,
- and that **only you have access to**),
- and copy the public key to the system or service you are connecting to.
- On the remote system or service, you make the public key an *authorized key* to
- allow authentication via the SSH key pair instead of your password. This
- either takes a single command in the terminal, or a few clicks in a web interface
- to achieve.
- You should protect your SSH keys on your machine with a passphrase to prevent
- others -- e.g., in case of theft -- to log in to servers or services with
- SSH authentication [#f2]_, and configure an ``ssh agent``
- to handle this passphrase for you with a single command. How to do all of this
- is detailed in the tutorial.
- How to add a sibling on a Git repository hosting site: The automated way
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- DataLad provides ``create-sibling-*`` commands to automatically create datasets on certain hosting sites.
- You can automatically create new repositories from the command line for :term:`GitHub`, :term:`GitLab`, :term:`GIN`, `Gogs <https://gogs.io>`__, or Gitea_.
- This is implemented with a set of commands called :dlcmd:`create-sibling-github`, :dlcmd:`create-sibling-gitlab`, :dlcmd:`create-sibling-gin`, :dlcmd:`create-sibling-gogs`, and :dlcmd:`create-sibling-gitea`.
- Each command is slightly tuned towards the peculiarities of each particular platform, but the most important common parameters are streamlined across commands as follows:
- - ``[REPONAME]`` (required): The name of the repository on the hosting site. It will be created under a user's namespace, unless this argument includes an organization name prefix. For example, ``datalad create-sibling-github my-awesome-repo`` will create a new repository under ``github.com/<user>/my-awesome-repo``, while ``datalad create-sibling-github <orgname>/my-awesome-repo`` will create a new repository of this name under the GitHub organization ``<orgname>`` (given appropriate permissions).
- - ``-s/--name <name>`` (required): A name under which the sibling is identified. By default, it will be based on or similar to the hosting site. For example, the sibling created with ``datalad create-sibling-github`` will be called ``github`` by default.
- - ``--credential <name>`` (optional): Credentials used for authentication are stored internally by DataLad under specific names. These names allow you to have multiple credentials, and flexibly decide which one to use. When ``--credential <name>`` is the name of an existing credential, DataLad tries to authenticate with the specified credential; when it does not yet exist DataLad will prompt interactively for a credential, such as an access token, and store it under the given ``<name>`` for future authentications. By default, DataLad will name a credential according to the hosting service URL it used for, such as ``datalad-api.github.com`` as the default for credentials used to authenticate against GitHub.
- - ``--access-protocol {https|ssh|https-ssh}`` (default ``https``): Whether to use :term:`SSH` or :term:`HTTPS` URLs, or a hybrid version in which HTTPS is used to *pull* and SSH is used to *push*. Using :term:`SSH` URLs requires an :term:`SSH key` setup, but is a very convenient authentication method, especially when pushing updates -- which would need manual input on user name and token with every ``push`` over HTTPS.
- - ``--dry-run`` (optional): With this flag set, the command will not actually create the target repository, but only perform tests for name collisions and report repository name(s).
- - ``--private`` (optional): A switch that, if set, makes sure that the created repository is private.
- Other streamlined arguments, such as ``--recursive`` or ``--publish-depends`` allow you to perform more complex configurations, such as publication of dataset hierarchies or connections to :term:`special remote`\s. Upcoming walk-throughs will demonstrate them.
- Self-hosted repository services, e.g., Gogs or Gitea instances, have an additional required argument, the ``--api`` flag.
- It needs to point to the URL of the instance, for example
- .. code-block:: console
- $ datalad create-sibling-gogs my_repo_on_gogs --api "https://try.gogs.io"
- :term:`GitLab`'s internal organization differs from that of the other hosting services, and as there are multiple different GitLab instances, ``create-sibling-gitlab`` requires slightly more configuration than the other commands.
- Thus, a short walk-through is at the :ref:`end of this section <gitlab>`.
- .. _token:
- Authentication by token
- ^^^^^^^^^^^^^^^^^^^^^^^
- To create or update repositories on remote hosting services you will need to set up appropriate authentication and permissions.
- In most cases, this will be in the form of an authorization token with a specific permission scope.
- What is a token?
- """"""""""""""""
- Personal access tokens are an alternative to authenticating via your password, and take the form of a long character string, associated with a human-readable name or description.
- If you are prompted for ``username`` and ``password`` in the command line, you would enter your token in place of the ``password`` [#f3]_.
- Note that you do not have to type your token at every authentication -- your token will be stored on your system the first time you have used it and automatically reused whenever relevant.
- .. index:: credential; storage
- .. find-out-more:: How does the authentication storage work?
- Passwords, user names, tokens, or any other login information is stored in
- your system's (encrypted) `keyring <https://en.wikipedia.org/wiki/GNOME_Keyring>`_.
- It is a built-in credential store, used in all major operating systems, and
- can store credentials securely.
- You can have multiple tokens, and each of them can get a different scope of permissions, but it is important to treat your tokens like passwords and keep them secret.
- Which permissions do they need?
- """""""""""""""""""""""""""""""
- The most convenient way to generate tokens is typically via the webinterface of the hosting service of your choice.
- Often, you can specifically select which set of permissions a specific token has in a drop-down menu similar (but likely not identical) to the screenshot from GitHub in :numref:`fig-token`.
- .. _fig-token:
- .. figure:: ../artwork/src/github-token.png
- :width: 80%
- Webinterface to generate an authentication token on GitHub. One typically has to set a name and
- permission set, and potentially an expiration date.
- For creating and updating repositories with DataLad commands it is usually sufficient to grant only repository-related permissions.
- However, broader permission sets may also make sense.
- Should you employ GitHub workflows, for example, a token without "workflow" scope could not push changes to workflow files, resulting in errors like this one:
- .. code-block:: console
- [remote rejected] (refusing to allow a Personal Access Token to create or update workflow `.github/workflows/benchmarks.yml` without `workflow` scope)]
- .. _gitlab:
- Creating a sibling on GitLab
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- :term:`GitLab` is an open source Git repository hosting platform, and many institutions and companies deploy their own instance.
- This short walk-through demonstrates the necessary steps to create a GitLab sibling, and the different options GitLab allows for when creating siblings recursively for a dataset hierarchy.
- Step 1: Configure your site
- """""""""""""""""""""""""""
- As a first step, users will need to create a configuration file following the format of `python-gitlab <https://python-gitlab.readthedocs.io/en/stable/cli-usage.html#configuration-file-format>`_.
- This configuration file is typically called ``.python-gitlab.cfg`` and placed into a users home directory.
- It contains one section per GitLab instance, and a ``[global]`` section that defines the default instance to use.
- Here is an example:
- .. code-block:: console
- $ cat ~/.python-gitlab.cfg
- [global]
- default = my-university-gitlab
- ssl_verify = true
- timeout = 5
- [my-university-gitlab]
- url = https://gitlab.my-university.com
- private_token = <here-is-your-token>
- api_version = 4
- [gitlab-general]
- url = https://gitlab.com
- api_version = 4
- private_token = <here-is-your-token>
- Once this configuration is in place, ``create-sibling-gitlab``'s ``--site`` parameter can be supplied with the name of the instance you want to use (e.g., ``datalad create-sibling-gitlab --site gitlab-general``).
- Ensure that the token for each instance has appropriate permissions to create new groups and projects under your user account using the GitLab API in :numref:`fig-gitlabtoken`.
- .. _fig-gitlabtoken:
- .. figure:: ../artwork/src/gitlab-token.png
- :width: 80%
- Webinterface to generate an authentication token on GitLab. One typically has to set a name and
- permission set, and potentially an expiration date.
- Step 2: Create or select a group
- """"""""""""""""""""""""""""""""
- GitLab's organization consists of *projects* and *groups*.
- Projects are single repositories, and groups can be used to manage one or more projects at the same time.
- In order to use ``create-sibling-gitlab``, a user **must** `create a group <https://docs.gitlab.com/ee/user/group/#create-a-group>`_ via the web interface, or specify a pre-existing group, because `GitLab does not allow root-level groups to be created via their API <https://docs.gitlab.com/ee/api/groups.html#new-group>`_.
- Only when there already is a "parent" group DataLad and other tools can create sub-groups and projects automatically.
- In the screenshots :numref:`fig-rootgroup-gitlab1` and :numref:`fig-rootgroup-gitlab2`, a new group ``my-datalad-root-level-group`` is created right underneath the user account.
- The group name as shown in the URL bar is what DataLad needs in order to create sibling datasets.
- .. _fig-rootgroup-gitlab1:
- .. figure:: ../artwork/src/gitlab-rootgroup.png
- :width: 80%
- Webinterface to create a root-level group on GitLab.
- .. _fig-rootgroup-gitlab2:
- .. figure:: ../artwork/src/gitlab-rootgroup2.png
- :width: 80%
- A created root-level group in GitLab's webinterface.
- Step 3: Select a layout
- """""""""""""""""""""""
- Due to the distinction between groups and projects, GitLab allows two different layouts that DataLad can use to publish datasets or dataset hierarchies:
- * **flat**:
- All datasets become projects in the same, pre-existing group.
- The name of a project is its relative path within the root dataset, with all path separator characters replaced by '-' [#f4]_.
- * **collection**:
- A new group is created for the dataset. The root dataset (the topmost superdataset) is placed in a "project" project inside this group, and all nested subdatasets are represented inside the group using a "flat" layout [#f4]_. This layout is the default.
- Consider the ``DataLad-101`` dataset, a superdataset with a several subdatasets in the following layout:
- .. code-block:: bash
- /home/me/dl-101/DataLad-101 # dataset
- ├── books/
- │ └── [...]
- ├── code/
- │ └── [...]
- ├── midterm_project/ # subdataset
- │ ├── code/
- │ └── [...]
- │ └── input/ # sub-subdataset
- ├── recordings/
- │ └── longnow/ # subdataset
- │ ├── [...]
- How the ``collection`` and ``flat`` layouts for this dataset look in practice is shown in :numref:`fig-gitlab-layout`.
- .. _fig-gitlab-layout:
- .. figure:: ../artwork/src/gitlab-layouts.png
- :width: 50%
- The ``collection`` layout has a group (``DataLad-101_collection``, defined by the user with a configuration) with four projects underneath. The ``project`` project contains the root-level dataset, and all contained subdatasets are named according to their location in the dataset. The ``flat`` layout consists of projects in the root-level group. The project name for the superdataset (``DataLad-101_flat``) is defined by the user with a configuration, and the names of the subdatasets extend this project name based on their location in the dataset hierarchy.
- Publishing a single dataset
- """""""""""""""""""""""""""
- When publishing a single dataset, users can configure the project or group name as a command argument ``--project``.
- Here are two command examples and their outcomes.
- For a **flat** layout, the ``--project`` parameter determines the project name, shown in :numref:`fig-gitlab-flat`.
- .. code-block:: console
- $ datalad create-sibling-gitlab --site gitlab-general --layout flat --project my-datalad-root-level-group/this-will-be-the-project-name
- create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-project-name]
- configure-sibling(ok): . (sibling)
- action summary:
- configure-sibling (ok: 1)
- create_sibling_gitlab (ok: 1)
- .. _fig-gitlab-flat:
- .. figure:: ../artwork/src/gitlab-layout-flat.png
- :width: 50%
- An example dataset using GitLab's "flat" layout.
- For a **collection** layout, the ``--project`` parameter determines the group name, shown in figure :numref:`fig-gitlab-collection`.
- .. code-block:: console
- $ datalad create-sibling-gitlab --site gitlab-general --layout collection --project my-datalad-root-level-group/this-will-be-the-group-name
- create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-group-name/project]
- configure-sibling(ok): . (sibling)
- action summary:
- configure-sibling (ok: 1)
- create_sibling_gitlab (ok: 1)
- .. _fig-gitlab-collection:
- .. figure:: ../artwork/src/gitlab-layout-collection.png
- :width: 50%
- An example dataset using GitLab's "collection" layout.
- Publishing datasets recursively
- """""""""""""""""""""""""""""""
- When publishing a series of datasets recursively, the ``--project`` argument cannot be used anymore - otherwise, all datasets in the hierarchy would attempt to create the same group or project over and over again.
- Instead, one configures the root level dataset, and the names for underlying datasets will be derived from this configuration:
- .. index::
- single: configuration item; datalad.gitlab-<name>-project
- .. code-block:: console
- $ # do the configuration for the top-most dataset
- $ # either configure with Git
- $ git config --local --replace-all \
- datalad.gitlab-<gitlab-site>-project \
- 'my-datalad-root-level-group/DataLad-101_flat'
- $ # or configure with DataLad
- $ datalad configuration set \
- datalad.gitlab-<gitlab-site>-project='my-datalad-root-level-group/DataLad-101_flat'
- Afterwards, publish dataset hierarchies with the ``--recursive`` flag:
- .. code-block:: console
- $ datalad create-sibling-gitlab --site gitlab-general --recursive --layout flat
- create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat]
- configure-sibling(ok): . (sibling)
- create_sibling_gitlab(ok): midterm_project (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project]
- configure-sibling(ok): . (sibling)
- create_sibling_gitlab(ok): midterm_project/input (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project-input]
- configure-sibling(ok): . (sibling)
- create_sibling_gitlab(ok): recordings/longnow (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-recordings-longnow]
- configure-sibling(ok): . (sibling)
- action summary:
- configure-sibling (ok: 4)
- create_sibling_gitlab (ok: 4)
- Final step: Pushing to GitLab
- """""""""""""""""""""""""""""
- Once you have set up your dataset sibling(s), you can push individual datasets with ``datalad push --to gitlab`` or push recursively across a hierarchy by adding the ``--recursive`` flag to the push command.
- .. _gitea: https://about.gitea.com
- .. rubric:: Footnotes
- .. [#f1] Many repository hosting services have useful features to make your work citeable.
- For example, :term:`gin` is able to assign a :term:`DOI` to your dataset, and GitHub allows ``CITATION.cff`` files. At the same time, archival services such as `Zenodo <https://zenodo.org>`_ often integrate with published repositories, allowing you to preserve your dataset with them.
- .. [#f2] Your private SSH key is incredibly valuable, and it is important to keep
- it secret!
- Anyone who gets your private key has access to anything that the public key
- is protecting. If the private key does not have a passphrase, simply copying
- this file grants a person access!
- .. [#f3] GitHub `deprecated user-password authentication <https://developer.github.com/changes/2020-02-14-deprecating-password-auth>`_ in favor of authentication via personal access token. Supplying a password instead of a token will fail to authenticate.
- .. index::
- single: configuration item; datalad.gitlab-default-projectname
- single: configuration item; datalad.gitlab-default-pathseparator
- .. [#f4] The default project name ``project`` and path separator ``-`` are configurable using the dataset-level configurations ``datalad.gitlab-default-projectname`` and ``datalad.gitlab-default-pathseparator``
|