101-139-hostingservices.rst 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385
  1. .. _share_hostingservice:
  2. Publishing datasets to Git repository hosting
  3. ---------------------------------------------
  4. Because DataLad datasets are :term:`Git` repositories, it is possible to
  5. :dlcmd:`push` datasets to any Git repository hosting service, such as
  6. :term:`GitHub`, :term:`GitLab`, :term:`GIN`, :term:`Bitbucket`, `Gogs <https://gogs.io>`_, or Gitea_.
  7. These published datasets are ordinary :term:`sibling`\s of your dataset, and among other advantages, they can constitute a back-up, an entry-point to retrieve your dataset for others or yourself, the backbone for collaboration on datasets, or the means to enhance visibility, findability and citeability of your work [#f1]_.
  8. This section contains a brief overview on how to publish your dataset to different services.
  9. Git repository hosting and annexed data
  10. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  11. As outlined in a number of sections before, Git repository hosting sites typically do not support dataset annexes - some, like :term:`GIN` however, do.
  12. Depending on whether or not an annex is supported, you can push either only your Git history to the sibling, or the complete dataset including annexed file contents.
  13. You can find out whether a sibling on a remote hosting services carries an annex or not by running the :dlcmd:`siblings` command.
  14. A ``+``, ``-``, or ``?`` sign in parenthesis indicates whether the sibling carries an annex, does not carry an annex, or whether this information isn't yet known.
  15. In the example below you can see that the public GitHub repository `github.com/psychoinformatics-de/studyforrest-data-phase2 <https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ does not carry an annex on GitHub (the sibling ``origin``), but that the annexed data are served from an additional sibling ``mddatasrc`` (a :term:`special remote` with annex support).
  16. Even though the dataset sibling on GitHub does not serve the data, it constitutes a simple, findable access point to retrieve the dataset, and can be used to provide updates and fixes via :term:`pull request`\s, issues, etc.
  17. .. code-block:: console
  18. $ # a clone of github/psychoinformatics/studyforrest-data-phase2 has the following siblings:
  19. $ datalad siblings
  20. .: here(+) [git]
  21. .: mddatasrc(+) [https://datapub.fz-juelich.de/studyforrest/studyforrest/phase2/.git (git)]
  22. .: origin(-) [git@github.com:psychoinformatics-de/studyforrest-data-phase2.git (git)]
  23. There are multiple ways to create a dataset sibling on a repository hosting site to push your dataset to.
  24. How to add a sibling on a Git repository hosting site: The manual way
  25. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  26. #. Create a new repository via the webinterface of the hosting service of your choice. The screenshots in :numref:`fig-newrepogin` and :numref:`fig-newrepogithub` show examples of this.
  27. The new repository does not need to have the same name as your local dataset, but it helps to associate local dataset and remote siblings.
  28. #. Afterwards, copy the :term:`SSH` or :term:`HTTPS` URL of the repository. Usually, repository hosting services will provide you with a convenient way to copy it to your clipboard. An SSH URL takes the form ``git@<hosting-service>:/<user>/<repo-name>.git`` and an HTTPS URL takes the form ``https://<hosting-service>/<user>/<repo-name>.git``. The type of URL you choose determines whether and how you will be able to ``push`` to your repository. Note that many services will require you to use the SSH URL to your repository in order to do :dlcmd:`push` operations, so make sure to take the :term:`SSH` and not the :term:`HTTPS` URL if this is the case.
  29. #. If you pick the :term:`SSH` URL, make sure to have an :term:`SSH key` set up. This usually requires generating an SSH key pair if you do not have one yet, and uploading the public key to the repository hosting service. The :find-out-more:`on SSH keys <fom-sshkey>` points to a useful tutorial for this.
  30. #. Use the URL to add the repository as a sibling. There are two commands that allow you to do that; both require you give the sibling a name of your choice (common name choices are ``upstream``, or a short-cut for your user name or the hosting platform, but its completely up to you to decide):
  31. #. ``git remote add <name> <url>``
  32. #. ``datalad siblings add --dataset . --name <name> --url <url>``
  33. #. Push your dataset to the new sibling: ``datalad push --to <name>``
  34. .. _fig-newrepogin:
  35. .. figure:: ../artwork/src/GIN_newrepo.png
  36. :width: 80%
  37. Webinterface of :term:`GIN` during the creation of a new repository.
  38. .. _fig-newrepogithub:
  39. .. figure:: ../artwork/src/newrepo-github.png
  40. :width: 80%
  41. Webinterface of :term:`GitHub` during the creation of a new repository.
  42. .. index:: concepts; SSH key, SSH; key
  43. .. _sshkey:
  44. .. find-out-more:: What is an SSH key and how can I create one?
  45. :name: fom-sshkey
  46. An SSH key is an access credential in the :term:`SSH` protocol that can be used
  47. to login from one system to remote servers and services, such as from your private
  48. computer to an :term:`SSH server`. For repository hosting services such as :term:`GIN`,
  49. :term:`GitHub`, or :term:`GitLab`, it can be used to connect and authenticate
  50. without supplying your username or password for each action.
  51. A tutorial by GitHub at `docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh <https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent>`_
  52. has a detailed step-by-step instruction to generate and use SSH keys for authentication.
  53. You will also learn how add your public SSH key to your hosting service account
  54. so that you can install or clone datasets or Git repositories via ``SSH`` (in addition
  55. to the ``http`` protocol).
  56. Don't be intimidated if you have never done this before -- it is fast and easy:
  57. First, you need to create a private and a public key (an SSH key pair).
  58. All this takes is a single command in the terminal. The resulting files are
  59. text files that look like someone spilled alphabet soup in them, but constitute
  60. a secure password procedure.
  61. You keep the private key on your own machine (the system you are connecting from,
  62. and that **only you have access to**),
  63. and copy the public key to the system or service you are connecting to.
  64. On the remote system or service, you make the public key an *authorized key* to
  65. allow authentication via the SSH key pair instead of your password. This
  66. either takes a single command in the terminal, or a few clicks in a web interface
  67. to achieve.
  68. You should protect your SSH keys on your machine with a passphrase to prevent
  69. others -- e.g., in case of theft -- to log in to servers or services with
  70. SSH authentication [#f2]_, and configure an ``ssh agent``
  71. to handle this passphrase for you with a single command. How to do all of this
  72. is detailed in the tutorial.
  73. How to add a sibling on a Git repository hosting site: The automated way
  74. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  75. DataLad provides ``create-sibling-*`` commands to automatically create datasets on certain hosting sites.
  76. You can automatically create new repositories from the command line for :term:`GitHub`, :term:`GitLab`, :term:`GIN`, `Gogs <https://gogs.io>`__, or Gitea_.
  77. This is implemented with a set of commands called :dlcmd:`create-sibling-github`, :dlcmd:`create-sibling-gitlab`, :dlcmd:`create-sibling-gin`, :dlcmd:`create-sibling-gogs`, and :dlcmd:`create-sibling-gitea`.
  78. Each command is slightly tuned towards the peculiarities of each particular platform, but the most important common parameters are streamlined across commands as follows:
  79. - ``[REPONAME]`` (required): The name of the repository on the hosting site. It will be created under a user's namespace, unless this argument includes an organization name prefix. For example, ``datalad create-sibling-github my-awesome-repo`` will create a new repository under ``github.com/<user>/my-awesome-repo``, while ``datalad create-sibling-github <orgname>/my-awesome-repo`` will create a new repository of this name under the GitHub organization ``<orgname>`` (given appropriate permissions).
  80. - ``-s/--name <name>`` (required): A name under which the sibling is identified. By default, it will be based on or similar to the hosting site. For example, the sibling created with ``datalad create-sibling-github`` will be called ``github`` by default.
  81. - ``--credential <name>`` (optional): Credentials used for authentication are stored internally by DataLad under specific names. These names allow you to have multiple credentials, and flexibly decide which one to use. When ``--credential <name>`` is the name of an existing credential, DataLad tries to authenticate with the specified credential; when it does not yet exist DataLad will prompt interactively for a credential, such as an access token, and store it under the given ``<name>`` for future authentications. By default, DataLad will name a credential according to the hosting service URL it used for, such as ``datalad-api.github.com`` as the default for credentials used to authenticate against GitHub.
  82. - ``--access-protocol {https|ssh|https-ssh}`` (default ``https``): Whether to use :term:`SSH` or :term:`HTTPS` URLs, or a hybrid version in which HTTPS is used to *pull* and SSH is used to *push*. Using :term:`SSH` URLs requires an :term:`SSH key` setup, but is a very convenient authentication method, especially when pushing updates -- which would need manual input on user name and token with every ``push`` over HTTPS.
  83. - ``--dry-run`` (optional): With this flag set, the command will not actually create the target repository, but only perform tests for name collisions and report repository name(s).
  84. - ``--private`` (optional): A switch that, if set, makes sure that the created repository is private.
  85. Other streamlined arguments, such as ``--recursive`` or ``--publish-depends`` allow you to perform more complex configurations, such as publication of dataset hierarchies or connections to :term:`special remote`\s. Upcoming walk-throughs will demonstrate them.
  86. Self-hosted repository services, e.g., Gogs or Gitea instances, have an additional required argument, the ``--api`` flag.
  87. It needs to point to the URL of the instance, for example
  88. .. code-block:: console
  89. $ datalad create-sibling-gogs my_repo_on_gogs --api "https://try.gogs.io"
  90. :term:`GitLab`'s internal organization differs from that of the other hosting services, and as there are multiple different GitLab instances, ``create-sibling-gitlab`` requires slightly more configuration than the other commands.
  91. Thus, a short walk-through is at the :ref:`end of this section <gitlab>`.
  92. .. _token:
  93. Authentication by token
  94. ^^^^^^^^^^^^^^^^^^^^^^^
  95. To create or update repositories on remote hosting services you will need to set up appropriate authentication and permissions.
  96. In most cases, this will be in the form of an authorization token with a specific permission scope.
  97. What is a token?
  98. """"""""""""""""
  99. Personal access tokens are an alternative to authenticating via your password, and take the form of a long character string, associated with a human-readable name or description.
  100. If you are prompted for ``username`` and ``password`` in the command line, you would enter your token in place of the ``password`` [#f3]_.
  101. Note that you do not have to type your token at every authentication -- your token will be stored on your system the first time you have used it and automatically reused whenever relevant.
  102. .. index:: credential; storage
  103. .. find-out-more:: How does the authentication storage work?
  104. Passwords, user names, tokens, or any other login information is stored in
  105. your system's (encrypted) `keyring <https://en.wikipedia.org/wiki/GNOME_Keyring>`_.
  106. It is a built-in credential store, used in all major operating systems, and
  107. can store credentials securely.
  108. You can have multiple tokens, and each of them can get a different scope of permissions, but it is important to treat your tokens like passwords and keep them secret.
  109. Which permissions do they need?
  110. """""""""""""""""""""""""""""""
  111. The most convenient way to generate tokens is typically via the webinterface of the hosting service of your choice.
  112. Often, you can specifically select which set of permissions a specific token has in a drop-down menu similar (but likely not identical) to the screenshot from GitHub in :numref:`fig-token`.
  113. .. _fig-token:
  114. .. figure:: ../artwork/src/github-token.png
  115. :width: 80%
  116. Webinterface to generate an authentication token on GitHub. One typically has to set a name and
  117. permission set, and potentially an expiration date.
  118. For creating and updating repositories with DataLad commands it is usually sufficient to grant only repository-related permissions.
  119. However, broader permission sets may also make sense.
  120. Should you employ GitHub workflows, for example, a token without "workflow" scope could not push changes to workflow files, resulting in errors like this one:
  121. .. code-block:: console
  122. [remote rejected] (refusing to allow a Personal Access Token to create or update workflow `.github/workflows/benchmarks.yml` without `workflow` scope)]
  123. .. _gitlab:
  124. Creating a sibling on GitLab
  125. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  126. :term:`GitLab` is an open source Git repository hosting platform, and many institutions and companies deploy their own instance.
  127. This short walk-through demonstrates the necessary steps to create a GitLab sibling, and the different options GitLab allows for when creating siblings recursively for a dataset hierarchy.
  128. Step 1: Configure your site
  129. """""""""""""""""""""""""""
  130. As a first step, users will need to create a configuration file following the format of `python-gitlab <https://python-gitlab.readthedocs.io/en/stable/cli-usage.html#configuration-file-format>`_.
  131. This configuration file is typically called ``.python-gitlab.cfg`` and placed into a users home directory.
  132. It contains one section per GitLab instance, and a ``[global]`` section that defines the default instance to use.
  133. Here is an example:
  134. .. code-block:: console
  135. $ cat ~/.python-gitlab.cfg
  136. [global]
  137. default = my-university-gitlab
  138. ssl_verify = true
  139. timeout = 5
  140. [my-university-gitlab]
  141. url = https://gitlab.my-university.com
  142. private_token = <here-is-your-token>
  143. api_version = 4
  144. [gitlab-general]
  145. url = https://gitlab.com
  146. api_version = 4
  147. private_token = <here-is-your-token>
  148. Once this configuration is in place, ``create-sibling-gitlab``'s ``--site`` parameter can be supplied with the name of the instance you want to use (e.g., ``datalad create-sibling-gitlab --site gitlab-general``).
  149. Ensure that the token for each instance has appropriate permissions to create new groups and projects under your user account using the GitLab API in :numref:`fig-gitlabtoken`.
  150. .. _fig-gitlabtoken:
  151. .. figure:: ../artwork/src/gitlab-token.png
  152. :width: 80%
  153. Webinterface to generate an authentication token on GitLab. One typically has to set a name and
  154. permission set, and potentially an expiration date.
  155. Step 2: Create or select a group
  156. """"""""""""""""""""""""""""""""
  157. GitLab's organization consists of *projects* and *groups*.
  158. Projects are single repositories, and groups can be used to manage one or more projects at the same time.
  159. In order to use ``create-sibling-gitlab``, a user **must** `create a group <https://docs.gitlab.com/ee/user/group/#create-a-group>`_ via the web interface, or specify a pre-existing group, because `GitLab does not allow root-level groups to be created via their API <https://docs.gitlab.com/ee/api/groups.html#new-group>`_.
  160. Only when there already is a "parent" group DataLad and other tools can create sub-groups and projects automatically.
  161. In the screenshots :numref:`fig-rootgroup-gitlab1` and :numref:`fig-rootgroup-gitlab2`, a new group ``my-datalad-root-level-group`` is created right underneath the user account.
  162. The group name as shown in the URL bar is what DataLad needs in order to create sibling datasets.
  163. .. _fig-rootgroup-gitlab1:
  164. .. figure:: ../artwork/src/gitlab-rootgroup.png
  165. :width: 80%
  166. Webinterface to create a root-level group on GitLab.
  167. .. _fig-rootgroup-gitlab2:
  168. .. figure:: ../artwork/src/gitlab-rootgroup2.png
  169. :width: 80%
  170. A created root-level group in GitLab's webinterface.
  171. Step 3: Select a layout
  172. """""""""""""""""""""""
  173. Due to the distinction between groups and projects, GitLab allows two different layouts that DataLad can use to publish datasets or dataset hierarchies:
  174. * **flat**:
  175. All datasets become projects in the same, pre-existing group.
  176. The name of a project is its relative path within the root dataset, with all path separator characters replaced by '-' [#f4]_.
  177. * **collection**:
  178. A new group is created for the dataset. The root dataset (the topmost superdataset) is placed in a "project" project inside this group, and all nested subdatasets are represented inside the group using a "flat" layout [#f4]_. This layout is the default.
  179. Consider the ``DataLad-101`` dataset, a superdataset with a several subdatasets in the following layout:
  180. .. code-block:: bash
  181. /home/me/dl-101/DataLad-101 # dataset
  182. ├── books/
  183. │ └── [...]
  184. ├── code/
  185. │ └── [...]
  186. ├── midterm_project/ # subdataset
  187. │ ├── code/
  188. │ └── [...]
  189. │ └── input/ # sub-subdataset
  190. ├── recordings/
  191. │ └── longnow/ # subdataset
  192. │ ├── [...]
  193. How the ``collection`` and ``flat`` layouts for this dataset look in practice is shown in :numref:`fig-gitlab-layout`.
  194. .. _fig-gitlab-layout:
  195. .. figure:: ../artwork/src/gitlab-layouts.png
  196. :width: 50%
  197. The ``collection`` layout has a group (``DataLad-101_collection``, defined by the user with a configuration) with four projects underneath. The ``project`` project contains the root-level dataset, and all contained subdatasets are named according to their location in the dataset. The ``flat`` layout consists of projects in the root-level group. The project name for the superdataset (``DataLad-101_flat``) is defined by the user with a configuration, and the names of the subdatasets extend this project name based on their location in the dataset hierarchy.
  198. Publishing a single dataset
  199. """""""""""""""""""""""""""
  200. When publishing a single dataset, users can configure the project or group name as a command argument ``--project``.
  201. Here are two command examples and their outcomes.
  202. For a **flat** layout, the ``--project`` parameter determines the project name, shown in :numref:`fig-gitlab-flat`.
  203. .. code-block:: console
  204. $ datalad create-sibling-gitlab --site gitlab-general --layout flat --project my-datalad-root-level-group/this-will-be-the-project-name
  205. create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-project-name]
  206. configure-sibling(ok): . (sibling)
  207. action summary:
  208. configure-sibling (ok: 1)
  209. create_sibling_gitlab (ok: 1)
  210. .. _fig-gitlab-flat:
  211. .. figure:: ../artwork/src/gitlab-layout-flat.png
  212. :width: 50%
  213. An example dataset using GitLab's "flat" layout.
  214. For a **collection** layout, the ``--project`` parameter determines the group name, shown in figure :numref:`fig-gitlab-collection`.
  215. .. code-block:: console
  216. $ datalad create-sibling-gitlab --site gitlab-general --layout collection --project my-datalad-root-level-group/this-will-be-the-group-name
  217. create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-group-name/project]
  218. configure-sibling(ok): . (sibling)
  219. action summary:
  220. configure-sibling (ok: 1)
  221. create_sibling_gitlab (ok: 1)
  222. .. _fig-gitlab-collection:
  223. .. figure:: ../artwork/src/gitlab-layout-collection.png
  224. :width: 50%
  225. An example dataset using GitLab's "collection" layout.
  226. Publishing datasets recursively
  227. """""""""""""""""""""""""""""""
  228. When publishing a series of datasets recursively, the ``--project`` argument cannot be used anymore - otherwise, all datasets in the hierarchy would attempt to create the same group or project over and over again.
  229. Instead, one configures the root level dataset, and the names for underlying datasets will be derived from this configuration:
  230. .. index::
  231. single: configuration item; datalad.gitlab-<name>-project
  232. .. code-block:: console
  233. $ # do the configuration for the top-most dataset
  234. $ # either configure with Git
  235. $ git config --local --replace-all \
  236. datalad.gitlab-<gitlab-site>-project \
  237. 'my-datalad-root-level-group/DataLad-101_flat'
  238. $ # or configure with DataLad
  239. $ datalad configuration set \
  240. datalad.gitlab-<gitlab-site>-project='my-datalad-root-level-group/DataLad-101_flat'
  241. Afterwards, publish dataset hierarchies with the ``--recursive`` flag:
  242. .. code-block:: console
  243. $ datalad create-sibling-gitlab --site gitlab-general --recursive --layout flat
  244. create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat]
  245. configure-sibling(ok): . (sibling)
  246. create_sibling_gitlab(ok): midterm_project (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project]
  247. configure-sibling(ok): . (sibling)
  248. create_sibling_gitlab(ok): midterm_project/input (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project-input]
  249. configure-sibling(ok): . (sibling)
  250. create_sibling_gitlab(ok): recordings/longnow (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-recordings-longnow]
  251. configure-sibling(ok): . (sibling)
  252. action summary:
  253. configure-sibling (ok: 4)
  254. create_sibling_gitlab (ok: 4)
  255. Final step: Pushing to GitLab
  256. """""""""""""""""""""""""""""
  257. Once you have set up your dataset sibling(s), you can push individual datasets with ``datalad push --to gitlab`` or push recursively across a hierarchy by adding the ``--recursive`` flag to the push command.
  258. .. _gitea: https://about.gitea.com
  259. .. rubric:: Footnotes
  260. .. [#f1] Many repository hosting services have useful features to make your work citeable.
  261. For example, :term:`gin` is able to assign a :term:`DOI` to your dataset, and GitHub allows ``CITATION.cff`` files. At the same time, archival services such as `Zenodo <https://zenodo.org>`_ often integrate with published repositories, allowing you to preserve your dataset with them.
  262. .. [#f2] Your private SSH key is incredibly valuable, and it is important to keep
  263. it secret!
  264. Anyone who gets your private key has access to anything that the public key
  265. is protecting. If the private key does not have a passphrase, simply copying
  266. this file grants a person access!
  267. .. [#f3] GitHub `deprecated user-password authentication <https://developer.github.com/changes/2020-02-14-deprecating-password-auth>`_ in favor of authentication via personal access token. Supplying a password instead of a token will fail to authenticate.
  268. .. index::
  269. single: configuration item; datalad.gitlab-default-projectname
  270. single: configuration item; datalad.gitlab-default-pathseparator
  271. .. [#f4] The default project name ``project`` and path separator ``-`` are configurable using the dataset-level configurations ``datalad.gitlab-default-projectname`` and ``datalad.gitlab-default-pathseparator``