101-139-gin.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321
  1. .. _gin:
  2. Walk-through: Dataset hosting on GIN
  3. ------------------------------------
  4. `GIN <https://gin.g-node.org/G-Node/Info/wiki>`__ (G-Node infrastructure) is a
  5. free data management system designed for comprehensive and reproducible management
  6. of scientific data. It is a web-based repository store and provides
  7. fine-grained access control to share data. :term:`GIN` builds up on :term:`Git` and
  8. :term:`git-annex`, and is an easy alternative to other third-party services to host
  9. and share your DataLad datasets [#f1]_. It allows to share datasets and their
  10. contents with selected collaborators or making them publicly and anonymously
  11. available.
  12. :ref:`And even if you prefer to expose and share your datasets via GitHub, you can still use GIN to host your data <ginbts>`.
  13. .. figure:: ../artwork/src/publishing/publishing_network_publishgin.svg
  14. :width: 80%
  15. Some repository hosting services such as GIN have annex support, and can thus hold the complete dataset. This makes publishing datasets very easy.
  16. Prerequisites
  17. ^^^^^^^^^^^^^
  18. In order to use GIN for hosting and sharing your datasets, you need to
  19. - register
  20. - upload your public :term:`SSH key` for SSH access
  21. Once you have `registered <https://gin.g-node.org/user/sign_up>`_
  22. an account on the GIN server by providing your e-mail address, affiliation,
  23. and name, and selecting a user name and password, you should upload your
  24. :term:`SSH key` to allow SSH access
  25. (you can find an explanation of what SSH keys are and how you can create one in :ref:`this Findoutmore <fom-sshkey>` in the general section :ref:`share_hostingservice`).
  26. To do this, visit the settings of your user account. On the left hand side, select
  27. the tab "SSH Keys", and click the button "Add Key":
  28. .. figure:: ../artwork/src/GIN_SSH_1.png
  29. Upload your SSH key to GIN
  30. You should copy the contents of your public key file into the field labeled
  31. ``content``, and enter an arbitrary but informative ``Key Name``, such as
  32. "My private work station". Afterwards, you are done!
  33. .. index::
  34. pair: create-sibling-gin; DataLad command
  35. Publishing your dataset to GIN
  36. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  37. As outlined in the section :ref:`share_hostingservice`, there are two ways in which you can publish your dataset to GIN.
  38. Either by 1) creating a new, empty repository on GIN via the web interface, or 2) via the :dlcmd:`create-sibling-gin` command.
  39. **1) via webinterface:** If you choose to create a new repository via GIN's web interface, make sure to not initialize it with a README:
  40. .. figure:: ../artwork/src/GIN_newrepo.png
  41. Create a new repository on GIN using the web interface.
  42. Afterwards, add this repository as a sibling of your dataset. To do this, use the
  43. :dlcmd:`siblings add` command and the SSH URL of the repository as shown below.
  44. Note that since this is the first time you will be connecting to the GIN server
  45. via SSH, you will likely be asked to confirm to connect. This is a safety measure,
  46. and you can type "yes" to continue:
  47. .. code-block:: text
  48. $ datalad siblings add -d . \
  49. --name gin \
  50. --url git@gin.g-node.org:/adswa/DataLad-101.git
  51. The authenticity of host 'gin.g-node.org (141.84.41.219)' can't be established.
  52. ECDSA key fingerprint is SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU.
  53. Are you sure you want to continue connecting (yes/no)? yes
  54. [INFO ] Failed to enable annex remote gin, could be a pure git or not accessible
  55. [WARNING] Failed to determine if gin carries annex.
  56. .: gin(-) [git@gin.g-node.org:/adswa/DataLad-101.git (git)]
  57. .. ifconfig:: internal
  58. .. runrecord:: _examples/DL-101-139-101
  59. :language: console
  60. $ python3 /home/me/makepushtarget.py '/home/me/dl-101/DataLad-101' 'gin' '/home/me/pushes/DataLad-101' True True
  61. **2) via command-line:**
  62. If you choose to use the :dlcmd:`create-sibling-gin` command, supply the command with a name for the repository, and optionally add a ``-s/--siblingname [NAME]`` parameter (if unconfigured it will be ``gin``), and ``--access-protocol [https|ssh|https-ssh]`` (ideally ``ssh``).
  63. The command has a number of additional useful parameters, so make sure to take a look at its ``--help``.
  64. Afterwards, you can publish your dataset with :dlcmd:`push`. As the
  65. repository on GIN supports a dataset annex, there is no publication dependency
  66. to an external data hosting service necessary, and the dataset contents
  67. stored in Git and in git-annex are published to the same place:
  68. .. runrecord:: _examples/DL-101-139-102
  69. :language: console
  70. :workdir: dl-101/DataLad-101
  71. $ datalad push --to gin
  72. On the GIN web interface you will find all of your dataset -- including annexed contents!
  73. What is especially cool is that the GIN web interface (unlike :term:`GitHub`) can even preview your annexed contents.
  74. .. figure:: ../artwork/src/GIN_dl101_repo.png
  75. A published dataset in a GIN repository at gin.g-node.org.
  76. .. _access:
  77. Sharing and accessing the dataset
  78. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  79. Once your dataset is published, you can point collaborators and friends to it.
  80. If it is a **public** repository, retrieving the dataset and getting access to
  81. all published data contents (in a read-only fashion) is done by cloning the
  82. repository's ``https`` url. This does not require a user account on GIN.
  83. .. index::
  84. pair: clone; DataLad command
  85. .. importantnote:: Take the URL in the browser, not the copy-paste URL
  86. Please note that you need to use the browser URL of the repository, not the copy-paste URL on the upper right hand side of the repository if you want to get anonymous HTTPS access!
  87. The two URLs differ only by a ``.git`` extension:
  88. * Browser bar: ``https://gin.g-node.org/<user>/<repo>``
  89. * Copy-paste "HTTPS clone": ``https://gin.g-node.org/<user>/<repo>.git``
  90. A dataset cloned from ``https://gin.g-node.org/<user>/<repo>.git``, however, cannot retrieve annexed files!
  91. .. runrecord:: _examples/DL-101-139-107
  92. :language: console
  93. :workdir: dl-101/clone_of_dl-101
  94. $ datalad clone https://gin.g-node.org/adswa/DataLad-101
  95. Subsequently, :dlcmd:`get` calls will be able to retrieve all annexed
  96. file contents that have been published to the repository.
  97. .. index::
  98. pair: clone; DataLad command
  99. If it is a **private** dataset, cloning the dataset from GIN requires a user
  100. name and password for anyone you want to share your dataset with.
  101. The "Collaboration" tab under Settings lets you set fine-grained access rights,
  102. and it is possible to share datasets with collaborators that are not registered
  103. on GIN with provided Guest accounts.
  104. If you are unsure if your dataset is private, :ref:`this find-out-more shows you how to find out <fom-private-gin>`.
  105. In order to get access to annexed contents, cloning *requires* setting up
  106. an SSH key as detailed above, and cloning via the SSH url:
  107. .. code-block:: console
  108. $ datalad clone git@gin.g-node.org:/adswa/DataLad-101.git
  109. Likewise, in order to publish changes back to a GIN repository, the repository needs
  110. to be cloned via its SSH url.
  111. .. index:: dataset hosting; GIN
  112. .. find-out-more:: How do I know if my repository is private?
  113. :name: fom-private-gin
  114. :float:
  115. Private repos are marked with a lock sign. To make it public, untick the
  116. "Private" box, found under "Settings":
  117. ..
  118. the image below can't become a figure because it can't be used in LaTeXs minipage environment
  119. .. image:: ../artwork/src/GIN_private.png
  120. .. index::
  121. pair: subdatasets; DataLad command
  122. .. _subdspublishing:
  123. Subdataset publishing
  124. ^^^^^^^^^^^^^^^^^^^^^
  125. Just as the input subdataset ``iris_data`` in your published ``midterm_project``
  126. was referencing its source on :term:`GitHub`, the ``longnow`` subdataset in your
  127. published ``DataLad-101`` dataset directly references the original
  128. dataset on :term:`GitHub`. If you click onto ``recordings`` and then ``longnow`` in GIN's webinterface, you will
  129. be redirected to the podcast's original dataset.
  130. The subdataset ``midterm_project``, however, is not successfully referenced. If
  131. you click on it, you would get to a 404 Error page. The crucial difference between this
  132. subdataset and the longnow dataset is its entry in the ``.gitmodules`` file of
  133. ``DataLad-101``:
  134. .. code-block:: ini
  135. :emphasize-lines: 4, 8
  136. $ cat .gitmodules
  137. [submodule "recordings/longnow"]
  138. path = recordings/longnow
  139. url = https://github.com/datalad-datasets/longnow-podcasts.git
  140. datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
  141. [submodule "midterm_project"]
  142. path = midterm_project
  143. url = ./midterm_project
  144. datalad-id = e5a3d370-223d-11ea-af8b-e86a64c8054c
  145. While the longnow subdataset is referenced with a valid URL to GitHub, the midterm
  146. project's URL is a relative path from the root of the superdataset. This is because
  147. the ``longnow`` subdataset was installed with :dlcmd:`clone -d .`
  148. (that records the source of the subdataset), and the ``midterm_project`` dataset
  149. was created as a subdataset with :dlcmd:`create -d . midterm_project`.
  150. Since there is no repository at
  151. ``https://gin.g-node.org/<USER>/DataLad-101/midterm_project`` (which this submodule
  152. entry would resolve to), accessing the subdataset fails.
  153. However, since you have already published this dataset (to GitHub), you could
  154. update the submodule entry and provide the accessible GitHub URL instead. This
  155. can be done via the ``set-property <NAME> <VALUE>`` option of
  156. :dlcmd:`subdatasets` [#f3]_ (replace the URL shown here with the URL
  157. your dataset was published to -- likely, you only need to change the user name):
  158. .. runrecord:: _examples/DL-101-139-103
  159. :language: console
  160. :workdir: dl-101/DataLad-101
  161. $ datalad subdatasets --contains midterm_project \
  162. --set-property url https://github.com/adswa/midtermproject
  163. .. runrecord:: _examples/DL-101-139-104
  164. :language: console
  165. :workdir: dl-101/DataLad-101
  166. $ cat .gitmodules
  167. Handily, the :dlcmd:`subdatasets` command saved this change to the
  168. ``.gitmodules`` file automatically and the state of the dataset is clean:
  169. .. runrecord:: _examples/DL-101-139-105
  170. :language: console
  171. :workdir: dl-101/DataLad-101
  172. $ datalad status
  173. Afterwards, publish these changes to ``gin`` and see for yourself how this fixed
  174. the problem:
  175. .. runrecord:: _examples/DL-101-139-106
  176. :language: console
  177. :workdir: dl-101/DataLad-101
  178. $ datalad push --to gin
  179. If the subdataset was not published before, you could publish the subdataset to
  180. a location of your choice, and modify the ``.gitmodules`` entry accordingly.
  181. .. index::
  182. single: configuration item; remote.<name>.annex-ignore
  183. pair: configure sibling; with DataLad
  184. .. _ginbts:
  185. Using GIN as a data source behind the scenes
  186. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  187. Even if you do not want to point collaborators to yet another hosting site but want to be able to expose your datasets via services they use and know already (such as GitHub or GitLab), GIN can be very useful:
  188. You can let GIN perform data hosting in the background by using it as an "autoenabled data source" that a dataset :term:`sibling` (even if it is published to GitHub or GitLab) can retrieve data from.
  189. You will need to have a GIN account and SSH key setup, so please take a look at the first part of this section if you do not yet know how to do this.
  190. Then, follow these steps:
  191. - First, create a new repository on GIN (see step by step instructions above).
  192. - In your to-be-published dataset, add this repository as a sibling, this time setting `--url` and `--pushurl` arguments explicitly. Make sure to configure a :term:`SSH` URL as a ``--pushurl`` but a :term:`HTTPS` URL as a ``url``.
  193. Please also note that the :term:`HTTPS` URL written after ``--url`` DOES NOT have the ``.git`` suffix.
  194. Here is the command:
  195. .. code-block:: console
  196. $ datalad siblings add \
  197. -d . \
  198. --name gin \
  199. --pushurl git@gin.g-node.org:/studyforrest/aggregate-fmri-timeseries.git \
  200. --url https://gin.g-node.org/studyforrest/aggregate-fmri-timeseries \
  201. - Locally, run ``git config --unset-all remote.gin.annex-ignore`` to prevent :term:`git-annex` from ignoring this new dataset
  202. - Push your data to the repository on GIN (``datalad push --to gin``). This pushes the actual state of the repository, including content, but also adjusts the :term:`git-annex` configuration.
  203. - Configure this sibling as a "common data source". Use the same name as previously in ``--name`` (to indicate which sibling you are configuring) and give a new, different, name after ``--as-common-datasrc``:
  204. .. code-block:: console
  205. $ datalad siblings configure \
  206. --name gin \
  207. --as-common-datasrc gin-src
  208. - Push to the repository on GIN again (``datalad push --to gin``) to make the configuration change known to the Gin sibling.
  209. - Publish your dataset to GitHub/GitLab/..., or update an existing published dataset (``datalad push``)
  210. Afterwards, :dlcmd:`get` retrieves files from GIN, even if the dataset has been cloned from GitHub.
  211. .. index::
  212. pair: common data source; DataLad concept
  213. .. gitusernote:: Siblings as a common data source
  214. The argument ``as-common-datasrc <name>`` configures a sibling as a common data source -- in technical terms, as an auto-enabled git-annex special remote.
  215. .. rubric:: Footnotes
  216. .. [#f1] GIN looks and feels similar to GitHub, and among a number advantages, it can
  217. assign a :term:`DOI` to your dataset, making it cite-able. Moreover, its
  218. `web interface <https://gin.g-node.org/G-Node/Info/wiki/WebInterface>`_
  219. and `client <https://gin.g-node.org/G-Node/Info/wiki/GinUsageTutorial>`_ are
  220. useful tools with a variety of features that are worthwhile to check out, as well.
  221. .. [#f3] Alternatively, you can configure the siblings URL with :gitcmd:`config`:
  222. .. code-block:: console
  223. $ git config -f .gitmodules --replace-all submodule.midterm_project.url https://github.com/adswa/midtermproject
  224. Remember, though, that this command modifies ``.gitmodules`` *without*
  225. an automatic, subsequent :dlcmd:`save`, so that you will have to save
  226. this change manually.