101-138-sharethirdparty.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259
  1. .. _sharethirdparty:
  2. Beyond shared infrastructure
  3. ----------------------------
  4. Data sharing potentially involves a number of different elements:
  5. .. figure:: ../artwork/src/publishing/startingpoint.svg
  6. :width: 60%
  7. An overview of all elements potentially included in a publication workflow.
  8. Users on a common, shared computational infrastructure such as an :term:`SSH server`
  9. can share datasets via simple installations with paths, without any involvement of third party storage providers or repository hosting services, as shown in :numref:`fig-clonecompute`.
  10. .. _fig-clonecompute:
  11. .. figure:: ../artwork/src/publishing/clone_combined.svg
  12. Cloning from local or remote compute infrastructure.
  13. But at some point in a dataset's life, you may want to share it with people that
  14. can't access the computer or server your dataset lives on, store it on other infrastructure
  15. to save diskspace, or create a backup.
  16. When this happens, you will want to publish your dataset to repository hosting
  17. services (for example, :term:`GitHub`, :term:`GitLab`, or :term:`GIN`)
  18. and/or third party storage providers (such as Dropbox_, Google_,
  19. `Amazon S3 buckets <https://aws.amazon.com/s3>`_,
  20. the `Open Science Framework`_ (OSF), and many others).
  21. This chapter tackles different aspects of dataset publishing.
  22. The remainder of this section talks about general aspects of dataset publishing, and
  23. illustrates the idea of using third party services as :term:`special remote`\s from
  24. which annexed file contents can be retrieved via :dlcmd:`get`.
  25. The upcoming section :ref:`gin` shows you one of the most easy ways to publish your
  26. dataset publicly or for selected collaborators and friends.
  27. If you don't want to dive in to all the details on dataset sharing, it is safe to
  28. directly skip ahead to this section, and have your dataset published in only a few minutes.
  29. Other sections in this chapter will showcase a variety of ways to publish datasets
  30. and their contents to different services:
  31. The section :ref:`share_hostingservice` demonstrates how to publish datasets to any
  32. kind of Git repository hosting service.
  33. The sections :ref:`s3` and :ref:`dropbox` are concrete examples of sharing datasets
  34. publicly or with selected others via different cloud services.
  35. The section :ref:`gitlfs` talks about using the centralized, for-pay service
  36. `Git LFS`_ for sharing dataset content on GitHub, and the
  37. section :ref:`figshare` shows built-in dataset export to services such as
  38. `figshare.com <https://figshare.com>`__.
  39. Leveraging third party infrastructure
  40. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  41. There are several ways to make datasets available for others:
  42. - You can **publish your dataset to a repository with annex support** such as :term:`GIN` or the OSF_ [#f1]_. This is the easiest way to share datasets and all their contents. Read on in the section :ref:`gin` or consult the tutorials of the `datalad-osf extension`_ to learn how to do this.
  43. - You can **publish your dataset to a repository hosting service**, and **configure an external resource that stores your annexed data**. Such a resource can be a private web server, but also a third party services cloud storage such as Dropbox_, Google_, `Amazon S3 buckets <https://aws.amazon.com/s3>`_, `Box.com <https://www.box.com>`_, `owncloud <https://owncloud.com>`_, `sciebo <https://hochschulcloud.nrw>`_, or many more.
  44. - You can **export your dataset statically** as a snapshot to a service such as `Figshare <https://figshare.com>`__ or the OSF_ [#f1]_.
  45. - You can **publish your dataset to a repository hosting service** and ensure that
  46. all dataset contents are either available from pre-existing public sources or can be recomputed from a :term:`run record`.
  47. Dataset contents and third party services influence sharing
  48. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  49. Because DataLad datasets are :term:`Git` repositories, it is possible to
  50. :dlcmd:`push` datasets to any Git repository hosting service, such as
  51. :term:`GitHub`, :term:`GitLab`, :term:`GIN`, :term:`Bitbucket`, `Gogs <https://gogs.io>`_,
  52. or Gitea_.
  53. You have already done this in section :ref:`yoda_project` when you shared your ``midterm_project`` dataset via :term:`GitHub`.
  54. However, most Git repository hosting services do not support hosting the file content
  55. of the files managed by :term:`git-annex`.
  56. For example, the the results of the analysis in section :ref:`yoda_project`,
  57. ``pairwise_comparisons.png`` and ``prediction_report.csv``, were not published to
  58. GitHub: There was meta data about their file availability, but if a friend cloned
  59. this dataset and ran a :dlcmd:`get` command, content retrieval would fail
  60. because their only known location is your private computer to which only you have access.
  61. Instead, they would need to be recomputed from the :term:`run record` in the dataset.
  62. When you are sharing DataLad datasets with other people or third party services,
  63. an important distinction thus lies in *annexed* versus *not-annexed* content, i.e.,
  64. files that stored in your dataset's :term:`annex` versus files that are committed
  65. into :term:`Git`.
  66. The third-party service of your choice may have support for both annexed and non-annexed files, or only one them.
  67. .. figure:: ../artwork/src/publishing/publishing_network_publishparts2.svg
  68. :width: 80%
  69. Schematic difference between the Git and git-annex aspect of your dataset, and where each part *usually* gets published to.
  70. The common case: Repository hosting without annex support and special remotes
  71. """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  72. Because DataLad datasets are :term:`Git` repositories, it is possible to
  73. :dlcmd:`push` datasets to any Git repository hosting service, such as
  74. :term:`GitHub`, :term:`GitLab`, :term:`GIN`, :term:`Bitbucket`, `Gogs <https://gogs.io>`_,
  75. or Gitea_.
  76. But while anything that is managed by Git is accessible in repository hosting services, they usually don't support storing annexed data [#f2]_.
  77. When you want to publish a dataset to a Git repository hosting service to allow others to easily find and clone it, but you also want others to be able to retrieve annexed files in this dataset via :dlcmd:`get`, annexed contents need to be pushed to additional storage hosting services.
  78. The hosting services can be all kinds of private, institutional, or commercial services, and their location will be registered in the dataset under the concept of a :term:`special remote`.
  79. .. index::
  80. pair: special remote; git-annex concept
  81. .. find-out-more:: What is a special remote
  82. A special-remote is an extension to Git’s concept of remotes, and can
  83. enable :term:`git-annex` to transfer data from and possibly to places that are not Git
  84. repositories (e.g., cloud services or external machines such as an HPC
  85. system). For example, an *s3* special remote uploads and downloads content
  86. to AWS S3, a *web* special remote downloads files from the web, the *datalad-archive* special remote
  87. extracts files from annexed archives, etc. Don’t envision a special-remote
  88. as merely a physical place or location – a special-remote is a protocol that
  89. defines the underlying transport of your files to and/or from a specific location.
  90. To register a special remote in your dataset and use it for file storage, you need to configure the service of your choice and *publish* the annexed contents to it. Afterwards, the published dataset (e.g., via :term:`GitHub` or :term:`GitLab`) stores the information about where to obtain annexed file contents from such that
  91. :dlcmd:`get` works.
  92. Once you have configured the service of your choice, you can push your datasets Git history to the repository hosting service and the annexed contents to the special remote. DataLad also makes it easy to push these different dataset contents exactly where they need to be automatically via a :term:`publication dependency`.
  93. Exemplary walk-throughs for Dropbox_, `Amazon S3 buckets <https://aws.amazon.com/s3>`_, and `Git LFS`_ can be found in the upcoming sections in this chapter.
  94. But the general workflow looks as follows:
  95. From your perspective (as someone who wants to share data), you will
  96. need to
  97. - (potentially) install/setup the relevant *special-remote*,
  98. - create a dataset sibling on GitHub/GitLab/... for others to install from,
  99. - set up a *publication dependency* between repository hosting and special remote, so that annexed contents are automatically pushed to the special remote when ever you update the sibling on the Git repository hosting site,
  100. - publish your dataset.
  101. This gives you the freedom to decide where your data lives and
  102. who can have access to it. Once this set up is complete, updating and
  103. accessing a published dataset and its data is almost as easy as if it would
  104. lie on your own machine.
  105. From the perspective of a consumer (as someone who wants to obtain your dataset),
  106. they will need to
  107. - (potentially) install the relevant *special-remote* (dependent on the third-party service you chose) and
  108. - perform the standard :dlcmd:`clone` and :dlcmd:`get` commands
  109. as necessary.
  110. Thus, from a collaborator's perspective, with the exception of potentially
  111. installing/setting up the relevant *special-remote*, obtaining your dataset and its
  112. data is as easy as with any public DataLad dataset.
  113. While you have to invest some setup effort in the beginning, once this
  114. is done, the workflows of yours and others are the same that you are already
  115. very familiar with, as :numref:`fig-cloneurls` illustrates.
  116. .. _fig-cloneurls:
  117. .. figure:: ../artwork/src/publishing/clone_url.svg
  118. :width: 60%
  119. Cloning from remote URLs.
  120. If you are interested in learning how to set up different services as special remotes, you can take a look at the sections :ref:`s3`, :ref:`dropbox` or :ref:`gitlfs` for concrete examples with DataLad datasets, and the general section :ref:`share_hostingservice` on setting up dataset siblings.
  121. In addition, there are step-by-step walk-throughs in the documentation of git-annex for services such as `S3 <https://git-annex.branchable.com/tips/public_Amazon_S3_remote>`_, `Google Cloud Storage <https://git-annex.branchable.com/tips/using_Google_Cloud_Storage>`_,
  122. `Box.com <https://git-annex.branchable.com/tips/using_box.com_as_a_special_remote>`__,
  123. `Amazon Glacier <https://git-annex.branchable.com/tips/using_Amazon_Glacier>`_,
  124. `OwnCloud <https://git-annex.branchable.com/tips/owncloudannex>`__, and many more.
  125. Here is the complete list: `git-annex.branchable.com/special_remotes <https://git-annex.branchable.com/special_remotes>`_.
  126. The easy case: Repository hosting with annex support
  127. """"""""""""""""""""""""""""""""""""""""""""""""""""
  128. There are a few Git repository hosting services with support for annexed contents, as illustrated in :numref:`fig-ginpublishing`.
  129. One of them is :term:`GIN`.
  130. What makes them extremely convenient is that there is no need to configure a special remote -- creating a :term:`sibling` and running :dlcmd:`push` is enough.
  131. .. _fig-ginpublishing:
  132. .. figure:: ../artwork/src/publishing/publishing_network_publishgin.svg
  133. :width: 80%
  134. Some repository hosting services have annex support - they can host both the Git and git-annex parts of your dataset.
  135. Read the section :ref:`gin` for a walk-through.
  136. The uncommon case: Special remotes with repository hosting support
  137. """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  138. Typically, storage hosting services such as cloud storage providers do not provide
  139. the ability to host Git repositories.
  140. Therefore, it is typically not possible to :dlcmd:`clone` from a cloud storage.
  141. However, a number of :term:`datalad extension`\s have been created that equip cloud storage providers with the ability to also host Git repositories, as :numref:`fig-publishosf` illustrates.
  142. While they do not get the ability to display repositories the same way that pure
  143. Git repository hosting services like GitHub do, they do get the super power of becoming clonable.
  144. One example for this is the Open Science Framework, which can become the home of datasets by using the `datalad-osf extension`_.
  145. As long as you and your collaborators have the extension installed, annexed dataset
  146. contents and the Git repository part of your dataset can be pushed or cloned in one go.
  147. .. _fig-publishosf:
  148. .. figure:: ../artwork/src/publishing/publishing_network_publishosf.svg
  149. :width: 80%
  150. With some :term:`datalad extension`\s third party storage services can host Git repositories in addition to annexed contents.
  151. Please take a look at the documentation and tutorials of the `datalad-osf extension`_ for examples and more information.
  152. The creative case: Ensuring availability using only repository hosting
  153. """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  154. When you only want to use pure Git repository hosting services without annex support, you can still allow others to obtain (some) file contents with some creativity:
  155. For one, you can use commands such as :dlcmd:`download-url` or :dlcmd:`addurls` to retrieve files from web sources and register their location automatically.
  156. The first Chapter :ref:`chapter_datasets` demonstrates :dlcmd:`download-url`.
  157. Other than this, you can rely on digital provenance in the form of :term:`run record`\s that allow consumers of your dataset to recompute a result instead of :dlcmd:`get`\ing it.
  158. The midterm-project example in section :ref:`yoda_project` has been an example for this.
  159. The static case: Exporting dataset snapshots
  160. """"""""""""""""""""""""""""""""""""""""""""
  161. While DataLad datasets have the great advantage that they carry a history with all kinds of useful digital provenance and previous versions of files, it may not in all cases be necessary to make use of this advantage.
  162. Sometimes, you may just want to share or archive the most recent state of the dataset as a snapshot.
  163. DataLad provides the ability to do this out of the box to arbitrary locations, and support for specific services such as `Figshare <https://figshare.com>`__.
  164. Find out more information on this in the section :ref:`figshare`.
  165. Other than that, some :term:`datalad extension`\s allow an export to additional services such as the Open Science Framework.
  166. General information on publishing datasets
  167. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  168. Beyond concrete examples of publishing datasets, some general information may be useful in addition:
  169. The section :ref:`push` illustrates the DataLad command :dlcmd:`push`, a command that handles every publication operation, regardless of the type of published content or its destination.
  170. In addition to this, the section :ref:`privacy` contains tips and strategies on publishing datasets without leaking potentially private contents or information.
  171. .. _dropbox: https://www.dropbox.com
  172. .. _google: https://www.google.com
  173. .. _gitea: https://about.gitea.com
  174. .. _git lfs: https://git-lfs.com
  175. .. _Open Science Framework: https://osf.io
  176. .. _OSF: https://osf.io
  177. .. _datalad-osf extension: https://docs.datalad.org/projects/osf
  178. .. rubric:: Footnotes
  179. .. [#f1] Requires the `datalad-osf extension`_.
  180. .. [#f2] In addition to not storing annexed data, most Git repository hosting services also have a size limit for files kept in Git. So while you could *theoretically* commit a sizable file into Git, this would not only negatively impact the performance of your dataset as Git doesn't handle large files well, but it would also `prevent your dataset to be published to a Git repository hosting service like GitHub <https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github>`_.
  181. .. [#f5] Old versions of :term:`GitLab`, on the other hand, provide a git-annex configuration. It
  182. is disabled by default, and to enable it you would need to have administrative
  183. access to the server and client side of your GitLab instance.
  184. Alternatively, GitHub can integrate with
  185. `Git LFS`_, a non-free, centralized service
  186. that allows to store large file contents. :ref:`gitlfs` shows an example on how to use their free trial version.