101-116-sharelocal.rst 14 KB


  1. .. _sharelocal1:
  2. Looking without touching
  3. ------------------------
  4. Only now, several weeks into the DataLad-101 course does your room
  5. mate realize that he has enrolled in the course as well, but has not
  6. yet attended at all. "Oh man, can you help me catch up?" he asks
  7. you one day. "Sharing just your notes would be really cool for a
  8. start already!"
  9. "Sure thing", you say, and decide that it's probably best if he gets
  10. all of the ``DataLad-101`` course dataset. Sharing datasets was
  11. something you wanted to look into soon, anyway.
  12. This is one exciting aspect of DataLad datasets that has yet been missing
  13. from this course: How does one share a dataset?
  14. In this section, we will cover the simplest way of sharing a dataset:
  15. on a local or shared file system, via an *installation* with a path as
  16. a source.
  17. .. importantnote:: More on public data sharing
  18. Interested in sharing datasets *publicly*? Read this chapter to get a feel
  19. for all relevant basic concepts of sharing datasets. Afterwards, head over
  20. to chapter :ref:`chapter_thirdparty` to find out how to share a dataset
  21. on third-party infrastructure.
  22. In this scenario multiple people can access the very same files at the
  23. same time, often on the same machine (e.g., a shared workstation, or
  24. a server that people can ":term:`SSH`" into). You might think: "What do I need
  25. DataLad for, if everyone can already access everything?" However,
  26. universal, unrestricted access can easily lead to chaos. DataLad can
  27. help facilitate collaboration without requiring ultimate trust and
  28. reliability of all participants. Essentially, with a shared dataset,
  29. collaborators can see and use your dataset without any danger
  30. of undesired, or uncontrolled modification.
  31. To demonstrate how to share a DataLad dataset on a common file system,
  32. we will pretend that your personal computer
  33. can be accessed by other users. Let's say that
  34. your room mate has access, and you are making sure that there is
  35. a ``DataLad-101`` dataset in a different place on the file system
  36. for him to access and work with.
  37. This is indeed a common real-world use case: Two users on a shared
  38. file system sharing a dataset with each other.
  39. But as we cannot easily simulate a second user in this handbook,
  40. for now, you will have to share your dataset with yourself.
  41. This endeavor serves several purposes: For one, you will experience a very easy
  42. way of sharing a dataset. Secondly, it will show you
  43. how a dataset can be obtained from a path, instead of a URL as shown in section
  44. :ref:`installds`. Thirdly, ``DataLad-101`` is a dataset that can
  45. showcase many different properties of a dataset already, but it will
  46. be an additional learning experience to see how the different parts
  47. of the dataset -- text files, larger files, subdatasets,
  48. :term:`run record`\s -- will appear upon installation when shared.
  49. And lastly, you will likely "share a dataset with yourself" whenever you
  50. will be using a particular dataset of your own creation as input for
  51. one or more projects.
  52. "Awesome!" exclaims your room mate as you take out your laptop to
  53. share the dataset. "You are really saving my ass
  54. here. I'll make up for it when we prepare for the final", he promises.
  55. To install ``DataLad-101`` into a different part
  56. of your file system, navigate out of ``DataLad-101``, and -- for
  57. simplicity -- create a new directory, ``mock_user``, right next to it:
  58. .. runrecord:: _examples/DL-101-116-101
  59. :language: console
  60. :workdir: dl-101
  61. :realcommand: mkdir mock_user
  62. :notes: (hope this works)
  63. :cast: 04_collaboration
  64. $ cd ../
  65. $ mkdir mock_user
  66. For simplicity, pretend that this is a second user's -- your room mate's --
  67. home directory. Furthermore, let's for now disregard anything about
  68. :term:`permissions`. In a real-world example you likely would not be able to read and write
  69. to a different user's directories, but we will talk about permissions later.
  70. .. index::
  71. pair: clone; DataLad command
  72. pair: clone dataset (set location description); with DataLad
  73. After creation, navigate into ``mock_user`` and install the dataset ``DataLad-101``.
  74. To do this, use :dlcmd:`clone`, and provide a path to your original
  75. dataset:
  76. .. runrecord:: _examples/DL-101-116-102
  77. :language: console
  78. :workdir: dl-101
  79. :notes: We pretend to clone the DataLad-101 dataset into a different users home directory. To do this, we use datalad install with a path
  80. :cast: 04_collaboration
  81. $ cd mock_user
  82. $ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101
  83. This will install your dataset ``DataLad-101`` into your room mate's home
  84. directory. Note that we have given this new
  85. dataset a description about its location. Note further that we
  86. have not provided the optional destination path to :dlcmd:`clone`,
  87. and hence it installed the dataset under its original name in the current directory.
  88. Together with your room mate, you go ahead and see what this dataset looks
  89. like. Before running the command, try to predict what you will see.
  90. .. runrecord:: _examples/DL-101-116-103
  91. :language: console
  92. :workdir: dl-101/mock_user
  93. :notes: How do you think does the dataset look like
  94. :cast: 04_collaboration
  95. $ cd DataLad-101
  96. $ tree
  97. There are a number of interesting things, and your room mate is the
  98. first to notice them:
  99. "Hey, can you explain some things to me?", he asks. "This directory
  100. here, "``longnow``", why is it empty?"
  101. True, the subdataset has a directory name but apart from this,
  102. the ``longnow`` directory appears empty.
  103. "Also, why do the PDFs in ``books/`` and the ``.jpg`` files
  104. appear so weird? They have
  105. this cryptic path right next to them, and look, if I try to open
  106. one of them, it fails! Did something go wrong when we installed
  107. the dataset?" he worries.
  108. Indeed, the PDFs and pictures appear just as they did in the original dataset
  109. on first sight: They are symlinks pointing to some location in the
  110. object tree. To reassure your room mate that everything is fine you
  111. quickly explain to him the concept of a symlink and the :term:`object-tree`
  112. of :term:`git-annex`.
  113. .. index::
  114. pair: clone; DataLad command
  115. "But why does the PDF not open when I try to open it?" he repeats.
  116. True, these files cannot be opened. This mimics our experience when
  117. installing the ``longnow`` subdataset: Right after installation,
  118. the ``.mp3`` files also could not be opened, because their file
  119. content was not yet retrieved. You begin to explain to your room mate
  120. how DataLad retrieves only minimal metadata about which files actually
  121. exist in a dataset upon a :dlcmd:`clone`. "It's really handy",
  122. you tell him. "This way you can decide which book you want to read,
  123. and then retrieve what you need. Everything that is *annexed* is retrieved
  124. on demand. Note though that the text files
  125. contents are present, and the files can be opened -- this is because
  126. these files are stored in :term:`Git`. So you already have my notes,
  127. and you can decide for yourself whether you want to ``get`` the books."
  128. To demonstrate this, you decide to examine the PDFs further.
  129. "Try to get one of the books", you instruct your room mate:
  130. .. runrecord:: _examples/DL-101-116-104
  131. :language: console
  132. :workdir: dl-101/mock_user/DataLad-101
  133. :notes: how does it feel to get a file?
  134. :cast: 04_collaboration
  135. $ datalad get books/progit.pdf
  136. "Opening this file will work, because the content was retrieved from
  137. the original dataset.", you explain, proud that this worked just as you
  138. thought it would.
  139. Let's now turn to the fact that the subdataset ``longnow`` contains neither
  140. file content nor file metadata information to explore the contents of the
  141. dataset: there are no subdirectories or any files under ``recordings/longnow/``.
  142. This is behavior that you have not observed until now.
  143. To fix this and obtain file availability metadata,
  144. you have to run a somewhat unexpected command:
  145. .. runrecord:: _examples/DL-101-116-107
  146. :language: console
  147. :workdir: dl-101/mock_user/DataLad-101
  148. :notes: how do we get the subdataset? currently it looks empty. --> a plain datalad install
  149. :cast: 04_collaboration
  150. $ datalad get -n recordings/longnow
  151. Before we look further into :dlcmd:`get` and the
  152. ``-n/--no-data`` option, let's first see what has changed after
  153. running the above command (excerpt):
  154. .. runrecord:: _examples/DL-101-116-108
  155. :language: console
  156. :workdir: dl-101/mock_user/DataLad-101
  157. :lines: 1-20
  158. :notes: what has changed? --> file metadata information!
  159. :cast: 04_collaboration
  160. $ tree
  161. Interesting! The file metadata information is now present, and we can
  162. explore the file hierarchy. The file content, however, is not present yet.
  163. What has happened here?
  164. When DataLad installs a dataset, it will by default only obtain the
  165. superdataset, and not any subdatasets. The superdataset contains the
  166. information that a subdataset exists though -- the subdataset is *registered*
  167. in the superdataset. This is why the subdataset name exists as a directory.
  168. A subsequent :dlcmd:`get -n path/to/longnow` will install the registered
  169. subdataset again, just as we did in the example above.
  170. But what about the ``-n`` option for :dlcmd:`get`?
  171. Previously, we used :dlcmd:`get` to get file content. However,
  172. :dlcmd:`get` operates on more than just the level of *files* or *directories*.
  173. Instead, it can also operate on the level of *datasets*. Regardless of whether
  174. it is a single file (such as ``books/TLCL.pdf``) or a registered subdataset
  175. (such as ``recordings/longnow``), :dlcmd:`get` will operate on it to 1) install
  176. it -- if it is a not yet installed subdataset -- and 2) retrieve the contents of any files.
  177. That makes it very easy to get your file content, regardless of
  178. how your dataset may be structured -- it is always the same command, and DataLad
  179. blurs the boundaries between superdatasets and subdatasets.
  180. In the above example, we called :dlcmd:`get` with the option ``-n/--no-data``.
  181. This option prevents that :dlcmd:`get` obtains the data of individual files or
  182. directories, thus limiting its scope to the level of datasets as only a
  183. :dlcmd:`clone` is performed. Without this option, the command would
  184. have retrieved all of the subdatasets contents right away. But with ``-n/--no-data``,
  185. it only installed the subdataset to retrieve the meta data about file availability.
  186. .. index::
  187. pair: get all dataset content; with DataLad
  188. To explicitly install all potential subdatasets *recursively*, that is,
  189. all of the subdatasets inside it as well, one can give the
  190. ``-r``/``--recursive`` option to :dlcmd:`get`:
  191. .. code-block:: console
  192. $ datalad get -n -r <subds>
  193. This would install the ``subds`` subdataset and all potential further
  194. subdatasets inside of it, and the meta data about file hierarchies would
  195. have been available right away for every subdataset inside of ``subds``. If you
  196. had several subdatasets and would not provide a path to a single dataset,
  197. but, say, the current directory (``.`` as in :dlcmd:`get -n -r .`), it
  198. would clone all registered subdatasets recursively.
  199. So why is a recursive get not the default behavior?
  200. In :ref:`nesting` we learned that datasets can be nested *arbitrarily* deep.
  201. Upon getting the meta data of one dataset you might not want to also install
  202. a few dozen levels of nested subdatasets right away.
  203. However, there is a middle way [#f1]_: The ``--recursion-limit`` option let's
  204. you specify how many levels of subdatasets should be installed together
  205. with the first subdataset:
  206. .. code-block:: console
  207. $ datalad get -n -r --recursion-limit 1 <subds>
  208. To summarize what you learned in this section, write a note on how to
  209. install a dataset using a path as a source on a common file system.
  210. Write this note in "your own" (the original) ``DataLad-101`` dataset, though!
  211. .. runrecord:: _examples/DL-101-116-109
  212. :language: console
  213. :workdir: dl-101/mock_user/DataLad-101
  214. :notes: note in original DataLad-101 dataset
  215. :cast: 04_collaboration
  216. $ # navigate back into the original dataset
  217. $ cd ../../DataLad-101
  218. $ # write the note
  219. $ cat << EOT >> notes.txt
  220. A source to install a dataset from can also be a path, for example as
  221. in "datalad clone ../DataLad-101".
  222. Just as in creating datasets, you can add a description on the
  223. location of the new dataset clone with the -D/--description option.
  224. Note that subdatasets will not be installed by default, but are only
  225. registered in the superdataset -- you will have to do a
  226. "datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
  227. availability meta data. The -n/--no-data options prevents that file
  228. contents are also downloaded.
  229. Note that a recursive "datalad get" would install all further
  230. registered subdatasets underneath a subdataset, so a safer way to
  231. proceed is to set a decent --recursion-limit:
  232. "datalad get -n -r --recursion-limit 2 <subds>"
  233. EOT
  234. Save this note.
  235. .. runrecord:: _examples/DL-101-116-110
  236. :language: console
  237. :workdir: dl-101/DataLad-101
  238. :cast: 04_collaboration
  239. $ datalad save -m "add note about cloning from paths and recursive datalad get"
  240. .. index::
  241. pair: clone; DataLad concept
  242. .. gitusernote:: Get a clone
  243. A dataset that is installed from an existing source, e.g., a path or URL,
  244. is the DataLad equivalent of a *clone* in Git.
  245. .. only:: adminmode
  246. Add a tag at the section end.
  247. .. runrecord:: _examples/DL-101-116-111
  248. :language: console
  249. :workdir: dl-101/DataLad-101
  250. $ git branch sct_looking_without_touching
  251. .. rubric:: Footnotes
  252. .. [#f1] Another alternative to a recursion limit to :dlcmd:`get -n -r` is
  253. a dataset configuration that specifies subdatasets that should *not* be
  254. cloned recursively, unless explicitly given to the command with a path. With
  255. this configuration, a superdataset's maintainer can safeguard users and prevent
  256. potentially large amounts of subdatasets to be cloned.
  257. You can learn more about this configuration in the section :ref:`config2`.