101-105-install.rst 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485
  1. .. index::
  2. pair: clone; DataLad command
  3. pair: clone dataset; with DataLad
  4. .. _installds:
  5. Install datasets
  6. ----------------
  7. So far, we have created a ``DataLad-101`` course dataset. We saved some additional readings
  8. into the dataset, and have carefully made and saved notes on the DataLad
  9. commands we discovered. Up to this point, we therefore know the typical, *local*
  10. workflow to create and populate a dataset from scratch.
  11. But we've been told that with DataLad we could very easily get vast amounts of data to our
  12. computer. Rumor has it that this would be only a single command in the terminal!
  13. Therefore, everyone in today's lecture excitedly awaits today's topic: Installing datasets.
  14. "With DataLad, users can install *clones* of existing DataLad datasets from paths, URLs, or
  15. open-data collections" our lecturer begins.
  16. "This makes accessing data fast and easy. A dataset that others could install can be
  17. created by anyone, without a need for additional software. Your own datasets can be
  18. installed by others, should you want that, for example. Therefore, not only accessing
  19. data becomes fast and easy, but also *sharing*."
  20. "That's so cool!", you think. "Exam preparation will be a piece of cake if all of us
  21. can share our mid-term and final projects easily!"
  22. "But today, let's only focus on how to install a dataset", she continues.
  23. "Damn it! Can we not have longer lectures?", you think and set alarms to all of the
  24. upcoming lecture dates in your calendar.
  25. There is so much exciting stuff to come, you cannot miss a single one.
  26. "Psst!" a student from the row behind reaches over. "There are
  27. a bunch of audio recordings of a really cool podcast, and they have been shared in the form
  28. of a DataLad dataset! Shall we try whether we can install that?"
  29. "Perfect! What a great way to learn how to install a dataset. Doing it
  30. now instead of looking at slides for hours is my preferred type of learning anyway",
  31. you think as you fire up your terminal and navigate into your ``DataLad-101`` dataset.
  32. In this demonstration, we are using one of the many openly available datasets that
  33. DataLad provides in a public registry that anyone can access. One of these datasets is a
  34. collection of audio recordings of a great podcast, the longnow seminar series [#f2]_.
  35. It consists of audio recordings about long-term thinking, and while the DataLad-101
  36. course is not a long-term thinking seminar, those recordings are nevertheless a
  37. good addition to the large stash of yet-to-read text books we piled up.
  38. Let's get this dataset into our existing ``DataLad-101`` dataset.
  39. To keep the ``DataLad-101`` dataset neat and organized, we first create a new directory,
  40. called recordings.
  41. .. runrecord:: _examples/DL-101-105-101
  42. :language: console
  43. :workdir: dl-101/DataLad-101
  44. :cast: 01_dataset_basics
  45. :notes: The next challenge is to clone an existing dataset from the web as a subdataset. First, we create a location for this
  46. $ # we are in the root of DataLad-101
  47. $ mkdir recordings
  48. The command that can be used to obtain a dataset is :dlcmd:`clone`,
  49. but we often refer to the process of cloning a Dataset as *installing*.
  50. Let's install the longnow podcasts in this new directory.
  51. The :dlcmd:`clone` command takes a location of an existing dataset to clone. This *source*
  52. can be a URL or a path to a local directory, or an SSH server [#f1]_. The dataset
  53. to be installed lives on :term:`GitHub`, at
  54. `https://github.com/datalad-datasets/longnow-podcasts.git <https://github.com/datalad-datasets/longnow-podcasts>`_,
  55. and we can give its GitHub URL as the first positional argument.
  56. Optionally, the command also takes as second positional argument a path to the *destination*,
  57. -- a path to where we want to install the dataset to. In this case it is ``recordings/longnow``.
  58. Because we are installing a dataset (the podcasts) into an existing dataset (the ``DataLad-101``
  59. dataset), we also supply a ``-d/--dataset`` flag to the command.
  60. This specifies the dataset to perform the operation on, and allows us to install
  61. the podcasts as a *subdataset* of ``DataLad-101``. Because we are in the root
  62. of the ``DataLad-101`` dataset, the pointer to the dataset is a ``.`` (which is Unix'
  63. way of saying "current directory").
  64. As before with long commands, we line break the code with a ``\``. You can
  65. copy it as it is presented here into your terminal, but in your own work you
  66. can write commands like this into a single line.
  67. .. runrecord:: _examples/DL-101-105-102
  68. :language: console
  69. :workdir: dl-101/DataLad-101/
  70. :cast: 01_dataset_basics
  71. :notes: We need to clone the dataset as a subdataset. For this, we use the datalad clone command with a --dataset option and a path. Else the dataset would not be registered as a subdataset!
  72. $ datalad clone --dataset . \
  73. https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
  74. This command copied the repository found at the URL https://github.com/datalad-datasets/longnow-podcasts
  75. into the existing ``DataLad-101`` dataset, into the directory ``recordings/longnow``.
  76. The optional destination is helpful: If we had not specified the path
  77. ``recordings/longnow`` as a destination for the dataset clone, the command would
  78. have installed the dataset into the root of the ``DataLad-101`` dataset, and instead
  79. of ``longnow`` it would have used the name of the remote repository "``longnow-podcasts``".
  80. But the coolest feature of :dlcmd:`clone` is yet invisible: This command
  81. also recorded where this dataset came from, thus capturing its *origin* as
  82. :term:`provenance`. Even though this is not obvious at this point in time, later
  83. chapters in this handbook will demonstrate how useful this information can be.
  84. .. index::
  85. pair: clone; DataLad concept
  86. .. gitusernote:: Clone internals
  87. The :dlcmd:`clone` command uses :gitcmd:`clone`.
  88. A dataset that is installed from an existing source, e.g., a path or URL,
  89. is the DataLad equivalent of a *clone* in Git.
  90. .. index::
  91. pair: clone into another dataset; with DataLad
  92. .. find-out-more:: Do I have to install from the root of datasets?
  93. No. Instead of from the *root* of the ``DataLad-101`` dataset, you could have also
  94. installed the dataset from within the ``recordings``, or ``books`` directory.
  95. In the case of installing datasets into existing datasets you however need
  96. to adjust the paths that are given with the ``-d/--dataset`` option:
  97. ``-d`` needs to specify the path to the root of the dataset. This is
  98. important to keep in mind whenever you do not execute the :dlcmd:`clone` command
  99. from the root of this dataset. Luckily, there is a shortcut: ``-d^`` will always
  100. point to root of the top-most dataset. For example, if you navigate into ``recordings``,
  101. the command would be:
  102. .. code-block:: console
  103. $ datalad clone -d^ https://github.com/datalad-datasets/longnow-podcasts.git longnow
  104. .. find-out-more:: What if I do not install into an existing dataset?
  105. If you do not install into an existing dataset, you only need to omit the ``-d/--dataset``
  106. option. You can try:
  107. .. code-block:: console
  108. $ datalad clone https://github.com/datalad-datasets/longnow-podcasts.git
  109. anywhere outside of your ``DataLad-101`` dataset to install the podcast dataset into a new directory
  110. called ``longnow-podcasts``. You could even do this inside of an existing dataset.
  111. However, whenever you install datasets into of other datasets, the ``-d/--dataset``
  112. option is necessary to not only install the dataset, but also *register* it
  113. automatically into the higher level *superdataset*. The upcoming section will
  114. elaborate on this.
  115. Here is the repository structure:
  116. .. index::
  117. pair: tree; terminal command
  118. pair: display directory tree; on Windows
  119. .. windows-wit:: use tree
  120. .. include:: topic/tree-windows.rst
  121. .. runrecord:: _examples/DL-101-105-103
  122. :language: console
  123. :workdir: dl-101/DataLad-101
  124. :cast: 01_dataset_basics
  125. :notes: Let's take a look at the directory structure after cloning
  126. $ tree -d # we limit the output to directories
  127. We can see that ``recordings`` has one subdirectory, our newly installed ``longnow``
  128. dataset with two subdirectories.
  129. If we navigate into one of them and list its content, we'll see many ``.mp3`` files (here is an excerpt).
  130. .. runrecord:: _examples/DL-101-105-104
  131. :language: console
  132. :workdir: dl-101/DataLad-101/
  133. :lines: 1-15
  134. :cast: 01_dataset_basics
  135. :notes: And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?
  136. $ cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
  137. $ ls
  138. Dataset content identity and availability information
  139. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  140. Surprised, you turn to your fellow student and wonder about
  141. how fast the dataset was installed. Should
  142. a download of that many ``.mp3`` files not take much more time?
  143. Here you can see another import feature of DataLad datasets
  144. and the :dlcmd:`clone` command:
  145. Upon installation of a DataLad dataset, DataLad retrieves only small files
  146. (for example, text files or markdown files) and (small) metadata
  147. about the dataset. It does not, however, download any large files
  148. (yet). The metadata exposes the dataset's file hierarchy
  149. for exploration (note how you are able to list the dataset contents with ``ls``),
  150. and downloading only this metadata speeds up the installation of a DataLad dataset
  151. of many TB in size to a few seconds. Just now, after installing, the dataset is
  152. small in size:
  153. .. index::
  154. pair: show file size; in a terminal
  155. .. runrecord:: _examples/DL-101-105-105
  156. :language: console
  157. :workdir: dl-101/DataLad-101/recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
  158. :cast: 01_dataset_basics
  159. :notes: Upon cloning of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)
  160. $ cd ../ # in longnow/
  161. $ du -sh # Unix command to show size of contents
  162. This is tiny indeed!
  163. If you executed the previous ``ls`` command in your own terminal, you might have seen
  164. the ``.mp3`` files highlighted in a different color than usually.
  165. On your computer, try to open one of the ``.mp3`` files.
  166. You will notice that you cannot open any of the audio files.
  167. This is not your fault: *None of these files exist on your computer yet*.
  168. Wait, what?
  169. This sounds strange, but it has many advantages. Apart from a fast installation,
  170. it allows you to retrieve precisely the content you need, instead of all the contents
  171. of a dataset. Thus, even if you install a dataset that is many TB in size,
  172. it takes up only few MB of space after the install, and you can retrieve only those
  173. components of the dataset that you need.
  174. Let's see how large the dataset would be in total if all of the files were present.
  175. For this, we supply an additional option to :dlcmd:`status`. Make sure to be
  176. (somewhere) inside of the ``longnow`` dataset to execute the following command:
  177. .. runrecord:: _examples/DL-101-105-106
  178. :language: console
  179. :workdir: dl-101/DataLad-101/recordings/longnow
  180. :cast: 01_dataset_basics
  181. :notes: But how large would the dataset be if we had all the content?
  182. $ datalad status --annex
  183. Woah! More than 200 files, totaling more than 15 GB?
  184. You begin to appreciate that DataLad did not
  185. download all of this data right away! That would have taken hours given the crappy
  186. internet connection in the lecture hall, and you are not even sure whether your
  187. hard drive has much space left...
  188. But you nevertheless are curious on how to actually listen to one of these ``.mp3``\s now.
  189. So how does one actually "get" the files?
  190. .. index::
  191. pair: get; DataLad command
  192. The command to retrieve file content is :dlcmd:`get`.
  193. You can specify one or more specific files, or ``get`` all of the dataset by
  194. specifying :dlcmd:`get .` at the root directory of the dataset (with ``.`` denoting "current directory").
  195. .. index::
  196. pair: get file content; with DataLad
  197. First, we get one of the recordings in the dataset -- take any one of your choice
  198. (here, it's the first).
  199. .. runrecord:: _examples/DL-101-105-107
  200. :language: console
  201. :workdir: dl-101/DataLad-101/recordings/longnow
  202. :cast: 01_dataset_basics
  203. :notes: Now let's finally get some content in this dataset. This is done with the datalad get command
  204. $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
  205. Try to open it -- it will now work.
  206. If you would want to get the rest of the missing data, instead of specifying all files individually,
  207. we can use ``.`` to refer to *all* of the dataset like this:
  208. .. code-block:: console
  209. $ datalad get .
  210. However, with a total size of more than 15GB, this might take a while, so do not do that now.
  211. If you did execute the command above, interrupt it by pressing ``CTRL`` + ``C`` -- Do not worry,
  212. this will not break anything.
  213. .. index::
  214. pair: show dataset size; with DataLad
  215. Isn't that easy?
  216. Let's see how much content is now present locally. For this, :dlcmd:`status --annex all`
  217. has a nice summary:
  218. .. runrecord:: _examples/DL-101-105-108
  219. :language: console
  220. :workdir: dl-101/DataLad-101/recordings/longnow
  221. :cast: 01_dataset_basics
  222. :notes: DataLad status can also summarize how much of the content is already present locally:
  223. $ datalad status --annex all
  224. This shows you how much of the total content is present locally. With one file,
  225. it is only a fraction of the total size.
  226. Let's ``get`` a few more recordings, just because it was so mesmerizing to watch
  227. DataLad's fancy progress bars.
  228. .. runrecord:: _examples/DL-101-105-109
  229. :language: console
  230. :workdir: dl-101/DataLad-101/recordings/longnow
  231. :cast: 01_dataset_basics
  232. :notes: Let's get a few more files. Note how already obtained files are not downloaded again:
  233. $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
  234. Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
  235. Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
  236. Note that any data that is already retrieved (the first file) is not downloaded again.
  237. DataLad summarizes the outcome of the execution of ``get`` in the end and informs
  238. that the download of one file was ``notneeded`` and the retrieval of the other files was ``ok``.
  239. .. index::
  240. pair: get; DataLad concept
  241. .. gitusernote:: Get internals
  242. :dlcmd:`get` uses :gitannexcmd:`get` underneath the hood.
  243. .. index::
  244. pair: drop file content; with DataLad
  245. Keep whatever you like
  246. ^^^^^^^^^^^^^^^^^^^^^^
  247. "Oh shit, oh shit, oh shit..." you hear from right behind you. Your fellow student
  248. apparently downloaded the *full* dataset accidentally. "Is there a way to get rid
  249. of file contents in dataset, too?", they ask. "Yes", the lecturer responds,
  250. "you can remove file contents by using :dlcmd:`drop`. This is
  251. really helpful to save disk space for data you can easily reobtain, for example".
  252. .. index::
  253. pair: drop; DataLad command
  254. The :dlcmd:`drop` command will remove
  255. file contents completely from your dataset.
  256. You should only use this command to remove contents that you can :dlcmd:`get`
  257. again, or generate again (for example, with next chapter's :dlcmd:`datalad run`
  258. command), or that you really do not need anymore.
  259. Let's remove the content of one of the files that we have downloaded, and check
  260. what this does to the total size of the dataset. Here is the current amount of
  261. retrieved data in this dataset:
  262. .. runrecord:: _examples/DL-101-105-110
  263. :language: console
  264. :workdir: dl-101/DataLad-101/recordings/longnow
  265. $ datalad status --annex all
  266. We drop a single recording's content that we previously downloaded with
  267. :dlcmd:`get` ...
  268. .. runrecord:: _examples/DL-101-105-111
  269. :language: console
  270. :workdir: dl-101/DataLad-101/recordings/longnow
  271. $ datalad drop Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
  272. ... and check the size of the dataset again:
  273. .. runrecord:: _examples/DL-101-105-112
  274. :language: console
  275. :workdir: dl-101/DataLad-101/recordings/longnow
  276. $ datalad status --annex all
  277. Dropping the file content of one ``mp3`` file saved roughly 40MB of disk space.
  278. Whenever you need the recording again, it is easy to re-retrieve it:
  279. .. runrecord:: _examples/DL-101-105-113
  280. :language: console
  281. :workdir: dl-101/DataLad-101/recordings/longnow
  282. $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
  283. Reobtained!
  284. This was only a quick digression into :dlcmd:`drop`. The main principles
  285. of this command will become clear after chapter
  286. :ref:`chapter_gitannex`, and its precise use is shown in the paragraph on
  287. :ref:`removing file contents <remove>`.
  288. At this point, however, you already know that datasets allow you do
  289. :dlcmd:`drop` file contents flexibly. If you want to, you could have more
  290. podcasts (or other data) on your computer than you have disk space available
  291. by using DataLad datasets -- and that really is a cool feature to have.
  292. Dataset archeology
  293. ^^^^^^^^^^^^^^^^^^
  294. You have now experienced how easy it is to (re)obtain shared data with DataLad.
  295. But beyond sharing only the *data* in the dataset, when sharing or installing
  296. a DataLad dataset, all copies also include the dataset's *history*.
  297. .. index::
  298. pair: log; Git command
  299. pair: show history (reverse); with Git
  300. For example, we can find out who created the dataset in the first place
  301. (the output shows an excerpt of ``git log --reverse``, which displays the
  302. history from first to most recent commit):
  303. .. runrecord:: _examples/DL-101-105-114
  304. :language: console
  305. :workdir: dl-101/DataLad-101/recordings/longnow
  306. :emphasize-lines: 3
  307. :lines: 1-13
  308. :cast: 01_dataset_basics
  309. :notes: On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!
  310. $ git log --reverse
  311. But that's not all. The seminar series is ongoing, and more recordings can get added
  312. to the original repository shared on GitHub.
  313. Because an installed dataset knows the dataset it was installed from,
  314. your local dataset clone can be updated from its origin, and thus get the new recordings,
  315. should there be some. Later in this handbook, we will see examples of this.
  316. .. index::
  317. pair: update heredoc; in a terminal
  318. pair: save dataset modification; with DataLad
  319. Now you can not only create datasets and work with them locally, you can also consume
  320. existing datasets by installing them. Because that's cool, and because you will use this
  321. command frequently, make a note of it into your ``notes.txt``, and :dlcmd:`save` the
  322. modification.
  323. .. runrecord:: _examples/DL-101-105-115
  324. :language: console
  325. :workdir: dl-101/DataLad-101/recordings/longnow
  326. :cast: 01_dataset_basics
  327. :notes: We can make a note about this:
  328. $ # in the root of DataLad-101:
  329. $ cd ../../
  330. $ cat << EOT >> notes.txt
  331. The command 'datalad clone URL/PATH [PATH]' installs a dataset from
  332. e.g., a URL or a path. If you install a dataset into an existing
  333. dataset (as a subdataset), remember to specify the root of the
  334. superdataset with the '-d' option.
  335. EOT
  336. $ datalad save -m "Add note on datalad clone"
  337. .. index::
  338. pair: placeholder files; on Mac
  339. .. importantnote:: Empty files can be confusing
  340. Listing files directly after the installation of a dataset will
  341. work if done in a terminal with ``ls``.
  342. However, certain file managers (such as OSX's Finder [#f3]_) may fail to
  343. display files that are not yet present locally (i.e., before a
  344. :dlcmd:`get` was run). Therefore, be mindful when exploring
  345. a dataset hierarchy with a file manager -- it might not show you
  346. the available but not yet retrieved files.
  347. Consider browsing datasets with the :term:`DataLad Gooey` to be on the safe side.
  348. More about why this is will be explained in section :ref:`symlink`.
  349. .. only:: adminmode
  350. Add a tag at the section end.
  351. .. runrecord:: _examples/DL-101-105-116
  352. :language: console
  353. :workdir: dl-101/DataLad-101
  354. $ git branch sct_install_datasets
  355. .. rubric:: Footnotes
  356. .. [#f1] Additionally, a source can also be a pointer to an open-data collection,
  357. for example :term:`the DataLad superdataset ///` -- more on what this is and how to
  358. use it later, though.
  359. .. [#f2] The longnow podcasts are lectures and conversations on long-term thinking produced by
  360. the LongNow foundation and we can wholeheartedly recommend them for their worldly
  361. wisdoms and compelling, thoughtful ideas. Subscribe to the podcasts at https://longnow.org/seminars/podcast.
  362. Support the foundation by becoming a member: https://longnow.org/join.
  363. .. [#f3] You can also upgrade your file manager to display file types in a
  364. DataLad datasets (e.g., with the
  365. `git-annex-turtle extension <https://github.com/andrewringler/git-annex-turtle>`_
  366. for Finder)