101-102-populate.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406
  1. .. _populate:
  2. Populate a dataset
  3. ------------------
  4. The first lecture in DataLad-101 referenced some useful literature.
  5. Even if we end up not reading those books at all, let's download
  6. them nevertheless and put them into our dataset. You never know, right?
  7. Let's first create a directory to save books for additional reading in.
  8. .. runrecord:: _examples/DL-101-102-101
  9. :language: console
  10. :workdir: dl-101/DataLad-101
  11. :cast: 01_dataset_basics
  12. :notes: The dataset is empty, lets put some PDFs inside. First, create a directory to store them in:
  13. $ mkdir books
  14. .. index::
  15. pair: tree; terminal command
  16. Let's take a look at the current directory structure with the tree command [#f1]_:
  17. .. runrecord:: _examples/DL-101-102-102
  18. :language: console
  19. :workdir: dl-101/DataLad-101
  20. :cast: 01_dataset_basics
  21. :notes: The tree command shows us the directory structure in the dataset. Apart from the directory, it's empty.
  22. $ tree
  23. Arguably, not the most exciting thing to see. So let's put some PDFs inside.
  24. Below is a short list of optional readings. We decide to download them (they
  25. are all free, in total about 15 MB), and save them in ``DataLad-101/books``.
  26. - Additional reading about the command line: `The Linux Command Line <https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download>`_
  27. - An intro to Python: `A byte of Python <https://github.com/swaroopch/byte-of-python/releases/download/vadb91fc6fce27c58e3f931f5861806d3ccd1054c/byte-of-python.pdf>`_
  28. You can either visit the links and save them in ``books/``,
  29. or run the following commands [#f2]_ to download the books right from the terminal.
  30. Note that we line break the command with ``\`` line continuation characters. In your own work you can write
  31. commands like this into a single line. If you copy them into your terminal as they
  32. are presented here, make sure to check the :windows-wit:`on peculiarities of its terminals
  33. <ww-no-multiline-commands>`.
  34. .. index::
  35. pair: line continuation; on Windows in a terminal
  36. .. windows-wit:: Terminals other than Git Bash can't handle multi-line commands
  37. :name: ww-no-multiline-commands
  38. .. include:: topic/terminal-linecontinuation.rst
  39. .. index::
  40. pair: download file; with wget
  41. .. runrecord:: _examples/DL-101-102-103
  42. :language: console
  43. :workdir: dl-101/DataLad-101
  44. :cast: 01_dataset_basics
  45. :notes: We use wget to download a few books from the web. CAVE: longish realcommand!
  46. $ cd books
  47. $ wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download \
  48. -O TLCL.pdf
  49. $ wget -q https://github.com/swaroopch/byte-of-python/releases/download/vadb91fc6fce27c58e3f931f5861806d3ccd1054c/byte-of-python.pdf \
  50. -O byte-of-python.pdf
  51. $ # get back into the root of the dataset
  52. $ cd ../
  53. Some machines will not have :shcmd:`wget` available by default, but any command that can
  54. download a file can work as an alternative. See the :windows-wit:`for the popular alternative
  55. curl <ww-curl-instead-wget>`.
  56. .. index::
  57. pair: curl instead of wget; on Windows
  58. pair: download file; with curl
  59. .. windows-wit:: You can use curl instead of wget
  60. :name: ww-curl-instead-wget
  61. .. include:: topic/curl-instead-wget.rst
  62. Let's see what happened. First of all, in the root of ``DataLad-101``, show the directory
  63. structure with tree:
  64. .. runrecord:: _examples/DL-101-102-104
  65. :language: console
  66. :workdir: dl-101/DataLad-101
  67. :cast: 01_dataset_basics
  68. :notes: Here they are:
  69. $ tree
  70. .. index::
  71. pair: status; DataLad command
  72. pair: check dataset for modification; with DataLad
  73. Now what does DataLad do with this new content? One command you will use very
  74. often is :dlcmd:`status`.
  75. It reports on the state of dataset content, and
  76. regular status reports should become a habit in the wake of ``DataLad-101``.
  77. .. runrecord:: _examples/DL-101-102-105
  78. :language: console
  79. :workdir: dl-101/DataLad-101
  80. :cast: 01_dataset_basics
  81. :notes: What has happened to our dataset now with this new content? We can use datalad status to find out:
  82. $ datalad status
  83. .. index::
  84. pair: save; DataLad command
  85. pair: save dataset modification; with DataLad
  86. Interesting; the ``books/`` directory is "untracked". Remember how content
  87. *can* be tracked *if a user wants to*?
  88. Untracked means that DataLad does not know about this directory or its content,
  89. because we have not instructed DataLad to actually track it. This means that DataLad
  90. does not store the downloaded books in its history yet. Let's change this by
  91. *saving* the files to the dataset's history with the :dlcmd:`save` command.
  92. This time, it is your turn to specify a helpful :term:`commit message`
  93. with the ``-m`` option (although the DataLad command is :dlcmd:`save`, we talk
  94. about commit messages because :dlcmd:`save` ultimately uses the command
  95. :gitcmd:`commit` to do its work):
  96. .. runrecord:: _examples/DL-101-102-106
  97. :language: console
  98. :workdir: dl-101/DataLad-101
  99. :cast: 01_dataset_basics
  100. :notes: ATM the files are untracked and thus unknown to any version control system. In order to version control the PDFs we need to save them. We attach a meaningful summary of this with the -m option:
  101. $ datalad save -m "add books on Python and Unix to read later"
  102. If you ever forget to specify a message, or made a typo, not all is lost. A
  103. :find-out-more:`explains how to amend a saved state <fom-amend-save>`.
  104. .. index::
  105. pair: amend commit message; with Git
  106. .. find-out-more:: "Oh no! I forgot the -m option for 'datalad save'!"
  107. :name: fom-amend-save
  108. :float:
  109. If you forget to specify a commit message with the ``-m`` option, DataLad will write
  110. ``[DATALAD] Recorded changes`` as a commit message into your history.
  111. This is not particularly informative.
  112. You can change the *last* commit message with the Git command
  113. :gitcmd:`commit --amend`. This will open up your default editor
  114. and you can edit
  115. the commit message. Careful -- the default editor might be :term:`vim`!
  116. The section :ref:`history` will show you many more ways in which you can
  117. interact with a dataset's history.
  118. As already noted, any files you ``save`` in this dataset, and all modifications
  119. to these files that you ``save``, are tracked in this history.
  120. Importantly, this file tracking works
  121. regardless of the size of the files -- a DataLad dataset could be
  122. your private music or movie collection with single files being many GB in size.
  123. This is one aspect that distinguishes DataLad from many other
  124. version control tools, among them Git.
  125. Large content is tracked in an *annex* that is automatically
  126. created and handled by DataLad. Whether text files or larger files change,
  127. all of these changes can be written to your DataLad dataset's history.
  128. .. index::
  129. pair: log; Git command
  130. pair: show last commit; with Git
  131. Let's see how the saved content shows up in the history of the dataset with :gitcmd:`log`.
  132. The option ``-n 1`` specifies that we want to take a look at the most recent commit.
  133. In order to get a bit more details, we add the ``-p`` flag. If you end up in a
  134. :term:`pager`, navigate with up and down arrow keys and leave the log by typing ``q``:
  135. .. runrecord:: _examples/DL-101-102-107
  136. :language: console
  137. :workdir: dl-101/DataLad-101
  138. :lines: 1-20
  139. :emphasize-lines: 3-4, 6, 8, 12, 16, 20
  140. :cast: 01_dataset_basics
  141. :notes: Save command reports what has been added to the dataset. Now we can see how this action looks like in our dataset's history:
  142. $ git log -p -n 1
  143. Now this might look a bit cryptic (and honestly, tig [#f3]_ makes it look prettier).
  144. But this tells us the date and time in which a particular author added two PDFs to
  145. the directory ``books/``, and thanks to that commit message we have a nice
  146. human-readable summary of that action. A :find-out-more:`explains what makes
  147. a good message <fom-commit-message-guidance>`.
  148. .. index::
  149. pair: recommendation; commit message
  150. .. find-out-more:: DOs and DON'Ts for commit messages
  151. :name: fom-commit-message-guidance
  152. :float: tbp
  153. **DOs**
  154. - Write a *title line* with 72 characters or less
  155. - Use imperative voice, e.g., "Add notes from lecture 2"
  156. - If a title line is not enough to express your changes and reasoning behind it, add a body to your commit message: hit enter twice (before closing the quotation marks), and continue writing a brief summary of the changes after a blank line. This summary should explain "what" has been done and "why", but not "how". Close the quotation marks, and hit enter to save the change with your message.
  157. **DON'Ts**
  158. - Avoid passive voice
  159. - Extensive formatting (hashes, asterisks, quotes, ...) will most likely make your shell complain
  160. - Do not say nasty things about other people
  161. .. index::
  162. pair: no staging; with DataLad
  163. .. gitusernote:: There is no staging area in DataLad
  164. Just as in Git, new files are not tracked from their creation on, but only when
  165. explicitly added to Git (in Git terms, with an initial :gitcmd:`add`). But different
  166. from the common Git workflow, DataLad skips the staging area. A :dlcmd:`save`
  167. combines a :gitcmd:`add` and a :gitcmd:`commit`, and therefore, the commit message
  168. is specified with :dlcmd:`save`.
  169. Cool, so now you have added some files to your dataset history. But what is a bit
  170. inconvenient is that both books were saved *together*. You begin to wonder: "A Python
  171. book and a Unix book do not have that much in common. I probably should not save them
  172. in the same commit. And ... what happens if I have files I do not want to track?
  173. :dlcmd:`save -m "some commit message"` would save all of what is currently
  174. untracked or modified in the dataset into the history!"
  175. Regarding your first remark, you are absolutely right!
  176. It is good practice to save only those changes
  177. together that belong together. We do not want to squish completely unrelated changes
  178. into the same spot of our history, because it would get very nasty should we want to
  179. revert *some* of the changes without affecting others in this commit.
  180. Luckily, we can point :dlcmd:`save` to exactly the changes we want it to record.
  181. Let's try this by adding yet another book, a good reference work about git,
  182. `Pro Git <https://git-scm.com/book/en/v2>`_:
  183. .. runrecord:: _examples/DL-101-102-108
  184. :language: console
  185. :workdir: dl-101/DataLad-101
  186. :cast: 01_dataset_basics
  187. :notes: It's inconvenient that we saved two books together - we should have saved them as independent modifications of the dataset. To see how single modifications can be saved, let's download another book
  188. $ cd books
  189. $ wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
  190. $ cd ../
  191. :dlcmd:`status` shows that there is a new untracked file:
  192. .. runrecord:: _examples/DL-101-102-109
  193. :language: console
  194. :workdir: dl-101/DataLad-101
  195. :cast: 01_dataset_basics
  196. :notes: Check the dataset state with the status command frequently
  197. $ datalad status
  198. Let's give :dlcmd:`save` precisely this file by specifying its path after the commit message:
  199. .. runrecord:: _examples/DL-101-102-110
  200. :language: console
  201. :workdir: dl-101/DataLad-101
  202. :cast: 01_dataset_basics
  203. :notes: To save a single modification, provide a path to it!
  204. $ datalad save -m "add reference book about git" books/progit.pdf
  205. Regarding your second remark, you are right that a :dlcmd:`save` without a
  206. path specification would write all of the currently untracked files or modifications
  207. to the history. But check the :find-out-more:`on how to tell it otherwise <fom-save-updated-only>`.
  208. .. index::
  209. pair: save already tracked files only; with DataLad
  210. .. find-out-more:: How to save already tracked dataset components only?
  211. :name: fom-save-updated-only
  212. :float:
  213. A :dlcmd:`save -m "concise message" --updated` (or the shorter
  214. form of ``--updated``, ``-u``) will only write *modifications* to the
  215. history, not untracked files. Later, we will also see ``.gitignore`` files
  216. that let you hide content from version control. However, it is good
  217. practice to safely store away modifications or new content. This improves
  218. your dataset and workflow, and will be a requirement for executing certain
  219. commands.
  220. A :dlcmd:`status` should now be empty, and our dataset's history should look like this:
  221. .. index::
  222. pair: show history (compact); with Git
  223. .. runrecord:: _examples/DL-101-102-111
  224. :workdir: dl-101/DataLad-101
  225. :language: console
  226. :cast: 01_dataset_basics
  227. :notes: Let's view the growing history (concise with the --oneline option):
  228. $ # lets make the output a bit more concise with the --oneline option
  229. $ git log --oneline
  230. “Wonderful! I’m getting a hang on this quickly”, you think. “Version controlling
  231. files is not as hard as I thought!”
  232. But downloading and adding content to your dataset “manually” has two
  233. disadvantages: For one, it requires you to download the content and save it.
  234. Compared to a workflow with no DataLad dataset, this is one additional command
  235. you have to perform (`and that additional time adds up, after a while <https://xkcd.com/1205>`_). But a more
  236. serious disadvantage is that you have no electronic record of the source of the
  237. contents you added. The amount of :term:`provenance`, the time, date, and author
  238. of file, is already quite nice, but we don't know anything about where you downloaded
  239. these files from. If you would want to find out, you would have to *remember*
  240. where you got the content from – and brains are not made for such tasks.
  241. Luckily, DataLad has a command that will solve both of these problems:
  242. The :dlcmd:`download-url` command.
  243. We will dive deeper into the provenance-related benefits of using it in later chapters, but for now,
  244. we’ll start with best-practice-building. :dlcmd:`download-url` can retrieve content
  245. from a URL (following any URL-scheme from https, http, or ftp or s3) and save it
  246. into the dataset together with a human-readable commit message and a hidden,
  247. machine-readable record of the origin of the content. This saves you time,
  248. and captures :term:`provenance` information about the data you add to your dataset.
  249. To experience this, let's add a final book,
  250. `a beginner’s guide to bash <https://tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf>`_,
  251. to the dataset. We provide the command with a URL, a pointer to the dataset the
  252. file should be saved in (``.`` denotes "current directory"), and a commit message.
  253. .. runrecord:: _examples/DL-101-102-112
  254. :language: console
  255. :workdir: dl-101/DataLad-101
  256. :cast: 01_dataset_basics
  257. :notes: finally, datalad-download-url
  258. $ datalad download-url \
  259. https://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
  260. --dataset . \
  261. -m "add beginners guide on bash" \
  262. -O books/bash_guide.pdf
  263. Afterwards, a fourth book is inside your ``books/`` directory:
  264. .. runrecord:: _examples/DL-101-102-113
  265. :language: console
  266. :workdir: dl-101/DataLad-101
  267. :cast: 01_dataset_basics
  268. $ ls books
  269. However, the :dlcmd:`status` command does not return any output –
  270. the dataset state is “clean”:
  271. .. runrecord:: _examples/DL-101-102-114
  272. :language: console
  273. :workdir: dl-101/DataLad-101
  274. :cast: 01_dataset_basics
  275. $ datalad status
  276. This is because :dlcmd:`download-url` took care of saving for you:
  277. .. runrecord:: _examples/DL-101-102-115
  278. :language: console
  279. :workdir: dl-101/DataLad-101
  280. $ git log -p -n 1
  281. At this point in time, the biggest advantage may seem to be the time save. However,
  282. soon you will experience how useful it is to have DataLad keep track for you where
  283. file content came from.
  284. To conclude this section, let's take a final look at the history of your dataset at
  285. this point:
  286. .. runrecord:: _examples/DL-101-102-116
  287. :language: console
  288. :workdir: dl-101/DataLad-101
  289. $ git log --oneline
  290. Well done! Your ``DataLad-101`` dataset and its history are slowly growing.
  291. .. only:: adminmode
  292. Add a tag at the section end.
  293. .. runrecord:: _examples/DL-101-102-117
  294. :language: console
  295. :workdir: dl-101/DataLad-101
  296. $ git branch sct_populate_a_dataset
  297. .. rubric:: Footnotes
  298. .. [#f1] ``tree`` is a Unix command to list file system content. If it is not yet installed,
  299. you can get it with your native package manager (e.g., ``apt``, ``brew``, or conda).
  300. For example, if you use OSX, ``brew install tree`` will get you this tool.
  301. Windows has its own ``tree`` command.
  302. Note that this ``tree`` works slightly different than its Unix equivalent - by default, it will only display directories, not files, and the command options it accepts are either ``/f`` (display file names) or ``/a`` (change display of subdirectories to text instead of graphic characters).
  303. .. [#f2] ``wget`` is a Unix command for non-interactively downloading files from the
  304. web. If it is not yet installed, you can get it with your native package manager (e.g.,
  305. ``apt`` or ``brew``). For example, if you use OSX, ``brew install wget``
  306. will get you this tool.
  307. .. [#f3] See :term:`tig`. Once installed, exchange any git log command you
  308. see here with the single word ``tig``.