101-108-run.rst 9.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247
  1. .. _run:
  2. Keeping track
  3. -------------
  4. In previous examples, with the exception of :dlcmd:`download-url`, all
  5. changes that happened to the dataset or the files it contains were
  6. saved to the dataset's history by hand. We added larger and smaller
  7. files and saved them, and we also modified smaller file contents and
  8. saved these modifications.
  9. Often, however, files get changed by shell commands
  10. or by scripts.
  11. Consider a data scientist.
  12. She has data files with numeric data,
  13. and code scripts in Python, R, Matlab or any other programming language
  14. that will use the data to compute results or figures. Such output is
  15. stored in new files, or modifies existing files.
  16. But only a few weeks after these scripts were executed she finds it hard
  17. to remember which script was modified for which reason or created which
  18. output. How did this result came to be? Which script would she need
  19. to run again on which data to produce this particular figure?
  20. In this section we will experience how DataLad can help
  21. to record the changes in a dataset after executing a script
  22. from the shell. Just as :dlcmd:`download-url` was able to associate
  23. a file with its origin and store this information, we want to be
  24. able to associate a particular file with the commands, scripts, and inputs
  25. it was produced from, and thus capture and store full :term:`provenance`.
  26. Let's say, for example, that you enjoyed the longnow podcasts a lot,
  27. and you start a podcast-night with friends to wind down from all of
  28. the exciting DataLad lectures. They propose to make a
  29. list of speakers and titles to cross out what they've already listened
  30. to, and ask you to prepare such a list.
  31. "Mhh... probably there is a DataLad way to do this... wasn't there also
  32. a note about metadata extraction at some point?" But as we are not that
  33. far into the lectures, you decide to write a short shell script
  34. to generate a text file that lists speaker and title
  35. name instead.
  36. To do this, we are following a best practice that will reappear in the
  37. later section on :ref:`YODA principles <yoda>`: Collecting all
  38. additional scripts that work with content of a subdataset *outside*
  39. of this subdataset, in a dedicated ``code/`` directory,
  40. and collating the output of the execution of these scripts
  41. *outside* of the subdataset as well -- and
  42. therefore not modifying the subdataset.
  43. The motivation behind this will become clear in later sections,
  44. but for now we'll start with best-practice building.
  45. Therefore, create a subdirectory ``code/`` in the ``DataLad-101``
  46. superdataset:
  47. .. runrecord:: _examples/DL-101-108-101
  48. :language: console
  49. :workdir: dl-101
  50. :notes: it's impossible to remember how a large data analysis produced which result for a human, but datalad can help to keep track. To see this in action, we'll do a data analysis. Start with yoda principles and structure ds with code directory.
  51. :cast: 02_reproducible_execution
  52. :realcommand: cd DataLad-101 && mkdir code && tree -d
  53. $ mkdir code
  54. $ tree -d
  55. Inside of ``DataLad-101/code``, create a simple shell script ``list_titles.sh``.
  56. This script will carry out a simple task:
  57. It will loop through the file names of the ``.mp3`` files and
  58. write out speaker names and talk titles in a very basic fashion.
  59. The ``cat`` command will write the script content into ``code/list_titles.sh``.
  60. .. windows-wit:: Here's a script for Windows users
  61. .. include:: topic/globscript1-windows.rst
  62. .. index::
  63. pair: hidden file name extensions; on Windows
  64. .. windows-wit:: Be mindful of hidden extensions when creating files!
  65. .. include:: topic/hidden-extensions.rst
  66. .. runrecord:: _examples/DL-101-108-102
  67. :language: console
  68. :workdir: dl-101/DataLad-101
  69. :notes: We will create a script to execute. Let's make one that summarizes the podcasts titles in the longnow dataset:
  70. :cast: 02_reproducible_execution
  71. $ cat << EOT > code/list_titles.sh
  72. for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
  73. # get the filename
  74. base=\$(basename "\$i");
  75. # strip the extension
  76. base=\${base%.mp3};
  77. # date as yyyy-mm-dd
  78. printf "\${base%%__*}\t" | tr '_' '-';
  79. # name and title without underscores
  80. printf "\${base#*__}\n" | tr '_' ' ';
  81. done
  82. EOT
  83. Save this script to the dataset.
  84. .. runrecord:: _examples/DL-101-108-103
  85. :language: console
  86. :workdir: dl-101/DataLad-101
  87. :notes: We have to save the script first: status and save
  88. :cast: 02_reproducible_execution
  89. $ datalad status
  90. .. runrecord:: _examples/DL-101-108-104
  91. :language: console
  92. :workdir: dl-101/DataLad-101
  93. :notes: ... preferably with a helpful commit message
  94. :cast: 02_reproducible_execution
  95. $ datalad save -m "Add short script to write a list of podcast speakers and titles"
  96. Once we run this script, it will simply print dates, names and titles to
  97. your terminal. We can save its outputs to a new file
  98. ``recordings/podcasts.tsv`` in the superdataset by redirecting these
  99. outputs with ``bash code/list_titles.sh > recordings/podcasts.tsv``.
  100. Obviously, we could create this file, and subsequently save it to the superdataset.
  101. However, just as in the example about the data scientist,
  102. in a bit of time, we will forget how this file came into existence, or
  103. that the script ``code/list_titles.sh`` is associated with this file, and
  104. can be used to update it later on.
  105. .. index::
  106. pair: run; DataLad command
  107. pair: run command with provenance capture; with DataLad
  108. pair: run command with provenance capture; with DataLad run
  109. The :dlcmd:`run` command
  110. can help with this. Put simply, it records a command's impact on a dataset. Put
  111. more technically, it will record a shell command, and :dlcmd:`save` all changes
  112. this command triggered in the dataset -- be that new files or changes to existing
  113. files.
  114. Let's try the simplest way to use this command: :dlcmd:`run`,
  115. followed by a commit message (``-m "a concise summary"``), and the
  116. command that executes the script from the shell: ``bash code/list_titles.sh > recordings/podcasts.tsv``.
  117. It is helpful to enclose the command in quotation marks.
  118. Note that we execute the command from the root of the superdataset.
  119. It is recommended to use :dlcmd:`run` in the root of the dataset
  120. you want to record the changes in, so make sure to run this
  121. command from the root of ``DataLad-101``.
  122. .. runrecord:: _examples/DL-101-108-105
  123. :language: console
  124. :workdir: dl-101/DataLad-101
  125. :notes: The datalad run command records a command's impact on a dataset. We try it in the most simple way:
  126. :cast: 02_reproducible_execution
  127. $ datalad run -m "create a list of podcast titles" \
  128. "bash code/list_titles.sh > recordings/podcasts.tsv"
  129. Let's take a look into the history:
  130. .. runrecord:: _examples/DL-101-108-106
  131. :language: console
  132. :workdir: dl-101/DataLad-101
  133. :lines: 1-30
  134. :emphasize-lines: 6, 11, 25
  135. :notes: Let's now check what has been written into the history. (runrecord)
  136. :cast: 02_reproducible_execution
  137. $ git log -p -n 1 # On Windows, you may just want to type "git log".
  138. The commit message we have supplied with ``-m`` directly after :dlcmd:`run` appears
  139. in our history as a short summary.
  140. Additionally, the output of the command, ``recordings/podcasts.tsv``,
  141. was saved right away.
  142. But there is more in this log entry, a section in between the markers
  143. ``=== Do not change lines below ===`` and
  144. ``^^^ Do not change lines above ^^^``.
  145. This is the so-called ``run record`` -- a recording of all of the
  146. information in the :dlcmd:`run` command, generated by DataLad.
  147. In this case, it is a very simple summary. One informative
  148. part is highlighted:
  149. ``"cmd": "bash code/list_titles.sh"`` is the command that was run
  150. in the terminal.
  151. This information therefore maps the command, and with it the script,
  152. to the output file, in one commit. Nice, isn't it?
  153. Arguably, the :term:`run record` is not the most human-readable way to display information.
  154. This representation however is less for the human user (the human user should
  155. rely on their informative commit message), but for DataLad, in particular for the
  156. :dlcmd:`rerun` command, which you will see in action shortly. This
  157. ``run record`` is machine-readable provenance that associates an output with
  158. the command that produced it.
  159. You have probably already guessed that every :dlcmd:`run` command
  160. ends with a ``datalad save``. A logical consequence from this fact is that any
  161. :dlcmd:`run` that does not result in any changes in a dataset (no modification
  162. of existing content; no additional files) will not produce any record in the
  163. dataset's history (just as a :dlcmd:`save` with no modifications present
  164. will not create a history entry). Try to run the exact same
  165. command as before, and check whether anything in your log changes:
  166. .. runrecord:: _examples/DL-101-108-107
  167. :language: console
  168. :workdir: dl-101/DataLad-101
  169. :notes: A run command that does not result in changes (no modifications, no additional files) will not produce a record in the dataset history. So what happens if we do the same again?
  170. :cast: 02_reproducible_execution
  171. $ datalad run -m "Try again to create a list of podcast titles" \
  172. "bash code/list_titles.sh > recordings/podcasts.tsv"
  173. .. runrecord:: _examples/DL-101-108-108
  174. :language: console
  175. :workdir: dl-101/DataLad-101
  176. :lines: 1-5
  177. :emphasize-lines: 2
  178. :notes: as the result is byte-identical, there is no new commit
  179. :cast: 02_reproducible_execution
  180. $ git log --oneline
  181. The most recent commit is still the :dlcmd:`run` command from before,
  182. and there was no second :dlcmd:`run` commit created.
  183. The :dlcmd:`run` can therefore help you to keep track of what you are doing
  184. in a dataset and capture provenance of your files: When, by whom, and how exactly
  185. was a particular file created or modified?
  186. The next sections will demonstrate how to make use of this information,
  187. and also how to extend the command with additional arguments that will prove to
  188. be helpful over the course of this chapter.
  189. .. only:: adminmode
  190. Add a tag at the section end.
  191. .. runrecord:: _examples/DL-101-108-109
  192. :language: console
  193. :workdir: dl-101/DataLad-101
  194. $ git branch sct_keeping_track