provenance_tracking.rst 8.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
  1. .. index:: ! Usecase; Basic provenance tracking
  2. .. _usecase_provenance_tracking:
  3. Basic provenance tracking
  4. -------------------------
  5. This use case demonstrates how the provenance of downloaded and generated files
  6. can be captured with DataLad by
  7. #. downloading a data file from an arbitrary URL from the web
  8. #. perform changes to this data file and
  9. #. capture provenance for all of this
  10. .. importantnote:: How to become a Git pro
  11. This section uses advanced Git commands and concepts on the side
  12. that are not covered in the book. If you want to learn more about
  13. the Git commands shown here, the `ProGit book <https://git-scm.com/book/en/v2>`_
  14. is an excellent resource.
  15. The Challenge
  16. ^^^^^^^^^^^^^
  17. Rob needs to turn in an art project at the end of the high school year.
  18. He wants to make it as easy as possible and decides to just make a
  19. photomontage of some pictures from the internet. When he submits the project,
  20. he does not remember where he got the input data from, nor the exact steps to
  21. create his project, even though he tried to take notes.
  22. The DataLad Approach
  23. ^^^^^^^^^^^^^^^^^^^^
  24. Rob starts his art project as a DataLad dataset. When downloading the
  25. images he wants to use for his project, he tracks where they come from.
  26. And when he changes or creates output, he tracks how, when and why and
  27. this was done using standard DataLad commands.
  28. This will make it easy for him to find out or remember what he has
  29. done in his project, and how it has been done, a long time after he
  30. finished the project, without any note taking.
  31. Step-by-Step
  32. ^^^^^^^^^^^^
  33. Rob starts by creating a dataset, because everything in a dataset can
  34. be version controlled and tracked:
  35. .. runrecord:: _examples/prov-101
  36. :workdir: usecases/provenance
  37. :language: console
  38. $ datalad create artproject && cd artproject
  39. For his art project, Rob decides to download a mosaic image composed of flowers
  40. from Wikimedia. As a first step, he extracts some of the flowers into individual
  41. files to reuse them later.
  42. He uses the :dlcmd:`download-url` command to get the resource straight
  43. from the web, but also capture all provenance automatically, and save the
  44. resource in his dataset together with a useful commit message:
  45. .. runrecord:: _examples/prov-102
  46. :workdir: usecases/provenance/artproject
  47. :language: console
  48. $ mkdir sources
  49. $ datalad download-url -m "Added flower mosaic from wikimedia" \
  50. https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
  51. --path sources/flowers.jpg
  52. If he later wants to find out where he obtained this file from, a
  53. :gitannexcmd:`whereis` [#f1]_ command will tell him:
  54. .. runrecord:: _examples/prov-103
  55. :workdir: usecases/provenance/artproject
  56. :language: console
  57. $ git annex whereis sources/flowers.jpg
  58. To extract some image parts for the first step of his project, he uses
  59. the ``extract`` tool from `ImageMagick <https://imagemagick.org/index.php>`_ to
  60. extract the St. Bernard's Lily from the upper left corner, and the pimpernel
  61. from the upper right corner. The commands will take the
  62. Wikimedia poster as an input and produce output files from it. To capture
  63. provenance on this action, Rob wraps it into :dlcmd:`run` [#f2]_
  64. commands.
  65. .. runrecord:: _examples/prov-104
  66. :workdir: usecases/provenance/artproject
  67. :language: console
  68. $ datalad run -m "extract st-bernard lily" \
  69. --input "sources/flowers.jpg" \
  70. --output "st-bernard.jpg" \
  71. "convert -extract 1522x1522+0+0 sources/flowers.jpg st-bernard.jpg"
  72. .. runrecord:: _examples/prov-105
  73. :workdir: usecases/provenance/artproject
  74. :language: console
  75. $ datalad run -m "extract pimpernel" \
  76. --input "sources/flowers.jpg" \
  77. --output "pimpernel.jpg" \
  78. "convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg"
  79. He continues to process the images, capturing all provenance with DataLad.
  80. Later, he can always find out which commands produced or changed which file.
  81. This information is easily accessible within the history of his dataset,
  82. both with Git and DataLad commands such as :gitcmd:`log` or
  83. :dlcmd:`diff`.
  84. .. runrecord:: _examples/prov-106
  85. :workdir: usecases/provenance/artproject
  86. :language: console
  87. $ git log --oneline HEAD~3..HEAD
  88. .. runrecord:: _examples/prov-107
  89. :workdir: usecases/provenance/artproject
  90. :language: console
  91. $ datalad diff -f HEAD~3
  92. Based on this information, he can always reconstruct how an when
  93. any data file came to be – across the entire life-time of a project.
  94. He decides that one image manipulation for his art project will
  95. be to displace pixels of an image by a random amount to blur the image:
  96. .. runrecord:: _examples/prov-108
  97. :workdir: usecases/provenance/artproject
  98. :language: console
  99. $ datalad run -m "blur image" \
  100. --input "st-bernard.jpg" \
  101. --output "st-bernard-displaced.jpg" \
  102. "convert -spread 10 st-bernard.jpg st-bernard-displaced.jpg"
  103. Because he is not completely satisfied with the first random pixel displacement,
  104. he decides to retry the operation. Because everything was wrapped in :dlcmd:`run`,
  105. he can rerun the command. Rerunning the command will produce a commit, because the displacement is
  106. random and the output file changes slightly from its previous version.
  107. .. runrecord:: _examples/prov-109
  108. :workdir: usecases/provenance/artproject
  109. :language: console
  110. $ git log -1 --oneline HEAD
  111. .. runrecord:: _examples/prov-110
  112. :workdir: usecases/provenance/artproject
  113. :language: console
  114. :realcommand: echo "$ datalad rerun $(git rev-parse HEAD)" && datalad rerun $(git rev-parse HEAD)
  115. This blur also does not yet fulfill Robs expectations, so he decides to
  116. discard the change, using standard Git tools [#f3]_.
  117. .. runrecord:: _examples/prov-111
  118. :workdir: usecases/provenance/artproject
  119. :language: console
  120. $ git reset --hard HEAD~1
  121. He knows that within a DataLad dataset, he can also rerun *a range*
  122. of commands with the ``--since`` flag, and even specify alternative
  123. starting points for rerunning them with the ``--onto`` flag. Every
  124. command from commits reachable from the specified checksum until
  125. ``--since`` (but not including ``--since``) will be re-executed.
  126. For example, ``datalad rerun --since=HEAD~5`` will re-execute any
  127. commands in the last five commits.
  128. ``--onto`` indicates where to start rerunning the commands from.
  129. The default is ``HEAD``, but anything other than HEAD will be
  130. checked out prior to execution, such that re-execution happens in
  131. a detached HEAD state, or checked out out on the new branch specified
  132. by the ``--branch`` flag.
  133. If ``--since`` is an empty string, it is set to rerun every command from the
  134. first commit that contains a recorded command. If ``--onto`` is an empty
  135. string, re-execution is performed on top to the parent of the first
  136. run commit in the revision list specified with ``--since``.
  137. When both arguments are set to empty strings, it therefore means
  138. "rerun all commands with HEAD at the parent of the first commit a command".
  139. In other words, Rob can "replay" all the history for his artproject in a single
  140. command. Using the ``--branch`` option of :dlcmd:`rerun`,
  141. he does it on a new branch he names ``replay``:
  142. .. runrecord:: _examples/prov-112
  143. :workdir: usecases/provenance/artproject
  144. :language: console
  145. $ datalad rerun --since= --onto= --branch=replay
  146. Now he is on a new branch of his project, which contains "replayed" history.
  147. .. runrecord:: _examples/prov-113
  148. :workdir: usecases/provenance/artproject
  149. :language: console
  150. $ git log --oneline --graph main replay
  151. He can even compare the two branches:
  152. .. runrecord:: _examples/prov-114
  153. :workdir: usecases/provenance/artproject
  154. :language: console
  155. $ datalad diff -t main -f replay
  156. He can see that the blurring, which involved a random element,
  157. produced different results. Because his dataset contains two branches,
  158. he can compare the two branches using normal Git operations.
  159. The next command, for example, marks which commits are "patch-equivalent"
  160. between the branches.
  161. Notice that all commits are marked as equivalent (=) except the ‘random spread’ ones.
  162. .. runrecord:: _examples/prov-115
  163. :workdir: usecases/provenance/artproject
  164. :language: console
  165. $ git log --oneline --left-right --cherry-mark main...replay
  166. Rob can continue processing images, and will turn in a successful art project.
  167. Long after he finishes high school, he finds his dataset on his old computer
  168. again and remembers this small project fondly.
  169. .. rubric:: Footnotes
  170. .. [#f1] If you want to learn more about :gitannexcmd:`whereis`, re-read
  171. section :ref:`sharelocal2`.
  172. .. [#f2] If you want to learn more about :dlcmd:`run`, read on from
  173. section :ref:`run`.
  174. .. [#f3] Find out more about working with the history of a dataset with Git in
  175. section :ref:`file system`