123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319 |
- .. index::
- pair: rerun; DataLad command
- .. _run2:
- DataLad, rerun!
- ----------------
- So far, you created a ``.tsv`` file of all
- speakers and talk titles in the ``longnow/`` podcasts subdataset.
- Let's actually take a look into this file now:
- .. runrecord:: _examples/DL-101-109-101
- :language: console
- :workdir: dl-101/DataLad-101
- :lines: 1-3,5-7
- :append: -✂--✂-
- :notes: The script produced a simple list of podcast titles. let's take a look into our output file. What's cool is that is was created in a way that the code and output are linked:
- :cast: 02_reproducible_execution
- $ less recordings/podcasts.tsv
- Not too bad, and certainly good enough for the podcast night people.
- What's been cool about creating this file is that it was created with
- a script within a :dlcmd:`run` command. Thanks to :dlcmd:`run`,
- the output file ``podcasts.tsv`` is associated with the script it
- generated.
- Upon reviewing the list you realized that you made a mistake, though: you only
- listed the talks in the SALT series (the
- ``Long_Now__Seminars_About_Long_term_Thinking/`` directory), but not
- in the ``Long_Now__Conversations_at_The_Interval/`` directory.
- Let's fix this in the script. Replace the contents in ``code/list_titles.sh``
- with the following, fixed script:
- .. windows-wit:: Here's a script adjustment for Windows users
- .. include:: topic/globscript2-windows.rst
- .. runrecord:: _examples/DL-101-109-102
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 2
- :notes: Dang, we made a mistake in our script: we only listed a part of the podcasts! Let's fix the script:
- :cast: 02_reproducible_execution
- $ cat << EOT >| code/list_titles.sh
- for i in recordings/longnow/Long_Now*/*.mp3; do
- # get the filename
- base=\$(basename "\$i");
- # strip the extension
- base=\${base%.mp3};
- printf "\${base%%__*}\t" | tr '_' '-';
- # name and title without underscores
- printf "\${base#*__}\n" | tr '_' ' ';
- done
- EOT
- Because the script is now modified, save the modifications to the dataset.
- We can use the shorthand "BF" to denote "Bug fix" in the commit message.
- .. runrecord:: _examples/DL-101-109-103
- :language: console
- :workdir: dl-101/DataLad-101
- :cast: 02_reproducible_execution
- $ datalad status
- .. runrecord:: _examples/DL-101-109-104
- :language: console
- :workdir: dl-101/DataLad-101
- :cast: 02_reproducible_execution
- $ datalad save -m "BF: list both directories content" \
- code/list_titles.sh
- What we *could* do is run the same :dlcmd:`run` command as before to recreate
- the file, but now with all of the contents:
- .. code-block:: console
- $ # do not execute this!
- $ datalad run -m "create a list of podcast titles" \
- "bash code/list_titles.sh > recordings/podcasts.tsv"
- However, think about any situation where the command would be longer than this,
- or that is many months past the first execution. It would not be easy to remember
- the command, nor would it be very convenient to copy it from the ``run record``.
- Luckily, a fellow student remembered the DataLad way of re-executing
- a ``run`` command, and he's eager to show it to you.
- "In order to re-execute a :dlcmd:`run` command,
- find the commit and use its :term:`shasum` (or a :term:`tag`, or anything else that Git
- understands) as an argument for the
- :dlcmd:`rerun` command! That's it!",
- he says happily.
- So you go ahead and find the commit :term:`shasum` in your history:
- .. runrecord:: _examples/DL-101-109-105
- :language: console
- :workdir: dl-101/DataLad-101
- :lines: 1-12
- :emphasize-lines: 8
- :notes: We could execute the same command as before. However, we can also let DataLad take care of it, and use the datalad rerun command.
- :cast: 02_reproducible_execution
- $ git log -n 2
- Take that shasum and paste it after :dlcmd:`rerun`
- (the first 6-8 characters of the shasum would be sufficient,
- here we are using all of them).
- .. runrecord:: _examples/DL-101-109-106
- :language: console
- :workdir: dl-101/DataLad-101
- :realcommand: echo "$ datalad rerun $(git rev-parse HEAD~1)" && datalad rerun $(git rev-parse HEAD~1)
- :notes: We'll find the shasum of the run commit and plug it into rerun
- :cast: 02_reproducible_execution
- Now DataLad has made use of the ``run record``, and
- re-executed the original command based on the information in it.
- Because we updated the script, the output ``podcasts.tsv``
- has changed and now contains the podcast
- titles of both subdirectories.
- You've probably already guessed it, but the easiest way
- to check whether a :dlcmd:`rerun`
- has changed the desired output file is
- to check whether the rerun command appears in the datasets history:
- If a :dlcmd:`rerun` does not add or change any content in the dataset,
- it will also not be recorded in the history.
- .. runrecord:: _examples/DL-101-109-107
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: how does a rerun look in the history?
- :cast: 02_reproducible_execution
- $ git log -n 1
- In the dataset's history,
- we can see that a new :dlcmd:`run` was recorded. This action is
- committed by DataLad under the original commit message of the ``run``
- command, and looks just like the previous :dlcmd:`run` commit.
- .. index::
- pair: diff; DataLad command
- Two cool tools that go beyond the :gitcmd:`log`
- are the :dlcmd:`diff` and :gitcmd:`diff` commands.
- Both commands can report differences between two states of
- a dataset. Thus, you can get an overview of what changed between two commits.
- Both commands have a similar, but not identical structure: :dlcmd:`diff`
- compares one state (a commit specified with ``-f``/``--from``,
- by default the latest change)
- and another state from the dataset's history (a commit specified with
- ``-t``/``--to``). Let's do a :dlcmd:`diff` between the current state
- of the dataset and the previous commit (called "``HEAD~1``" in Git terminology [#f1]_):
- .. index::
- pair: show dataset modification; on Windows with DataLad
- pair: diff; DataLad command
- pair: corresponding branch; in adjusted mode
- .. windows-wit:: please use 'datalad diff --from main --to HEAD~1'
- .. include:: topic/adjustedmode-diff.rst
- .. index::
- pair: diff; Git command
- pair: show dataset modification; with DataLad
- .. runrecord:: _examples/DL-101-109-108
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: The datalad diff command can help us find out what changed between the last two commands:
- :cast: 02_reproducible_execution
- $ datalad diff --to HEAD~1
- .. index::
- pair: diff; Git command
- pair: show dataset modification; with Git
- This indeed shows the output file as "modified". However, we do not know
- what exactly changed. This is a task for :gitcmd:`diff` (get out of the
- diff view by pressing ``q``):
- .. runrecord:: _examples/DL-101-109-109
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: The git diff command has even more insights:
- :cast: 02_reproducible_execution
- :lines: 1-20
- $ git diff HEAD~1
- This output actually shows the precise changes between the contents created
- with the first version of the script and the second script with the bug fix.
- All of the files that are added after the second directory
- was queried as well are shown in the ``diff``, preceded by a ``+``.
- Quickly create a note about these two helpful commands in ``notes.txt``:
- .. runrecord:: _examples/DL-101-109-110
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: Let's make a note about this.
- :cast: 02_reproducible_execution
- $ cat << EOT >> notes.txt
- There are two useful functions to display changes between two
- states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
- and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
- in the history.
- EOT
- Finally, save this note.
- .. runrecord:: _examples/DL-101-109-111
- :language: console
- :workdir: dl-101/DataLad-101
- :cast: 02_reproducible_execution
- $ datalad save -m "add note datalad and git diff"
- Note that :dlcmd:`rerun` can re-execute the run records of both a :dlcmd:`run`
- or a :dlcmd:`rerun` command,
- but not with any other type of DataLad command in your history
- such as a :dlcmd:`save` on results or outputs after you executed a script.
- Therefore, make it a
- habit to record the execution of scripts by plugging it into :dlcmd:`run`.
- This very basic example of a :dlcmd:`run` is as simple as it can get, but it
- is already
- convenient from a memory-load perspective: Now you do not need to
- remember the commands or scripts involved in creating an output. DataLad kept track
- of what you did, and you can instruct it to "``rerun``" it.
- Also, incidentally, we have generated :term:`provenance` information. It is
- now recorded in the history of the dataset how the output ``podcasts.tsv`` came
- into existence. And we can interact with and use this provenance information with
- other tools than from the machine-readable ``run record``.
- For example, to find out who (or what) created or modified a file,
- give the file path to :gitcmd:`log` (prefixed by ``--``):
- .. index::
- pair: show history for particular paths; on Windows with Git
- pair: log; Git command
- pair: corresponding branch; in adjusted mode
- .. windows-wit:: use 'git log main -- recordings/podcasts.tsv'
- .. include:: topic/adjustedmode-log-path.rst
- .. index::
- pair: show history for particular paths; with Git
- .. runrecord:: _examples/DL-101-109-112
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: An amazing thing is that DataLad captured all of the provenance of the output file, and we get use git tools to find out about it
- :cast: 02_reproducible_execution
- $ git log -- recordings/podcasts.tsv
- Neat, isn't it?
- Still, this :dlcmd:`run` was very simple.
- The next section will demonstrate how :dlcmd:`run` becomes handy in
- more complex standard use cases: situations with *locked* contents.
- But prior to that, make a note about :dlcmd:`run` and :dlcmd:`rerun` in your
- ``notes.txt`` file.
- .. runrecord:: _examples/DL-101-109-113
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: Another final note on run and rerun
- :cast: 02_reproducible_execution
- $ cat << EOT >> notes.txt
- The datalad run command can record the impact a script or command has
- on a Dataset. In its simplest form, datalad run only takes a commit
- message and the command that should be executed.
- Any datalad run command can be re-executed by using its commit shasum
- as an argument in datalad rerun CHECKSUM. DataLad will take
- information from the run record of the original commit, and re-execute
- it. If no changes happen with a rerun, the command will not be written
- to history. Note: you can also rerun a datalad rerun command!
- EOT
- Finally, save this note.
- .. runrecord:: _examples/DL-101-109-114
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: Another final note on run and rerun
- :cast: 02_reproducible_execution
- $ datalad save -m "add note on basic datalad run and datalad rerun"
- .. only:: adminmode
- Add a tag at the section end.
- .. runrecord:: _examples/DL-101-109-115
- :language: console
- :workdir: dl-101/DataLad-101
- $ git branch sct_datalad_rerun
- .. rubric:: Footnotes
- .. [#f1] The section :ref:`history` will elaborate more on common :term:`Git` commands
- and terminology.
|