123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247 |
- .. _run:
- Keeping track
- -------------
- In previous examples, with the exception of :dlcmd:`download-url`, all
- changes that happened to the dataset or the files it contains were
- saved to the dataset's history by hand. We added larger and smaller
- files and saved them, and we also modified smaller file contents and
- saved these modifications.
- Often, however, files get changed by shell commands
- or by scripts.
- Consider a data scientist.
- She has data files with numeric data,
- and code scripts in Python, R, Matlab or any other programming language
- that will use the data to compute results or figures. Such output is
- stored in new files, or modifies existing files.
- But only a few weeks after these scripts were executed she finds it hard
- to remember which script was modified for which reason or created which
- output. How did this result came to be? Which script would she need
- to run again on which data to produce this particular figure?
- In this section we will experience how DataLad can help
- to record the changes in a dataset after executing a script
- from the shell. Just as :dlcmd:`download-url` was able to associate
- a file with its origin and store this information, we want to be
- able to associate a particular file with the commands, scripts, and inputs
- it was produced from, and thus capture and store full :term:`provenance`.
- Let's say, for example, that you enjoyed the longnow podcasts a lot,
- and you start a podcast-night with friends to wind down from all of
- the exciting DataLad lectures. They propose to make a
- list of speakers and titles to cross out what they've already listened
- to, and ask you to prepare such a list.
- "Mhh... probably there is a DataLad way to do this... wasn't there also
- a note about metadata extraction at some point?" But as we are not that
- far into the lectures, you decide to write a short shell script
- to generate a text file that lists speaker and title
- name instead.
- To do this, we are following a best practice that will reappear in the
- later section on :ref:`YODA principles <yoda>`: Collecting all
- additional scripts that work with content of a subdataset *outside*
- of this subdataset, in a dedicated ``code/`` directory,
- and collating the output of the execution of these scripts
- *outside* of the subdataset as well -- and
- therefore not modifying the subdataset.
- The motivation behind this will become clear in later sections,
- but for now we'll start with best-practice building.
- Therefore, create a subdirectory ``code/`` in the ``DataLad-101``
- superdataset:
- .. runrecord:: _examples/DL-101-108-101
- :language: console
- :workdir: dl-101
- :notes: it's impossible to remember how a large data analysis produced which result for a human, but datalad can help to keep track. To see this in action, we'll do a data analysis. Start with yoda principles and structure ds with code directory.
- :cast: 02_reproducible_execution
- :realcommand: cd DataLad-101 && mkdir code && tree -d
- $ mkdir code
- $ tree -d
- Inside of ``DataLad-101/code``, create a simple shell script ``list_titles.sh``.
- This script will carry out a simple task:
- It will loop through the file names of the ``.mp3`` files and
- write out speaker names and talk titles in a very basic fashion.
- The ``cat`` command will write the script content into ``code/list_titles.sh``.
- .. windows-wit:: Here's a script for Windows users
- .. include:: topic/globscript1-windows.rst
- .. index::
- pair: hidden file name extensions; on Windows
- .. windows-wit:: Be mindful of hidden extensions when creating files!
- .. include:: topic/hidden-extensions.rst
- .. runrecord:: _examples/DL-101-108-102
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: We will create a script to execute. Let's make one that summarizes the podcasts titles in the longnow dataset:
- :cast: 02_reproducible_execution
- $ cat << EOT > code/list_titles.sh
- for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
- # get the filename
- base=\$(basename "\$i");
- # strip the extension
- base=\${base%.mp3};
- # date as yyyy-mm-dd
- printf "\${base%%__*}\t" | tr '_' '-';
- # name and title without underscores
- printf "\${base#*__}\n" | tr '_' ' ';
- done
- EOT
- Save this script to the dataset.
- .. runrecord:: _examples/DL-101-108-103
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: We have to save the script first: status and save
- :cast: 02_reproducible_execution
- $ datalad status
- .. runrecord:: _examples/DL-101-108-104
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: ... preferably with a helpful commit message
- :cast: 02_reproducible_execution
- $ datalad save -m "Add short script to write a list of podcast speakers and titles"
- Once we run this script, it will simply print dates, names and titles to
- your terminal. We can save its outputs to a new file
- ``recordings/podcasts.tsv`` in the superdataset by redirecting these
- outputs with ``bash code/list_titles.sh > recordings/podcasts.tsv``.
- Obviously, we could create this file, and subsequently save it to the superdataset.
- However, just as in the example about the data scientist,
- in a bit of time, we will forget how this file came into existence, or
- that the script ``code/list_titles.sh`` is associated with this file, and
- can be used to update it later on.
- .. index::
- pair: run; DataLad command
- pair: run command with provenance capture; with DataLad
- pair: run command with provenance capture; with DataLad run
- The :dlcmd:`run` command
- can help with this. Put simply, it records a command's impact on a dataset. Put
- more technically, it will record a shell command, and :dlcmd:`save` all changes
- this command triggered in the dataset -- be that new files or changes to existing
- files.
- Let's try the simplest way to use this command: :dlcmd:`run`,
- followed by a commit message (``-m "a concise summary"``), and the
- command that executes the script from the shell: ``bash code/list_titles.sh > recordings/podcasts.tsv``.
- It is helpful to enclose the command in quotation marks.
- Note that we execute the command from the root of the superdataset.
- It is recommended to use :dlcmd:`run` in the root of the dataset
- you want to record the changes in, so make sure to run this
- command from the root of ``DataLad-101``.
- .. runrecord:: _examples/DL-101-108-105
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: The datalad run command records a command's impact on a dataset. We try it in the most simple way:
- :cast: 02_reproducible_execution
- $ datalad run -m "create a list of podcast titles" \
- "bash code/list_titles.sh > recordings/podcasts.tsv"
- Let's take a look into the history:
- .. runrecord:: _examples/DL-101-108-106
- :language: console
- :workdir: dl-101/DataLad-101
- :lines: 1-30
- :emphasize-lines: 6, 11, 25
- :notes: Let's now check what has been written into the history. (runrecord)
- :cast: 02_reproducible_execution
- $ git log -p -n 1 # On Windows, you may just want to type "git log".
- The commit message we have supplied with ``-m`` directly after :dlcmd:`run` appears
- in our history as a short summary.
- Additionally, the output of the command, ``recordings/podcasts.tsv``,
- was saved right away.
- But there is more in this log entry, a section in between the markers
- ``=== Do not change lines below ===`` and
- ``^^^ Do not change lines above ^^^``.
- This is the so-called ``run record`` -- a recording of all of the
- information in the :dlcmd:`run` command, generated by DataLad.
- In this case, it is a very simple summary. One informative
- part is highlighted:
- ``"cmd": "bash code/list_titles.sh"`` is the command that was run
- in the terminal.
- This information therefore maps the command, and with it the script,
- to the output file, in one commit. Nice, isn't it?
- Arguably, the :term:`run record` is not the most human-readable way to display information.
- This representation however is less for the human user (the human user should
- rely on their informative commit message), but for DataLad, in particular for the
- :dlcmd:`rerun` command, which you will see in action shortly. This
- ``run record`` is machine-readable provenance that associates an output with
- the command that produced it.
- You have probably already guessed that every :dlcmd:`run` command
- ends with a ``datalad save``. A logical consequence from this fact is that any
- :dlcmd:`run` that does not result in any changes in a dataset (no modification
- of existing content; no additional files) will not produce any record in the
- dataset's history (just as a :dlcmd:`save` with no modifications present
- will not create a history entry). Try to run the exact same
- command as before, and check whether anything in your log changes:
- .. runrecord:: _examples/DL-101-108-107
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: A run command that does not result in changes (no modifications, no additional files) will not produce a record in the dataset history. So what happens if we do the same again?
- :cast: 02_reproducible_execution
- $ datalad run -m "Try again to create a list of podcast titles" \
- "bash code/list_titles.sh > recordings/podcasts.tsv"
- .. runrecord:: _examples/DL-101-108-108
- :language: console
- :workdir: dl-101/DataLad-101
- :lines: 1-5
- :emphasize-lines: 2
- :notes: as the result is byte-identical, there is no new commit
- :cast: 02_reproducible_execution
- $ git log --oneline
- The most recent commit is still the :dlcmd:`run` command from before,
- and there was no second :dlcmd:`run` commit created.
- The :dlcmd:`run` can therefore help you to keep track of what you are doing
- in a dataset and capture provenance of your files: When, by whom, and how exactly
- was a particular file created or modified?
- The next sections will demonstrate how to make use of this information,
- and also how to extend the command with additional arguments that will prove to
- be helpful over the course of this chapter.
- .. only:: adminmode
- Add a tag at the section end.
- .. runrecord:: _examples/DL-101-108-109
- :language: console
- :workdir: dl-101/DataLad-101
- $ git branch sct_keeping_track
|