123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521 |
- .. _run3:
- Input and output
- ----------------
- In the previous two sections, you created a simple ``.tsv`` file of all
- speakers and talk titles in the ``longnow/`` podcasts subdataset, and you have
- re-executed a :dlcmd:`run` command after a bug-fix in your script.
- But these previous :dlcmd:`run` and :dlcmd:`rerun` command were very simple.
- Maybe you noticed some values in the ``run record`` were empty:
- ``inputs`` and ``outputs`` for example did not have an entry. Let's experience
- a few situations in which
- these two arguments can become necessary.
- In our DataLad-101 course we were given a group assignment. Everyone should
- give a small presentation about an open DataLad dataset they found. Conveniently,
- you decided to settle for the longnow podcasts right away.
- After all, you know the dataset quite well already,
- and after listening to almost a third of the podcasts
- and enjoying them a lot,
- you also want to recommend them to the others.
- Almost all of the slides are ready, but what's still missing is the logo of the
- longnow podcasts. Good thing that this is part of the subdataset,
- so you can simply retrieve it from there.
- The logos (one for the SALT series, one for the Interval series -- the two
- directories in the subdataset)
- were originally extracted from the podcasts metadata information by DataLad.
- In a while, we will dive into the metadata aggregation capabilities of DataLad,
- but for now, let's just use the logos instead of finding out where they
- come from -- this will come later.
- As part of the metadata of the dataset, the logos are
- in the hidden paths
- ``.datalad/feed_metadata/logo_salt.jpg`` and
- ``.datalad/feed_metadata/logo_interval.jpg``:
- .. runrecord:: _examples/DL-101-110-101
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: We saw a very simple datalad run. Now we are going to extend it with useful options. Narrative: prepare talk about dataset, add logo to slides. For this, we'll try to resize a logo in the meta data of the subdataset
- :cast: 02_reproducible_execution
- $ ls recordings/longnow/.datalad/feed_metadata/*jpg
- For the slides you decide to prepare images of size 400x400 px, but
- the logos' original size is much larger (both are 3000x3000 pixel). Therefore
- let's try to resize the images -- currently, they are far too large to fit on a slide.
- To resize an image from the command line we can use the Unix
- command ``convert -resize`` from the `ImageMagick tool <https://imagemagick.org/index.php>`_.
- The command takes a new size in pixels as an argument, a path to the file that should be
- resized, and a filename and path under which a new,
- resized image will be saved.
- To resize one image to 400x400 px, the command would thus be
- ``convert -resize 400x400 path/to/file.jpg path/to/newfilename.jpg``.
- .. index::
- pair: install ImageMagick; on Windows
- single: installation; ImageMagick
- .. windows-wit:: Tool installation
- .. include:: topic/installation-imagemagick.rst
- Remembering the last lecture on :dlcmd:`run`, you decide to plug this into
- :dlcmd:`run`. Even though this is not a script, it is a command, and you can wrap
- commands like this conveniently with :dlcmd:`run`.
- Because they will be quite long, we line break the commands in the upcoming examples
- for better readability -- in your terminal, you can always write the commands into
- a single line.
- .. index::
- pair: run command with provenance capture; with DataLad run
- .. runrecord:: _examples/DL-101-110-102
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 4
- :notes: This command resizes the logo to 400 by 400 px -- but it will fail!
- :cast: 02_reproducible_execution
- :exitcode: 1
- $ datalad run -m "Resize logo for slides" \
- "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- *Oh, crap!* Why didn't this work?
- Let's take a look at the error message DataLad provides. In general, these error messages
- might seem wordy, and maybe a bit intimidating as well, but usually they provide helpful
- information to find out what is wrong. Whenever you encounter an error message,
- make sure to read it, even if it feels like a mushroom cloud exploded in your terminal.
- A :dlcmd:`run` error message has several parts. The first starts after
- ``[INFO ] == Command start (output follows) =====``.
- This is displaying errors that the
- terminal command threw: The ``convert`` tool complains that it cannot open
- the file, because there is "No such file or directory".
- The second part starts after
- ``[INFO ] == Command exit (modification check follows) =====``.
- DataLad adds information about a "non-zero exit code". A non-zero exit code indicates
- that something went wrong [#f1]_. In principle, you could go ahead and google what this
- specific exit status indicates. However, the solution might have already occurred to you when
- reading the first error report: The file is not present.
- How can that be?
- "Right!", you exclaim with a facepalm.
- Just as the ``.mp3`` files, the ``.jpg`` file content is not present
- locally after a :dlcmd:`clone`, and we did not :dlcmd:`get` it yet!
- .. index::
- pair: declare command input; with DataLad run
- This is where the ``-i``/``--input`` option for a ``datalad run`` becomes useful.
- The content of everything that is specified as an ``input`` will be retrieved
- prior to running the command.
- .. runrecord:: _examples/DL-101-110-103
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 8
- :realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- :notes: The problem is that the content (logo) is not yet retrieved. The --input option makes sure that all content is retrieved prior to command execution.
- :cast: 02_reproducible_execution
- $ datalad run -m "Resize logo for slides" \
- --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
- "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- $ # or shorter:
- $ datalad run -m "Resize logo for slides" \
- -i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
- "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- Cool! You can see in this output that prior to the data command execution, DataLad did a :dlcmd:`get`.
- This is useful for several reasons. For one, it saved us the work of manually
- getting content. But moreover, this is useful for anyone with whom we might share the
- dataset: With an installed dataset one can very simply rerun :dlcmd:`run` commands
- if they have the input argument appropriately specified. It is therefore good practice to
- specify the inputs appropriately. Remember from section :ref:`installds`
- that :dlcmd:`get` will only retrieve content if
- it is not yet present, all input already downloaded will not be downloaded again -- so
- specifying inputs even though they are already present will not do any harm.
- .. index::
- pair: path globbing; with DataLad run
- .. find-out-more:: What if there are several inputs?
- Often, a command needs several inputs. In principle, every input (which could be files, directories, or subdatasets) gets its own ``-i``/``--input``
- flag. However, you can make use of :term:`globbing`. For example,
- .. code-block:: console
- $ datalad run --input "*.jpg" "COMMAND"
- will retrieve all ``.jpg`` files prior to command execution.
- If outputs already exist...
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- .. index::
- pair: files are unlocked by default; on Windows
- pair: unlocked files; in adjusted mode
- .. windows-wit:: Good news! Here is something that is easier on Windows
- .. include:: topic/adjustedmode-unlockedfiles.rst
- Looking at the resulting image, you wonder whether 400x400 might be a tiny bit to small.
- Maybe we should try to resize it to 450x450, and see whether that looks better?
- Note that we cannot use a :dlcmd:`rerun` for this: if we want to change the dimension option
- in the command, we have to define a new :dlcmd:`run` command.
- To establish best-practices, let's specify the input even though it is already present:
- .. runrecord:: _examples/DL-101-110-104
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 9
- :realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- :notes: Maybe 400x400 is too small. We should try 450x450. Can we use a datalad rerun for this? (no)
- :exitcode: 1
- :cast: 02_reproducible_execution
- $ datalad run -m "Resize logo for slides" \
- --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
- "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- $ # or shorter:
- $ datalad run -m "Resize logo for slides" \
- -i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
- "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
- **Oh wtf**... *What is it now?*
- A quick glimpse into the error message shows a different error than before:
- The tool complains that it is "unable to open" the image, because the "Permission [is] denied".
- We have not seen anything like this before, and we need to turn to our lecturer for help.
- Confused about what we might have
- done wrong, we raise our hand to ask the instructor.
- Knowingly, she smiles, and tells you about how DataLad protects content given
- to it:
- "Content in your DataLad dataset is protected by :term:`git-annex` from
- accidental changes" our instructor begins.
- "Wait!" we interrupt. "First off, that wasn't accidental. And second, I was told this
- course does not have ``git-annex-101`` as a prerequisite?"
- "Yes, hear me out" she says. "I promise you two different solutions at
- the end of this explanation, and the concept behind this is quite relevant".
- DataLad usually gives content to :term:`git-annex` to store and track.
- git-annex, let's just say, takes this task *really* seriously. One of its
- features that you have just experienced is that it *locks* content.
- If files are *locked down*, their content cannot be modified. In principle,
- that's not a bad thing: It could be your late grandma's secret cherry-pie
- recipe, and you do not want to *accidentally* change that.
- Therefore, a file needs to be consciously *unlocked* to apply modifications.
- In the attempt to resize the image to 450x450 you tried to overwrite
- ``recordings/salt_logo_small.jpg``, a file that was given to DataLad
- and thus protected by git-annex.
- .. index::
- pair: unlock; DataLad command
- pair: unlock file; with DataLad
- There is a DataLad command that takes care of unlocking file content,
- and thus making locked files modifiable again: :dlcmd:`unlock`.
- Let us check out what it does:
- .. index::
- pair: files are unlocked by default; on Windows
- single: adjusted branch; unlocked files
- .. windows-wit:: What happens if I run this on Windows?
- .. include:: topic/adjustedmode-unlockedfiles2.rst
- .. runrecord:: _examples/DL-101-111-101
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: The created output is protected from accidental modifications, we have to unlock it first:
- :cast: 02_reproducible_execution
- $ datalad unlock recordings/salt_logo_small.jpg
- Well, ``unlock(ok)`` does not sound too bad for a start. As always, we
- feel the urge to run a :dlcmd:`status` on this:
- .. runrecord:: _examples/DL-101-111-102
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: How does the file look like after an unlock?
- :cast: 02_reproducible_execution
- $ datalad status
- "Ah, do not mind that for now", our instructor says, and with a wink she
- continues: "We'll talk about symlinks and object trees a while later".
- You are not really sure whether that's a good thing, but you have a task to focus
- on. Hastily, you run the command right from the terminal:
- .. runrecord:: _examples/DL-101-111-103
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: In principle, you could rerun the command now, outside of any datalad run. The unlocked output can be overwritten
- :cast: 02_reproducible_execution
- $ convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg
- Hey, no permission denied error! You note that the instructor still stands
- right next to you. "Sooo... now what do I do to *lock* the file again?" you ask.
- "Well... what you just did there was quite suboptimal. Didn't you want to
- use :dlcmd:`run`? But, anyway, in order to lock the file again, you would need to
- run a :dlcmd:`save`."
- .. runrecord:: _examples/DL-101-111-104
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: Afterwards you'd need to save, to lock everything again
- :cast: 02_reproducible_execution
- $ datalad save -m "resized picture by hand"
- "So", you wonder aloud, "whenever I want to modify I need to
- :dlcmd:`unlock` it, do the modifications, and then :dlcmd:`save` it?"
- "Well, this is certainly one way of doing it, and a completely valid workflow
- if you would do that outside of a :dlcmd:`run` command.
- But within :dlcmd:`run` there is actually a much easier way of doing this.
- Let's use the ``--output`` argument."
- :dlcmd:`run` *retrieves* everything that is specified as ``--input`` prior to
- command execution, and it *unlocks* everything specified as ``--output`` prior to
- command execution. Therefore, whenever the output of a :dlcmd:`run` command already
- exists and is tracked, it should be specified as an argument in
- the ``-o``/``--output`` option.
- .. index::
- pair: path globbing; with DataLad run
- .. find-out-more:: But what if I have a lot of outputs?
- The use case here is simplistic -- a single file gets modified.
- But there are commands and tools that create full directories with
- many files as an output.
- The easiest way to specify this type of output
- is by supplying the directory name, or the directory name and a :term:`globbing` character, such as
- ``-o directory/*.dat``.
- This would unlock all files with a ``.dat`` extension inside of ``directory``.
- To glob for files in multiple levels of directories, use ``**`` (a so-called `globstar <https://www.linuxjournal.com/content/globstar-new-bash-globbing-option>`_) for a recursive glob through any number directories.
- And, just as for ``-i``/``--input``, you could use multiple ``--output`` specifications.
- .. index::
- pair: declare command output; with DataLad run
- In order to execute :dlcmd:`run` with both the ``-i``/``--input`` and ``-o``/``--output``
- flag and see their magic, let's crop the second logo, ``logo_interval.jpg``:
- .. index::
- pair: files are unlocked by default; on Windows
- pair: run; DataLad command
- pair: unlocked files; in adjusted mode
- .. windows-wit:: Wait, would I need to specify outputs, too?
- .. include:: topic/adjustedmode-unlockedfiles-output.rst
- .. runrecord:: _examples/DL-101-111-105
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 11
- :realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" --output "recordings/interval_logo_small.jpg" "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
- :notes: but it is way easier to just use the --output option of datalad run: it takes care of unlocking if necessary
- :cast: 02_reproducible_execution
- $ datalad run -m "Resize logo for slides" \
- --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
- --output "recordings/interval_logo_small.jpg" \
- "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
- $ # or shorter:
- $ datalad run -m "Resize logo for slides" \
- -i "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
- -o "recordings/interval_logo_small.jpg" \
- "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
- This time, with both ``--input`` and ``--output``
- options specified, DataLad informs about the :dlcmd:`get`
- operations it performs prior to the command
- execution, and :dlcmd:`run` executes the command successfully.
- It does *not* inform about any :dlcmd:`unlock` operation,
- because the output ``recordings/interval_logo_small.jpg`` does not
- exist before the command is run. Should you rerun this command however,
- the summary will include a statement about content unlocking. You will
- see an example of this in the next section.
- Note now how many individual commands a :dlcmd:`run` saves us:
- :dlcmd:`get`, :dlcmd:`unlock`, and :dlcmd:`save`!
- But even better: Beyond saving time *now*, running commands reproducibly and
- recorded with :dlcmd:`run` saves us plenty of time in the future as soon
- as we want to rerun a command, or find out how a file came into existence.
- With this last code snippet, you have experienced a full :dlcmd:`run` command: commit message,
- input and output definitions (the order in which you give those two options is irrelevant),
- and the command to be executed. Whenever a command takes input or produces output you should specify
- this with the appropriate option.
- Make a note of this behavior in your ``notes.txt`` file.
- .. runrecord:: _examples/DL-101-111-106
- :language: console
- :workdir: dl-101/DataLad-101
- :notes: Finally, lets add a note on this
- :cast: 02_reproducible_execution
- $ cat << EOT >> notes.txt
- You should specify all files that a command takes as input with an
- -i/--input flag. These files will be retrieved prior to the command
- execution. Any content that is modified or produced by the command
- should be specified with an -o/--output flag. Upon a run or rerun of
- the command, the contents of these files will get unlocked so that
- they can be modified.
- EOT
- Save yourself the preparation time
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Its generally good practice to specify ``--input`` and ``--output`` even if your input files are already retrieved and your output files unlocked -- it makes sure that a recomputation can succeed, even if inputs are not yet retrieved, or if output needs to be unlocked.
- However, the internal preparation steps of checking that inputs exist or that outputs are unlocked can take a bit of time, especially if it involves checking a large number of files.
- If you want to avoid the expense of unnecessary preparation steps you can make use of the ``--assume-ready`` argument of :dlcmd:`run`.
- Depending on whether your inputs are already retrieved, your outputs already unlocked (or not needed to be unlocked), or both, specify ``--assume-ready`` with the argument ``inputs``, ``outputs`` or ``both`` and save yourself a few seconds, without sacrificing the ability to rerun your command under conditions in which the preparation would be necessary.
- Placeholders
- ^^^^^^^^^^^^
- Just after writing the note, you had to relax your fingers a bit. "Man, this was
- so much typing. Not only did I need to specify the inputs and outputs, I also had
- to repeat all of these lengthy paths in the command line call..." you think.
- There is a neat little trick to spare you half of this typing effort, though: *Placeholders*
- for inputs and outputs. This is how it works:
- Instead of running
- .. code-block:: console
- $ datalad run -m "Resize logo for slides" \
- --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
- --output "recordings/interval_logo_small.jpg" \
- "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
- you could shorten this to
- .. code-block:: console
- :emphasize-lines: 4
- $ datalad run -m "Resize logo for slides" \
- --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
- --output "recordings/interval_logo_small.jpg" \
- "convert -resize 450x450 {inputs} {outputs}"
- The placeholder ``{inputs}`` will expand to the path given as ``--input``, and
- the placeholder ``{outputs}`` will expand to the path given as ``--output``.
- This means instead of writing the full paths in the command, you can simply reuse
- the ``--input`` and ``--output`` specification done before.
- .. index::
- pair: multiple command inputs; with DataLad run
- .. find-out-more:: What if I have multiple inputs or outputs?
- If multiple values are specified, e.g., as in
- .. code-block:: console
- $ datalad run -m "move a few files around" \
- --input "file1" --input "file2" --input "file3" \
- --output "directory_a/" \
- "mv {inputs} {outputs}"
- the values will be joined by a space like this:
- .. code-block:: console
- $ datalad run -m "move a few files around" \
- --input "file1" --input "file2" --input "file3" \
- --output "directory_a/" \
- "mv file1 file2 file3 directory_a/"
- The order of the values will match that order from the command line.
- If you use globs for input specification, as in
- .. code-block:: console
- $ datalad run -m "move a few files around" \
- --input "file*" \
- --output "directory_a/" \
- "mv {inputs} {outputs}"
- the globs will expanded in alphabetical order (like bash):
- .. code-block:: console
- $ datalad run -m "move a few files around" \
- --input "file1" --input "file2" --input "file3" \
- --output "directory_a/" \
- "mv file1 file2 file3 directory_a/"
- If the command only needs a subset of the inputs or outputs, individual values
- can be accessed with an integer index, e.g., ``{inputs[0]}`` for the very first
- input.
- .. index::
- pair: run command with curly brackets; with DataLad run
- .. find-out-more:: ... wait, what if I need a curly bracket in my 'datalad run' call?
- If your command call involves a ``{`` or ``}`` character, you will need to escape
- this brace character by doubling it, i.e., ``{{`` or ``}}``.
- .. index::
- pair: dry-run; with DataLad run
- .. _dryrun:
- Dry-running your run call
- ^^^^^^^^^^^^^^^^^^^^^^^^^
- :dlcmd:`run` commands can become confusing and long, especially when you make heavy use of placeholders or wrap a complex bash commands.
- To better anticipate what you will be running, or help debug a failed command, you can make use of the ``--dry-run`` flag of ``datalad run``.
- This option needs a mode specification (``--dry-run=basic`` or ``dry-run=command``), followed by the ``run`` command you want to execute, and it will decipher the commands elements:
- The mode ``command`` will display the command that is about to be ran.
- The mode ``basic`` will report a few important details about the execution:
- Apart from displaying the command that will be ran, you will learn *where* the command runs, what its *inputs* are (helpful if your ``--input`` specification includes a :term:`globbing` term), and what its *outputs* are.
- .. only:: adminmode
- Add a tag at the section end.
- .. runrecord:: _examples/DL-101-111-107
- :language: console
- :workdir: dl-101/DataLad-101
- $ git branch sct_input_and_output
- .. [#f1] In shell programming, commands exit with a specific code that indicates
- whether they failed, and if so, how. Successful commands have the exit code zero. All failures
- have exit codes greater than zero.
|