101-133-containersrun.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387
  1. .. _containersrun:
  2. Computational reproducibility with software containers
  3. ------------------------------------------------------
  4. Just after submitting your midterm data analysis project, you get together
  5. with your friends. "I'm curious: So what kind of analyses did y'all carry out?"
  6. you ask. The variety of methods and datasets the others used is huge, and
  7. one analysis interests you in particular. Later that day, you decide to
  8. install this particular analysis dataset to learn more about the methods used
  9. in there. However, when you :dlcmd:`rerun` your friends analysis script,
  10. it throws an error. Hastily, you call her -- maybe she can quickly fix her
  11. script and resubmit the project with only minor delays. "I don't know what
  12. you mean", you hear in return.
  13. "On my machine, everything works fine!"
  14. On its own, DataLad datasets can contain almost anything that is relevant to
  15. ensure reproducibility: Data, code, human-readable analysis descriptions
  16. (e.g., ``README.md`` files), provenance on the origin of all files
  17. obtained from elsewhere, and machine-readable records that link generated
  18. outputs to the commands, scripts, and data they were created from.
  19. This however may not be sufficient to ensure that an analysis *reproduces*
  20. (i.e., produces the same or highly similar results), let alone *works* on a
  21. computer different than the one it was initially composed on. This is because
  22. the analysis does not only depend on data and code, but also the
  23. *software environment* that it is conducted in.
  24. A lack of information about the operating system of the computer, the precise
  25. versions of installed software, or their configurations may
  26. make it impossible to replicate your analysis on a different machine, or even
  27. on your own machine once a new software update is installed. Therefore, it is
  28. important to communicate all details about the computational environment for
  29. an analysis as thoroughly as possible. Luckily, DataLad provides an extension
  30. that can link computational environments to datasets, the
  31. `datalad containers <https://docs.datalad.org/projects/container>`_
  32. extension.
  33. This section will give a quick overview on what containers are and
  34. demonstrate how ``datalad-container`` helps to capture full provenance of an
  35. analysis by linking containers to datasets and analyses.
  36. .. importantnote:: Install the datalad-container extension
  37. This section uses the :term:`DataLad extension` ``datalad-container``.
  38. As other extensions, it is a stand-alone Python package, and can be installed using :term:`pip`:
  39. .. code-block:: bash
  40. $ pip install datalad-container
  41. As with DataLad and other Python packages, you might want to do the installation in a :term:`virtual environment`.
  42. .. index::
  43. pair: recipe; software container concept
  44. pair: image; software container concept
  45. pair: container; software container concept
  46. Containers
  47. ^^^^^^^^^^
  48. To put it simple, computational containers are cut-down virtual machines that
  49. allow you to package all software libraries and their dependencies (all in the
  50. precise version your analysis requires) into a bundle you can share with
  51. others. On your own and other's machines, the container constitutes a secluded
  52. software environment that
  53. - contains the exact software environment that you specified, ready to run
  54. analyses
  55. - does not effect any software outside of the container
  56. Unlike virtual machines, software containers do not run a full operating
  57. system on virtualized hardware. Instead, they use basic services of the host operating system
  58. (in a read-only fashion). This makes them
  59. lightweight and still portable. By sharing software environments with containers,
  60. others (and also yourself) have easy access to the correct software
  61. without the need to modify the software environment of the machine the
  62. container runs on. Thus, containers are ideal to encapsulate the software
  63. environment and share it together with the analysis code and data to ensure
  64. computational reproducibility of your analyses, or to create a suitable
  65. software environment on a computer that you do not have permissions to deploy
  66. software on.
  67. There are a number of different tools to create and use containers, with
  68. `Docker <https://www.docker.com>`_ being one of the most well-known of them.
  69. While being a powerful tool, it is only rarely used on high performance computing
  70. (HPC) infrastructure [#f2]_. An alternative is `Singularity <https://sylabs
  71. .io/docs>`_.
  72. Both of these tools share core terminology:
  73. :term:`container recipe`
  74. A text file that lists all required components of the computational environment.
  75. It is made by a human user.
  76. :term:`container image`
  77. This is *built* from the recipe file. It is a static file system inside a file,
  78. populated with the software specified in the recipe, and some initial configuration.
  79. :term:`container`
  80. A running instance of an image that you can actually use for your computations.
  81. If you want to create and run your own software container, you start by writing
  82. a recipe file and build an image from it. Alternatively, you can can also *pull*
  83. an image built from a publicly shared recipe from the *Hub* of the tool you are using.
  84. hub
  85. A storage resource to share and consume images. Examples are
  86. :term:`Singularity-Hub`, :term:`Docker-Hub`, and `Amazon ECR <https://aws.amazon.com/ecr>`_ which hosts Docker images.
  87. Note that as of now, the ``datalad-container`` extension supports
  88. Singularity and Docker images.
  89. Singularity furthermore is compatible with Docker -- you can use
  90. Docker images as a basis for Singularity images, or run Docker images with
  91. Singularity (even without having Docker installed).
  92. See the :windows-wit:`on Docker <ww-docker>` for installation options.
  93. .. importantnote:: Additional requirement: Singularity
  94. To use Singularity containers you have to
  95. `install <https://docs.sylabs.io/guides/3.4/user-guide/installation.html>`_ the software singularity.
  96. .. index::
  97. pair: installation; Docker
  98. pair: install Docker; on Windows
  99. .. find-out-more:: Docker installation Windows
  100. :name: ww-docker
  101. The software singularity is not available for Windows.
  102. Windows users therefore need to install :term:`Docker`.
  103. The currently recommended way to do so is by installing `Docker Desktop <https://docs.docker.com/desktop/install/windows-install/>`_, and use its "WSL2" backend (a choice one can set during the installation).
  104. In the case of an "outdated WSL kernel version" issue, run ``wsl --update`` in a regular Windows Command Prompt (CMD).
  105. After the installation, run Docker Desktop, and wait several minutes for it to start the Docker engine in the background.
  106. To verify that everything works as it should, run ``docker ps`` in a Windows Command Prompt (CMD).
  107. If it reports an error that asks "Is the docker daemon running?" give it a few more minutes to let Docker Desktop start it.
  108. If it can't find the docker command, something went wrong during installation.
  109. .. index::
  110. pair: containers-add; DataLad command
  111. pair: containers-run; DataLad command
  112. Using ``datalad containers``
  113. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  114. One core feature of the ``datalad containers`` extension is that it registers
  115. computational containers with a dataset. This is done with the
  116. :dlcmd:`containers-add` command.
  117. Once a container is registered, arbitrary commands can be executed inside of
  118. it, i.e., in the precise software environment the container encapsulates. All it
  119. needs for this it to swap the :dlcmd:`run` command introduced in
  120. section :ref:`run` with the :dlcmd:`containers-run` command.
  121. Let's see this in action for the ``midterm_analysis`` dataset by rerunning
  122. the analysis you did for the midterm project within a Singularity container.
  123. We start by registering a container to the dataset.
  124. For this, we will pull an image from Singularity hub. This image was made
  125. for the handbook, and it contains the relevant Python setup for
  126. the analysis. Its recipe lives in the handbook's
  127. `resources repository <https://github.com/datalad-handbook/resources>`_.
  128. If you are curious how to create a Singularity image, the :find-out-more:`on this topic <fom-container-creation>` has some pointers:
  129. .. index::
  130. pair: build container image; with Singularity
  131. .. windows-wit:: How to make a Singularity image
  132. :name: fom-container-creation
  133. Singularity containers are build from image files, often
  134. called "recipes", that hold a "definition" of the software container and its
  135. contents and components. The
  136. `singularity documentation <https://docs.sylabs.io/guides/3.4/user-guide/build_a_container.html>`_
  137. has its own tutorial on how to build such images from scratch.
  138. An alternative to writing the image file by hand is to use
  139. `Neurodocker <https://github.com/ReproNim/neurodocker>`_. This
  140. command-line program can help you generate custom Singularity recipes (and
  141. also ``Dockerfiles``, from which Docker images are built). A wonderful tutorial
  142. on how to use Neurodocker is
  143. `this introduction <https://miykael.github.io/nipype_tutorial/notebooks/introduction_neurodocker.html>`_
  144. by Michael Notter.
  145. Once a recipe exists, the command
  146. .. code-block:: console
  147. $ sudo singularity build <NAME> <RECIPE>
  148. will build a container (called ``<NAME>``) from the recipe. Note that this
  149. command requires ``root`` privileges ("``sudo``"). You can build the container
  150. on any machine, though, not necessarily the one that is later supposed to
  151. actually run the analysis, e.g., your own laptop versus a compute cluster.
  152. .. index::
  153. pair: add container image to dataset; with DataLad
  154. The :dlcmd:`containers-add` command takes an arbitrary
  155. name to give to the container, and a path or URL to a container image:
  156. .. runrecord:: _examples/DL-101-133-101
  157. :language: console
  158. :workdir: dl-101/DataLad-101/midterm_project
  159. :cast: 10_yoda
  160. :notes: Computational reproducibility: add a software container
  161. $ # we are in the midterm_project subdataset
  162. $ datalad containers-add midterm-software --url shub://adswa/resources:2
  163. .. index::
  164. pair: hub; Docker
  165. .. find-out-more:: How do I add an image from Docker-Hub, Amazon ECR, or a local container?
  166. Should the image you want to use sit on Dockerhub, specify the ``--url``
  167. option prefixed with ``docker://`` or ``dhub://`` instead of ``shub://``:
  168. .. code-block:: console
  169. $ datalad containers-add midterm-software --url docker://adswa/resources:2
  170. If your image lives on Amazon ECR, use a ``dhub://`` prefix followed by the AWS ECR URL as in
  171. .. code-block:: console
  172. $ datalad containers-add --url dhub://12345678.dkr.ecr.us-west-2.amazonaws.com/maze-code/data-import:latest data-import
  173. If you want to add a container that exists locally, specify the path to it
  174. like this:
  175. .. code-block:: console
  176. $ datalad containers-add midterm-software --url path/to/container
  177. This command downloaded the container from Singularity Hub, added it to
  178. the ``midterm_project`` dataset, and recorded basic information on the
  179. container under its name "midterm-software" in the dataset's configuration at
  180. ``.datalad/config``. You can find out more about them in a dedicated :ref:`find-out-more on these additional configurations <fom-containerconfig>`.
  181. .. index::
  182. pair: DataLad concept; container image registration
  183. .. find-out-more:: What changes in .datalad/config when one adds a container?
  184. :name: fom-containerconfig
  185. :float:
  186. .. include:: topic/container-imgcfg.rst
  187. Such configurations can, among other things, be important to ensure correct container invocation on specific systems or across systems.
  188. One example is *bind-mounting* directories into containers, i.e., making a specific directory and its contents available inside a container.
  189. Different containerization software (versions) or configurations of those determine *default bind-mounts* on a given system.
  190. Thus, depending on the system and the location of the dataset on this system, a shared dataset may be automatically bind-mounted or not.
  191. To ensure that the dataset is correctly bind-mounted on all systems, let's add a call-format specification with a bind-mount to the current working directory following the information in the :ref:`find-out-more on additional container configurations <fom-containerconfig>`.
  192. .. index::
  193. single: configuration.item; datalad.containers.<name>.cmdexec
  194. .. runrecord:: _examples/DL-101-133-104
  195. :language: console
  196. :workdir: dl-101/DataLad-101/midterm_project
  197. :cast: 10_yoda
  198. $ git config -f .datalad/config datalad.containers.midterm-software.cmdexec 'singularity exec -B {{pwd}} {img} {cmd}'
  199. $ datalad save -m "Modify the container call format to bind-mount the working directory"
  200. .. index::
  201. pair: run command with provenance capture; with DataLad
  202. pair: run command; with DataLad containers-run
  203. Now that we have a complete computational environment linked to the ``midterm_project``
  204. dataset, we can execute commands in this environment. Let us, for example, try to repeat
  205. the :dlcmd:`run` command from the section :ref:`yoda_project` as a
  206. :dlcmd:`containers-run` command.
  207. The previous ``run`` command looked like this:
  208. .. code-block:: console
  209. $ datalad run -m "analyze iris data with classification analysis" \
  210. --input "input/iris.csv" \
  211. --output "pairwise_relationships.png" \
  212. --output "prediction_report.csv" \
  213. "python3 code/script.py {inputs} {outputs}"
  214. How would it look like as a ``containers-run`` command?
  215. .. runrecord:: _examples/DL-101-133-105
  216. :language: console
  217. :workdir: dl-101/DataLad-101/midterm_project
  218. :cast: 10_yoda
  219. :notes: The analysis can be rerun in a software container
  220. $ datalad containers-run -m "rerun analysis in container" \
  221. --container-name midterm-software \
  222. --input "input/iris.csv" \
  223. --output "pairwise_relationships.png" \
  224. --output "prediction_report.csv" \
  225. "python3 code/script.py {inputs} {outputs}"
  226. Almost exactly like a :dlcmd:`run` command! The only additional parameter
  227. is ``container-name``. At this point, though, the ``--container-name``
  228. flag is even *optional* because there is only a single container registered to the dataset.
  229. But if your dataset contains more than one container you will *need* to specify
  230. the name of the container you want to use in your command.
  231. The complete command's structure looks like this:
  232. .. code-block:: console
  233. $ datalad containers-run --name <containername> [-m ...] [--input ...] [--output ...] <COMMAND>
  234. .. index::
  235. pair: containers-remove; DataLad command
  236. pair: containers-list; DataLad command
  237. pair: list known containers; with DataLad
  238. .. find-out-more:: How can I list available containers or remove them?
  239. The command :dlcmd:`containers-list` will list all containers in
  240. the current dataset:
  241. .. runrecord:: _examples/DL-101-133-110
  242. :language: console
  243. :workdir: dl-101/DataLad-101/midterm_project
  244. $ datalad containers-list
  245. The command :dlcmd:`containers-remove` will remove a container
  246. from the dataset, if there exists a container with name given to the
  247. command. Note that this will remove not only the image from the dataset,
  248. but also the configuration for it in ``.datalad/config``.
  249. Here is how the history entry looks like:
  250. .. runrecord:: _examples/DL-101-133-111
  251. :language: console
  252. :workdir: dl-101/DataLad-101/midterm_project
  253. :cast: 10_yoda
  254. :notes: Here is how that looks like in the history:
  255. $ git log -p -n 1
  256. If you would :dlcmd:`rerun` this commit, it would be re-executed in the
  257. software container registered to the dataset. If you would share the dataset
  258. with a friend and they would :dlcmd:`rerun` this commit, the image would first
  259. be obtained from its registered url, and thus your
  260. friend can obtain the correct execution environment automatically.
  261. Note that because this new :dlcmd:`containers-run` command modified the
  262. ``midterm_project`` subdirectory, we need to also save
  263. the most recent state of the subdataset to the superdataset ``DataLad-101``.
  264. .. runrecord:: _examples/DL-101-133-112
  265. :language: console
  266. :workdir: dl-101/DataLad-101/midterm_project
  267. :cast: 10_yoda
  268. :notes: Save the change in the superdataset
  269. $ cd ../
  270. $ datalad status
  271. .. runrecord:: _examples/DL-101-133-113
  272. :language: console
  273. :workdir: dl-101/DataLad-101
  274. :cast: 10_yoda
  275. :notes: Save the change in the superdataset
  276. $ datalad save -d . -m "add container and execute analysis within container" midterm_project
  277. Software containers, the ``datalad-container`` extension, and DataLad thus work well together
  278. to make your analysis completely reproducible -- by not only linking code, data,
  279. and outputs, but also the software environment of an analysis. And this does not
  280. only benefit your future self, but also whomever you share your dataset with, as
  281. the information about the container is shared together with the dataset. How cool
  282. is that?
  283. .. only:: adminmode
  284. Add a tag at the section end.
  285. .. runrecord:: _examples/DL-101-133-114
  286. :language: console
  287. :workdir: dl-101/DataLad-101
  288. $ git branch sct_computational_reproducibility
  289. .. rubric:: Footnotes
  290. .. [#f2] The main reason why Docker is not deployed on HPC systems is because
  291. it grants users "`superuser privileges <https://en.wikipedia.org/wiki/Superuser>`_".
  292. On multi-user systems such as HPC, users should not have those
  293. privileges, as it would enable them to tamper with other's or shared
  294. data and resources, posing a severe security threat.