Scheduled service maintenance on November 22


On Friday, November 22, 2024, between 06:00 CET and 18:00 CET, GIN services will undergo planned maintenance. Extended service interruptions should be expected. We will try to keep downtimes to a minimum, but recommend that users avoid critical tasks, large data uploads, or DOI requests during this time.

We apologize for any inconvenience.

101-168-dvc.rst 45 KB


  1. .. _dvc:
  2. Reproducible machine learning analyses: DataLad as DVC
  3. ------------------------------------------------------
  4. Machine learning analyses are complex:
  5. Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics.
  6. This complexity can jeopardize reproducibility -- it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization.
  7. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.
  8. .. figure:: ../artwork/src/ML.svg
  9. DataLad provides the functionality to achieve this, and :ref:`previous <yoda>` :ref:`sections <containersrun>` have given some demonstrations on how to do it.
  10. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too.
  11. One of the most well-known is `DVC (Data Version Control) <https://dvc.org>`__, a "version control system for machine learning projects".
  12. This section compares the two tools and demonstrates `workflows for data versioning, data sharing, and analysis execution <https://realpython.com/python-data-version-control>`_ in the context of a machine learning project with DVC and DataLad.
  13. While they share a number of similarities and goals, their respective workflows are quite distinct.
  14. The workflows showcased here are based on a `DVC tutorial <https://realpython.com/python-data-version-control>`__.
  15. This tutorial consists of the following steps:
  16. - A data set with pictures of 10 classes of objects (`Imagenette <https://github.com/fastai/imagenette>`_) is version controlled with DVC
  17. - the data is pushed to a "storage remote" on a local path
  18. - the data are analyzed using various ML models in DVC pipelines
  19. This handbook section demonstrates how DataLad could be used as an alternative to DVC.
  20. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow.
  21. The use case :ref:`usecase_ML` demonstrates a similar analysis in a completely DataLad-centric fashion.
  22. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands.
  23. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept.
  24. If you have never used DVC, `its documentation <https://dvc.org/doc>`_ (including the `command reference <https://dvc.org/doc/command-reference>`_) can answer further questions.
  25. .. admonition:: If you are not a Git user
  26. DVC relies heavily on Git workflows.
  27. Understanding the DVC workflows requires a solid understanding of :term:`branch`\es, Git's concepts of `Working tree, Index ("Staging Area"), and Repository <https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository>`_, and some basic Git commands such as ``add``, ``commit``, and ``checkout``.
  28. `The Turing Way <https://the-turing-way.netlify.app/index.html>`_ has an excellent `chapter on version control with Git <https://the-turing-way.netlify.app/reproducible-research/vcs.html>`_ if you want to catch up on those basics first.
  29. .. gitusernote:: Terminology
  30. Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake.
  31. Make sure to read the `DVC documentation <https://dvc.org/doc/command-reference>`_ for each command to get more information on what it does.
  32. Setup
  33. ^^^^^
  34. The `DVC tutorial <https://realpython.com/python-data-version-control>`_ comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, `the repository <https://github.com/datalad-handbook/data-version-control>`_ needs to be :term:`fork`\ed (requires a GitHub account) and cloned from your own fork [#f1]_.
  35. .. runrecord:: _examples/DL-101-168-101
  36. :workdir: DVCvsDL
  37. :language: console
  38. ### DVC
  39. # please clone this repository from your own fork when coding along
  40. $ git clone https://github.com/datalad-handbook/data-version-control DVC
  41. .. only:: adminmode
  42. We need to reconfigure the remote origin to a local push target to simulate pushing back.
  43. First, rename origin:
  44. .. runrecord:: _examples/DL-101-168-102a
  45. :language: console
  46. :workdir: DVCvsDL/DVC
  47. $ git remote rename origin github
  48. Then, create a local pushtarget under the remote name origin
  49. .. runrecord:: _examples/DL-101-168-102b
  50. :language: console
  51. $ python3 /home/me/makepushtarget.py '/home/me/DVCvsDL/DVC' 'origin' '/home/me/pushes/data-version-control' True True
  52. The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories ``model`` and ``metrics``, and a set of Python scripts for a machine learning analysis in ``src/``.
  53. .. runrecord:: _examples/DL-101-168-102
  54. :workdir: DVCvsDL
  55. :language: console
  56. ### DVC
  57. $ tree DVC
  58. For a comparison, we will recreate a similarly structured DataLad dataset.
  59. For greater compliance with DataLad's :ref:`YODA principles <yoda>`, the dataset structure will differ marginally in that scripts will be kept in ``code/`` instead of ``src/``.
  60. We create the dataset with two configurations, ``yoda`` and ``text2git`` [#f2]_.
  61. .. runrecord:: _examples/DL-101-168-103
  62. :workdir: DVCvsDL
  63. :language: console
  64. ### DVC-DataLad
  65. $ datalad create -c text2git -c yoda DVC-DataLad
  66. $ cd DVC-DataLad
  67. $ mkdir -p data/{raw,prepared} model metrics
  68. Afterwards, we make sure to get the same scripts.
  69. .. runrecord:: _examples/DL-101-168-104
  70. :workdir: DVCvsDL/DVC-DataLad
  71. :language: console
  72. ### DVC-DataLad
  73. # get the scripts
  74. $ datalad download-url -m "download scripts for ML analysis" \
  75. https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
  76. -O 'code/'
  77. Here's the final directory structure:
  78. .. runrecord:: _examples/DL-101-168-105
  79. :workdir: DVCvsDL/DVC-DataLad
  80. :language: console
  81. ### DVC-DataLad
  82. $ tree
  83. .. find-out-more:: Required software for coding along
  84. In order to code along, `DVC <https://dvc.org/doc/install>`__, `scikit-learn <https://scikit-learn.org>`_, `scikit-image <https://scikit-image.org>`_, `pandas <https://pandas.pydata.org>`_, and `numpy <https://numpy.org>`_ are required.
  85. All tools are available via `pip <https://pypi.org/project/pip>`_ or `conda <https://docs.conda.io>`_.
  86. We recommend to install them in a `virtual environment <https://realpython.com/python-data-version-control/#set-up-your-working-environment>`_ -- the DVC tutorial has `step-by-step instructions <https://realpython.com/python-data-version-control/#set-up-your-working-environment>`_.
  87. Version controlling data
  88. ^^^^^^^^^^^^^^^^^^^^^^^^
  89. In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.
  90. Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar:
  91. (Large) data is stored outside of :term:`Git` -- :term:`Git` only tracks information on where this data can be found.
  92. In DataLad datasets, (large) data is handled by :term:`git-annex`.
  93. Data content is `hashed <https://en.wikipedia.org/wiki/Hash_function>`_ and only the hash (represented as the original file name) is stored in Git [#f3]_.
  94. Actual data is stored in the :term:`annex` of the dataset, and annexed data can be transferred from and to a `large number of storage solutions <https://git-annex.branchable.com/special_remotes>`_ using either DataLad or git-annex commands.
  95. Information on where data is available from is :ref:`stored in an internal representation of git-annex <gitannexbranch>`.
  96. In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive.
  97. For internal representation of where files are available from, DVC uses one ``.dvc`` text file for each data file or directory given to DVC.
  98. The ``.dvc`` files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to :term:`Git`.
  99. DVC workflow
  100. """"""""""""
  101. Prior to adding and version controlling data, a "DVC project" needs to be initialized in the Git repository:
  102. .. runrecord:: _examples/DL-101-168-106
  103. :workdir: DVCvsDL/DVC-DataLad
  104. :language: console
  105. ### DVC
  106. $ cd ../DVC
  107. $ dvc init
  108. This populates the repository with a range of `staged <https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository>`_ files -- most of them are internal directories and files for DVC's configuration.
  109. .. runrecord:: _examples/DL-101-168-107
  110. :workdir: DVCvsDL/DVC
  111. :language: console
  112. ### DVC
  113. $ git status
  114. As they are only *staged* but not *committed*, we need to commit them (into Git):
  115. .. runrecord:: _examples/DL-101-168-108
  116. :workdir: DVCvsDL/DVC
  117. :language: console
  118. ### DVC
  119. $ git commit -m "initialize dvc"
  120. The DVC project is now ready to version control data.
  121. In the tutorial, data comes from the "Imagenette" dataset.
  122. This data is available `from an Amazon S3 bucket <https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz>`_ as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the :term:`Open Science Framework` (OSF).
  123. We'll download it and extract it into the ``data/raw/`` directory of the repository.
  124. .. runrecord:: _examples/DL-101-168-109
  125. :workdir: DVCvsDL/DVC
  126. :language: console
  127. ### DVC
  128. # download the data
  129. $ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
  130. # extract it
  131. $ tar -xzf imagenette2-160.tgz
  132. # move it into the directories
  133. $ mv train data/raw/
  134. $ mv val data/raw/
  135. # remove the archive
  136. $ rm -rf imagenette2-160.tgz
  137. The data directories in ``data/raw`` are then version controlled with the :shcmd:`dvc add` command that can place files or complete directories under version control by DVC.
  138. .. runrecord:: _examples/DL-101-168-110
  139. :workdir: DVCvsDL/DVC
  140. :language: console
  141. ### DVC
  142. $ dvc add data/raw/train
  143. $ dvc add data/raw/val
  144. Here is what this command has accomplished:
  145. The data files were copied into a *cache* in ``.dvc/cache`` (a non-human readable directory structure based on hashes similar to `.git/annex/objects` used by `git-annex`), data file names were added to a ``.gitignore`` [#f4]_ file to become invisible to Git, and two ``.dvc`` files, ``train.dvc`` and ``val.dvc``, were created [#f5]_.
  146. :gitcmd:`status` shows these changes:
  147. .. runrecord:: _examples/DL-101-168-111
  148. :workdir: DVCvsDL/DVC
  149. :language: console
  150. ### DVC
  151. $ git status
  152. In order to complete the version control workflow, Git needs to know about the ``.dvc`` files, and forget about the data directories.
  153. For this, the modified ``.gitignore`` file and the untracked ``.dvc`` files need to be added to Git:
  154. .. runrecord:: _examples/DL-101-168-112
  155. :workdir: DVCvsDL/DVC
  156. :language: console
  157. ### DVC
  158. $ git add --all
  159. Finally, we commit.
  160. .. runrecord:: _examples/DL-101-168-113
  161. :workdir: DVCvsDL/DVC
  162. :language: console
  163. ### DVC
  164. $ git commit -m "control data with DVC"
  165. The data is now version controlled with DVC.
  166. .. find-out-more:: How does DVC represent modifications to data?
  167. When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective ``.dvc`` file.
  168. If any file in the directory changes, this hash would change, and the :shcmd:`dvc status` command would report the directory to be "changed".
  169. To demonstrate this, we pretend to accidentally delete a single file::
  170. # if one or more files in the val/ data changes, dvc status reports a change
  171. $ dvc status
  172. data/raw/val.dvc:
  173. changed outs:
  174. modified: data/raw/val
  175. **Important**: Detecting a data modification **requires** the :shcmd:`dvc status` command -- :gitcmd:`status` will not be able to detect changes as this directory as it is git-ignored!
  176. DataLad workflow
  177. """"""""""""""""
  178. DataLad has means to get data or data archives from web sources and store this availability information within :term:`git-annex`.
  179. This has several advantages:
  180. For one, the original OSF file URL is known and stored as a location to re-retrieve the data from.
  181. This enables reliable data access for yourself and others that you share the dataset with.
  182. Beyond this, the data is also automatically extracted and saved, and thus put under version control.
  183. Note that this strays slightly from DataLad's :ref:`YODA principles <yoda>` in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset.
  184. Here, we stick to the project organization of DVC though.
  185. .. runrecord:: _examples/DL-101-168-114
  186. :workdir: DVCvsDL/DVC
  187. :language: console
  188. ### DVC-DataLad
  189. $ cd ../DVC-DataLad
  190. $ datalad download-url \
  191. --archive \
  192. --message "Download Imagenette dataset" \
  193. https://osf.io/d6qbz/download \
  194. -O 'data/raw/'
  195. At this point, the data is already version controlled [#f6]_, and we have the following directory tree::
  196. $ tree
  197. .
  198. ├── code
  199. │   └── [...]
  200. ├── data
  201. │   └── raw
  202. │   ├── train
  203. │    │   ├──[...]
  204. │    └── val
  205. │   ├── [...]
  206. ├── metrics
  207. └── model
  208. 29 directories
  209. .. find-out-more:: How does DataLad represent modifications to data?
  210. As DataLad always tracks files individually, :dlcmd:`status` (or, alternatively, :gitcmd:`status` or :gitannexcmd:`status`) will show modifications on the level of individual files::
  211. $ datalad status
  212. deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)
  213. $ git status
  214. On branch main
  215. Your branch is ahead of 'origin/main' by 2 commits.
  216. (use "git push" to publish your local commits)
  217. Changes not staged for commit:
  218. (use "git add/rm <file>..." to update what will be committed)
  219. (use "git restore <file>..." to discard changes in working directory)
  220. deleted: data/raw/val/n01440764/n01440764_12021.JPEG
  221. $ git annex status
  222. D data/raw/val/n01440764/n01440764_12021.JPEG
  223. Sharing data
  224. ^^^^^^^^^^^^
  225. In the second part of the tutorial, the versioned data is transferred to a local directory to demonstrate data sharing.
  226. The general mechanisms of DVC and DataLad data sharing are similar: (Large) data files are kept somewhere where potentially large files can be stored. They can be retrieved on demand as the location information is stored in Git.
  227. DVC uses the term "data remote" to refer to external storage locations for (large) data, whereas DataLad would refer to them as (storage-) :term:`sibling`\s.
  228. Both DVC and DataLad support a range of hosting solutions, from local paths and SSH servers to providers such as S3 or GDrive.
  229. For DVC, every supported remote is pre-implemented, which restricts the number of available services (a list is `here <https://dvc.org/doc/command-reference/remote/add>`_), but results in a convenient, streamlined procedure for adding remotes based on URL schemes.
  230. DataLad, largely thanks to "external special remotes" mechanism of git-annex, has more storage options (in addition, for example, :ref:`DropBox <sharethirdparty>`, `the Open Science Framework (OSF) <https://docs.datalad.org/projects/osf>`_, :ref:`Git LFS <gitlfs>`, :ref:`Figshare <figshare>`, :ref:`GIN <gin>`, or :ref:`RIA stores <riastore>`), but depending on selected storage provider, the procedure to add a sibling may differ.
  231. In addition, DataLad is able to store complete datasets (annexed data *and* Git repository) in certain services (e.g., OSF, GIN, GitHub if used with GitLFS, Dropbox, ...), enabling a clone from, for example, Google Drive, and while DVC can never keep data in Git repository hosting services, DataLad can do this if the hosting service supports hosting annexed data (default on :term:`Gin` and possible with :term:`GitHub`, :term:`GitLab` or :term:`BitBucket` if used with `GitLFS <https://git-lfs.com>`_).
  232. DVC workflow
  233. """"""""""""
  234. **Step 1: Set up a remote**
  235. The `DVC tutorial <https://realpython.com/python-data-version-control>`__ demonstrates data sharing via a local data remote [#f7]_.
  236. As a first step, there needs to exist a directory to use as a remote, so we will create a new directory:
  237. .. runrecord:: _examples/DL-101-168-120
  238. :workdir: DVCvsDL/DVC-DataLad
  239. :language: console
  240. ### DVC
  241. # go back to DVC (we were in DVC-Datalad)
  242. $ cd ../DVC
  243. # create a directory somewhere else
  244. $ mkdir ../dvc-remote
  245. Afterwards, the new, empty directory can be added as a data remote using :shcmd:`dvc remote add`.
  246. The ``-d`` option sets it as the default remote, which simplifies pushing later on:
  247. .. runrecord:: _examples/DL-101-168-121
  248. :workdir: DVCvsDL/DVC
  249. :language: console
  250. ### DVC
  251. $ dvc remote add -d remote_storage ../dvc-remote
  252. The location of the remote is written into a config file:
  253. .. runrecord:: _examples/DL-101-168-122
  254. :workdir: DVCvsDL/DVC
  255. :language: console
  256. ### DVC
  257. $ cat .dvc/config
  258. Note that ``dvc remote add`` only *modifies* the config file, and it still needs to be added and committed to Git:
  259. .. runrecord:: _examples/DL-101-168-123
  260. :workdir: DVCvsDL/DVC
  261. :language: console
  262. ### DVC
  263. $ git status
  264. .. runrecord:: _examples/DL-101-168-124
  265. :workdir: DVCvsDL/DVC
  266. :language: console
  267. ### DVC
  268. $ git add .dvc/config
  269. $ git commit -m "add local remote"
  270. .. gitusernote:: Remotes
  271. The DVC and Git concepts of a "remote" are related, but not identical.
  272. Therefore, DVC remotes are invisible to :gitcmd:`remote`, and likewise, Git :term:`remote`\s are invisible to the :shcmd:`dvc remote list` command.
  273. **Step 2: Push data to the remote**
  274. Once the remote is set up, the data that is managed by DVC can be pushed from the *cache* of the project to the remote.
  275. During this operation, all data for which ``.dvc`` files exist will be copied from ``.dvc/cache`` to the remote storage.
  276. .. runrecord:: _examples/DL-101-168-125
  277. :workdir: DVCvsDL/DVC
  278. :language: console
  279. ### DVC
  280. $ dvc push
  281. **Step 3: Push Git history**
  282. At this point, all changes that were committed to :term:`Git` (such as the ``.dvc`` files) still need to be pushed to a Git repository hosting service.
  283. .. runrecord:: _examples/DL-101-168-126
  284. :workdir: DVCvsDL/DVC
  285. :language: console
  286. ### DVC
  287. # this will only work if you have cloned from your own fork
  288. $ git push origin master
  289. **Step 4: Data retrieval**
  290. In DVC projects, there are several ways to retrieve data into its original location or the project cache.
  291. In order to demonstrate this, we start by deleting a data directory (in its original location, ``data/raw/val/``).
  292. .. runrecord:: _examples/DL-101-168-127
  293. :workdir: DVCvsDL/DVC
  294. :language: console
  295. ### DVC
  296. $ rm -rf data/raw/val
  297. .. gitusernote:: Status
  298. Do note that this deletion would not be detected by :gitcmd:`status` -- you have to use :shcmd:`dvc status` instead.
  299. At this point, a copy of the data still resides in the cache of the repository.
  300. These data are copied back to ``val/`` with the :shcmd:`dvc checkout` command:
  301. .. runrecord:: _examples/DL-101-168-128
  302. :workdir: DVCvsDL/DVC
  303. :language: console
  304. ### DVC
  305. $ dvc checkout data/raw/val.dvc
  306. If the cache of the repository would be empty, the data can be re-retrieved into the cache from the data remote.
  307. To demonstrate this, let's look at a repository with an empty cache by cloning this repository from GitHub into a new location.
  308. .. runrecord:: _examples/DL-101-168-129
  309. :workdir: DVCvsDL/DVC
  310. :language: console
  311. :realcommand: cd ../ && git clone -b master /home/me/pushes/data-version-control DVC-2
  312. ### DVC
  313. # clone the repo into a new location for demonstration purposes:
  314. $ cd ../
  315. $ git clone https://github.com/datalad-handbook/data-version-control DVC-2
  316. Retrieving the data from the data remote to repopulate the cache is done with the :shcmd:`dvc fetch` command:
  317. .. runrecord:: _examples/DL-101-168-130
  318. :workdir: DVCvsDL
  319. :language: console
  320. ### DVC
  321. $ cd DVC-2
  322. $ dvc fetch data/raw/val.dvc
  323. Afterwards, another :shcmd:`dvc checkout` will copy the files from the cache back to ``val/``.
  324. Alternatively, the command :shcmd:`dvc pull` performs ``fetch`` (get data into the cache) and ``checkout`` (copy data from the cache to its original location) in a single command.
  325. Unless DVC is used on a small subset of file systems (trfs, XFS, OCFS2, or APFS), copying data between its original location and the cache is the default.
  326. This results in a "built-in data duplication" on most current file systems [#f8]_.
  327. An alternative is to switch from copies to :term:`symlink`\s (as done by :term:`git-annex`) or `hardlinks <https://en.wikipedia.org/wiki/Hard_link>`_.
  328. DataLad workflow
  329. """"""""""""""""
  330. Because the OSF archive containing the raw data is known and stored in the dataset, it strictly speaking isn't necessary to create a storage sibling to push the data to -- DataLad already treats the original web location as storage.
  331. Currently, the dataset can thus be shared via :term:`GitHub` or similar hosting services, and the data can be retrieved using :dlcmd:`get`.
  332. .. index::
  333. pair: create-sibling-github; DataLad command
  334. .. find-out-more:: Really?
  335. Sure.
  336. Let's demonstrate this.
  337. First, we create a sibling on GitHub for this dataset and push its contents to the sibling:
  338. .. code-block:: bash
  339. ### DVC-DataLad
  340. $ cd ../DVC-DataLad
  341. $ datalad create-sibling-github DVC-DataLad --github-organization datalad-handbook
  342. [INFO ] Successfully obtained information about organization datalad-handbook using UserPassword(name='github', url='https://github.com/login') credential
  343. .: github(-) [https://github.com/datalad-handbook/DVC-DataLad.git (git)]
  344. 'https://github.com/datalad-handbook/DVC-DataLad.git' configured as sibling 'github' for Dataset(/home/me/DVCvsDL/DVC-DataLad)
  345. $ datalad push --to github
  346. Update availability for 'github': [...] [00:00<00:00, 28.9k Steps/s]Username for 'https://github.com': <user>
  347. Password for 'https://adswa@github.com': <password>
  348. publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
  349. publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
  350. Next, we can clone this dataset, and retrieve the files:
  351. .. runrecord:: _examples/DL-101-168-131
  352. :workdir: DVCvsDL
  353. :language: console
  354. ### DVC-DataLad
  355. # outside of a dataset
  356. $ datalad clone https://github.com/datalad-handbook/DVC-DataLad.git DVC-DataLad-2
  357. $ cd DVC-DataLad-2
  358. .. runrecord:: _examples/DL-101-168-132
  359. :workdir: DVCvsDL/DVC-DataLad-2
  360. :language: console
  361. :realcommand: datalad get data/raw/val | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  362. ### DVC-DataLad2
  363. $ datalad get data/raw/val
  364. The data was retrieved by re-downloading the original archive from OSF and extracting the required files.
  365. Here's an example of pushing a dataset to a local sibling nevertheless:
  366. .. index::
  367. pair: create-sibling; DataLad command
  368. **Step 1: Set up the sibling**
  369. The easiest way to share data is via a local sibling [#f7]_.
  370. This won't share only annexed data, but it instead will push everything, including the Git aspect of the dataset.
  371. First, we need to create a local sibling:
  372. .. runrecord:: _examples/DL-101-168-140
  373. :workdir: DVCvsDL
  374. :language: console
  375. ### DVC-DataLad
  376. $ cd DVC-DataLad
  377. $ datalad create-sibling --name mysibling ../datalad-sibling
  378. **Step 2: Push the data**
  379. Afterwards, the dataset contents can be pushed using :dlcmd:`push`.
  380. .. runrecord:: _examples/DL-101-168-141
  381. :workdir: DVCvsDL/DVC-DataLad
  382. :language: console
  383. :realcommand: datalad push --to mysibling | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  384. ### DVC-DataLad
  385. $ datalad push --to mysibling
  386. This pushed all of the annexed data and the Git history of the dataset.
  387. **Step 3: Retrieve the data**
  388. The data in the dataset (complete directories or individual files) can be dropped using :dlcmd:`drop`, and reobtained using :dlcmd:`get`.
  389. .. runrecord:: _examples/DL-101-168-142
  390. :workdir: DVCvsDL/DVC-DataLad
  391. :language: console
  392. :realcommand: datalad drop data/raw/val | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  393. ### DVC-DataLad
  394. $ datalad drop data/raw/val
  395. .. runrecord:: _examples/DL-101-168-143
  396. :workdir: DVCvsDL/DVC-DataLad
  397. :language: console
  398. :realcommand: datalad get data/raw/val | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  399. ### DVC-DataLad
  400. $ datalad get data/raw/val
  401. Data analysis
  402. ^^^^^^^^^^^^^
  403. DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines.
  404. The tutorial therefore runs an SGD classifier and a random forest classifier on the data and compares the two models.
  405. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps.
  406. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.
  407. In this general overview, we will only rush through the analysis:
  408. In short, it consists of three steps, each associated with a script.
  409. ``src/prepare.py`` creates two ``.csv`` files with mappings of file names in ``train/`` and ``val/`` to image categories.
  410. Later, these files will be used to train and test the classifiers.
  411. ``src/train.py`` loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the ``model/`` directory as ``model.joblib``.
  412. The final script, ``src/evaluate.py`` is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file ``metrics/accuracy.json``.
  413. There are more detailed insights and explanations of the actual analysis code in the `Tutorial <https://realpython.com/python-data-version-control>`_ if you're interested in finding out more.
  414. For workflow management, DVC has the concept of a "DVC pipeline".
  415. A pipeline consists of multiple stages, which are set up and executed using a :shcmd:`dvc stage add [--run]` command.
  416. Each stage has three components: "deps", "outs", and "command".
  417. Each of the scripts in the repository will be represented by a stage in the DVC pipeline.
  418. DataLad does not have any workflow management functions.
  419. The closest to it are :dlcmd:`run` to record any command execution or analysis, :dlcmd:`rerun` to recompute such an analysis, and :dlcmd:`containers-run` to perform and record a command execution or analysis inside of a tracked software container [#f10]_.
  420. DVC workflow
  421. """"""""""""
  422. **Model 1: SGD classifier**
  423. Each model will be analyzed in a different branch of the repository.
  424. Therefore, we start by creating a new branch.
  425. .. runrecord:: _examples/DL-101-168-150
  426. :workdir: DVCvsDL/DVC-DataLad
  427. :language: console
  428. ### DVC
  429. $ cd ../DVC
  430. $ git checkout -b sgd-pipeline
  431. The first stage in the pipeline is data preparation (performed by the script ``prepare.py``).
  432. The following command sets up the stage:
  433. .. runrecord:: _examples/DL-101-168-151
  434. :workdir: DVCvsDL/DVC
  435. :language: console
  436. ### DVC
  437. $ dvc stage add -n prepare \
  438. -d src/prepare.py -d data/raw \
  439. -o data/prepared/train.csv -o data/prepared/test.csv \
  440. --run \
  441. python src/prepare.py
  442. The ``-n`` parameter gives the stage a name, the ``-d`` parameter passes the dependencies -- the raw data -- to the command, and the ``-o`` parameter defines the outputs of the command -- the CSV files that ``prepare.py`` will create.
  443. ``python src/prepare.py`` is the command that will be executed in the stage.
  444. The resulting changes can be added to Git:
  445. .. runrecord:: _examples/DL-101-168-152
  446. :workdir: DVCvsDL/DVC
  447. :language: console
  448. ### DVC
  449. $ git add dvc.yaml data/prepared/.gitignore dvc.lock
  450. This command runs the command, and also creates two `YAML <https://en.wikipedia.org/wiki/YAML>`_ files, ``dvc.yaml`` and ``dvc.lock``.
  451. They contain the pipeline description, which currently comprises of the first stage:
  452. .. runrecord:: _examples/DL-101-168-153
  453. :workdir: DVCvsDL/DVC
  454. :language: console
  455. ### DVC
  456. $ cat dvc.yaml
  457. The lock file tracks the versions of all relevant files via MD5 hashes.
  458. This allows DVC to track all dependencies and outputs and detect if any of these files change.
  459. .. runrecord:: _examples/DL-101-168-154
  460. :workdir: DVCvsDL/DVC
  461. :language: console
  462. ### DVC
  463. $ cat dvc.lock
  464. The command also added the results from the stage, ``train.csv`` and ``test.csv`` into a ``.gitignore`` file.
  465. The next pipeline stage is training, in which ``train.py`` will be used to train a classifier on the data.
  466. Initially, this classifier is an SGD classifier.
  467. The following command sets it up:
  468. .. runrecord:: _examples/DL-101-168-155
  469. :workdir: DVCvsDL/DVC
  470. :language: console
  471. $ dvc stage add -n train \
  472. -d src/train.py -d data/prepared/train.csv \
  473. -o model/model.joblib \
  474. --run \
  475. python src/train.py
  476. Afterwards, ``train.py`` has been executed, and the pipelines have been updated with a second stage.
  477. The resulting changes can be added to Git:
  478. .. runrecord:: _examples/DL-101-168-156
  479. :workdir: DVCvsDL/DVC
  480. :language: console
  481. ### DVC
  482. $ git add dvc.yaml model/.gitignore dvc.lock
  483. Finally, we create the last stage, model evaluation.
  484. The following command sets it up:
  485. .. runrecord:: _examples/DL-101-168-157
  486. :workdir: DVCvsDL/DVC
  487. :language: console
  488. $ dvc stage add -n evaluate \
  489. -d src/evaluate.py -d model/model.joblib \
  490. -M metrics/accuracy.json \
  491. --run \
  492. python src/evaluate.py
  493. .. runrecord:: _examples/DL-101-168-158
  494. :workdir: DVCvsDL/DVC
  495. :language: console
  496. ### DVC
  497. $ git add dvc.yaml dvc.lock
  498. Instead of "outs", this final stage uses the ``-M`` flag to denote a "metric".
  499. This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML).
  500. DVC can then read from these files to display model performances and comparisons:
  501. .. runrecord:: _examples/DL-101-168-159
  502. :workdir: DVCvsDL/DVC
  503. :language: console
  504. ### DVC
  505. $ dvc metrics show
  506. The complete pipeline now consists of preparation, training, and evaluation.
  507. It now needs to be committed, tagged, and pushed:
  508. .. runrecord:: _examples/DL-101-168-160
  509. :workdir: DVCvsDL/DVC
  510. :language: console
  511. ### DVC
  512. $ git add --all
  513. $ git commit -m "Add SGD pipeline"
  514. $ dvc commit
  515. $ git push --set-upstream origin sgd-pipeline
  516. $ git tag -a sgd -m "Trained SGD as DVC pipeline."
  517. $ git push origin --tags
  518. $ dvc push
  519. **Model 2: random forest classifier**
  520. In order to explore a second model, a random forest classifier, we start with a new branch.
  521. .. runrecord:: _examples/DL-101-168-161
  522. :workdir: DVCvsDL/DVC
  523. :language: console
  524. ### DVC
  525. $ git checkout -b random-forest
  526. To switch from SGD to a random forest classifier, a few lines of code within ``train.py`` need to be changed.
  527. The following `here doc <https://en.wikipedia.org/wiki/Here_document>`_ changes the script accordingly (changes are highlighted):
  528. .. runrecord:: _examples/DL-101-168-162
  529. :workdir: DVCvsDL/DVC
  530. :language: console
  531. :emphasize-lines: 10, 37-38
  532. ### DVC
  533. $ cat << EOT >| src/train.py
  534. from joblib import dump
  535. from pathlib import Path
  536. import numpy as np
  537. import pandas as pd
  538. from skimage.io import imread_collection
  539. from skimage.transform import resize
  540. from sklearn.ensemble import RandomForestClassifier
  541. def load_images(data_frame, column_name):
  542. filelist = data_frame[column_name].to_list()
  543. image_list = imread_collection(filelist)
  544. return image_list
  545. def load_labels(data_frame, column_name):
  546. label_list = data_frame[column_name].to_list()
  547. return label_list
  548. def preprocess(image):
  549. resized = resize(image, (100, 100, 3))
  550. reshaped = resized.reshape((1, 30000))
  551. return reshaped
  552. def load_data(data_path):
  553. df = pd.read_csv(data_path)
  554. labels = load_labels(data_frame=df, column_name="label")
  555. raw_images = load_images(data_frame=df, column_name="filename")
  556. processed_images = [preprocess(image) for image in raw_images]
  557. data = np.concatenate(processed_images, axis=0)
  558. return data, labels
  559. def main(repo_path):
  560. train_csv_path = repo_path / "data/prepared/train.csv"
  561. train_data, labels = load_data(train_csv_path)
  562. rf = RandomForestClassifier()
  563. trained_model = rf.fit(train_data, labels)
  564. dump(trained_model, repo_path / "model/model.joblib")
  565. if __name__ == "__main__":
  566. repo_path = Path(__file__).parent.parent
  567. main(repo_path)
  568. EOT
  569. Afterwards, since ``train.py`` is changed, :shcmd:`dvc status` will realize that one dependency of the pipeline stage "train" has changed:
  570. .. runrecord:: _examples/DL-101-168-163
  571. :workdir: DVCvsDL/DVC
  572. :language: console
  573. ### DVC
  574. $ dvc status
  575. Since the code change (stage 2) will likely affect the metric (stage 3), it is best to reproduce the whole chain.
  576. You can reproduce a complete DVC pipeline file with the :shcmd:`dvc repro <stagename>` command:
  577. .. runrecord:: _examples/DL-101-168-164
  578. :workdir: DVCvsDL/DVC
  579. :language: console
  580. ### DVC
  581. $ dvc repro evaluate
  582. DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again.
  583. Compared to the branch ``sgd-pipeline``, the workspace in the current ``random-forest`` branch contains a changed script (``src/train.py``), a changed trained classifier (``model/model.joblib``), and a changed metric (``metric/accuracy.json``).
  584. All these changes need to be committed, tagged, and pushed now.
  585. .. runrecord:: _examples/DL-101-168-165
  586. :workdir: DVCvsDL/DVC
  587. :language: console
  588. ### DVC
  589. $ git add --all
  590. $ git commit -m "Train Random Forest classifier"
  591. $ dvc commit
  592. $ git push --set-upstream origin random-forest
  593. $ git tag -a randomforest -m "Random Forest classifier with 80.99% accuracy."
  594. $ git push origin --tags
  595. $ dvc push
  596. At this point, you can compare metrics across multiple tags:
  597. .. runrecord:: _examples/DL-101-168-166
  598. :workdir: DVCvsDL/DVC
  599. :language: console
  600. ### DVC
  601. $ dvc metrics show -T
  602. Done!
  603. DataLad workflow
  604. """"""""""""""""
  605. For a direct comparison to DVC, we'll try to mimic the DVC workflow as closely as it is possible with DataLad.
  606. **Model 1: SGD classifier**
  607. .. runrecord:: _examples/DL-101-168-170
  608. :workdir: DVCvsDL/DVC
  609. :language: console
  610. ### DVC-DataLad
  611. $ cd ../DVC-DataLad
  612. As there is no workflow manager in DataLad [#f9]_, each script execution needs to be done separately.
  613. To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a :dlcmd:`run` call [#f10]_.
  614. Later on, we can rerun a range of :dlcmd:`run` calls at once to recompute the relevant aspects of the analysis.
  615. To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use :dlcmd:`containers-run` call [#f11]_ and have that reran, but we'll stay basic here.
  616. Let's start with data preparation.
  617. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.
  618. .. runrecord:: _examples/DL-101-168-171
  619. :workdir: DVCvsDL/DVC-DataLad
  620. :language: console
  621. :realcommand: datalad run --message "Prepare the train and testing data" --input "data/raw/*" --output "data/prepared/*" python code/prepare.py | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  622. ### DVC-DataLad
  623. $ datalad run --message "Prepare the train and testing data" \
  624. --input "data/raw/*" \
  625. --output "data/prepared/*" \
  626. python code/prepare.py
  627. The results of this computation are automatically saved and associated with their inputs and command execution.
  628. This information isn't stored in a separate file, but in the Git history, and saved with the commit message we have attached to the :dlcmd:`run` command.
  629. To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers (such as commit hashes) to identify this computation.
  630. As we at this point have set up our data and are ready for the analysis, we will name the first tag "ready-for-analysis".
  631. This can be done with :gitcmd:`tag`, but also with :dlcmd:`save`.
  632. .. runrecord:: _examples/DL-101-168-172
  633. :workdir: DVCvsDL/DVC-DataLad
  634. :language: console
  635. ### DVC-DataLad
  636. $ datalad save --version-tag ready-for-analysis
  637. Let's continue with training by running ``code/train.py`` on the prepared data.
  638. .. runrecord:: _examples/DL-101-168-173
  639. :workdir: DVCvsDL/DVC-DataLad
  640. :language: console
  641. ### DVC-DataLad
  642. $ datalad run --message "Train an SGD classifier" \
  643. --input "data/prepared/*" \
  644. --output "model/model.joblib" \
  645. python code/train.py
  646. As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.
  647. As a last step, we evaluate the first model:
  648. .. runrecord:: _examples/DL-101-168-174
  649. :workdir: DVCvsDL/DVC-DataLad
  650. :language: console
  651. ### DVC-DataLad
  652. $ datalad run --message "Evaluate SGD classifier model" \
  653. --input "model/model.joblib" \
  654. --output "metrics/accuracy.json" \
  655. python code/evaluate.py
  656. At this point, the first accuracy metric is saved in ``metrics/accuracy.json``.
  657. Let's add a tag to declare that it belongs to the SGD classifier.
  658. .. runrecord:: _examples/DL-101-168-175
  659. :workdir: DVCvsDL/DVC-DataLad
  660. :language: console
  661. ### DVC-DataLad
  662. $ datalad save --version-tag SGD
  663. Let's now change the training script to use a random forest classifier as before:
  664. .. runrecord:: _examples/DL-101-168-176
  665. :workdir: DVCvsDL/DVC-DataLad
  666. :language: console
  667. :emphasize-lines: 10, 38-39
  668. ### DVC-DataLad
  669. $ cat << EOT >| code/train.py
  670. from joblib import dump
  671. from pathlib import Path
  672. import numpy as np
  673. import pandas as pd
  674. from skimage.io import imread_collection
  675. from skimage.transform import resize
  676. from sklearn.ensemble import RandomForestClassifier
  677. def load_images(data_frame, column_name):
  678. filelist = data_frame[column_name].to_list()
  679. image_list = imread_collection(filelist)
  680. return image_list
  681. def load_labels(data_frame, column_name):
  682. label_list = data_frame[column_name].to_list()
  683. return label_list
  684. def preprocess(image):
  685. resized = resize(image, (100, 100, 3))
  686. reshaped = resized.reshape((1, 30000))
  687. return reshaped
  688. def load_data(data_path):
  689. df = pd.read_csv(data_path)
  690. labels = load_labels(data_frame=df, column_name="label")
  691. raw_images = load_images(data_frame=df, column_name="filename")
  692. processed_images = [preprocess(image) for image in raw_images]
  693. data = np.concatenate(processed_images, axis=0)
  694. return data, labels
  695. def main(repo_path):
  696. train_csv_path = repo_path / "data/prepared/train.csv"
  697. train_data, labels = load_data(train_csv_path)
  698. rf = RandomForestClassifier()
  699. trained_model = rf.fit(train_data, labels)
  700. dump(trained_model, repo_path / "model/model.joblib")
  701. if __name__ == "__main__":
  702. repo_path = Path(__file__).parent.parent
  703. main(repo_path)
  704. EOT
  705. We need to save this change:
  706. .. runrecord:: _examples/DL-101-168-177
  707. :workdir: DVCvsDL/DVC-DataLad
  708. :language: console
  709. $ datalad save -m "Switch to random forest classification" code/train.py
  710. Afterwards, we can rerun all run records between the tags ``ready-for-analysis`` and ``SGD`` using :dlcmd:`rerun`.
  711. We could automatically compute this on a different branch if we wanted to by using the ``branch`` option:
  712. .. runrecord:: _examples/DL-101-168-178
  713. :workdir: DVCvsDL/DVC-DataLad
  714. :language: console
  715. $ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready-for-analysis..SGD
  716. Done!
  717. The difference in accuracies between models could now, for example, be compared with a ``git diff``:
  718. .. runrecord:: _examples/DL-101-168-179
  719. :workdir: DVCvsDL/DVC-DataLad
  720. :language: console
  721. $ git diff SGD -- metrics/accuracy.json
  722. Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.
  723. .. only:: adminmode
  724. We need to clean up -- reset the state of the "data version control" repo to its original state, force push
  725. DISABLED, NOT NECESSARY WITH A CHANGE TO A LOCAL PUSH TARGET
  726. .. runrecord:: _examples/DL-101-168-190
  727. :workdir: DVCvsDL/DVC
  728. :language: console
  729. #$ git checkout master
  730. #$ git reset --hard b796ba195447268ebc51e20a778fb2db9f11e341
  731. #$ git push --force origin master
  732. ## delete the branches and tags
  733. ## note: tags & branches were renamed; if uncommenting, check GitHub repo first
  734. #$ git push origin :random-forest :sgd-pipeline
  735. #$ git tag -d randomforest sgd
  736. Summary
  737. ^^^^^^^
  738. DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses.
  739. DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines.
  740. Despite their similar purpose, the looks, feels and functions of both tools are different, and it is a personal decision which one you feel more comfortable with.
  741. Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and ``.gitignore`` files.
  742. But despite the reliance on Git, DVC barely integrates with Git -- changes done to files in DVC cannot be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name.
  743. Thus, DVC users need to master Git *and* DVC workflows and intertwine them correctly.
  744. In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to "data version control" than git-annex used by DataLad.
  745. .. rubric:: Footnotes
  746. .. [#f1] Instructions on :term:`fork`\ing and cloning the repo are in the README of the repository: `github.com/realpython/data-version-control <https://github.com/realpython/data-version-control>`_.
  747. .. [#f2] The two procedures provide the dataset with useful structures and configurations for its purpose: ``yoda`` creates a dataset structure with a ``code`` directory and makes sure that everything kept in ``code`` will be committed to :term:`Git` (thus allowing for direct sharing of code). ``text2git`` makes sure that any other text file in the dataset will be stored in Git as well. The sections :ref:`text2git` and :ref:`yodaproc` explain the two configurations in detail.
  748. .. [#f3] To re-read about how :term:`git-annex` handles versioning of (large) files, go back to section :ref:`symlink`.
  749. .. [#f4] You can read more about ``.gitignore`` files in the section :ref:`gitignore`
  750. .. [#f5] If you are curious about why data is duplicated in a cache or why the paths to the data are placed into a ``.gitignore`` file, this section in the `DVC tutorial <https://realpython.com/python-data-version-control/#tracking-files>`__ has more insights on the internals of this process.
  751. .. [#f6] The sections :ref:`populate` and :ref:`modify` introduce the concepts of saving and modifying files in DataLad datasets.
  752. .. [#f7] A similar procedure for sharing data on a local file system for DataLad is shown in the chapter :ref:`sharelocal1`.
  753. .. [#f8] In DataLad datasets, data duplication is usually avoided as :term:`git-annex` uses :term:`symlink`\s. Only on file systems that lack support for symlinks or for removing write :term:`permissions` from files (so called "crippled file systems" such as ``/sdcard`` on Android, FAT or NTFS) git-annex needs to duplicate data.
  754. .. [#f9] yet.
  755. .. [#f10] To re-read about :dlcmd:`run` and :dlcmd:`rerun`, checkout chapter :ref:`chapter_run`.
  756. .. [#f11] To re-read about joining code, execution, data, results and software environment in a re-executable record with :dlcmd:`container-run`, checkout section :ref:`containersrun`.