ml-analysis.rst 29 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583
  1. .. index:: ! Usecase; Machine Learning Analysis
  2. .. _usecase_ML:
  3. DataLad for reproducible machine-learning analyses
  4. --------------------------------------------------
  5. This use case demonstrates an automatically and computationally reproducible analyses in the context of a machine learning (ML) project.
  6. It demonstrates on an example image classification analysis project how one can
  7. - link data, models, parametrization, software and results using ``datalad containers-run``
  8. - keep track of results and compare them across models or parametrizations
  9. - stay computationally reproducible, transparent, and importantly, intuitive and clear
  10. The Challenge
  11. ^^^^^^^^^^^^^
  12. Chad is a recent college graduate and has just started in a wicked start-up that prides itself with using "AI and ML for individualized medicine" in the Bay area.
  13. Even though he's extraordinarily motivated, the fast pace and pressure to deliver at his job are still stressful.
  14. For his first project, he's tasked with training a machine learning model to detect cancerous tissue in `computer tomography (CT) <https://en.wikipedia.org/wiki/CT_scan>`_ images.
  15. Excited and eager to impress, he builds his first image classification ML model with state of the art Python libraries and a `stochastic gradient descent (SGD) <https://en.wikipedia.org/wiki/Stochastic_gradient_descent>`_ classifier.
  16. "Not too bad", he thinks, when he shares the classification accuracy with his team lead, "way higher than chance level!"
  17. "Fantastic, Chad, but listen, we really need a higher accuracy than this.
  18. Our customers deserve that. Turn up the number of iterations. Also, try a random forest classification instead. And also, I need that done by tomorrow morning latest, Chad.
  19. Take a bag of organic sea-weed-kale crisps from the kitchen, oh, and also, you're coming to our next project pitch at the roof-top bar on Sunday?"
  20. Hastily, Chad pulls an all-nighter to adjust his models by dawn.
  21. Increase iterations here, switch classifier there, oh no, did this increase or decrease the overall accuracy? Tune some parameters here and there, re-do that previous one just one more time just to be sure.
  22. A quick two-hour nap on the office couch, and he is ready for the `daily scrum <https://en.wikipedia.org/wiki/Scrum_(software_development)#Daily_scrum>`_ in the morning.
  23. "Shit, what accuracy belonged to which parametrization again?", he thinks to himself as he pitches his analysis and presents his results.
  24. But everyone rushes to the next project already.
  25. A week later, when a senior colleague is tasked with checking his analyses, Chad needs to spend a few hours with them to them guide through his chaotic analysis directory full of jupyter notebooks.
  26. They struggle to figure out which Python libraries to install on the colleagues computer, have to adjust hard-code :term:`absolute path`\s, and fail to reproduce the results that he presented.
  27. The DataLad Approach
  28. ^^^^^^^^^^^^^^^^^^^^
  29. Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics.
  30. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization.
  31. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.
  32. DataLad has many concepts and tools that assist in creating transparent and computationally and automatically reproducible analyses.
  33. From general principles on how to structure analyses projects to linking and versioning software and data alongside to code or capturing analysis execution as re-executable run-records.
  34. To make a machine-learning project intuitively structured and transparent, Chad applies DataLad's YODA principles to his work.
  35. He keeps the training and testing data a reusable, standalone component, installed as a subdataset, and keeps his analysis dataset completely self-contained with :term:`relative path`\s in all his scripts.
  36. Later, he can share his dataset without the need to adjust paths.
  37. Chad also attaches a software container to his dataset, so that others don't need to recreate his Python environment.
  38. And lastly, he wraps every command that he executes in a ``datalad containers-run`` call, such that others don't need to rely on his brain to understand the analysis, but can have a computer recompute every analysis step in the correct software environment.
  39. Using concise commit messages and :term:`tag`\s, Chad creates a transparent and intuitive dataset history.
  40. With these measures in place, he can experiment flexibly with various models and data, and does not only have means to compare his models, but can also set his dataset to the state in which his most preferred model is ready to be used.
  41. Step-by-Step
  42. ^^^^^^^^^^^^
  43. .. admonition:: Required software
  44. The analysis requires the Python packages `scikit-learn <https://scikit-learn.org>`_, `scikit-image <https://scikit-image.org>`_, `pandas <https://pandas.pydata.org>`_, and `numpy <https://numpy.org>`_.
  45. We have build a :term:`Singularity` :term:`software container` with all relevant software, and the code below will use the ``datalad-container`` extension [#f1]_ to download the container from :term:`Singularity-Hub` and execute all analysis in this software environment.
  46. If you do not want to install the ``datalad-container`` extension or Singularity, you can also create a :term:`virtual environment` with all necessary software if you prefer [#f2]_, and exchange the ``datalad containers-run`` commands below with ``datalad run`` commands.
  47. Let's start with an overview of the analysis plans:
  48. We're aiming for an image classification analysis.
  49. In this type of ML analysis, a `classifier` is trained on a subset of data, the `training set`, and is then used for predictions on a previously unseen subset of data, the `test set`.
  50. Its task is to label the test data with one of several class attributes it is trained to classify, such as `"cancerous" or "non-cancerous" with medical data <https://www.nature.com/articles/d41586-020-00847-2>`_, `"cat" or "dog" <https://www.kaggle.com/c/dogs-vs-cats>`_ with your pictures of pets, or "spam" versus "not spam" in your emails.
  51. In most cases, classification analyses are `supervised` learning methods: The correct class attributes are known, and the classifier is tested on a `labeled` set of training data.
  52. Its classification accuracy is calculated from comparing its performance on the unlabeled testing set with its correct labels.
  53. As a first analysis step, train and testing data therefore need to be labeled -- both to allow model training and model evaluation.
  54. In a second step, a classifier needs to be trained on the labeled test data.
  55. It learns which features are to be associated with which class attribute.
  56. In a final step, the trained classifier classifies the test data, and its results are evaluated against the true labels.
  57. Below, we will go through a image classification analysis on a few categories in the `Imagenette dataset <https://github.com/fastai/imagenette>`_, a smaller subset of the `Imagenet dataset <https://image-net.org>`_, one of the most widely used large scale dataset for bench-marking Image Classification algorithms. It contains images from ten categories (tench (a type of fish), English springer (a type of dog), cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).
  58. We will prepare a subset of the data, and train and evaluate different types of classifier.
  59. The analysis is based on `this tutorial <https://realpython.com/python-data-version-control>`_.
  60. First, let's create an input data dataset.
  61. Later, this dataset will be installed as a subdataset of the analysis.
  62. This complies to the :ref:`YODA principles <yoda>` and helps to keep the input data modular, reusable, and transparent.
  63. .. runrecord:: _examples/ml-101
  64. :language: console
  65. :cast: usecase_ml
  66. :workdir: usecases
  67. $ datalad create imagenette
  68. The original Imagenette dataset contains 10 image categories can be downloaded as an archive from Amazon (`s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz <https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz>`_), but for this tutorial we're using a subset of this dataset with only two categories.
  69. It is available as an archive from the :term:`Open Science Framework` (OSF).
  70. The :dlcmd:`download-url --archive` not only extracts and saves the data, but also registers the datasets origin such that it can re-retrieved on demand from its original location.
  71. .. runrecord:: _examples/ml-102
  72. :language: console
  73. :cast: usecase_ml
  74. :workdir: usecases
  75. :realcommand: cd imagenette && datalad download-url --archive --message "Download Imagenette dataset" 'https://osf.io/d6qbz/download' | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)$' && sleep 15
  76. $ cd imagenette
  77. $ datalad download-url \
  78. --archive \
  79. --message "Download Imagenette dataset" \
  80. 'https://osf.io/d6qbz/download'
  81. Next, let's create an analysis dataset.
  82. For a pre-structured and pre-configured starting point, the dataset can be created with the ``yoda`` and ``text2git`` :term:`run procedure`\s [#f3]_.
  83. These configurations create a ``code/`` directory, place some place-holding ``README`` files in appropriate places, and make sure that all text files, e.g. scripts or evaluation results, are kept in :term:`Git` to allow for easier modifications.
  84. .. runrecord:: _examples/ml-103
  85. :language: console
  86. :cast: usecase_ml
  87. :workdir: usecases/imagenette
  88. $ cd ../
  89. $ datalad create -c text2git -c yoda ml-project
  90. Afterwards, the input dataset can be installed from a local path as a subdataset, using :dlcmd:`clone` with the ``-d``/``--dataset`` flag and a ``.`` to denote the current dataset:
  91. .. runrecord:: _examples/ml-104
  92. :language: console
  93. :cast: usecase_ml
  94. :workdir: usecases
  95. $ cd ml-project
  96. $ mkdir -p data
  97. # install the dataset into data/
  98. $ datalad clone -d . ../imagenette data/raw
  99. Here are the dataset contents up to now:
  100. .. runrecord:: _examples/ml-105
  101. :language: console
  102. :cast: usecase_ml
  103. :workdir: usecases/ml-project
  104. # show the directory hierarchy
  105. $ tree -d
  106. Next, let's add the necessary software to the dataset.
  107. This is done using the ``datalad containers`` extension and the :dlcmd:`container-add` command. This command takes an arbitrary name and a path or url to a :term:`software container`, registers the containers origin, and adds it under the specified name to the dataset.
  108. If used with a public url, for example to :term:`Singularity-Hub`, others that you share your dataset with can retrieve the container as well [#f1]_.
  109. .. runrecord:: _examples/ml-106
  110. :language: console
  111. :cast: usecase_ml
  112. :workdir: usecases/ml-project
  113. :realcommand: datalad containers-add software --call-fmt 'singularity exec -B {{pwd}} --cleanenv {img} {cmd}' --url shub://adswa/python-ml:1
  114. $ datalad containers-add software --url shub://adswa/python-ml:1
  115. At this point, with input data and software set-up, we can start with the first step: Dataset preparation.
  116. The imagenette dataset is structured in ``train/`` and ``val/`` folder, and each folder contains one sub-folder per image category.
  117. To prepare the dataset for training and testing a classifier, we create a mapping between file names and image categories.
  118. In this example we only use two categories, "golf balls" (subdirectory ``n03445777``) and "parachutes" (subdirectory ``n03888257``).
  119. The following script creates two files, ``data/train.csv`` and ``data/test.csv`` from the input data.
  120. Each contains file names and category associations for the files in those subdirectories.
  121. Note how, in accordance to the :ref:`YODA principles <yoda>`, the script only contains :term:`relative path`\s to make the dataset portable.
  122. .. runrecord:: _examples/ml-107
  123. :language: console
  124. :cast: usecase_ml
  125. :workdir: usecases/ml-project
  126. $ cat << EOT > code/prepare.py
  127. #!/usr/bin/env python3
  128. import pandas as pd
  129. from pathlib import Path
  130. FOLDERS_TO_LABELS = {"n03445777": "golf ball",
  131. "n03888257": "parachute"}
  132. def get_files_and_labels(source_path):
  133. images = []
  134. labels = []
  135. for image_path in source_path.rglob("*/*.JPEG"):
  136. filename = image_path
  137. folder = image_path.parent.name
  138. if folder in FOLDERS_TO_LABELS:
  139. images.append(filename)
  140. label = FOLDERS_TO_LABELS[folder]
  141. labels.append(label)
  142. return images, labels
  143. def save_as_csv(filenames, labels, destination):
  144. data_dictionary = {"filename": filenames, "label": labels}
  145. data_frame = pd.DataFrame(data_dictionary)
  146. data_frame.to_csv(destination)
  147. def main(repo_path):
  148. data_path = repo_path / "data"
  149. train_path = data_path / "raw/train"
  150. test_path = data_path / "raw/val"
  151. train_files, train_labels = get_files_and_labels(train_path)
  152. test_files, test_labels = get_files_and_labels(test_path)
  153. save_as_csv(train_files, train_labels, data_path / "train.csv")
  154. save_as_csv(test_files, test_labels, data_path / "test.csv")
  155. if __name__ == "__main__":
  156. repo_path = Path(__file__).parent.parent
  157. main(repo_path)
  158. EOT
  159. Executing the `heredoc <https://en.wikipedia.org/wiki/Here_document>`_ in the code block above has created a script ``code/prepare.py``:
  160. .. runrecord:: _examples/ml-108
  161. :language: console
  162. :cast: usecase_ml
  163. :workdir: usecases/ml-project
  164. $ datalad status
  165. We add it to the dataset using :dlcmd:`save`:
  166. .. runrecord:: _examples/ml-109
  167. :language: console
  168. :cast: usecase_ml
  169. :workdir: usecases/ml-project
  170. $ datalad save -m "Add script for data preparation for 2 categories" code/prepare.py
  171. This script can now be used to prepare the data.
  172. Note how it, in accordance to the :ref:`YODA principles <yoda>`, saves the files into the superdataset, and leaves the input dataset untouched.
  173. When ran, it will create files with the following structure::
  174. ,filename,label
  175. 0,data/raw/imagenette2-160/val/n03445777/n03445777_20061.JPEG,golf ball
  176. 1,data/raw/imagenette2-160/val/n03445777/n03445777_9740.JPEG,golf ball
  177. 2,data/raw/imagenette2-160/val/n03445777/n03445777_3900.JPEG,golf ball
  178. 3,data/raw/imagenette2-160/val/n03445777/n03445777_5862.JPEG,golf ball
  179. 4,data/raw/imagenette2-160/val/n03445777/n03445777_4172.JPEG,golf ball
  180. 5,data/raw/imagenette2-160/val/n03445777/n03445777_14301.JPEG,golf ball
  181. 6,data/raw/imagenette2-160/val/n03445777/n03445777_2951.JPEG,golf ball
  182. 7,data/raw/imagenette2-160/val/n03445777/n03445777_8732.JPEG,golf ball
  183. 8,data/raw/imagenette2-160/val/n03445777/n03445777_5810.JPEG,golf ball
  184. 9,data/raw/imagenette2-160/val/n03445777/n03445777_3132.JPEG,golf ball
  185. [...]
  186. To capture all provenance and perform the computation in the correct software environment, this is best done in a :dlcmd:`containers-run` command:
  187. .. runrecord:: _examples/ml-110
  188. :language: console
  189. :cast: usecase_ml
  190. :workdir: usecases/ml-project
  191. :realcommand: datalad containers-run -n software -m "Prepare the data for categories golf balls and parachutes" --input 'data/raw/train/n03445777' --input 'data/raw/val/n03445777' --input 'data/raw/train/n03888257' --input 'data/raw/val/n03888257' --output 'data/train.csv' --output 'data/test.csv' "python3 code/prepare.py" | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
  192. $ datalad containers-run -n software \
  193. -m "Prepare the data for categories golf balls and parachutes" \
  194. --input 'data/raw/train/n03445777' \
  195. --input 'data/raw/val/n03445777' \
  196. --input 'data/raw/train/n03888257' \
  197. --input 'data/raw/val/n03888257' \
  198. --output 'data/train.csv' \
  199. --output 'data/test.csv' \
  200. "python3 code/prepare.py"
  201. Beyond the script execution and container name (``-n/--container-name``), this command can take a human readable commit message to summarize the operation (``-m/--message``) and input and output specifications (``-i/--input``, ``-o/--output``).
  202. DataLad will make sure to retrieve everything labeled as ``--input`` prior to running the command, and specifying ``--output`` ensures that the files can be updated should the command be reran at a later point [#f4]_.
  203. It saves the results of this command together with a machine-readable run-record into the dataset history.
  204. Next, the first model can be trained.
  205. .. runrecord:: _examples/ml-111
  206. :language: console
  207. :cast: usecase_ml
  208. :workdir: usecases/ml-project
  209. $ cat << EOT > code/train.py
  210. #!/usr/bin/env python3
  211. from joblib import dump
  212. from pathlib import Path
  213. import numpy as np
  214. import pandas as pd
  215. from skimage.io import imread_collection
  216. from skimage.transform import resize
  217. from sklearn.linear_model import SGDClassifier
  218. def load_images(data_frame, column_name):
  219. filelist = data_frame[column_name].to_list()
  220. image_list = imread_collection(filelist)
  221. return image_list
  222. def load_labels(data_frame, column_name):
  223. label_list = data_frame[column_name].to_list()
  224. return label_list
  225. def preprocess(image):
  226. resized = resize(image, (100, 100, 3))
  227. reshaped = resized.reshape((1, 30000))
  228. return reshaped
  229. def load_data(data_path):
  230. df = pd.read_csv(data_path)
  231. labels = load_labels(data_frame=df, column_name="label")
  232. raw_images = load_images(data_frame=df, column_name="filename")
  233. processed_images = [preprocess(image) for image in raw_images]
  234. data = np.concatenate(processed_images, axis=0)
  235. return data, labels
  236. def main(repo_path):
  237. train_csv_path = repo_path / "data/train.csv"
  238. train_data, labels = load_data(train_csv_path)
  239. sgd = SGDClassifier(max_iter=10)
  240. trained_model = sgd.fit(train_data, labels)
  241. dump(trained_model, repo_path / "model.joblib")
  242. if __name__ == "__main__":
  243. repo_path = Path(__file__).parent.parent
  244. main(repo_path)
  245. EOT
  246. This script trains a stochastic gradient descent classifier on the training data.
  247. The files in the ``train.csv`` file a read, preprocessed into the same shape, and an SGD model is fitted to the predict the image labels from the data.
  248. The trained model is then saved into a ``model.joblib`` file -- this allows to transparently cache the classifier as a Python object to disk.
  249. Later, `the cached model can be applied to various data with the need to retrain the classifier <https://scikit-learn.org/stable/modules/model_persistence.html>`_.
  250. Let's save the script.
  251. .. runrecord:: _examples/ml-112
  252. :language: console
  253. :cast: usecase_ml
  254. :workdir: usecases/ml-project
  255. $ datalad save -m "Add SGD classification script" code/train.py
  256. The last analysis step needs to test the trained classifier.
  257. We will use the following script for this:
  258. .. runrecord:: _examples/ml-113
  259. :language: console
  260. :cast: usecase_ml
  261. :workdir: usecases/ml-project
  262. $ cat << EOT > code/evaluate.py
  263. #!/usr/bin/env python3
  264. from joblib import load
  265. import json
  266. from pathlib import Path
  267. from sklearn.metrics import accuracy_score
  268. from train import load_data
  269. def main(repo_path):
  270. test_csv_path = repo_path / "data/test.csv"
  271. test_data, labels = load_data(test_csv_path)
  272. model = load(repo_path / "model.joblib")
  273. predictions = model.predict(test_data)
  274. accuracy = accuracy_score(labels, predictions)
  275. metrics = {"accuracy": accuracy}
  276. print(metrics)
  277. accuracy_path = repo_path / "accuracy.json"
  278. accuracy_path.write_text(json.dumps(metrics))
  279. if __name__ == "__main__":
  280. repo_path = Path(__file__).parent.parent
  281. main(repo_path)
  282. EOT
  283. It will load the trained and dumped model and use it to test its prediction performance on the yet unseen test data.
  284. To evaluate the model performance, it calculates the accuracy of the prediction, i.e., the proportion of correctly labeled images, prints it to the terminal, and saves it into a json file in the superdataset.
  285. As this script constitutes the last analysis step, let's save it with a :term:`tag`.
  286. Its entirely optional to do this, but just as commit messages are an easier way for humans to get an overview of a commits contents, a tag is an easier way for humans to identify a change than a commit hash.
  287. With this script set up, we're ready for analysis, and thus can tag this state ``ready4analysis`` to identify it more easily later.
  288. .. runrecord:: _examples/ml-114
  289. :language: console
  290. :cast: usecase_ml
  291. :workdir: usecases/ml-project
  292. $ datalad save -m "Add script to evaluate model performance" --version-tag "ready4analysis" code/evaluate.py
  293. Afterwards, we can train the first model:
  294. .. runrecord:: _examples/ml-115
  295. :language: console
  296. :cast: usecase_ml
  297. :workdir: usecases/ml-project
  298. :realcommand: datalad containers-run -n software -m "Train an SGD classifier on the data" --input 'data/raw/train/n03445777' --input 'data/raw/train/n03888257' --output 'model.joblib' "python3 code/train.py" | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)$'
  299. $ datalad containers-run -n software \
  300. -m "Train an SGD classifier on the data" \
  301. --input 'data/raw/train/n03445777' \
  302. --input 'data/raw/train/n03888257' \
  303. --output 'model.joblib' \
  304. "python3 code/train.py"
  305. And finally, we're ready to find out how well the model did and run the last script:
  306. .. runrecord:: _examples/ml-116
  307. :language: console
  308. :cast: usecase_ml
  309. :workdir: usecases/ml-project
  310. :realcommand: datalad containers-run -n software -m "Evaluate SGD classifier on test data" --input 'data/raw/val/n03445777' --input 'data/raw/val/n03888257' --output 'accuracy.json' "python3 code/evaluate.py" | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)$'
  311. $ datalad containers-run -n software \
  312. -m "Evaluate SGD classifier on test data" \
  313. --input 'data/raw/val/n03445777' \
  314. --input 'data/raw/val/n03888257' \
  315. --output 'accuracy.json' \
  316. "python3 code/evaluate.py"
  317. Now this initial accuracy isn't yet fully satisfying.
  318. What could have gone wrong?
  319. The model would probably benefit from a few more training iterations for a start.
  320. Instead of 10, the patch below increases the number of iterations to 100.
  321. Note that the code block below does this change with the stream editor :term:`sed` for the sake of automatically executed code in the handbook, but you could also apply this change with a text editor "by hand".
  322. .. runrecord:: _examples/ml-117
  323. :language: console
  324. :cast: usecase_ml
  325. :workdir: usecases/ml-project
  326. $ sed -i 's/SGDClassifier(max_iter=10)/SGDClassifier(max_iter=100)/g' code/train.py
  327. Here's what has changed:
  328. .. runrecord:: _examples/ml-118
  329. :language: console
  330. :cast: usecase_ml
  331. :workdir: usecases/ml-project
  332. $ git diff
  333. Let's save the change...
  334. .. runrecord:: _examples/ml-119
  335. :language: console
  336. :cast: usecase_ml
  337. :workdir: usecases/ml-project
  338. $ datalad save -m "Increase the amount of iterations to 100" --version-tag "SGD-100" code/train.py
  339. ... and try again.
  340. As we need to retrain the classifier and re-evaluate its performance, we rerun every run-record between the point in time we created the ``SGD`` tag and now.
  341. This will update both the ``model.joblib`` and the ``accuracy.json`` files, but their past versions are still in the dataset history.
  342. One was to do this is to specify a range between the two tags, but likewise, commit hashes would work, or a specification using ``--since`` [#f5]_.
  343. .. runrecord:: _examples/ml-130
  344. :workdir: usecases/ml-project
  345. :cast: usecase_ml
  346. :language: console
  347. $ datalad rerun -m "Recompute classification with more iterations" ready4analysis..SGD-100
  348. Any better? Mhh, not so much. Maybe a different classifier does the job better.
  349. Let's switch from SGD to a `random forest classification <https://en.wikipedia.org/wiki/Random_forest>`_.
  350. The code block below writes the relevant changes (highlighted) into the script.
  351. .. runrecord:: _examples/ml-131
  352. :workdir: usecases/ml-project
  353. :language: console
  354. :cast: usecase_ml
  355. :emphasize-lines: 11, 39-40
  356. $ cat << EOT >| code/train.py
  357. #!/usr/bin/env python3
  358. from joblib import dump
  359. from pathlib import Path
  360. import numpy as np
  361. import pandas as pd
  362. from skimage.io import imread_collection
  363. from skimage.transform import resize
  364. from sklearn.ensemble import RandomForestClassifier
  365. def load_images(data_frame, column_name):
  366. filelist = data_frame[column_name].to_list()
  367. image_list = imread_collection(filelist)
  368. return image_list
  369. def load_labels(data_frame, column_name):
  370. label_list = data_frame[column_name].to_list()
  371. return label_list
  372. def preprocess(image):
  373. resized = resize(image, (100, 100, 3))
  374. reshaped = resized.reshape((1, 30000))
  375. return reshaped
  376. def load_data(data_path):
  377. df = pd.read_csv(data_path)
  378. labels = load_labels(data_frame=df, column_name="label")
  379. raw_images = load_images(data_frame=df, column_name="filename")
  380. processed_images = [preprocess(image) for image in raw_images]
  381. data = np.concatenate(processed_images, axis=0)
  382. return data, labels
  383. def main(repo_path):
  384. train_csv_path = repo_path / "data/train.csv"
  385. train_data, labels = load_data(train_csv_path)
  386. rf = RandomForestClassifier()
  387. trained_model = rf.fit(train_data, labels)
  388. dump(trained_model, repo_path / "model.joblib")
  389. if __name__ == "__main__":
  390. repo_path = Path(__file__).parent.parent
  391. main(repo_path)
  392. EOT
  393. We need to save this change:
  394. .. runrecord:: _examples/ml-132
  395. :workdir: usecases/ml-project
  396. :cast: usecase_ml
  397. :language: console
  398. $ datalad save -m "Switch to random forest classification" --version-tag "random-forest" code/train.py
  399. And now we can retrain and reevaluate again.
  400. This time, in order to have very easy access to the trained models and results of the evaluation, we're rerunning the sequence of run-records in a new :term:`branch` [#f6]_.
  401. This way, we have access to a trained random-forest model or a trained SGD model or their respective results by simply switching branches.
  402. .. runrecord:: _examples/ml-133
  403. :workdir: usecases/ml-project
  404. :cast: usecase_ml
  405. :language: console
  406. $ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready4analysis..SGD-100
  407. This updated the model.joblib file to a trained random forest classifier, and also updated ``accuracy.json`` with the current models' evaluation.
  408. The difference in accuracy between models could now, for example, be compared with a ``git diff`` of the contents of ``accuracy.json`` to the :term:`main` :term:`branch`:
  409. .. runrecord:: _examples/ml-134
  410. :workdir: usecases/ml-project
  411. :cast: usecase_ml
  412. :language: console
  413. $ git diff main -- accuracy.json
  414. And if you decide to rather do more work on the SGD classier, you can go back to the previous :term:`main` :term:`branch`:
  415. .. runrecord:: _examples/ml-135
  416. :workdir: usecases/ml-project
  417. :cast: usecase_ml
  418. :language: console
  419. $ git checkout main
  420. $ cat accuracy.json
  421. Your Git history becomes a log of everything you did as well as the chance to go back to and forth between analysis states.
  422. And this is not only useful for yourself, but it makes your analyses and results also transparent to others that you share your dataset with.
  423. If you cache your trained models, there is no need to retrain them when traveling to past states of your dataset.
  424. And if any aspect of your dataset changes -- from changes to the input data to changes to your trained model or code -- you can rerun these analysis stages automatically.
  425. The attached software container makes sure that your analysis will always be rerun in the correct software environment, even if the dataset is shared with collaborators with systems that lack a Python installation.
  426. References
  427. ^^^^^^^^^^
  428. The analysis is adapted from the chapter :ref:`dvc`, which in turn is based on `this tutorial at RealPython.org <https://realpython.com/python-data-version-control>`_.
  429. .. rubric:: Footnotes
  430. .. [#f1] You can install the ``datalad-container`` extension from :term:`pip` via ``pip install datalad-container``. You can find out more about extensions in general in the section :ref:`extensions_intro`, and you can more computationally reproducible analysis using ``datalad container`` in the chapter :ref:`containersrun` and the use case :ref:`usecase_reproduce_neuroimg`.
  431. .. [#f2] Unsure how to create a :term:`virtual environment`? You can find a tutorial using :term:`pip` and the ``virtualenv`` module `in the Python docs <https://packaging.python.org/guides/installing-using-pip-and-virtual-environments>`_.
  432. .. [#f3] To re-read about :term:`run procedure`\s, check out section :ref:`procedures`.
  433. .. [#f4] The chapter :ref:`chapter_run` introduces the options of ``datalad run`` and demonstrates their use. Note that ``--output``\s don't need to be individual files, but could also be directories or :term:`globbing` terms.
  434. .. [#f5] In order to re-execute any run-record in the last five commits, you could use ``--since=HEAD~5``, for example. You could also, however, rerun the previous run commands sequentially, with ``datalad rerun <commit-hash>``.
  435. .. [#f6] Rerunning on a different :term:`branch` is optional but handy. Alternatively, you could checkout a previous state in the datasets history to get access to a previous version of a file, reset the dataset to a previous state, or use commands like :gitcmd:`cat-file` to read out a non-checked-out file. The section :ref:`history` summarizes a number of common Git operations to interact with the dataset history.