101-130-yodaproject.rst 41 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991
  1. .. _yoda_project:
  2. YODA-compliant data analysis projects
  3. -------------------------------------
  4. Now that you know about the YODA principles, it is time to start working on
  5. ``DataLad-101``'s midterm project. Because the midterm project guidelines
  6. require a YODA-compliant data analysis project, you will not only have theoretical
  7. knowledge about the YODA principles, but also gain practical experience.
  8. In principle, you can prepare YODA-compliant data analyses in any programming
  9. language of your choice. But because you are already familiar with
  10. the `Python <https://www.python.org>`__ programming language, you decide
  11. to script your analysis in Python. Delighted, you find out that there is even
  12. a Python API for DataLad's functionality that you can read about in :ref:`a Findoutmore on DataLad in Python<fom-pythonapi>`.
  13. .. _pythonapi:
  14. .. index::
  15. pair: use DataLad API; with Python
  16. .. find-out-more:: DataLad's Python API
  17. :name: fom-pythonapi
  18. :float:
  19. .. _python:
  20. Whatever you can do with DataLad from the command line, you can also do it with
  21. DataLad's Python API.
  22. Thus, DataLad's functionality can also be used within interactive Python sessions
  23. or Python scripts.
  24. All of DataLad's user-oriented commands are exposed via ``datalad.api``.
  25. Thus, any command can be imported as a stand-alone command like this:
  26. .. code-block:: python
  27. >>> from datalad.api import <COMMAND>
  28. Alternatively, to import all commands, one can use
  29. .. code-block:: python
  30. >>> import datalad.api as dl
  31. and subsequently access commands as ``dl.get()``, ``dl.clone()``, and so forth.
  32. The `developer documentation <https://docs.datalad.org/en/latest/modref.html>`_
  33. of DataLad lists an overview of all commands, but naming is congruent to the
  34. command line interface. The only functionality that is not available at the
  35. command line is ``datalad.api.Dataset``, DataLad's core Python data type.
  36. Just like any other command, it can be imported like this:
  37. .. code-block:: python
  38. >>> from datalad.api import Dataset
  39. or like this:
  40. .. code-block:: python
  41. >>> import datalad.api as dl
  42. >>> dl.Dataset()
  43. A ``Dataset`` is a `class <https://docs.python.org/3/tutorial/classes.html>`_
  44. that represents a DataLad dataset. In addition to the
  45. stand-alone commands, all of DataLad's functionality is also available via
  46. `methods <https://docs.python.org/3/tutorial/classes.html#method-objects>`_
  47. of this class. Thus, these are two equally valid ways to create a new
  48. dataset with DataLad in Python:
  49. .. code-block:: python
  50. >>> from datalad.api import create, Dataset
  51. # create as a stand-alone command
  52. >>> create(path='scratch/test')
  53. [INFO ] Creating a new annex repo at /.../scratch/test
  54. Out[3]: <Dataset path=/home/me/scratch/test>
  55. # create as a dataset method
  56. >>> ds = Dataset(path='scratch/test')
  57. >>> ds.create()
  58. [INFO ] Creating a new annex repo at /.../scratch/test
  59. Out[3]: <Dataset path=/home/me/scratch/test>
  60. As shown above, the only required parameter for a Dataset is the ``path`` to
  61. its location, and this location may or may not exist yet.
  62. Stand-alone functions have a ``dataset=`` argument, corresponding to the
  63. ``-d/--dataset`` option in their command-line equivalent. You can specify
  64. the ``dataset=`` argument with a path (string) to your dataset (such as
  65. ``dataset='.'`` for the current directory, or ``dataset='path/to/ds'`` to
  66. another location). Alternatively, you can pass a ``Dataset`` instance to it:
  67. .. code-block:: python
  68. >>> from datalad.api import save, Dataset
  69. # use save with dataset specified as a path
  70. >>> save(dataset='path/to/dataset/')
  71. # use save with dataset specified as a dataset instance
  72. >>> ds = Dataset('path/to/dataset')
  73. >>> save(dataset=ds, message="saving all modifications")
  74. # use save as a dataset method (no dataset argument)
  75. >>> ds.save(message="saving all modifications")
  76. **Use cases for DataLad's Python API**
  77. Using the command line or the Python API of DataLad are both valid ways to accomplish the same results.
  78. Depending on your workflows, using the Python API can help to automate dataset operations, provides an alternative
  79. to the command line, or could be useful for scripting reproducible data analyses.
  80. One unique advantage of the Python API is the ``Dataset``:
  81. As the Python API does not suffer from the startup time cost of the command line,
  82. there is the potential for substantial speed-up when doing many calls to the API,
  83. and using a persistent Dataset object instance.
  84. You will also notice that the output of Python commands can be more verbose as the result records returned by each command do not get filtered by command-specific result renderers.
  85. Thus, the outcome of ``dl.status('myfile')`` matches that of :dlcmd:`status` only when ``-f``/``--output-format`` is set to ``json`` or ``json_pp``, as illustrated below.
  86. .. code-block:: python
  87. >>> import datalad.api as dl
  88. >>> dl.status('myfile')
  89. [{'type': 'file',
  90. 'gitshasum': '915983d6576b56792b4647bf0d9fa04d83ce948d',
  91. 'bytesize': 85,
  92. 'prev_gitshasum': '915983d6576b56792b4647bf0d9fa04d83ce948d',
  93. 'state': 'clean',
  94. 'path': '/home/me/my-ds/myfile',
  95. 'parentds': '/home/me/my-ds',
  96. 'status': 'ok',
  97. 'refds': '/home/me/my-ds',
  98. 'action': 'status'}]
  99. .. code-block:: console
  100. $ datalad -f json_pp status myfile
  101. {"action": "status",
  102. "bytesize": 85,
  103. "gitshasum": "915983d6576b56792b4647bf0d9fa04d83ce948d",
  104. "parentds": "/home/me/my-ds",
  105. "path": "/home/me/my-ds/myfile",
  106. "prev_gitshasum": "915983d6576b56792b4647bf0d9fa04d83ce948d",
  107. "refds": "/home/me/my-ds/",
  108. "state": "clean",
  109. "status": "ok",
  110. "type": "file"}
  111. .. index::
  112. pair: use DataLad API; with Matlab
  113. pair: use DataLad API; with R
  114. .. importantnote:: Use DataLad in languages other than Python
  115. While there is a dedicated API for Python, DataLad's functions can of course
  116. also be used with other programming languages, such as Matlab, or R, via standard
  117. system calls.
  118. Even if you do not know or like Python, you can just copy-paste the code
  119. and follow along -- the high-level YODA principles demonstrated in this
  120. section generalize across programming languages.
  121. For your midterm project submission, you decide to create a data analysis on the
  122. `iris flower data set <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_.
  123. It is a multivariate dataset on 50 samples of each of three species of Iris
  124. flowers (*Setosa*, *Versicolor*, or *Virginica*), with four variables: the length and width of the sepals and petals
  125. of the flowers in centimeters. It is often used in introductory data science
  126. courses for statistical classification techniques in machine learning, and
  127. widely available -- a perfect dataset for your midterm project!
  128. .. index::
  129. pair: reproducible paper; with DataLad
  130. .. importantnote:: Turn data analysis into dynamically generated documents
  131. Beyond the contents of this section, we have transformed the example analysis also into a template to write a reproducible paper.
  132. If you are interested in checking that out, please head over to `github.com/datalad-handbook/repro-paper-sketch/ <https://github.com/datalad-handbook/repro-paper-sketch>`_.
  133. Raw data as a modular, independent entity
  134. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  135. The first YODA principle stressed the importance of modularity in a data analysis
  136. project: Every component that could be used in more than one context should be
  137. an independent component.
  138. The first aspect this applies to is the input data of your dataset: There can
  139. be thousands of ways to analyze it, and it is therefore immensely helpful to
  140. have a pristine raw iris dataset that does not get modified, but serves as
  141. input for these analysis.
  142. As such, the iris data should become a standalone DataLad dataset.
  143. For the purpose of this analysis, the DataLad handbook provides an ``iris_data``
  144. dataset at `https://github.com/datalad-handbook/iris_data <https://github.com/datalad-handbook/iris_data>`_.
  145. You can either use this provided input dataset, or find out how to create an
  146. independent dataset from scratch in a :ref:`dedicated Findoutmore <fom-iris>`.
  147. .. index::
  148. pair: create and publish dataset as dependency; with DataLad
  149. .. find-out-more:: Creating an independent input dataset
  150. :name: fom-iris
  151. If you acquire your own data for a data analysis, you will have
  152. to turn it into a DataLad dataset in order to install it as a subdataset.
  153. Any directory with data that exists on
  154. your computer can be turned into a dataset with :dlcmd:`create --force`
  155. and a subsequent :dlcmd:`save -m "add data" .` to first create a dataset inside of
  156. an existing, non-empty directory, and subsequently save all of its contents into
  157. the history of the newly created dataset.
  158. To create the ``iris_data`` dataset at https://github.com/datalad-handbook/iris_data
  159. we first created a DataLad dataset...
  160. .. runrecord:: _examples/DL-101-130-101
  161. :language: console
  162. :workdir: dl-101/DataLad-101
  163. :env:
  164. DATALAD_SEED=1
  165. $ # make sure to move outside of DataLad-101!
  166. $ cd ../
  167. $ datalad create iris_data
  168. and subsequently got the data from a publicly available
  169. `GitHub Gist <https://gist.github.com/netj/8836201>`__, a code snippet, or other short standalone information with a
  170. :dlcmd:`download-url` command:
  171. .. runrecord:: _examples/DL-101-130-102
  172. :workdir: dl-101
  173. :language: console
  174. $ cd iris_data
  175. $ datalad download-url https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv
  176. Finally, we *published* the dataset to :term:`GitHub`.
  177. With this setup, the iris dataset (a single comma-separated (``.csv``)
  178. file) is downloaded, and, importantly, the dataset recorded *where* it
  179. was obtained from thanks to :dlcmd:`download-url`, thus complying
  180. to the second YODA principle.
  181. This way, upon installation of the dataset, DataLad knows where to
  182. obtain the file content from. You can :dlcmd:`clone` the iris
  183. dataset and find out with a ``git annex whereis iris.csv`` command.
  184. "Nice, with this input dataset I have sufficient provenance capture for my
  185. input dataset, and I can install it as a modular component", you think as you
  186. mentally tick off YODA principle number 1 and 2. "But before I can install it,
  187. I need an analysis superdataset first."
  188. Building an analysis dataset
  189. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  190. There is an independent raw dataset as input data, but there is no place
  191. for your analysis to live, yet. Therefore, you start your midterm project
  192. by creating an analysis dataset. As this project is part of ``DataLad-101``,
  193. you do it as a subdataset of ``DataLad-101``.
  194. Remember to specify the ``--dataset`` option of :dlcmd:`create`
  195. to link it as a subdataset!
  196. You naturally want your dataset to follow the YODA principles, and, as a start,
  197. you use the ``cfg_yoda`` procedure to help you structure the dataset [#f1]_:
  198. .. runrecord:: _examples/DL-101-130-103
  199. :language: console
  200. :workdir: dl-101/DataLad-101
  201. :cast: 10_yoda
  202. :env:
  203. DATALAD_SEED=2
  204. :notes: Let's create a data analysis project with a yoda procedure
  205. $ # inside of DataLad-101
  206. $ datalad create -c yoda --dataset . midterm_project
  207. .. index::
  208. pair: subdatasets; DataLad command
  209. pair: list subdatasets; with DataLad
  210. The :dlcmd:`subdatasets` command can report on which subdatasets exist for
  211. ``DataLad-101``. This helps you verify that the command succeeded and the
  212. dataset was indeed linked as a subdataset to ``DataLad-101``:
  213. .. runrecord:: _examples/DL-101-130-104
  214. :language: console
  215. :workdir: dl-101/DataLad-101
  216. $ datalad subdatasets
  217. Not only the ``longnow`` subdataset, but also the newly created
  218. ``midterm_project`` subdataset are displayed -- wonderful!
  219. But back to the midterm project now. So far, you have created a pre-structured
  220. analysis dataset. As a next step, you take care of installing and linking the
  221. raw dataset for your analysis adequately to your ``midterm_project`` dataset
  222. by installing it as a subdataset. Make sure to install it as a subdataset of
  223. ``midterm_project``, and not ``DataLad-101``!
  224. .. runrecord:: _examples/DL-101-130-105
  225. :language: console
  226. :workdir: dl-101/DataLad-101
  227. :cast: 10_yoda
  228. :notes: Now clone input data as a subdataset
  229. $ cd midterm_project
  230. $ # we are in midterm_project, thus -d . points to the root of it.
  231. $ datalad clone -d . \
  232. https://github.com/datalad-handbook/iris_data.git \
  233. input/
  234. Note that we did not keep its original name, ``iris_data``, but rather provided
  235. a path with a new name, ``input``, because this much more intuitively comprehensible.
  236. After the input dataset is installed, the directory structure of ``DataLad-101``
  237. looks like this:
  238. .. runrecord:: _examples/DL-101-130-106
  239. :language: console
  240. :workdir: dl-101/DataLad-101/midterm_project
  241. :cast: 10_yoda
  242. :notes: here is how the directory structure looks like
  243. $ cd ../
  244. $ tree -d
  245. $ cd midterm_project
  246. Importantly, all of the subdatasets are linked to the higher-level datasets,
  247. and despite being inside of ``DataLad-101``, your ``midterm_project`` is an independent
  248. dataset, as is its ``input/`` subdataset. An overview is shown in :numref:`fig-linkeddl101`.
  249. .. _fig-linkeddl101:
  250. .. figure:: ../artwork/src/virtual_dstree_dl101_midterm.svg
  251. :width: 50%
  252. Overview of (linked) datasets in DataLad-101.
  253. YODA-compliant analysis scripts
  254. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  255. Now that you have an ``input/`` directory with data, and a ``code/`` directory
  256. (created by the YODA procedure) for your scripts, it is time to work on the script
  257. for your analysis. Within ``midterm_project``, the ``code/`` directory is where
  258. you want to place your scripts.
  259. But first, you plan your research question. You decide to do a
  260. classification analysis with a k-nearest neighbors algorithm [#f2]_. The iris
  261. dataset works well for such questions. Based on the features of the flowers
  262. (sepal and petal width and length) you will try to predict what type of
  263. flower (*Setosa*, *Versicolor*, or *Virginica*) a particular flower in the
  264. dataset is. You settle on two objectives for your analysis:
  265. #. Explore and plot the relationship between variables in the dataset and save
  266. the resulting graphic as a first result.
  267. #. Perform a k-nearest neighbor classification on a subset of the dataset to
  268. predict class membership (flower type) of samples in a left-out test set.
  269. Your final result should be a statistical summary of this prediction.
  270. To compute the analysis you create the following Python script inside of ``code/``:
  271. .. runrecord:: _examples/DL-101-130-107
  272. :language: console
  273. :workdir: dl-101/DataLad-101/midterm_project
  274. :emphasize-lines: 11-13, 23, 43
  275. :cast: 10_yoda
  276. :notes: Let's create code for an analysis
  277. $ cat << EOT > code/script.py
  278. import argparse
  279. import pandas as pd
  280. import seaborn as sns
  281. from sklearn import model_selection
  282. from sklearn.neighbors import KNeighborsClassifier
  283. from sklearn.metrics import classification_report
  284. parser = argparse.ArgumentParser(description="Analyze iris data")
  285. parser.add_argument('data', help="Input data (CSV) to process")
  286. parser.add_argument('output_figure', help="Output figure path")
  287. parser.add_argument('output_report', help="Output report path")
  288. args = parser.parse_args()
  289. # prepare the data as a pandas dataframe
  290. df = pd.read_csv(args.data)
  291. attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
  292. df.columns = attributes
  293. # create a pairplot to plot pairwise relationships in the dataset
  294. plot = sns.pairplot(df, hue='class', palette='muted')
  295. plot.savefig(args.output_figure)
  296. # perform a K-nearest-neighbours classification with scikit-learn
  297. # Step 1: split data in test and training dataset (20:80)
  298. array = df.values
  299. X = array[:,0:4]
  300. Y = array[:,4]
  301. test_size = 0.20
  302. seed = 7
  303. X_train, X_test, Y_train, Y_test = model_selection.train_test_split(
  304. X, Y,
  305. test_size=test_size,
  306. random_state=seed)
  307. # Step 2: Fit the model and make predictions on the test dataset
  308. knn = KNeighborsClassifier()
  309. knn.fit(X_train, Y_train)
  310. predictions = knn.predict(X_test)
  311. # Step 3: Save the classification report
  312. report = classification_report(Y_test, predictions, output_dict=True)
  313. df_report = pd.DataFrame(report).transpose().to_csv(args.output_report)
  314. EOT
  315. This script will
  316. - take three positional arguments: The input data, a path to save a figure under, and path to save the final prediction report under. By including these input and output specifications in a :dlcmd:`run` command when we run the analysis, we can ensure that input data is retrieved prior to the script execution, and that as much actionable provenance as possible is recorded [#f5]_.
  317. - read in the data, perform the analysis, and save the resulting figure and ``.csv`` prediction report into the root of ``midterm_project/``. Note how this helps to fulfil YODA principle 1 on modularity:
  318. Results are stored outside of the pristine input subdataset.
  319. A short help text explains how the script shall be used:
  320. .. code-block:: console
  321. $ python code/script.py -h
  322. usage: script.py [-h] data output_figure output_report
  323. Analyze iris data
  324. positional arguments:
  325. data Input data (CSV) to process
  326. output_figure Output figure path
  327. output_report Output report path
  328. optional arguments:
  329. -h, --help show this help message and exit
  330. The script execution would thus be ``python3 code/script.py <path-to-input> <path-to-figure-output> <path-to-report-output>``.
  331. When parametrizing the input and output path parameters, we just need make sure that all paths are *relative*, such that the ``midterm_project`` analysis is completely self-contained within the dataset, contributing to fulfill the second YODA principle.
  332. Let's run a quick :dlcmd:`status`...
  333. .. runrecord:: _examples/DL-101-130-108
  334. :language: console
  335. :workdir: dl-101/DataLad-101/midterm_project
  336. :cast: 10_yoda
  337. :notes: datalad status will show a new file
  338. $ datalad status
  339. .. index::
  340. pair: tag dataset version; with DataLad
  341. ... and save the script to the subdataset's history. As the script completes your
  342. analysis setup, we *tag* the state of the dataset to refer to it easily at a later
  343. point with the ``--version-tag`` option of :dlcmd:`save`.
  344. .. runrecord:: _examples/DL-101-130-109
  345. :language: console
  346. :workdir: dl-101/DataLad-101/midterm_project
  347. :cast: 10_yoda
  348. :notes: Save the analysis to the history
  349. $ datalad save -m "add script for kNN classification and plotting" \
  350. --version-tag ready4analysis \
  351. code/script.py
  352. .. index::
  353. pair: tag; Git concept
  354. pair: show; Git command
  355. pair: rerun command; with DataLad
  356. .. find-out-more:: What is a tag?
  357. :term:`tag`\s are markers that you can attach to commits in your dataset history.
  358. They can have any name, and can help you and others to identify certain commits
  359. or dataset states in the history of a dataset. Let's take a look at how the tag
  360. you just created looks like in your history with :gitcmd:`show`.
  361. Note how we can use a tag just as easily as a commit :term:`shasum`:
  362. .. runrecord:: _examples/DL-101-130-110
  363. :workdir: dl-101/DataLad-101/midterm_project
  364. :lines: 1-13
  365. :language: console
  366. $ git show ready4analysis
  367. This tag thus identifies the version state of the dataset in which this script
  368. was added.
  369. Later we can use this tag to identify the point in time at which
  370. the analysis setup was ready -- much more intuitive than a 40-character shasum!
  371. This is handy in the context of a :dlcmd:`rerun`, for example:
  372. .. code-block:: console
  373. $ datalad rerun --since ready4analysis
  374. would rerun any :dlcmd:`run` command in the history performed between tagging
  375. and the current dataset state.
  376. Finally, with your directory structure being modular and intuitive,
  377. the input data installed, the script ready, and the dataset status clean,
  378. you can wrap the execution of the script in a :dlcmd:`run` command. Note that
  379. simply executing the script would work as well -- thanks to DataLad's Python API.
  380. But using :dlcmd:`run` will capture full provenance, and will make
  381. re-execution with :dlcmd:`rerun` easy.
  382. .. importantnote:: Additional software requirements: pandas, seaborn, sklearn
  383. Note that you need to have the following Python packages installed to run the
  384. analysis [#f3]_:
  385. - `pandas <https://pandas.pydata.org>`_
  386. - `seaborn <https://seaborn.pydata.org>`_
  387. - `sklearn <https://scikit-learn.org>`_
  388. The packages can be installed via :term:`pip`.
  389. However, if you do not want to install any
  390. Python packages, do not execute the remaining code examples in this section
  391. -- an upcoming section on ``datalad containers-run`` will allow you to
  392. perform the analysis without changing your Python software-setup.
  393. .. index::
  394. pair: python instead of python3; on Windows
  395. .. windows-wit:: You may need to use 'python', not 'python3'
  396. .. include:: topic/py-or-py3.rst
  397. .. index::
  398. pair: run command with provenance capture; with DataLad
  399. .. runrecord:: _examples/DL-101-130-111
  400. :language: console
  401. :workdir: dl-101/DataLad-101/midterm_project
  402. :cast: 10_yoda
  403. :notes: The datalad run command can reproducibly execute a command reproducibly
  404. $ datalad run -m "analyze iris data with classification analysis" \
  405. --input "input/iris.csv" \
  406. --output "pairwise_relationships.png" \
  407. --output "prediction_report.csv" \
  408. "python3 code/script.py {inputs} {outputs}"
  409. As the successful command summary indicates, your analysis seems to work! Two
  410. files were created and saved to the dataset: ``pairwise_relationships.png``
  411. and ``prediction_report.csv``. If you want, take a look and interpret
  412. your analysis. But what excites you even more than a successful data science
  413. project on first try is that you achieved complete provenance capture:
  414. - Every single file in this dataset is associated with an author and a time
  415. stamp for each modification thanks to :dlcmd:`save`.
  416. - The raw dataset knows where the data came from thanks to :dlcmd:`clone`
  417. and :dlcmd:`download-url`.
  418. - The subdataset is linked to the superdataset thanks to
  419. :dlcmd:`clone -d`.
  420. - The :dlcmd:`run` command took care of linking the outputs of your
  421. analysis with the script and the input data it was generated from, fulfilling
  422. the third YODA principle.
  423. Let's take a look at the history of the ``midterm_project`` analysis
  424. dataset:
  425. .. runrecord:: _examples/DL-101-130-112
  426. :language: console
  427. :workdir: dl-101/DataLad-101/midterm_project
  428. :cast: 10_yoda
  429. :notes: Let's take a look at the history
  430. $ git log --oneline
  431. "Wow, this is so clean and intuitive!" you congratulate yourself. "And I think
  432. this was and will be the fastest I have ever completed a midterm project!"
  433. But what is still missing is a human readable description of your dataset.
  434. The YODA procedure kindly placed a ``README.md`` file into the root of your
  435. dataset that you can use for this [#f4]_.
  436. .. runrecord:: _examples/DL-101-130-113
  437. :language: console
  438. :workdir: dl-101/DataLad-101/midterm_project
  439. :cast: 10_yoda
  440. :notes: create human readable information for your project
  441. $ # with the >| redirection we are replacing existing contents in the file
  442. $ cat << EOT >| README.md
  443. # Midterm YODA Data Analysis Project
  444. ## Dataset structure
  445. - All inputs (i.e. building blocks from other sources) are located in input/.
  446. - All custom code is located in code/.
  447. - All results (i.e., generated files) are located in the root of the dataset:
  448. - "prediction_report.csv" contains the main classification metrics.
  449. - "output/pairwise_relationships.png" is a plot of the relations between features.
  450. EOT
  451. .. runrecord:: _examples/DL-101-130-114
  452. :language: console
  453. :workdir: dl-101/DataLad-101/midterm_project
  454. :cast: 10_yoda
  455. :notes: The README file is now modified
  456. $ datalad status
  457. .. runrecord:: _examples/DL-101-130-115
  458. :language: console
  459. :workdir: dl-101/DataLad-101/midterm_project
  460. :cast: 10_yoda
  461. :notes: Let's save this change
  462. $ datalad save -m "Provide project description" README.md
  463. Note that one feature of the YODA procedure was that it configured certain files
  464. (for example, everything inside of ``code/``, and the ``README.md`` file in the
  465. root of the dataset) to be saved in Git instead of git-annex. This was the
  466. reason why the ``README.md`` in the root of the dataset was easily modifiable.
  467. .. index::
  468. pair: save; DataLad command
  469. pair: save file content directly in Git (no annex); with DataLad
  470. .. find-out-more:: Saving contents with Git regardless of configuration with --to-git
  471. The ``yoda`` procedure in ``midterm_project`` applied a different configuration
  472. within ``.gitattributes`` than the ``text2git`` procedure did in ``DataLad-101``.
  473. Within ``DataLad-101``, any text file is automatically stored in :term:`Git`.
  474. This is not true in ``midterm_project``: Only the existing ``README.md`` files and
  475. anything within ``code/`` are stored -- everything else will be annexed.
  476. That means that if you create any other file, even text files, inside of
  477. ``midterm_project`` (but not in ``code/``), it will be managed by :term:`git-annex`
  478. and content-locked after a :dlcmd:`save` -- an inconvenience if it
  479. would be a file that is small enough to be handled by Git.
  480. Luckily, there is a handy shortcut to saving files in Git that does not
  481. require you to edit configurations in ``.gitattributes``: The ``--to-git``
  482. option for :dlcmd:`save`.
  483. .. code-block:: console
  484. $ datalad save -m "add sometextfile.txt" --to-git sometextfile.txt
  485. After adding this short description to your ``README.md``, your dataset now also
  486. contains sufficient human-readable information to ensure that others can understand
  487. everything you did easily.
  488. The only thing left to do is to hand in your assignment. According to the
  489. syllabus, this should be done via :term:`GitHub`.
  490. .. index:: dataset hosting; GitHub
  491. .. find-out-more:: What is GitHub?
  492. GitHub is a web based hosting service for Git repositories. Among many
  493. different other useful perks it adds features that allow collaboration on
  494. Git repositories. `GitLab <https://about.gitlab.com>`_ is a similar
  495. service with highly similar features, but its source code is free and open,
  496. whereas GitHub is a subsidiary of Microsoft.
  497. Web-hosting services like GitHub and :term:`GitLab` integrate wonderfully with
  498. DataLad. They are especially useful for making your dataset publicly available,
  499. if you have figured out storage for your large files otherwise (as large content
  500. cannot be hosted for free by GitHub). You can make DataLad publish large file content to one location
  501. and afterwards automatically push an update to GitHub, such that
  502. users can install directly from GitHub/GitLab and seemingly also obtain large file
  503. content from GitHub. GitHub can also resolve subdataset links to other GitHub
  504. repositories, which lets you navigate through nested datasets in the web-interface.
  505. ..
  506. the images below can't become figures because they can't be used in LaTeXs minipage environment
  507. .. image:: ../artwork/src/screenshot_midtermproject.png
  508. :alt: The midterm project repository, published to GitHub
  509. The above screenshot shows the linkage between the analysis project you will create
  510. and its subdataset. Clicking on the subdataset (highlighted) will take you to the iris dataset
  511. the handbook provides, shown below.
  512. .. image:: ../artwork/src/screenshot_submodule.png
  513. :alt: The input dataset is linked
  514. .. index::
  515. pair: create-sibling-github; DataLad command
  516. .. _publishtogithub:
  517. Publishing the dataset to GitHub
  518. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  519. For this, you need to
  520. - create a GitHub account, if you do not yet have one
  521. - create a repository for this dataset on GitHub,
  522. - configure this GitHub repository to be a :term:`sibling` of the ``midterm_project`` dataset,
  523. - and *publish* your dataset to GitHub.
  524. .. index::
  525. pair: create-sibling-gitlab; DataLad command
  526. Luckily, DataLad can make this very easy with the
  527. :dlcmd:`create-sibling-github`
  528. command (or, for `GitLab <https://about.gitlab.com>`_, :dlcmd:`create-sibling-gitlab`).
  529. The two commands have different arguments and options.
  530. Here, we look at :dlcmd:`create-sibling-github`.
  531. The command takes a repository name and GitHub authentication credentials
  532. (either in the command line call with options ``github-login <TOKEN>``, with an *oauth* `token <https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens>`_ stored in the Git
  533. configuration, or interactively).
  534. .. index::
  535. pair: GitHub token; credential
  536. .. importantnote:: Generate a GitHub token
  537. GitHub `deprecated user-password authentication <https://developer.github.com/changes/2020-02-14-deprecating-password-auth>`_ and instead supports authentication via personal access token.
  538. To ensure successful authentication, don't supply your password, but create a personal access token at `github.com/settings/tokens <https://github.com/settings/tokens>`_ [#f6]_ instead, and either
  539. * supply the token with the argument ``--github-login <TOKEN>`` from the command line,
  540. * or supply the token from the command line when queried for a password
  541. Based on the credentials and the
  542. repository name, it will create a new, empty repository on GitHub, and
  543. configure this repository as a sibling of the dataset:
  544. .. ifconfig:: internal
  545. .. runrecord:: _examples/DL-101-130-116
  546. :language: console
  547. $ python3 /home/me/makepushtarget.py '/home/me/dl-101/DataLad-101/midterm_project' 'github' '/home/me/pushes/midterm_project' False True
  548. .. index:: credential; entry
  549. pair: typed credentials are not displayed; on Windows
  550. .. windows-wit:: Your shell will not display credentials
  551. .. include:: topic/credential-nodisplay.rst
  552. .. code-block:: console
  553. $ datalad create-sibling-github -d . midtermproject
  554. .: github(-) [https://github.com/adswa/midtermproject.git (git)]
  555. 'https://github.com/adswa/midtermproject.git' configured as sibling 'github' for <Dataset path=/home/me/dl-101/DataLad-101/midterm_project>
  556. Verify that this worked by listing the siblings of the dataset:
  557. .. code-block:: console
  558. $ datalad siblings
  559. [WARNING] Failed to determine if github carries annex.
  560. .: here(+) [git]
  561. .: github(-) [https://github.com/adswa/midtermproject.git (git)]
  562. .. index::
  563. pair: sibling (GitHub); DataLad concept
  564. .. gitusernote:: Create-sibling-github internals
  565. Creating a sibling on GitHub will create a new empty repository under the
  566. account that you provide and set up a *remote* to this repository. Upon a
  567. :dlcmd:`push` to this sibling, your datasets history
  568. will be pushed there.
  569. .. index::
  570. pair: push; DataLad concept
  571. pair: push (dataset); with DataLad
  572. On GitHub, you will see a new, empty repository with the name
  573. ``midtermproject``. However, the repository does not yet contain
  574. any of your dataset's history or files. This requires *publishing* the current
  575. state of the dataset to this :term:`sibling` with the :dlcmd:`push`
  576. command.
  577. .. importantnote:: Learn how to push "on the job"
  578. Publishing is one of the remaining big concepts that this handbook tries to
  579. convey. However, publishing is a complex concept that encompasses a large
  580. proportion of the previous handbook content as a prerequisite. In order to be
  581. not too overwhelmingly detailed, the upcoming sections will approach
  582. :dlcmd:`push` from a "learning-by-doing" perspective:
  583. First, you will see a :dlcmd:`push` to GitHub, and the :ref:`Findoutmore on the published dataset <fom-midtermclone>`
  584. at the end of this section will already give a practical glimpse into the
  585. difference between annexed contents and contents stored in Git when pushed
  586. to GitHub. The chapter :ref:`chapter_thirdparty` will extend on this,
  587. but the section :ref:`push`
  588. will finally combine and link all the previous contents to give a comprehensive
  589. and detailed wrap up of the concept of publishing datasets. In this section,
  590. you will also find a detailed overview on how :dlcmd:`push` works and which
  591. options are available. If you are impatient or need an overview on publishing,
  592. feel free to skip ahead. If you have time to follow along, reading the next
  593. sections will get you towards a complete picture of publishing a bit more
  594. small-stepped and gently.
  595. For now, we will start with learning by doing, and
  596. the fundamental basics of :dlcmd:`push`: The command
  597. will make the last saved state of your dataset available (i.e., publish it)
  598. to the :term:`sibling` you provide with the ``--to`` option.
  599. .. runrecord:: _examples/DL-101-130-118
  600. :language: console
  601. :workdir: dl-101/DataLad-101/midterm_project
  602. $ datalad push --to github
  603. Thus, you have now published your dataset's history to a public place for others
  604. to see and clone. Now we will explore how this may look and feel for others.
  605. There is one important detail first, though: By default, your tags will not be published.
  606. Thus, the tag ``ready4analysis`` is not pushed to GitHub, and currently this
  607. version identifier is unavailable to anyone else but you.
  608. The reason for this is that tags are viral -- they can be removed locally, and old
  609. published tags can cause confusion or unwanted changes. In order to publish a tag,
  610. an additional :gitcmd:`push` with the ``--tags`` option is required:
  611. .. index::
  612. pair: push; DataLad concept
  613. pair: push (tag); with Git
  614. .. code-block:: console
  615. $ git push github --tags
  616. .. index::
  617. pair: push (tag); with DataLad
  618. .. gitusernote:: Pushing tags
  619. Note that this is a :gitcmd:`push`, not :dlcmd:`push`.
  620. Tags could be pushed upon a :dlcmd:`push`, though, if one
  621. configures (what kind of) tags to be pushed. This would need to be done
  622. on a per-sibling basis in ``.git/config`` in the ``remote.*.push``
  623. configuration. If you had a :term:`sibling` "github", the following
  624. configuration would push all tags that start with a ``v`` upon a
  625. :dlcmd:`push --to github`:
  626. .. code-block:: console
  627. $ git config --local remote.github.push 'refs/tags/v*'
  628. This configuration would result in the following entry in ``.git/config``:
  629. .. code-block:: ini
  630. [remote "github"]
  631. url = git@github.com/adswa/midtermproject.git
  632. fetch = +refs/heads/*:refs/remotes/github/*
  633. annex-ignore = true
  634. push = refs/tags/v*
  635. Yay! Consider your midterm project submitted! Others can now install your
  636. dataset and check out your data science project -- and even better: they can
  637. reproduce your data science project easily from scratch (take a look into the :ref:`Findoutmore <fom-midtermclone>` to see how)!
  638. .. index::
  639. pair: work on published YODA dataset; with DataLad
  640. pair: rerun command; with DataLad
  641. .. find-out-more:: On the looks and feels of this published dataset
  642. :name: fom-midtermclone
  643. :float:
  644. Now that you have created and published such a YODA-compliant dataset, you
  645. are understandably excited how this dataset must look and feel for others.
  646. Therefore, you decide to install this dataset into a new location on your
  647. computer, just to get a feel for it.
  648. Replace the ``url`` in the :dlcmd:`clone` command with the path
  649. to your own ``midtermproject`` GitHub repository, or clone the "public"
  650. ``midterm_project`` repository that is available via the Handbook's GitHub
  651. organization at `github.com/datalad-handbook/midterm_project <https://github.com/datalad-handbook/midterm_project>`_:
  652. .. runrecord:: _examples/DL-101-130-119
  653. :language: console
  654. :workdir: dl-101/DataLad-101/midterm_project
  655. $ cd ../../
  656. $ datalad clone "https://github.com/adswa/midtermproject.git"
  657. Let's start with the subdataset, and see whether we can retrieve the
  658. input ``iris.csv`` file. This should not be a problem, since its origin
  659. is recorded:
  660. .. runrecord:: _examples/DL-101-130-120
  661. :language: console
  662. :workdir: dl-101
  663. $ cd midtermproject
  664. $ datalad get input/iris.csv
  665. Nice, this worked well. The output files, however, cannot be easily
  666. retrieved:
  667. .. runrecord:: _examples/DL-101-130-121
  668. :language: console
  669. :exitcode: 1
  670. :workdir: dl-101/midtermproject
  671. $ datalad get prediction_report.csv pairwise_relationships.png
  672. Why is that? This is the first detail of publishing datasets we will dive into.
  673. When publishing dataset content to GitHub with :dlcmd:`push`, it is
  674. the dataset's *history*, i.e., everything that is stored in Git, that is
  675. published. The file *content* of these particular files, though, is managed
  676. by :term:`git-annex` and not stored in Git, and
  677. thus only information about the file name and location is known to Git.
  678. Because GitHub does not host large data for free, annexed file content always
  679. needs to be deposited somewhere else (e.g., a web server) to make it
  680. accessible via :dlcmd:`get`. The chapter :ref:`chapter_thirdparty`
  681. will demonstrate how this can be done. For this dataset, it is not
  682. necessary to make the outputs available, though: Because all provenance
  683. on their creation was captured, we can simply recompute them with the
  684. :dlcmd:`rerun` command. If the tag was published we can simply
  685. rerun any :dlcmd:`run` command since this tag:
  686. .. code-block:: console
  687. $ datalad rerun --since ready4analysis
  688. But without the published tag, we can rerun the analysis by specifying its
  689. shasum:
  690. .. runrecord:: _examples/DL-101-130-122
  691. :language: console
  692. :workdir: dl-101/midtermproject
  693. :realcommand: echo "$ datalad rerun $(git rev-parse HEAD~1)" && datalad rerun $(git rev-parse HEAD~1)
  694. Hooray, your analysis was reproduced! You happily note that rerunning your
  695. analysis was incredibly easy -- it would not even be necessary to have any
  696. knowledge about the analysis at all to reproduce it!
  697. With this, you realize again how letting DataLad take care of linking input,
  698. output, and code can make your life and others' lives so much easier.
  699. Applying the YODA principles to your data analysis was very beneficial indeed.
  700. Proud of your midterm project you cannot wait to use those principles the
  701. next time again.
  702. .. image:: ../artwork/src/reproduced.svg
  703. :width: 50%
  704. :align: center
  705. .. index::
  706. pair: push; DataLad concept
  707. .. gitusernote:: Push internals
  708. The :dlcmd:`push` uses ``git push``, and ``git annex copy`` under
  709. the hood. Publication targets need to either be configured remote Git repositories,
  710. or git-annex :term:`special remote`\s (if they support data upload).
  711. .. only:: adminmode
  712. Add a tag at the section end.
  713. .. runrecord:: _examples/DL-101-130-123
  714. :language: console
  715. :workdir: dl-101/DataLad-101
  716. $ git branch sct_yoda_project
  717. .. rubric:: Footnotes
  718. .. [#f1] Note that you could have applied the YODA procedure not only right at
  719. creation of the dataset with ``-c yoda``, but also after creation
  720. with the :dlcmd:`run-procedure` command:
  721. .. code-block:: console
  722. $ cd midterm_project
  723. $ datalad run-procedure cfg_yoda
  724. Both ways of applying the YODA procedure will lead to the same
  725. outcome.
  726. .. [#f2] The choice of analysis method
  727. for the handbook is rather arbitrary, and understanding the k-nearest
  728. neighbor algorithm is by no means required for this section.
  729. .. [#f3] It is recommended (but optional) to create a
  730. `virtual environment <https://docs.python.org/3/tutorial/venv.html>`_ and
  731. install the required Python packages inside of it:
  732. .. code-block:: console
  733. $ # create and enter a new virtual environment (optional)
  734. $ virtualenv --python=python3 ~/env/handbook
  735. $ . ~/env/handbook/bin/activate
  736. .. code-block:: console
  737. $ # install the Python packages from PyPi via pip
  738. $ pip install seaborn pandas sklearn
  739. .. [#f4] All ``README.md`` files the YODA procedure created are
  740. version controlled by Git, not git-annex, thanks to the
  741. configurations that YODA supplied. This makes it easy to change the
  742. ``README.md`` file. The previous section detailed how the YODA procedure
  743. configured your dataset. If you want to re-read the full chapter on
  744. configurations and run-procedures, start with section :ref:`config`.
  745. .. [#f5] Alternatively, if you were to use DataLad's Python API, you could import and expose it as ``dl.<COMMAND>`` and ``dl.get()`` the relevant files. This however, would not record them as provenance in the dataset's history.
  746. .. [#f6] Instead of using GitHub's WebUI you could also obtain a token using the command line GitHub interface (https://github.com/sociomantic-tsunami/git-hub) by running ``git hub setup`` (if no 2FA is used).