reproducible-paper.rst 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461
  1. .. index:: ! Usecase; reproducible paper
  2. .. _usecase_reproducible_paper:
  3. Writing a reproducible paper
  4. ----------------------------
  5. This use case demonstrates how to use nested DataLad datasets to create a fully
  6. reproducible paper by linking
  7. #. (different) DataLad dataset sources with
  8. #. the code needed to compute results and
  9. #. LaTeX files to compile the resulting paper.
  10. The different components each exist in individual DataLad datasets and are
  11. aggregated into a single :term:`DataLad superdataset` complying to the YODA principles
  12. for data analysis projects [#f1]_. The resulting superdataset can be publicly
  13. shared, data can be obtained effortlessly on demand by anyone that has the superdataset,
  14. and results and paper can be generated and recomputed everywhere on demand.
  15. A template to start your own reproducible paper with the same set up can be found `on GitHub <https://github.com/datalad-handbook/repro-paper-sketch>`__.
  16. The Challenge
  17. ^^^^^^^^^^^^^
  18. Over the past year, Steve worked on the implementation of an algorithm as a software package.
  19. For testing purposes, he used one of his own data collections, and later also included a publicly shared
  20. data collection. After completion, he continued to work on validation analyses to
  21. prove the functionality and usefulness of his software. Next to a directory in which he developed
  22. his code, and directories with data he tested his code on, he now also has other directories
  23. with different data sources used for validation analyses.
  24. "This cannot take too long!" Steve thinks optimistically when he finally sits down to write up a paper.
  25. His scripts run his algorithm on the different data collections, create derivatives of his raw data,
  26. pretty figures, and impressive tables.
  27. Just after he hand-copies and checks the last decimal of the final result in the very
  28. last table of his manuscript, he realizes that the script specified the wrong parameter
  29. values, and all of the results need to be recomputed - and obviously updated in his manuscript.
  30. When writing the discussion, he finds a paper that reports an error in the publicly shared
  31. data collection he uses. After many more days of updating tables and fixing data columns
  32. by hand, he finally submits the paper. Trying to stand with his values of
  33. open and reproducible science, he struggles to bundle all scripts, algorithm code, and data
  34. he used in a shareable form, and frankly, with all the extra time this manuscript took
  35. him so far, he lacks motivation and time. In the end, he writes a three page long README
  36. file in his GitHub code repository, includes his email for data requests, and
  37. secretly hopes that no-one will want to recompute his results, because by now even he
  38. himself forgot which script ran on which dataset and what data was fixed in which way,
  39. or whether he was careful enough to copy all of the results correctly. In the review process,
  40. reviewer 2 demands that the figures his software produces need to get a new color scheme,
  41. which requires updates in his software package, and more recomputations.
  42. The DataLad Approach
  43. ^^^^^^^^^^^^^^^^^^^^
  44. Steve sets up a DataLad dataset and calls it ``algorithm-paper``. In this
  45. dataset, he creates several subdirectories to collate everything that is relevant for
  46. the manuscript: Data, code, a manuscript backbone without results.
  47. ``code/`` contains a Python script that he uses for validation analyses, and
  48. prior to computing results, the script
  49. attempts to download the data should the files need to be obtained using DataLad's Python API.
  50. ``data/`` contains a separate DataLad subdataset for every dataset he uses. An
  51. ``algorithm/`` directory is a DataLad dataset containing a clone of his software repository,
  52. and within it, in the directory ``test/data/``, are additional DataLad subdatasets that
  53. contain the data he used for testing.
  54. Lastly, the DataLad superdataset contains a ``LaTeX`` ``.tex`` file with the text of the manuscript.
  55. When everything is set up, a single command line call triggers (optional) data retrieval
  56. from GitHub repositories of the datasets, computation of
  57. results and figures, automatic embedding of results and figures into his manuscript
  58. upon computation, and PDF compiling.
  59. When he notices the error in his script, his manuscript is recompiled and updated
  60. with a single command line call, and when he learns about the data error,
  61. he updates the respective DataLad dataset
  62. to the fixed state while preserving the history of the data repository.
  63. He makes his superdataset a public repository on GitHub, and anyone who clones it can obtain the
  64. data automatically and recompute and recompile the full manuscript with all results.
  65. Steve never had more confidence in his research results and proudly submits his manuscript.
  66. During review, the color scheme update in his algorithm sourcecode is integrated with a simple
  67. update of the ``algorithm/`` subdataset, and upon command-line invocation his manuscript updates
  68. itself with the new figures.
  69. .. importantnote:: Take a look at the real manuscript dataset
  70. The actual manuscript this use case is based on can be found
  71. `here <https://github.com/psychoinformatics-de/paper-remodnav>`_:
  72. https://github.com/psychoinformatics-de/paper-remodnav. :dlcmd:`clone`
  73. the repository and follow the few instructions in the README to experience the
  74. DataLad approach described above.
  75. There is also a slimmed down template that uses the analysis demonstrated in :ref:`yoda_project` and packages it up into a reproducible paper using the same tools: `github.com/datalad-handbook/repro-paper-sketch <https://github.com/datalad-handbook/repro-paper-sketch>`_.
  76. Step-by-Step
  77. ^^^^^^^^^^^^
  78. :dlcmd:`create` a DataLad dataset. In this example, it is named "algorithm-paper",
  79. and :dlcmd:`create` uses the yoda procedure [#f1]_ to apply useful configurations
  80. for a data analysis project:
  81. .. code-block:: bash
  82. $ datalad create -c yoda algorithm-paper
  83. [INFO ] Creating a new annex repo at /home/adina/repos/testing/algorithm-paper
  84. create(ok): /home/adina/repos/testing/algorithm-paper (dataset)
  85. This newly created directory already has a ``code/`` directory that will be tracked with Git
  86. and some ``README.md`` and ``CHANGELOG.md`` files
  87. thanks to the yoda procedure applied above. Additionally, create a subdirectory ``data/`` within
  88. the dataset. This project thus already has a comprehensible structure:
  89. .. code-block:: bash
  90. $ cd algorithm-paper
  91. $ mkdir data
  92. # You can checkout the directory structure with the tree command
  93. $ tree
  94. algorithm-paper
  95. ├── CHANGELOG.md
  96. ├── code
  97. │   └── README.md
  98. ├── data
  99. └── README.md
  100. All of your analyses scripts should live in the ``code/`` directory, and all input data should
  101. live in the ``data/`` directory.
  102. To populate the DataLad dataset, add all the
  103. data collections you want to perform analyses on as individual DataLad subdatasets within
  104. ``data/``.
  105. In this example, all data collections are already DataLad datasets or git repositories and hosted on GitHub.
  106. :dlcmd:`clone` therefore installs them as subdatasets, with ``-d ../``
  107. registering them as subdatasets to the superdataset [#f2]_.
  108. .. code-block:: bash
  109. $ cd data
  110. # clone existing git repositories with data (-s specifies the source, in this case, GitHub repositories)
  111. # -d points to the root of the superdataset
  112. datalad clone -d ../ https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
  113. [INFO ] Cloning https://github.com/psychoinformatics-de/studyforrest-data-phase2.git [1 other candidates] into '/home/adina/repos/testing/algorithm-paper/data/raw_eyegaze'
  114. install(ok): /home/adina/repos/testing/algorithm-paper/data/raw_eyegaze (dataset)
  115. $ datalad clone -d ../ git@github.com:psychoinformatics-de/studyforrest-data-eyemovementlabels.git
  116. [INFO ] Cloning git@github.com:psychoinformatics-de/studyforrest-data-eyemovementlabels.git into '/home/adina/repos/testing/algorithm-paper/data/studyforrest-data-eyemovementlabels'
  117. Cloning (compressing objects): 45% 1.80k/4.00k [00:01<00:01, 1.29k objects/s
  118. [...]
  119. Any script we need for the analysis should live inside ``code/``. During script writing, save any changes
  120. to you want to record in your history with :dlcmd:`save`.
  121. The eventual outcome of this work is a GitHub repository that anyone can use to get the data
  122. and recompute all results
  123. when running the script after cloning and setting up the necessary software.
  124. This requires minor preparation:
  125. * The final analysis should be able to run on anyone's file system.
  126. It is therefore important to reference datafiles with the scripts in ``code/`` as
  127. :term:`relative path`\s instead of hard-coding :term:`absolute path`\s.
  128. * After cloning the ``algorithm-paper`` repository, data files are not yet present
  129. locally. To spare users the work of a manual :dlcmd:`get`, you can have your
  130. script take care of data retrieval via DataLad's Python API.
  131. These two preparations can be seen in this excerpt from the Python script:
  132. .. code-block:: python
  133. # import DataLad's API
  134. from datalad.api import get
  135. # note that the datapath is relative
  136. datapath = op.join('data',
  137. 'studyforrest-data-eyemovementlabels',
  138. 'sub*',
  139. '*run-2*.tsv')
  140. data = sorted(glob(datapath))
  141. # this will get the data if it is not yet retrieved
  142. get(dataset='.', path=data)
  143. Lastly, :dlcmd:`clone` the software repository as a subdataset in the
  144. root of the superdataset [#f3]_.
  145. .. code-block:: bash
  146. # in the root of ``algorithm-paper`` run
  147. $ datalad clone -d . git@github.com:psychoinformatics-de/remodnav.git
  148. This repository has also subdatasets in which the datasets used for testing live (``tests/data/``):
  149. .. code-block:: bash
  150. $ tree
  151. [...]
  152. | ├── remodnav
  153. │   ├── clf.py
  154. │   ├── __init__.py
  155. │   ├── __main__.py
  156. │   └── tests
  157. │   ├── data
  158. │   │   ├── anderson_etal
  159. │   │   └── studyforrest
  160. At this stage, a public ``algorithm-paper`` repository shares code and data, and changes to any
  161. dataset can easily be handled by updating the respective subdataset.
  162. This already is a big leap towards open and reproducible science. Thanks to DataLad, code,
  163. data, and the history of all code and data are easily shared - with exact versions of all
  164. components and bound together in a single, fully tracked research object.
  165. By making use of the Python API of DataLad and :term:`relative path`\s in scripts,
  166. data retrieval is automated, and scripts can run on any other computer.
  167. Automation with existing tools
  168. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  169. To go beyond that and include freshly computed results in a manuscript on the fly does not
  170. require DataLad anymore, only some understanding of Python, ``LaTeX``, and Makefiles. As with most things,
  171. its a surprisingly simple challenge if one has just seen how to do it once.
  172. This last section will therefore outline how to compile the results into a PDF manuscript and
  173. automate this process.
  174. In principle, the challenge boils down to:
  175. #. have the script output results (only requires ``print()`` statements)
  176. #. capture these results automatically (done with a single line of Unix commands)
  177. #. embed the captured results in the PDF (done with one line in the ``.tex`` file and
  178. some clever referencing)
  179. #. automate as much as possible to keep it as simple as possible (done with a Makefile)
  180. That does not sound too bad, does it?
  181. Let's start by revealing how this magic trick works. Everything relies on printing
  182. the results in the form of user-defined ``LaTeX`` definitions (using the ``\newcommand``
  183. command), referencing those definitions in your manuscript where the
  184. results should end up, and bind the ``\newcommand``\s as ``\input{}`` to your ``.tex``
  185. file. But lets get there in small steps.
  186. First, if you want to read up on the ``\newcommand``, please see
  187. `its documentation <https://en.wikibooks.org/wiki/LaTeX/Macros>`_.
  188. The command syntax looks like this:
  189. ``\newcommand{\name}[num]{definition}``
  190. What we want to do, expressed in the most human-readable form, is this:
  191. ``\newcommand{\Table1Cell1Row1}{0.67}``
  192. where ``0.67`` would be a single result computed by your script.
  193. This requires ``print()`` statements that look like this in the most simple
  194. form (excerpt from script):
  195. .. code-block:: python
  196. print('\\newcommand{\\maxmclf}{{%.2f}}' % max_mclf)
  197. where ``max_mclf`` is a variable that stores the value of one computation.
  198. Tables and references to results within the ``.tex`` files then do not contain the
  199. specific value ``0.67`` (this value would change if the data changes, or other parameters),
  200. but ``\maxmclf`` (and similar, unique names for other results).
  201. For full tables, one can come up with naming schemes that make it easy
  202. to fill tables with unique names with minimal work, for example like this (excerpt):
  203. .. code-block:: tex
  204. \begin{table}[tbp]
  205. \caption{Cohen's Kappa reliability between human coders (MN, RA),
  206. and \remodnav\ (AL) with each of the human coders.
  207. }
  208. \label{tab:kappa}
  209. \begin{tabular*}{0.5\textwidth}{c @{\extracolsep{\fill}}llll}
  210. \textbf {Fixations} & & \\
  211. \hline\noalign{\smallskip}
  212. Comparison & Images & Dots \\
  213. \noalign{\smallskip}\hline\noalign{\smallskip}
  214. MN versus RA & \kappaRAMNimgFix & \kappaRAMNdotsFix \\
  215. AL versus RA & \kappaALRAimgFix & \kappaALRAdotsFix \\
  216. AL versus MN & \kappaALMNimgFix & \kappaALMNdotsFix \\
  217. \noalign{\smallskip}
  218. \textbf{Saccades} & & \\
  219. \hline\noalign{\smallskip}
  220. Comparison & Images & Dots \\
  221. \noalign{\smallskip}\hline\noalign{\smallskip}
  222. MN versus RA & \kappaRAMNimgSac & \kappaRAMNdotsSac \\
  223. AL versus RA & \kappaALRAimgSac & \kappaALRAdotsSac \\
  224. AL versus MN & \kappaALMNimgSac & \kappaALMNdotsSac \\
  225. \noalign{\smallskip}
  226. % [..] more content omitted
  227. \end{tabular*}
  228. \end{table}
  229. Without diving into the context of the paper, this table contains results for three
  230. three comparisons ("MN versus RA", "AL versus RA", "AL versus MN"), for three
  231. event types (Fixations, Saccades, and post-saccadic oscillations (PSO)), and three different
  232. stimulus types (Images, Dots, and Videos). The latter event and stimulus are omitted for
  233. better readability of the ``.tex`` excerpt. Here is how this table looks like in the manuscript
  234. (cropped to match the ``.tex`` snippet):
  235. .. figure:: ../artwork/src/img/remodnav.png
  236. It might appear tedious to write scripts that output results for such tables with individual names.
  237. However, ``print()`` statements to fill those tables can utilize Pythons string concatenation methods
  238. and loops to keep the code within a few lines for a full table, such as
  239. .. code-block:: python
  240. # iterate over stimulus categories
  241. for stim in ['img', 'dots', 'video']:
  242. # iterate over event categories
  243. for ev in ['Fix', 'Sac', 'PSO']:
  244. [...]
  245. # create the combinations
  246. for rating, comb in [('RAMN', [RA_res_flat, MN_res_flat]),
  247. ('ALRA', [RA_res_flat, AL_res_flat]),
  248. ('ALMN', [MN_res_flat, AL_res_flat])]:
  249. kappa = cohen_kappa_score(comb[0], comb[1])
  250. label = 'kappa{}{}{}'.format(rating, stim, ev)
  251. # print the result
  252. print('\\newcommand{\\%s}{%s}' % (label, '%.2f' % kappa))
  253. Running the python script will hence print plenty of LaTeX commands to your screen (try it out
  254. in the actual manuscript, if you want!). This was step number 1 of 4.
  255. .. find-out-more:: How about figures?
  256. To include figures, the figures just need to be saved into a dedicated location (for example,
  257. a directory ``img/``) and included into the ``.tex`` file with standard ``LaTeX`` syntax.
  258. Larger figures with subfigures can be created by combining several figures:
  259. .. code-block:: tex
  260. \begin{figure*}[tbp]
  261. \includegraphics[trim=0 8mm 3mm 0,clip,width=.5\textwidth]{img/mainseq_lab}
  262. \includegraphics[trim=8mm 8mm 0 0,clip,width=.5\textwidth-3.3mm]{img/mainseq_sub_lab} \\
  263. \includegraphics[trim=0 0 3mm 0,clip,width=.5\textwidth]{img/mainseq_mri}
  264. \includegraphics[trim=8mm 0 0 0,clip,width=.5\textwidth-3.3mm]{img/mainseq_sub_mri}
  265. \caption{Main sequence of eye movement events during one 15 minute sequence of
  266. the movie (segment 2) for lab (top), and MRI participants (bottom). Data
  267. across all participants per dataset is shown on the left, and data for a single
  268. exemplary participant on the right.}
  269. \label{fig:overallComp}
  270. \end{figure*}
  271. This figure looks like this in the manuscript:
  272. ..
  273. the image can't become a figure because it can't be used in LaTeXs minipage environment
  274. .. image:: ../artwork/src/img/remodnav2.png
  275. For step 2 and 3, the print statements need to be captured and bound to the ``.tex`` file.
  276. The `tee <https://en.wikipedia.org/wiki/Tee_(command)>`_ command can write all of the output to
  277. a file (called ``results_def.tex``):
  278. .. code-block:: python
  279. code/mk_figuresnstats.py -s | tee results_def.tex
  280. This will redirect every print statement the script wrote to the terminal into a file called
  281. ``results_def.tex``. This file will hence be full of ``\newcommand`` definitions that contain
  282. the results of the computations.
  283. For step 3, one can include this file as an input source into the ``.tex`` file with
  284. .. code-block:: tex
  285. \begin{document}
  286. \input{results_def.tex}
  287. Upon compilation of the ``.tex`` file into a PDF, the results of the
  288. computations captured with ``\newcommand`` definitions are inserted into the respective part
  289. of the manuscript.
  290. .. index:: ! Make
  291. The last step is to automate this procedure. So far, the script would need to be executed
  292. with a command line call, and the PDF compilation would require another commandline call.
  293. One way to automate this process are `Makefiles <https://en.wikipedia.org/wiki/Make_(software)>`_.
  294. ``make`` is a decades-old tool known to many and bears the important advantage that is will
  295. deliver results regardless of what actually needs to be done with a single ``make`` call --
  296. whether it is executing a Python script, running bash commands, or rendering figures, or all of this.
  297. Here is the one used for the manuscript:
  298. .. code-block:: make
  299. :linenos:
  300. all: main.pdf
  301. main.pdf: main.tex tools.bib EyeGaze.bib results_def.tex figures
  302. latexmk -pdf -g $<
  303. results_def.tex: code/mk_figuresnstats.py
  304. bash -c 'set -o pipefail; code/mk_figuresnstats.py -s | tee results_def.tex'
  305. figures: figures-stamp
  306. figures-stamp: code/mk_figuresnstats.py
  307. code/mk_figuresnstats.py -f -r -m
  308. $(MAKE) -C img
  309. touch $@
  310. clean:
  311. rm -f main.bbl main.aux main.blg main.log main.out main.pdf main.tdo main.fls main.fdb_latexmk example.eps img/*eps-converted-to.pdf texput.log results_def.tex figures-stamp
  312. $(MAKE) -C img clean
  313. One can read a Makefile as a recipe:
  314. - Line 1: "The overall target should be ``main.pdf`` (the final PDF of
  315. the manuscript)."
  316. - Line 2-3: "To make the target ``main.pdf``, the following files are required:
  317. ``main.tex`` (the manuscript's ``.tex`` file), ``tools.bib`` & ``EyeGaze.bib`` (bibliography files), ``results_def.tex``
  318. (the results definitions), and figures (a section not covered here, about rendering figures
  319. with inkscape prior to including them in the manuscript). If all of these files are present,
  320. the target ``main.pdf`` can be made by running the command ``latexmk -pdf -g``"
  321. - Line 5-6: "To make the target ``results_def.tex``, the script ``code/mk_figuresnstats.py`` is
  322. required. If the file is present, the target ``results_def.tex`` can be made by running the
  323. command ``bash -c 'set -o pipefail; code/mk_figuresnstats.py -s | tee results_def.tex'``"
  324. This triggers the execution of the script, collection of results in ``results_def.tex``, and PDF
  325. compilation upon typing ``make``.
  326. The last three lines define that a ``make clean`` removes all computed files, and also all
  327. images.
  328. Finally, by wrapping ``make`` in a :dlcmd:`run` command, the computation of results
  329. and compiling of the manuscript with all generated output can be written to the history of
  330. the superdataset. ``datalad run make`` will thus capture all provenance for the results
  331. and the final PDF.
  332. Thus, by using DataLad and its Python API, a few clever Unix and ``LaTeX`` tricks,
  333. and Makefiles, anyone can create a reproducible paper. This saves time, increases your own
  334. trust in the results, and helps to make a more convincing case with your research.
  335. If you have not yet, but are curious, checkout the
  336. `manuscript this use case is based on <https://github.com/psychoinformatics-de/paper-remodnav>`_.
  337. Any questions can be asked by `opening an issue <https://github.com/psychoinformatics-de/paper-remodnav/issues/new>`_.
  338. .. rubric:: Footnotes
  339. .. [#f1] You can read up on the YODA principles again in section :ref:`yoda`
  340. .. [#f2] You can read up on cloning datasets as subdatasets again in section :ref:`installds`.
  341. .. [#f3] Note that the software repository may just as well be cloned into ``data/``.