supervision.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265
  1. .. _usecase_student_supervision:
  2. Student supervision in a research project
  3. -----------------------------------------
  4. .. index:: ! Usecase; Student supervision
  5. This use case will demonstrate a workflow that uses DataLad tools and principles
  6. to assist in technical aspects of supervising research projects with computational
  7. components.
  8. It demonstrates how a DataLad dataset comes with advantages that mitigate technical
  9. complexities for trainees and allows high-quality supervision from afar with minimal
  10. effort and time commitment from busy supervisors. It furthermore serves to log
  11. undertaken steps, establishes trust in an analysis, and eases collaboration.
  12. Successful workflows rely on more knowledgeable "trainers" (i.e., supervisors, or a more
  13. experienced collaborator) for a quick initial dataset setup with optimal configuration, and
  14. an introduction to the YODA principles and basic DataLad commands.
  15. Subsequently, supervision and collaboration is made easy by the distributed nature of a dataset.
  16. Afterwards, reuse of a students work is made possible by the modular nature of the dataset.
  17. Students can concentrate on questions relevant for the field and research topic,
  18. and computational complexities are minimized.
  19. The Challenge
  20. ^^^^^^^^^^^^^
  21. Megan is a graduate student and does an internship in a lab
  22. at a partnering research institution. As she already has experience in data analysis,
  23. and the time of her supervisor is limited, she is given a research question
  24. to work on autonomously. The data are already collected, and everyone involved
  25. is certain that Megan will be fine performing the analyses she has
  26. experience with. Her supervisor confidently proposes the research project as a
  27. conference talk Megan should give at the end of her stay. Megan is excited about the
  28. responsibility and her project, and can not wait to start.
  29. On the first day, her supervisor spends an hour to show her the office,
  30. the coffee machine, and they chat about the high-level aspects
  31. of the projects: Which is the relevant literature, who collected the data,
  32. how long should the final talk be. Megan has many procedural questions,
  33. but the hour is over fast, and it is difficult to find time to meet again.
  34. As it turns out, her supervisor will leave the country for a three month visit
  35. to a lab in Japan soon, and is very busy preparing this stay and coordinating
  36. other projects. However, everyone is confident that Megan will be just fine.
  37. The IT office issues an account on the computational cluster for her,
  38. and the postdoc that collected the data points her to the directories in which
  39. the data are stored.
  40. When she starts, Megan realizes that she has no experience with the
  41. Linux-based operating system running on the compute cluster. She knows very well how
  42. to write scripts to perform very complex analyses, but needs to invest much
  43. time to understand basic concepts and relevant commands on the cluster
  44. because no-one is around to give her a quick introduction.
  45. When she starts her computations, she accidentally overwrites a data file in the
  46. data collection, and emails the postdoc for help. He luckily has a backup
  47. of the data and is able to restore the original state, but grimly CCs her supervisor
  48. in his response email to her. Not being told where to store analysis results in,
  49. Megan saves the results in a not backed-up ``scratch`` directory. With ambiguous,
  50. hard-to-make-sense-of emails her supervisor sends at 3am, Megan tries to
  51. comply to the instructions she extracts from the emails, and reports back lengthy
  52. explanations of what she is doing that her supervisor rarely has time to read.
  53. Without an interactive discussion or feedback component, Megan is very unsure
  54. about what she is supposed to do, and saves multiple different analysis scripts
  55. and results of them inside of the scratch folder.
  56. When her supervisor returns and meets for a project update, he scolds her for the
  57. bad organization, and the no-backup storage choice. With a pressing timeline,
  58. Megan is told to write down her results. She is discouraged when she finally gets
  59. feedback on them and learns that she interpreted one instruction of her supervisor
  60. differently from what was meant by it, deeming all of her results irrelevant.
  61. Not trusting Megan's analyses anymore, her supervisor cancels the talk and has the
  62. postdoc take over.
  63. Megan feels incompetent and regards the stay as a waste of time, her supervisor
  64. is unhappy about the mis-communication and lack of results, and the postdoc
  65. taking over is unable to comprehend what was done so far and needs to start over new,
  66. even though all analysis scripts were correct and very relevant for the future
  67. of the project.
  68. The DataLad Approach
  69. ^^^^^^^^^^^^^^^^^^^^
  70. When Megan arrives in the lab, her supervisor and the postdoc that collected the
  71. data take an hour to meet and talk about the upcoming project. To ease the technical
  72. complexities for a new student like Megan on an unfamiliar computational infrastructure,
  73. they talk about the YODA principles, basic DataLad commands, and
  74. set up a project dataset for Megan to work in. Inside of this dataset, the original
  75. data are cloned as a subdataset, code is tracked with Git, and the appropriate software
  76. is provided with a containerized image tracked in the dataset.
  77. Megan can adopt the version control workflow and data
  78. analysis principles very fast and is thankful for the brief but sufficient introduction.
  79. When her supervisor leaves for Japan, they stay in touch via email, but her
  80. supervisor also checks the development of the project and occasionally skims through Megan's code
  81. updates from afar every other week. When he notices that one of his
  82. instructions was ambiguous and Megan's approach to it misguided, he can intervene right away.
  83. Megan feels comfortable and confident that she is doing something useful and learns a lot
  84. about data management in the safe space of a version controlled dataset.
  85. Her supervisor can see how well made Megan's analysis methods are, and has trust in her results.
  86. Megan proudly presents the results of her analysis and leaves with many good experiences
  87. and lots of new knowledge. Her supervisor is happy about the progress done on the project,
  88. and the dataset is a standalone "lab-notebook" that anyone can later use as a detailed log
  89. to make sense of what was done. As an ongoing collaboration, Megan, the postdoc, and her
  90. supervisor write up a paper on the analysis and use the analysis dataset as a subdataset
  91. in this project.
  92. Step-by-Step
  93. ^^^^^^^^^^^^
  94. Megan's supervisor is excited that she comes to visit the lab and trusts her to be a diligent,
  95. organized, and capable researcher. But he also does not have much time for a lengthy introduction
  96. to technical aspects unrelated to the project, interactive teaching, or in-person supervision.
  97. Megan in turn is a competent student and eager to learn new things, but she
  98. does not have experience with DataLad, version control, or the computational cluster.
  99. As a first step, therefore, her supervisor and the postdoc prepare a preconfigured
  100. dataset in a dedicated directory everyone involved in the project has access to:
  101. .. code-block:: bash
  102. $ datalad create -c yoda project-megan
  103. All data that this lab generates or uses is a standalone DataLad dataset that lives
  104. in a dedicated ``data\`` directory on a server. To give Megan access to the data without
  105. endangering or potentially modifying the pristine data kept in there, complying to the
  106. YODA principles, they clone the data she is supposed to analyze as a subdataset:
  107. .. code-block:: bash
  108. $ cd project-megan
  109. $ datalad clone -d . \
  110. /home/data/ABC-project \
  111. data/ABC-project
  112. [INFO ] Cloning /home/data/ABC-project [1 other candidates] into '/home/projects/project-megan/data/ABC-project'
  113. [INFO ] Remote origin not usable by git-annex; setting annex-ignore
  114. install(ok): data/ABC-project (dataset)
  115. action summary:
  116. add (ok: 2)
  117. install (ok: 1)
  118. save (ok: 1)
  119. The YODA principle and the data installation created a comprehensive directory
  120. structure and configured the ``code\`` directory to be tracked in Git, to allow
  121. for easy, version-controlled modifications without the necessity to learn about
  122. locked content in the annex.
  123. .. code-block:: bash
  124. $ tree
  125. .
  126. ├── CHANGELOG.md
  127. ├── code
  128. │   └── README.md
  129. ├── data
  130. │   └── ABC-project [13 entries exceeds filelimit, not opening dir]
  131. └── README.md
  132. Within a 20-minute walk-through, Megan learns the general concepts of version-
  133. control, gets an overview of the YODA principles [#f1]_,
  134. configures her Git identity with the help of her supervisor, and is
  135. given an introduction to the most important DataLad commands relevant to her,
  136. :dlcmd:`save` [#f2]_, :dlcmd:`containers-run` [#f3]_,
  137. and :dlcmd:`rerun` [#f4]_.
  138. For reference, they also give her the :ref:`cheat sheet <cheat>` and the link
  139. to the DataLad handbook as a resource if she has further questions.
  140. To make the analysis reproducible, they spent the final part of the meeting
  141. on adding the labs default singularity image to the dataset.
  142. The lab has a singularity image with all the relevant software on
  143. `Singularity-Hub <https://singularity-hub.org>`_,
  144. and it can easily be added to the dataset with the DataLad-containers extension [#f3]_:
  145. .. code-block:: bash
  146. $ datalad containers-add somelabsoftware --url shub://somelab/somelab-container:Softwaresetup
  147. With the container image registered in the dataset, Megan can perform her analysis
  148. in the correct software environment, does not need to setup software herself,
  149. and creates a more reproducible analysis.
  150. With only a single command to run, Megan finds it easy to version control her
  151. scripts and gets into the habit of
  152. running :dlcmd:`save` frequently. This way, she can fully concentrate
  153. on writing up the analysis. In the beginning, her commit messages
  154. may not be optimal, and the changes she commits into a single commit might have
  155. better been split up into separate commits. But from the very beginning she is
  156. able to version control her progress, and she gets more and more proficient as
  157. the project develops.
  158. Knowing the YODA principles gives her clear and easy-to-follow guidelines
  159. on how to work. Her scripts are producing results in dedicated ``output/`` directories
  160. and are executed with :dlcmd:`containers-run` to capture the provenance of how
  161. which result came to be with which software. These guidelines are not complex, and yet
  162. make her whole workflow much more comprehensible, organized, and transparent.
  163. The preconfigured DataLad dataset thus minimized the visible technical complexity.
  164. Just a few commands and standards have a large positive impact on her project
  165. and Megan learns these new skills fast. It did not take her supervisor much time
  166. to configure the dataset or give her an introduction to the relevant commands,
  167. and yet it ensured her to be able to productively work and contribute her
  168. expertise to the project.
  169. Her supervisor can also check how the project develops if Megan asks for assistance or if
  170. he is curious -- even from afar and whenever he has some 15 minutes of spare-time.
  171. When he notices that Megan must have misunderstood one of his emails, he can
  172. intervene and contact Megan by their preferred method of communication,
  173. and/or push a fix or comment to the project, as he has write-access.
  174. This enables him to stay up-to-date independent of emails
  175. or meetings with Megan, and to help when necessary without much trouble. When they
  176. talk, they focus on the code and analysis at hand, and not solely on verbal reports.
  177. Megan finishes her analysis well ahead of time and can prepare her talk.
  178. Together with her supervisor she decides which figures look good and
  179. which results are important. All results that are deemed irrelevant can be dropped
  180. to keep the dataset lean, but could be recomputed as their provenance was tracked.
  181. Finally, the data analysis project is cloned as an input into a new dataset
  182. created for collaborative paper-writing on the analysis:
  183. .. code-block:: bash
  184. $ datalad create megans-paper
  185. $ cd megans-paper
  186. $ datalad clone -d . \
  187. /home/projects/project-megan \
  188. analysis
  189. [INFO ] Cloning /home/projects/project-megan [1 other candidates] into '/home/paper/megans-paper'
  190. [INFO ] Remote origin not usable by git-annex; setting annex-ignore
  191. install(ok): analysis (dataset)
  192. action summary:
  193. add (ok: 2)
  194. install (ok: 1)
  195. save (ok: 1)
  196. Even as Megan returns to her home institution, they can write up the paper
  197. on her analysis collaboratively, and her co-authors have a detailed research log
  198. of the project within the dataset's history.
  199. In summary, DataLad can help to effectively manage student supervision in computational
  200. projects. It requires minimal effort, but comes with great benefit:
  201. - Appropriate data management is made a key element of the project and handled from the start,
  202. not an afterthought that needs to be addressed at the end of its lifetime.
  203. - The dataset becomes the lab notebook, hence a valid and detailed log is always
  204. available and accessible to supervisor and trainee.
  205. - supervisors can efficiently prepare for meetings in a way that does not rely
  206. exclusively on a students report. This shifts the focus from trust in a student
  207. to trust in a student's work.
  208. - supervisors can provide feedback, not only high-level based on a presentation,
  209. but much more detailed, and also on process aspects if desired/necessary:
  210. Supervisors can directly contribute in a way that is as auditable/accountable as
  211. the student's own contributions -- for both parties the strict separation and tracking
  212. of any external inputs of a project make it possible (when a project is completed)
  213. that a supervisor can efficiently test the integrity of the inputs, discard them
  214. (if unmodified), and only archive the outputs that are unique to the project --
  215. which then can become a modular component for reuse in a future project.
  216. .. rubric:: Footnotes
  217. .. [#f1] Find out more about the YODA principles in section :ref:`yoda`
  218. .. [#f2] Find out more about datalad save in section :ref:`modify`
  219. .. [#f3] Find out more about the ``datalad containers`` extension in section TODO:link once it exists
  220. .. [#f4] Find out more about the ``datalad rerun`` command in section :ref:`run2`