adswa
/
handbook


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265
							.. _usecase_student_supervision:

Student supervision in a research project
-----------------------------------------

.. index:: ! Usecase; Student supervision

This use case will demonstrate a workflow that uses DataLad tools and principles
to assist in technical aspects of supervising research projects with computational
components.
It demonstrates how a DataLad dataset comes with advantages that mitigate technical
complexities for trainees and allows high-quality supervision from afar with minimal
effort and time commitment from busy supervisors. It furthermore serves to log
undertaken steps, establishes trust in an analysis, and eases collaboration.

Successful workflows rely on more knowledgeable "trainers" (i.e., supervisors, or a more
experienced collaborator) for a quick initial dataset setup with optimal configuration, and
an introduction to the YODA principles and basic DataLad commands.
Subsequently, supervision and collaboration is made easy by the distributed nature of a dataset.
Afterwards, reuse of a students work is made possible by the modular nature of the dataset.
Students can concentrate on questions relevant for the field and research topic,
and computational complexities are minimized.

The Challenge
^^^^^^^^^^^^^

Megan is a graduate student and does an internship in a lab
at a partnering research institution. As she already has experience in data analysis,
and the time of her supervisor is limited, she is given a research question
to work on autonomously. The data are already collected, and everyone involved
is certain that Megan will be fine performing the analyses she has
experience with. Her supervisor confidently proposes the research project as a
conference talk Megan should give at the end of her stay. Megan is excited about the
responsibility and her project, and can not wait to start.

On the first day, her supervisor spends an hour to show her the office,
the coffee machine, and they chat about the high-level aspects
of the projects: Which is the relevant literature, who collected the data,
how long should the final talk be. Megan has many procedural questions,
but the hour is over fast, and it is difficult to find time to meet again.
As it turns out, her supervisor will leave the country for a three month visit
to a lab in Japan soon, and is very busy preparing this stay and coordinating
other projects. However, everyone is confident that Megan will be just fine.
The IT office issues an account on the computational cluster for her,
and the postdoc that collected the data points her to the directories in which
the data are stored.

When she starts, Megan realizes that she has no experience with the
Linux-based operating system running on the compute cluster. She knows very well how
to write scripts to perform very complex analyses, but needs to invest much
time to understand basic concepts and relevant commands on the cluster
because no-one is around to give her a quick introduction.
When she starts her computations, she accidentally overwrites a data file in the
data collection, and emails the postdoc for help. He luckily has a backup
of the data and is able to restore the original state, but grimly CCs her supervisor
in his response email to her. Not being told where to store analysis results in,
Megan saves the results in a not backed-up ``scratch`` directory. With ambiguous,
hard-to-make-sense-of emails her supervisor sends at 3am, Megan tries to
comply to the instructions she extracts from the emails, and reports back lengthy
explanations of what she is doing that her supervisor rarely has time to read.
Without an interactive discussion or feedback component, Megan is very unsure
about what she is supposed to do, and saves multiple different analysis scripts
and results of them inside of the scratch folder.

When her supervisor returns and meets for a project update, he scolds her for the
bad organization, and the no-backup storage choice. With a pressing timeline,
Megan is told to write down her results. She is discouraged when she finally gets
feedback on them and learns that she interpreted one instruction of her supervisor
differently from what was meant by it, deeming all of her results irrelevant.
Not trusting Megan's analyses anymore, her supervisor cancels the talk and has the
postdoc take over.
Megan feels incompetent and regards the stay as a waste of time, her supervisor
is unhappy about the mis-communication and lack of results, and the postdoc
taking over is unable to comprehend what was done so far and needs to start over new,
even though all analysis scripts were correct and very relevant for the future
of the project.

The DataLad Approach
^^^^^^^^^^^^^^^^^^^^

When Megan arrives in the lab, her supervisor and the postdoc that collected the
data take an hour to meet and talk about the upcoming project. To ease the technical
complexities for a new student like Megan on an unfamiliar computational infrastructure,
they talk about the YODA principles, basic DataLad commands, and
set up a project dataset for Megan to work in. Inside of this dataset, the original
data are cloned as a subdataset, code is tracked with Git, and the appropriate software
is provided with a containerized image tracked in the dataset.
Megan can adopt the version control workflow and data
analysis principles very fast and is thankful for the brief but sufficient introduction.
When her supervisor leaves for Japan, they stay in touch via email, but her
supervisor also checks the development of the project and occasionally skims through Megan's code
updates from afar every other week. When he notices that one of his
instructions was ambiguous and Megan's approach to it misguided, he can intervene right away.
Megan feels comfortable and confident that she is doing something useful and learns a lot
about data management in the safe space of a version controlled dataset.
Her supervisor can see how well made Megan's analysis methods are, and has trust in her results.
Megan proudly presents the results of her analysis and leaves with many good experiences
and lots of new knowledge. Her supervisor is happy about the progress done on the project,
and the dataset is a standalone "lab-notebook" that anyone can later use as a detailed log
to make sense of what was done. As an ongoing collaboration, Megan, the postdoc, and her
supervisor write up a paper on the analysis and use the analysis dataset as a subdataset
in this project.

Step-by-Step
^^^^^^^^^^^^

Megan's supervisor is excited that she comes to visit the lab and trusts her to be a diligent,
organized, and capable researcher. But he also does not have much time for a lengthy introduction
to technical aspects unrelated to the project, interactive teaching, or in-person supervision.
Megan in turn is a competent student and eager to learn new things, but she
does not have experience with DataLad, version control, or the computational cluster.

As a first step, therefore, her supervisor and the postdoc prepare a preconfigured
dataset in a dedicated directory everyone involved in the project has access to:

.. code-block:: bash

   $ datalad create -c yoda project-megan

All data that this lab generates or uses is a standalone DataLad dataset that lives
in a dedicated ``data\`` directory on a server. To give Megan access to the data without
endangering or potentially modifying the pristine data kept in there, complying to the
YODA principles, they clone the data she is supposed to analyze as a subdataset:

.. code-block:: bash

   $ cd project-megan
   $ datalad clone -d . \
     /home/data/ABC-project \
     data/ABC-project

    [INFO   ] Cloning /home/data/ABC-project [1 other candidates] into '/home/projects/project-megan/data/ABC-project'
    [INFO   ] Remote origin not usable by git-annex; setting annex-ignore
    install(ok): data/ABC-project (dataset)
    action summary:
      add (ok: 2)
      install (ok: 1)
      save (ok: 1)

The YODA principle and the data installation created a comprehensive directory
structure and configured the ``code\`` directory to be tracked in Git, to allow
for easy, version-controlled modifications without the necessity to learn about
locked content in the annex.

.. code-block:: bash

   $ tree
   .
   ├── CHANGELOG.md
   ├── code
   │   └── README.md
   ├── data
   │   └── ABC-project [13 entries exceeds filelimit, not opening dir]
   └── README.md

Within a 20-minute walk-through, Megan learns the general concepts of version-
control, gets an overview of the YODA principles [#f1]_,
configures her Git identity with the help of her supervisor, and is
given an introduction to the most important DataLad commands relevant to her,
:dlcmd:`save` [#f2]_, :dlcmd:`containers-run` [#f3]_,
and :dlcmd:`rerun` [#f4]_.
For reference, they also give her the :ref:`cheat sheet <cheat>` and the link
to the DataLad handbook as a resource if she has further questions.

To make the analysis reproducible, they spent the final part of the meeting
on adding the labs default singularity image to the dataset.
The lab has a singularity image with all the relevant software on
`Singularity-Hub <https://singularity-hub.org>`_,
and it can easily be added to the dataset with the DataLad-containers extension [#f3]_:

.. code-block:: bash

   $ datalad containers-add somelabsoftware --url shub://somelab/somelab-container:Softwaresetup

With the container image registered in the dataset, Megan can perform her analysis
in the correct software environment, does not need to setup software herself,
and creates a more reproducible analysis.

With only a single command to run, Megan finds it easy to version control her
scripts and gets into the habit of
running :dlcmd:`save` frequently. This way, she can fully concentrate
on writing up the analysis. In the beginning, her commit messages
may not be optimal, and the changes she commits into a single commit might have
better been split up into separate commits. But from the very beginning she is
able to version control her progress, and she gets more and more proficient as
the project develops.

Knowing the YODA principles gives her clear and easy-to-follow guidelines
on how to work. Her scripts are producing results in dedicated ``output/`` directories
and are executed with :dlcmd:`containers-run` to capture the provenance of how
which result came to be with which software. These guidelines are not complex, and yet
make her whole workflow much more comprehensible, organized, and transparent.

The preconfigured DataLad dataset thus minimized the visible technical complexity.
Just a few commands and standards have a large positive impact on her project
and Megan learns these new skills fast. It did not take her supervisor much time
to configure the dataset or give her an introduction to the relevant commands,
and yet it ensured her to be able to productively work and contribute her
expertise to the project.

Her supervisor can also check how the project develops if Megan asks for assistance or if
he is curious -- even from afar and whenever he has some 15 minutes of spare-time.
When he notices that Megan must have misunderstood one of his emails, he can
intervene and contact Megan by their preferred method of communication,
and/or push a fix or comment to the project, as he has write-access.
This enables him to stay up-to-date independent of emails
or meetings with Megan, and to help when necessary without much trouble. When they
talk, they focus on the code and analysis at hand, and not solely on verbal reports.

Megan finishes her analysis well ahead of time and can prepare her talk.
Together with her supervisor she decides which figures look good and
which results are important. All results that are deemed irrelevant can be dropped
to keep the dataset lean, but could be recomputed as their provenance was tracked.
Finally, the data analysis project is cloned as an input into a new dataset
created for collaborative paper-writing on the analysis:

.. code-block:: bash

   $ datalad create megans-paper
   $ cd megans-paper
   $ datalad clone -d . \
     /home/projects/project-megan \
     analysis

   [INFO   ] Cloning /home/projects/project-megan [1 other candidates] into '/home/paper/megans-paper'
   [INFO   ]   Remote origin not usable by git-annex; setting annex-ignore
   install(ok): analysis (dataset)
   action summary:
     add (ok: 2)
     install (ok: 1)
     save (ok: 1)

Even as Megan returns to her home institution, they can write up the paper
on her analysis collaboratively, and her co-authors have a detailed research log
of the project within the dataset's history.

In summary, DataLad can help to effectively manage student supervision in computational
projects. It requires minimal effort, but comes with great benefit:

- Appropriate data management is made a key element of the project and handled from the start,
  not an afterthought that needs to be addressed at the end of its lifetime.

- The dataset becomes the lab notebook, hence a valid and detailed log is always
  available and accessible to supervisor and trainee.

- supervisors can efficiently prepare for meetings in a way that does not rely
  exclusively on a students report. This shifts the focus from trust in a student
  to trust in a student's work.

- supervisors can provide feedback, not only high-level based on a presentation,
  but much more detailed, and also on process aspects if desired/necessary:
  Supervisors can directly contribute in a way that is as auditable/accountable as
  the student's own contributions -- for both parties the strict separation and tracking
  of any external inputs of a project make it possible (when a project is completed)
  that a supervisor can efficiently test the integrity of the inputs, discard them
  (if unmodified), and only archive the outputs that are unique to the project --
  which then can become a modular component for reuse in a future project.


.. rubric:: Footnotes

.. [#f1] Find out more about the YODA principles in section :ref:`yoda`
.. [#f2] Find out more about datalad save in section :ref:`modify`
.. [#f3] Find out more about the ``datalad containers`` extension in section TODO:link once it exists
.. [#f4] Find out more about the ``datalad rerun`` command in section :ref:`run2`