adswa
/
handbook


			
			
				
					
						
							1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
							.. _summary_yoda:

Summary
-------

The YODA principles are a small set of guidelines that can make a huge
difference towards reproducibility, comprehensibility, and transparency
in a data analysis project. By applying them in your own midterm analysis
project, you have experienced their immediate benefits.

You also noticed that these standards are not complex -- quite the opposite,
they are very intuitive.
They structure essential components of a data analysis project --
data, code, potentially computational environments, and lastly also the results --
in a modular and practical way, and use basic principles and commands
of DataLad you are already familiar with.

There are many advantages to this organization of contents.

- Having input data as independent dataset(s) that are not influenced (only
  consumed) by an analysis allows for a modular reuse of pure data datasets,
  and does not conflate the data of an analysis with the results or the code.
  You have experienced this with the ``iris_data`` subdataset.

- Keeping code within an independent, version-controlled directory, but as a part
  of the analysis dataset, makes sharing code easy and transparent, and helps
  to keep directories neat and organized. Moreover,
  with the data as subdatasets, data and code can be automatically shared together.
  By complying to this principle, you were able to submit both code and data
  in a single superdataset.

- Keeping an analysis dataset fully self-contained with relative instead of
  absolute paths in scripts is critical to ensure that an analysis reproduces
  easily on a different computer.

- DataLad's Python API makes all of DataLad's functionality available in
  Python, either as standalone functions that are exposed via ``datalad.api``,
  or as methods of the ``Dataset`` class.
  This provides an alternative to the command line, but it also opens up the
  possibility of performing DataLad commands directly inside of scripts.

- Including the computational environment into an analysis dataset encapsulates
  software and software versions, and thus prevents re-computation failures
  (or sudden differences in the results) once
  software is updated, and software conflicts arising on different machines
  than the one the analysis was originally conducted on. You have not yet
  experienced how to do this first-hand, but you will in a later section.

- Having all of these components as part of a DataLad dataset allows version
  controlling all pieces within the analysis regardless of their size, and
  generates provenance for everything, especially if you make use of the tools
  that DataLad provides. This way, anyone can understand and even reproduce
  your analysis without much knowledge about your project.

- The yoda procedure is a good starting point to build your next data analysis
  project up on.

Now what can I do with it?
^^^^^^^^^^^^^^^^^^^^^^^^^^

Using tools that DataLad provides you are able to make the most out of
your data analysis project. The YODA principles are a guide to accompany
you on your path to reproducibility and provenance-tracking.

What should have become clear in this section is that you are already
equipped with enough DataLad tools and knowledge that complying to these
standards felt completely natural and effortless in your midterm analysis
project.