101-164-dataladdening.rst 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
  1. .. _dataladdening:
  2. Transitioning existing projects into DataLad
  3. --------------------------------------------
  4. Using DataLad offers exciting and useful features that warrant transitioning existing projects into DataLad datasets -- and in most cases, transforming your project into one or many DataLad datasets is easy.
  5. This sections outlines the basic steps to do so, and offers examples as well as advice and caveats.
  6. Important: Your safety net
  7. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  8. Chances are high that you are reading this section of the handbook after you stumbled across DataLad and were intrigued by its features, and you're now looking for a quick way to get going.
  9. If you haven't read much of the handbook, but are now planning to DataLad-ify the gigantic project you have been working on for the past months or years, this first paragraph is warning, advice, and a call for safety nets to prevent unexpected misery that can arise from transitioning to a new tool.
  10. Because while DataLad *can* do amazing things, you shouldn't blindly trust it to do everything *you* think it can or should do, but gain some familiarity with it.
  11. If you're a DataLad novice, we highly recommend that you read through the :ref:`basics-intro` part of the handbook.
  12. This part of the book provides you with a solid understanding of DataLad's functionality and a playground to experience working with DataLad.
  13. If you're really pressed for time because your dog is sick, your toddler keeps eating your papers and your boss is behind you with a whip, the findoutmore below summarizes the most important sections from the Basics for you to read:
  14. .. find-out-more:: The Basics for the impatient
  15. To get a general idea about DataLad, please read sections :ref:`philo` and :ref:`executive_summary` from the introduction (reading time: 15 min).
  16. To gain a good understanding of some important parts of DataLad, please read chapter :ref:`chapter_datasets`, :ref:`chapter_run`, and :ref:`chapter_gitannex` (reading time: 60 minutes).
  17. To become confident in using DataLad, sections :ref:`help`, :ref:`file system` can be very useful. Depending on your aim, :ref:`chapter_collaboration` (for collaborative workflows), :ref:`chapter_thirdparty` (for data sharing), or :ref:`chapter_yoda` (for data analysis) may contain the relevant background for you.
  18. Prior to transforming your project, regardless of how advanced of a user you are, **we recommend to create a copy of it**.
  19. We don't believe there is much that can go wrong from the software-side of things, but data is precious and backups a necessity, so better be safe than sorry.
  20. Step 1: Planning
  21. ^^^^^^^^^^^^^^^^
  22. The first step to DataLad-ify your project is to turn it into one or several nested datasets.
  23. Whether you turn a project into a single dataset or several is dependent on the current size of your project and how much you expect it to grow overtime, but also on its contents.
  24. You can find guidance on this in paragraph below.
  25. The next step is to save dataset contents.
  26. You should take your time and invest thought into this, as this determines the looks and feels of your dataset, in particular the decision on which contents should be saved into :term:`Git` or :term:`git-annex`.
  27. The section :ref:`symlink` should give you some necessary background information, and the chapter :ref:`chapter_config` the relevant skills to configure your dataset appropriately.
  28. You should consider the size, file type and modification frequency of files in your decisions as well as potential plans to share a dataset with a particular third party infrastructure.
  29. Step 2: Dataset creation
  30. ^^^^^^^^^^^^^^^^^^^^^^^^
  31. Transforming a directory into a dataset is done with :dlcmd:`create --force`.
  32. The ``-f``/``--force`` option enforces dataset creation in non-empty directories.
  33. Consider :ref:`applying procedures <procedures>` with ``-c <procedure-name>`` to apply configurations that suit your use case.
  34. .. find-out-more:: What if my directory is already a Git repository?
  35. If you want to transform a Git repository to a DataLad dataset, a :dlcmd:`create -f` is the way to go, too, and completely safe.
  36. Your Git history will stay intact and will not be tampered with.
  37. If you want to transform a series of nested directories into nested datasets, continue with :dlcmd:`create -f` commands in all further subdirectories.
  38. .. find-out-more:: One or many datasets?
  39. In deciding how many datasets you need, try to follow the benchmarks in chapter :ref:`chapter_gobig` and the yoda principles in section :ref:`yoda`.
  40. Two simple questions can help you make a decision:
  41. #. Do you have independently reusable components in your directory, such as data from several studies, or data and code/results? If yes, make each individual component a dataset.
  42. #. How large is each individual component? If it exceeds 100k files, split it up into smaller datasets. The decision on where to place subdataset boundaries can be guided by the existing directory structure or by common access patterns, for example, based on data type (raw, processed, ...) or subject association. One straightforward organization may be a top-level superdataset and subject-specific subdatasets, mimicking the structure chosen in the use case :ref:`usecase_HCP_dataset`.
  43. You can automate this with :term:`bash` loops, if you want.
  44. .. find-out-more:: Example bash loops
  45. Consider a directory structure that follows a naming standard such as `BIDS <https://bids.neuroimaging.io>`_::
  46. # create a mock-directory structure:
  47. $ mkdir -p study/sub-0{1,2,3,4,5}/{anat,func}
  48. $ tree study
  49. study
  50. ├── sub-01
  51. │   ├── anat
  52. │   └── func
  53. ├── sub-02
  54. │   ├── anat
  55. │   └── func
  56. ├── sub-03
  57. │   ├── anat
  58. │   └── func
  59. ├── sub-04
  60. │   ├── anat
  61. │   └── func
  62. └── sub-05
  63. ├── anat
  64. └── func
  65. Consider further that you have transformed the toplevel ``study`` directory into a dataset and now want to transform all ``sub-*`` directories into further subdatasets, registered in ``study``.
  66. Here is a line that would do this for the example above::
  67. $ for dir in study/sub-0{1,2,3,4,5}; do datalad -C $dir create -d^. --force .; done
  68. Step 3: Saving dataset contents
  69. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  70. Any existing content in your newly created dataset(s) still needs to be saved into its dataset at this point (unless it was already under version control with Git).
  71. This can be done with the :dlcmd:`save` command -- either "in one go" using a plain ``datalad save`` (saves all untracked files and modifications to a dataset -- by default into the dataset annex), or step-by-step by attaching paths to the ``save`` command.
  72. Make sure to run :dlcmd:`status` frequently.
  73. .. find-out-more:: Save things to Git or to git-annex?
  74. By default, all dataset contents are saved into :term:`git-annex`.
  75. Depending on your data and use case, this may or may not be useful for all files.
  76. Here are a few things to keep in mind:
  77. - large files, in particular binary files should almost always go into :term:`git-annex`. If you have pure data dataset made up of large files, put it into the dataset annex.
  78. - small files, especially if they are text files and undergo frequent modifications (e.g., code, manuscripts, notes) are best put under version control by :term:`Git`.
  79. - If you plan to publish a dataset to a repository hosting site without annex support such as :term:`GitHub` or :term:`GitLab`, and do not intend to set up third party storage for annexed contents, be aware that only contents placed in Git will be available to others after cloning your repository. At the same time, be mindful of file size limits the services impose. The largest file size GitHub allows is 100MB -- a dataset with files exceeding 100MB in size in Git will be rejected by GitHub. :term:`GIN` is an alternative hosting service with annex support, and the `Open Science Framework (OSF) <https://readthedocs.org/projects/datalad-osf>`_ may also be a suitable option to share datasets including their annexed files.
  80. You can find guidance on how to create configurations for your dataset (which need to be in place and saved prior to saving contents!) in the chapter :ref:`chapter_config`, in particular section :ref:`config2`.
  81. .. importantnote:: Create desired subdatasets first
  82. Be mindful during saving if you have a directory that should hold more, yet uncreated datasets down its hierarchy, as a plain ``datalad save`` will save *all* files and directories to the dataset!
  83. It is best to first create all subdatasets, and only then save their contents.
  84. If you are operating in a hierarchy of datasets, running a recursive save from the top-most dataset (``datalad save -r``) will save you time: All contents are saved to their respective datasets, all subdatasets are registered to their respective superdatasets.
  85. Step 4: Rerunning analyses reproducibly
  86. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  87. If you are transforming a complete data analysis into a dataset, you may also want to rerun any computation with DataLad's ``run`` commands.
  88. You can compose any :dlcmd:`run` or :dlcmd:`containers-run` [#f1]_ command to recreate and capture your previous analysis.
  89. Make sure to specify your previous results as ``--output`` in order to unlock them [#f2]_.
  90. Summary
  91. ^^^^^^^
  92. Existing projects and analysis can be DataLad-ified with a few standard commands.
  93. Be mindful about dataset sizes and whether you save contents into Git or git-annex, though, as these choices could potentially spoil your DataLad experience.
  94. The sections :ref:`file system` and :ref:`cleanup` can help you to undo unwanted changes, but it's better to do things right instead of having to fix them up.
  95. If you can, read up on the DataLad Basics to understand what you are doing, and create a backup in case things go not as planned in your first attempts.
  96. .. rubric:: Footnotes
  97. .. [#f1] Prior to using a software container, install the :ref:`datalad-containers <extensions_intro>` extension and add the container with the :dlcmd:`containers-add` command. You can find a concrete data analysis example with ``datalad-containers`` in the section :ref:`containersrun`.
  98. .. [#f2] If you are unfamiliar with ``datalad run``, please work through chapter :ref:`chapter_run` first.