101-114-txt2git.rst 4.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
  1. .. _text2git:
  2. Data safety
  3. -----------
  4. Later in the day, after seeing and solving so many DataLad error messages,
  5. you fall tired into your
  6. bed. Just as you are about to fall asleep, a thought crosses your mind:
  7. "I now know that tracked content in a dataset is protected by :term:`git-annex`.
  8. Whenever tracked contents are ``saved``, they get locked and should not be
  9. modifiable. But... what about the notes that I have been taking since the first day?
  10. Should I not need to unlock them before I can modify them? And also the script!
  11. I was able to modify this despite giving it to DataLad to track, with
  12. no permission denied errors whatsoever! How does that work?"
  13. This night, though, your question stays unanswered and you fall into a restless
  14. sleep filled with bad dreams about "permission denied" errors. The next day you are
  15. the first student in your lecturer's office hours.
  16. "Oh, you are really attentive. This is a great question!" our lecturer starts
  17. to explain.
  18. .. figure:: ../artwork/src/teacher.svg
  19. :width: 50%
  20. .. index:: ! dataset procedure; text2git
  21. Do you remember that we created the ``DataLad-101`` dataset with a
  22. specific configuration template? It was the ``-c text2git`` option we
  23. provided in the beginning of :ref:`createDS`. It is because of this configuration
  24. that we can modify ``notes.txt`` without unlocking its content first.
  25. The second commit message in our datasets history summarizes this (outputs are shortened):
  26. .. runrecord:: _examples/DL-101-114-101
  27. :language: console
  28. :workdir: dl-101
  29. :emphasize-lines: 3
  30. :lines: 1-10
  31. :realcommand: cd DataLad-101 && git log --reverse --oneline
  32. :notes: Confusing: Why could we modify the tsv file without unlocking? The reason is in the dataset configuration with text2git
  33. :cast: 03_git_annex_basics
  34. $ git log --reverse --oneline
  35. Instead of giving text files such as your notes or your script
  36. to git-annex, the dataset stores it in :term:`Git`.
  37. But what does it mean if files are in Git instead of git-annex?
  38. Well, procedurally it means that everything that is stored in git-annex is
  39. content-locked, and everything that is stored in Git is not. You can modify
  40. content stored in Git straight away, without unlocking it first.
  41. .. _fig-gitvsannex:
  42. .. figure:: ../artwork/src/git_vs_gitannex.svg
  43. :alt: A simplified illustration of content lock in files managed by git-annex.
  44. :width: 50%
  45. A simplified overview of the tools that manage data in your dataset.
  46. That's easy enough, and illustrated in :numref:`fig-gitvsannex`.
  47. "So, first of all: If we hadn't provided the ``-c text2git`` argument, text files
  48. would get content-locked, too?". "Yes, indeed. However, there are also ways to
  49. later change how file content is handled based on its type or size. It can be specified
  50. in the ``.gitattributes`` file, using ``annex.largefile`` options.
  51. But there will be a lecture on that [#f1]_."
  52. "Okay, well, second: Isn't it much easier to just not bother with locking and
  53. unlocking, and have everything 'stored in Git'? Even if :dlcmd:`run` takes care
  54. of unlocking content, I do not see the point of git-annex", you continue.
  55. Here it gets tricky. To begin with the most important, and most straight-forward fact:
  56. It is not possible to store
  57. large files in Git. This is because Git would very quickly run into severe performance
  58. issues. And hosting sites for projects using Git, such as :term:`GitHub` or :term:`GitLab`
  59. also do not allow files larger than a few dozen MB of size.
  60. For now, we have solved the mystery of why text files can be modified
  61. without unlocking, and this is a small
  62. improvement in the vast amount of questions that have piled up in our curious
  63. minds. Essentially, git-annex protects your data from accidental modifications
  64. and thus keeps it safe. :dlcmd:`run` commands mitigate any technical
  65. complexity of this completely if ``-o/--output`` is specified properly, and
  66. :dlcmd:`unlock` commands can be used to unlock content "by hand" if
  67. modifications are performed outside of a :dlcmd:`run`.
  68. .. index::
  69. pair: adjusted mode; git-annex concept
  70. But there comes the second, tricky part: There are ways to get rid of locking and
  71. unlocking within git-annex, using so-called :term:`adjusted branch`\es.
  72. This functionality is dependent on the git-annex version one has installed, the git-annex version of the repository, and a use-case dependent comparison of the pros and cons.
  73. On Windows systems, this *adjusted mode* is even the *only* mode of operation.
  74. In later sections we will see how to use this feature.
  75. The next lecture, in any way, will guide us deeper into git-annex, and improve our understanding a slight bit further.
  76. .. rubric:: Footnotes
  77. .. [#f1] If you cannot wait to read about ``.gitattributes`` and other
  78. configuration files, jump ahead to chapter :ref:`chapter_config`,
  79. starting with section :ref:`config`.