101-115-symlinks.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305
  1. .. _2-002:
  2. .. _symlink:
  3. Data integrity
  4. --------------
  5. So far, we mastered quite a number of challenges:
  6. Creating and populating a dataset with large and small files, modifying content and saving the changes to history, installing datasets, even as subdatasets within datasets, recording the impact of commands on a dataset with the :dlcmd:`run` and :dlcmd:`rerun` commands, and capturing plenty of :term:`provenance` on the way.
  7. We further noticed that when we modified content in ``notes.txt`` or ``list_titles.sh``, the modified content was in a *text file*.
  8. We learned that this precise type of file, in conjunction with the initial configuration template ``text2git`` we gave to :dlcmd:`create`, is meaningful:
  9. As the text file is stored in Git and not git-annex, no content unlocking is necessary.
  10. As we saw within the demonstrations of :dlcmd:`run`, modifying content of non-text files, such as ``.jpg``\s, typically requires the additional step of *unlocking* file content, either by hand with the :dlcmd:`unlock` command, or within :dlcmd:`run` using the ``-o``/``--output`` flag.
  11. There is one detail about DataLad datasets that we have not covered yet.
  12. It is a crucial component to understanding certain aspects of a dataset, but it is also a potential source of confusion that we want to eradicate.
  13. You might have noticed already that an ``ls -l`` or ``tree`` command in your dataset shows small arrows and quite cryptic paths following each non-text file.
  14. Maybe your shell also displays these files in a different color than text files when listing them.
  15. We'll take a look together, using the ``books/`` directory as an example:
  16. .. index::
  17. pair: no symlinks; on Windows
  18. pair: tree; terminal command
  19. .. windows-wit:: Dataset directories look different on Windows
  20. First of all, the Windows ``tree`` command lists only directories by default, unless you parametrize it with ``/f``.
  21. And, secondly, even if you list the individual files, you would not see the :term:`symlink`\s shown below.
  22. Due to insufficient support for symlinks on Windows, git-annex does not use them.
  23. The :windows-wit:`on git-annex's adjusted mode <ww-adjusted-mode>` has more on that.
  24. .. runrecord:: _examples/DL-101-115-101
  25. :language: console
  26. :workdir: dl-101/DataLad-101
  27. :notes: We have to talk about symlinks now.
  28. :cast: 03_git_annex_basics
  29. $ # in the root of DataLad-101
  30. $ cd books
  31. $ tree
  32. If you do not know what you are looking at,
  33. this looks weird, if not worse: intimidating, wrong, or broken.
  34. First of all: no, **it is all fine**. But let's start with the basics of what is displayed
  35. here to understand it.
  36. The small ``->`` symbol connecting one path (the book's name) to another path (the weird
  37. sequence of characters ending in ``.pdf``) is what is called a
  38. *symbolic link* (short: :term:`symlink`) or *softlink*.
  39. It is a term for any file that contains a reference to another file or directory as
  40. a :term:`relative path` or :term:`absolute path`.
  41. If you use Windows, you are familiar with a related, although more basic concept: a shortcut.
  42. This means that the files that are in the locations in which you saved content
  43. and are named as you named your files (e.g., ``TLCL.pdf``),
  44. do *not actually contain your files' content*:
  45. they just point to the place where the actual file content resides.
  46. This sounds weird, and like an unnecessary complication of things. But we will
  47. get to why this is relevant and useful shortly. First, however,
  48. where exactly are the contents of the files you created or saved?
  49. The start of the link path is ``../.git``. The section :ref:`createDS` contained
  50. a note that strongly advised that you to not tamper with
  51. (or in the worst case, delete) the ``.git``
  52. repository in the root of any dataset. One reason
  53. why you should not do this is because *this* ``.git`` directory is where all of your file content
  54. is actually stored.
  55. But why is that? We have to talk a bit git-annex now in order to understand it.
  56. When a file is saved into a dataset to be tracked,
  57. by default -- that is in a dataset created without any configuration template --
  58. DataLad gives this file to git-annex. Exceptions to this behavior can be
  59. defined based on
  60. #. file size
  61. #. and/or path/pattern, and thus, for example, file extensions,
  62. or names, or file types (e.g., text files, as with the
  63. ``text2git`` configuration template).
  64. git-annex, in order to version control the data, takes the file content
  65. and moves it under ``.git/annex/objects`` -- the so called :term:`object-tree`.
  66. It further renames the file into the sequence of characters you can see
  67. in the path, and in its place
  68. creates a symlink with the original file name, pointing to the new location.
  69. This process is often referred to as a file being *annexed*, and the object
  70. tree is also known as the *annex* of a dataset.
  71. .. index::
  72. pair: elevated storage demand; in adjusted mode
  73. pair: no symlinks; on Windows
  74. pair: adjusted mode; on Windows
  75. .. windows-wit:: File content management on Windows (adjusted mode)
  76. :name: ww-adjusted-mode
  77. :float:
  78. .. include:: topic/adjustedmode-nosymlinks.rst
  79. For a demonstration that this file path is not complete gibberish,
  80. take the target path of any of the book's symlinks and
  81. open it, for example with ``evince <path>``, or any other PDF reader in exchange for ``evince``:
  82. .. runrecord:: _examples/DL-101-115-102
  83. :language: console
  84. :workdir: dl-101/DataLad-101/books
  85. :realcommand: echo "evince $(readlink TLCL.pdf)"
  86. :notes: we can just open the cryptic file path and it works just as any pdf!
  87. :cast: 03_git_annex_basics
  88. Even though the path looks cryptic, it works and opens the file. Whenever you
  89. use a command like ``evince TLCL.pdf``, internally, programs will follow
  90. the same cryptic symlink like the one you have just opened.
  91. But *why* does this symlink-ing happen? Up until now, it still seems like a very
  92. unnecessary, superfluous thing to do, right?
  93. The resulting symlinks that look like
  94. your files but only point to the actual content in ``.git/annex/objects`` are
  95. small in size. An ``ls -lh`` reveals that all of these symlinks have roughly the same,
  96. small size of ~130 Bytes:
  97. .. runrecord:: _examples/DL-101-115-103
  98. :language: console
  99. :workdir: dl-101/DataLad-101/books
  100. :realcommand: ls -lh --time-style=long-iso
  101. :notes: Symlinks are super small in size, just the amount of characters in the symlink!
  102. :cast: 03_git_annex_basics
  103. $ ls -lh
  104. Here you can see the reason why content is symlinked: Small file size means that
  105. *Git can handle those symlinks*!
  106. Therefore, instead of large file content, only the symlinks are committed into
  107. Git, and the Git repository thus stays lean. Simultaneously, still, all
  108. files stored in Git as symlinks can point to arbitrarily large files in the
  109. object tree. Within the object tree, git-annex handles file content tracking,
  110. and is busy creating and maintaining appropriate symlinks so that your data
  111. can be version controlled just as any text file.
  112. This comes with two very important advantages:
  113. One, should you have copies of the
  114. same data in different places of your dataset, the symlinks of these files
  115. point to the same place - in order to understand why this is the case, you
  116. will need to read the :find-out-more:`about the object tree <fom-objecttree>`.
  117. Therefore, any amount of copies of a piece of data
  118. is only one single piece of data in your object tree. This, depending on
  119. how much identical file content lies in different parts of your dataset,
  120. can save you much disk space and time.
  121. The second advantage is less intuitive but clear for users familiar with Git.
  122. Compared to copying and deleting huge data files, small symlinks can be written very very fast, for example, when switching dataset versions, or :term:`branch`\es.
  123. .. gitusernote:: Speedy branch switches
  124. Switching branches fast, even when they track vasts amounts of data, lets you work with data with the same routines as in software development.
  125. This leads to a few conclusions:
  126. The first is that you should not be worried
  127. to see cryptic looking symlinks in your repository -- this is how it should look.
  128. You can read the :ref:`find-out-more on why these paths look so weird <fom-objecttree>` and what all of this has to do with data integrity, if you want to.
  129. It's additional information that can help to establish trust in that your data are safely stored and tracked, and understanding more about the object tree and knowing bits of the git-annex basics can make you more confident in working with your datasets.
  130. The second is that it should now be clear to you why the ``.git`` directory
  131. should not be deleted or in any way modified by hand. This place is where
  132. your data are stored, and you can trust git-annex to be better able to
  133. work with the paths in the object tree than you or any other human are.
  134. Lastly, understanding that annexed files in your dataset are symlinked
  135. will be helpful to understand how common file system operations such as
  136. moving, renaming, or copying content translate to dataset modifications
  137. in certain situations. Later in this book, the section :ref:`file system`
  138. will take a closer look at that.
  139. .. _objecttree:
  140. .. index::
  141. pair: key; git-annex concept
  142. .. find-out-more:: Data integrity and annex keys
  143. :name: fom-objecttree
  144. So how do these cryptic paths and names in the object tree come into existence?
  145. It's not malicious intent that leads to these paths and file names - its checksums.
  146. When a file is annexed, git-annex typically generates a *key* (or :term:`annex key`) from the **file content**.
  147. It uses this key (in part) as a name for the file and as the path
  148. in the object tree.
  149. Thus, the key is associated with the content of the file (the *value*),
  150. and therefore, using this key, file content can be identified.
  151. Most key types contain a :term:`checksum`. This is a string of a fixed number of characters
  152. computed from some input, for example the content of a PDF file,
  153. by a *hash* function.
  154. This checksum *uniquely* identifies a file's content.
  155. A hash function will generate the same character sequence for the same file content, and once file content changes, the generated checksum changes, too.
  156. Basing the file name on its contents thus becomes a way of ensuring data integrity:
  157. File content cannot be changed without git-annex noticing, because the file's checksum, and thus its key in its symlink, will change.
  158. Furthermore, if two files have identical checksums, the content in these files is identical.
  159. Consequently, if two files have the same symlink, and thus link the same file in the object-tree, they are identical in content.
  160. This can save disk space if a dataset contains many identical files: Copies of the same data only need one instance of that content in the object tree, and all copies will symlink to it.
  161. If you want to read more about the computer science basics about hash functions check out the `Wikipedia page <https://en.wikipedia.org/wiki/Hash_function>`_.
  162. .. runrecord:: _examples/DL-101-115-104
  163. :language: console
  164. :workdir: dl-101/DataLad-101/books
  165. :realcommand: ls -lh --time-style=long-iso TLCL.pdf
  166. :notes: how does the symlink relate to the shasum of the file?
  167. :cast: 03_git_annex_basics
  168. $ # take a look at the last part of the target path:
  169. $ ls -lh TLCL.pdf
  170. Let's take a closer look at the structure of the symlink.
  171. The key from the hash function is the last part of the name of the file the symlink links to (in which the actual data content is stored).
  172. .. index::
  173. pair: compute checksum; in a terminal
  174. .. runrecord:: _examples/DL-101-115-105
  175. :language: console
  176. :workdir: dl-101/DataLad-101/books
  177. :notes: let's look at how the shasum would look like
  178. :cast: 03_git_annex_basics
  179. $ # compare it to the checksum (here of type md5sum) of the PDF file and the subdirectory name
  180. $ md5sum TLCL.pdf
  181. The extension (e.g., ``.pdf``) is appended, because some programs require it, and would fail when not working directly with the symlink, but the file that it points to.
  182. Right at the beginning, the symlink starts with two directories just after ``.git/annex/objects/``,
  183. consisting of two letters each.
  184. These two letters are derived from the md5sum of the key, and their sole purpose to exist is to avoid issues with too many files in one directory (which is a situation that certain file systems have problems with).
  185. The next subdirectory in the symlink helps to prevent accidental deletions and changes, as it does not have write :term:`permissions`, so that users cannot modify any of its underlying contents.
  186. This is the reason that annexed files need to be unlocked prior to modifications, and this information will be helpful to understand some file system management operations such as removing files or datasets. Section :ref:`file system` takes a look at that.
  187. The next part of the symlink contains the actual checksum.
  188. There are different :term:`annex key` backends that use different checksums.
  189. Depending on which is used, the resulting :term:`checksum` has a certain length and structure, and the first part of the symlink actually states which hash function is used.
  190. By default, DataLad uses the ``MD5E`` git-annex backend (the ``E`` adds file extensions to annex keys), but should you want to, you can change this default to `one of many other types <https://git-annex.branchable.com/backends>`_.
  191. The reason why MD5E is used is the relatively short length of the underlying MD5 checksums -- which facilitates cross-platform compatibility for sharing datasets also with users on operating systems that have restrictions on total path length, such as Windows.
  192. The one remaining unidentified bit in the file name is the one after the checksum identifier.
  193. This part is the size of the content in bytes.
  194. An annexed file in the object tree thus has a file name following this structure
  195. (but see `the git-annex documentation on keys <https://git-annex.branchable.com/internals/key_format>`_ for the complete details):
  196. ``<backend type>-s<size>--<checksum>.<extension>``
  197. You now know a great deal more about git-annex and the object tree.
  198. Maybe you are as amazed as we are about some of the ingenuity used behind the scenes.
  199. Even more mesmerizing things about git-annex can be found in its `documentation <https://git-annex.branchable.com/git-annex>`_.
  200. .. index:: ! broken symlink, ! symlink; broken
  201. .. _wslfiles:
  202. Broken symlinks
  203. ^^^^^^^^^^^^^^^
  204. Whenever a symlink points to a non-existent target, this symlink is called
  205. *broken*, and opening the symlink would not work as it does not resolve. The
  206. section :ref:`file system` will give a thorough demonstration of how symlinks can
  207. break, and how one can fix them again. Even though *broken* sounds
  208. troublesome, most types of broken symlinks you will encounter can be fixed,
  209. or are not problematic. At this point, you actually have already seen broken
  210. symlinks: Back in section :ref:`installds` we explored
  211. the file hierarchy in an installed subdataset that contained many annexed
  212. ``mp3`` files. Upon the initial :dlcmd:`clone`, the annexed files were not present locally.
  213. Instead, their symlinks (stored in Git) existed and allowed to explore which
  214. file's contents could be retrieved. These symlinks point to nothing, though, as
  215. the content isn't yet present locally, and are thus *broken*. This state,
  216. however, is not problematic at all. Once the content is retrieved via
  217. :dlcmd:`get`, the symlink is functional again.
  218. Nevertheless, it may be important to know that some tools that you would expect to work in a dataset with not yet retrieved file contents can encounter unintuitive problems.
  219. Some **file managers** (e.g., OSX's Finder) may not display broken symlinks.
  220. In these cases, it will be impossible to browse and explore the file hierarchy of not-yet-retrieved files with the file manager.
  221. You can make sure to always be able to see the file hierarchy in two separate ways:
  222. Upgrade your file manager to display file types in DataLad datasets (e.g., the `git-annex-turtle extension <https://github.com/andrewringler/git-annex-turtle>`_ for Finder), or use the `DataLad Gooey <https://docs.datalad.org/projects/gooey>`_ to browse datasets.
  223. Alternatively, use the :shcmd:`ls` command in a terminal instead of a file manager GUI.
  224. Other tools may be more more specialized, smaller, or domain-specific, and may fail to correctly work with broken symlinks, or display unhelpful error messages when handling them, or require additional flags to modify their behavior.
  225. When encountering unexpected behavior or failures, try to keep in mind that a dataset without retrieved content appears to be a pile of broken symlinks to a range of tools, consult a tools documentation with regard to symlinks, and check whether data retrieval fixes persisting problems.
  226. A last special case on symlinks exists if you are using DataLad on the Windows Subsystem for Linux.
  227. If so, please take a look into the Windows Wit below.
  228. .. index::
  229. pair: access WSL2 symlinked files; on Windows
  230. single: WSL2; symlink access
  231. pair: log; Git command
  232. .. windows-wit:: Accessing symlinked files from your Windows system
  233. .. include:: topic/wsl2-symlinkaccess.rst
  234. Finally, if you are still in the ``books/`` directory, go back into the root of
  235. the superdataset.
  236. .. runrecord:: _examples/DL-101-115-106
  237. :workdir: dl-101/DataLad-101/books
  238. :language: console
  239. :notes: understanding how symlinks work will help you with everyday file management operations.
  240. :cast: 03_git_annex_basics
  241. $ cd ../