101-180-FAQ.rst 27 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564
  1. .. _FAQ:
  2. Frequently asked questions
  3. --------------------------
  4. This section answers frequently asked questions about high-level DataLad
  5. concepts or commands. If you have a question you want to see answered in here,
  6. `please create an issue <https://github.com/datalad-handbook/book/issues/new>`_
  7. or a `pull request <https://handbook.datalad.org/contributing.html>`_.
  8. For a series of specialized command snippets for various use cases, please see
  9. section :ref:`gists`.
  10. What is Git?
  11. ^^^^^^^^^^^^
  12. Git is a free and open source distributed version control system. In a
  13. directory that is initialized as a Git repository, it can track small-sized
  14. files and the modifications done to them.
  15. Git thinks of its data like a *series of snapshots* -- it basically takes a
  16. picture of what all files look like whenever a modification in the repository
  17. is saved. It is a powerful and yet small and fast tool with many features such
  18. as *branching and merging* for independent development, *checksumming* of
  19. contents for integrity, and *easy collaborative workflows* thanks to its
  20. distributed nature.
  21. DataLad uses Git underneath the hood. Every DataLad dataset is a Git
  22. repository, and you can use any Git command within a DataLad dataset. Based
  23. on the configurations in ``.gitattributes``, file content can be version
  24. controlled by Git or managed by git-annex, based on path pattern, file types,
  25. or file size. The section :ref:`config2` details how these configurations work.
  26. `This chapter <https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F>`_
  27. gives a comprehensive overview on what Git is.
  28. Where is Git's "staging area" in DataLad datasets?
  29. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  30. As mentioned in :ref:`populate`, a local version control workflow with
  31. DataLad "skips" the staging area (that is typical for Git workflows) from the
  32. user's point of view.
  33. What is git-annex?
  34. ^^^^^^^^^^^^^^^^^^
  35. git-annex (`https://git-annex.branchable.com/ <https://git-annex.branchable.com>`_)
  36. is a distributed file synchronization system written by Joey Hess. It can
  37. share and synchronize large files independent from a commercial service or a
  38. central server. It does so by managing all file *content* in a separate
  39. directory (the *annex*, *object tree*, or *key-value-store* in ``.git/annex/objects/``),
  40. and placing only file names and
  41. metadata into version control by Git. Among many other features, git-annex
  42. can ensure sufficient amounts of file copies to prevent accidental data loss and
  43. enables a variety of data transfer mechanisms.
  44. DataLad uses git-annex underneath the hood for file content tracking and
  45. transport logistics. git-annex offers an astonishing range of functionality
  46. that DataLad tries to expose in full. That being said, any DataLad dataset
  47. (with the exception of datasets configured to be pure Git repositories) is
  48. fully compatible with git-annex -- you can use any git-annex command inside a
  49. DataLad dataset.
  50. The chapter :ref:`chapter_gitannex` can give you more insights into how git-annex
  51. takes care of your data. git-annex's `website <https://git-annex.branchable.com>`_
  52. can give you a complete walk-through and detailed technical background
  53. information.
  54. What does DataLad add to Git and git-annex?
  55. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  56. DataLad sits on top of Git and git-annex and tries to integrate and expose
  57. their functionality fully. While DataLad thus is a "thin layer" on top of
  58. these tools and tries to minimize the use of unique/idiosyncratic functionality,
  59. it also tries to simplify working with repositories and adds a range of useful concepts
  60. and functions:
  61. - Both Git and git-annex are made to work with a single repository at a time.
  62. For example, while nesting pure Git repositories is possible via Git
  63. submodules (that DataLad also uses internally), *cleaning up* after
  64. placing a random file somewhere into this repository hierarchy can be very
  65. painful. A key advantage that DataLad brings to the table is that it makes
  66. the boundaries between repositories vanish from a user's point
  67. of view. Most core commands have a ``--recursive`` option that will discover
  68. and traverse any subdatasets and do-the-right-thing.
  69. Whereas git and git-annex would require the caller to first cd to the target
  70. repository, DataLad figures out which repository the given paths belong to and
  71. then works within that repository.
  72. :dlcmd:`save . --recursive` will solve the subdataset problem above,
  73. for example, no matter what was changed/added, no matter where in a tree
  74. of subdatasets.
  75. - DataLad provides users with the ability to act on "virtual" file paths. If
  76. software needs data files that are carried in a subdataset (in Git terms:
  77. submodule) for a computation or test, a ``datalad get`` will discover if
  78. there are any subdatasets to install at a particular version to eventually
  79. provide the file content.
  80. - DataLad adds metadata facilities for metadata extraction in various flavors,
  81. and can store extracted and aggregated metadata under ``.datalad/metadata``.
  82. .. todo::
  83. more here.
  84. Does DataLad host my data?
  85. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  86. No, DataLad manages your data, but it does not host it. When publishing a
  87. dataset with annexed data, you will need to find a place that the large file
  88. content can be stored in -- this could be a web server, a cloud service such
  89. as `Dropbox <https://www.dropbox.com>`_, an S3 bucket, or many other storage
  90. solutions -- and set up a publication dependency on this location.
  91. This gives you all the freedom to decide where your data lives, and who can
  92. have access to it. Once this set up is complete, publishing and accessing a
  93. published dataset and its data are as easy as if it would lie on your own
  94. machine.
  95. You can find a typical workflow in the chapter :ref:`chapter_thirdparty`.
  96. How does GitHub relate to DataLad?
  97. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  98. DataLad can make good use of GitHub, if you have figured out storage for your
  99. large files otherwise. You can make DataLad publish file content to one location
  100. and afterwards automatically push an update to GitHub, such that
  101. users can install directly from GitHub and seemingly also obtain large file
  102. content from GitHub. GitHub is also capable of resolving submodule/subdataset
  103. links to other GitHub repos, which makes for a nice UI.
  104. Does DataLad scale to large dataset sizes?
  105. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  106. In general, yes. The largest dataset managed by DataLad at this point is the `Human Connectome Project <http://www.humanconnectomeproject.org>`_ data, encompassing 80 Terabytes of data in 15 million files, and larger projects (up to 500TB) are currently actively worked on.
  107. The chapter :ref:`chapter_gobig` is a guide to "beyond-household-quantity datasets".
  108. What is the difference between a superdataset, a subdataset, and a dataset?
  109. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  110. Conceptually and technically, there is no difference between a dataset, a
  111. subdataset, or a superdataset. The only aspect that makes a dataset a sub- or
  112. superdataset is whether it is *registered* in another dataset (by means of an entry in the
  113. ``.gitmodules``, automatically performed upon an appropriate ``datalad
  114. install -d`` or ``datalad create -d`` command) or contains registered datasets.
  115. How can I convert/import/transform an existing Git or git-annex repository into a DataLad dataset?
  116. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  117. You can transform any existing Git or git-annex repository of yours into a
  118. DataLad dataset by running:
  119. .. code-block:: console
  120. $ datalad create -f
  121. inside of it. Afterwards, you may want to tweak settings in ``.gitattributes``
  122. according to your needs (see sections :ref:`config` and :ref:`config2` for
  123. additional insights on this).
  124. The chapter :ref:`chapter_retro` guides you through transitioning an existing project into DataLad.
  125. How can I convert an existing DataLad dataset with annexed data back to a plain Git repository?
  126. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  127. If you decide to stop using git-annex or DataLad, or if you want to turn an annex repo back into a Git repo, you can do so with the git-annex uninit command.
  128. The section :ref:`uninit` contains more details.
  129. How can I cite DataLad?
  130. ^^^^^^^^^^^^^^^^^^^^^^^
  131. Please cite the official paper on DataLad:
  132. Halchenko et al., (2021). DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262, `https://doi.org/10.21105/joss.03262 <https://doi.org/10.21105/joss.03262>`_.
  133. .. _dataset_textblock:
  134. How can I help others get started with a shared dataset?
  135. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  136. If you want to share your dataset with users that are not already familiar with
  137. DataLad, it is helpful to include some information on how to interact with
  138. DataLad datasets in your dataset's ``README`` (or similar) file.
  139. If you do not want to invent a description yourself, you can run
  140. :dlcmd:`add-readme` in your dataset, and have one added automatically.
  141. .. only:: html
  142. Below, we provide a standard text block that you can use (and adapt as you
  143. wish) for such purposes.
  144. .. find-out-more:: Textblock in .rst format:
  145. .. code-block:: rst
  146. DataLad datasets and how to use them
  147. ------------------------------------
  148. This repository is a `DataLad <https://www.datalad.org>`__ dataset. It provides
  149. fine-grained data access down to the level of individual files, and allows for
  150. tracking future updates. In order to use this repository for data retrieval,
  151. `DataLad <https://www.datalad.org>`_ is required.
  152. It is a free and open source command line tool, available for all
  153. major operating systems, and builds up on Git and `git-annex
  154. <https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and
  155. version controlling collections of large files. You can find information on
  156. how to install DataLad at `handbook.datalad.org/intro/installation.html
  157. <https://handbook.datalad.org/intro/installation.html>`_.
  158. Get the dataset
  159. ^^^^^^^^^^^^^^^
  160. A DataLad dataset can be ``cloned`` by running:
  161. .. code-block:: bash
  162. datalad clone <url>
  163. Once a dataset is cloned, it is a light-weight directory on your local machine.
  164. At this point, it contains only small metadata and information on the
  165. identity of the files in the dataset, but not actual *content* of the
  166. (sometimes large) data files.
  167. Retrieve dataset content
  168. ^^^^^^^^^^^^^^^^^^^^^^^^
  169. After cloning a dataset, you can retrieve file contents by running:
  170. .. code-block:: bash
  171. datalad get <path/to/directory/or/file>
  172. This command will trigger a download of the files, directories, or
  173. subdatasets you have specified.
  174. DataLad datasets can contain other datasets, so called *subdatasets*. If you
  175. clone the top-level dataset, subdatasets do not yet contain metadata and
  176. information on the identity of files, but appear to be empty directories. In
  177. order to retrieve file availability metadata in subdatasets, run:
  178. .. code-block:: bash
  179. datalad get -n <path/to/subdataset>
  180. Afterwards, you can browse the retrieved metadata to find out about
  181. subdataset contents, and retrieve individual files with ``datalad get``. If you
  182. use ``datalad get <path/to/subdataset>``, all contents of the subdataset will
  183. be downloaded at once.
  184. Stay up-to-date
  185. ^^^^^^^^^^^^^^^
  186. DataLad datasets can be updated. The command ``datalad update`` will *fetch*
  187. updates and store them on a different branch (by default
  188. ``remotes/origin/main``). Running
  189. .. code-block:: bash
  190. datalad update --merge
  191. will *pull* available updates and integrate them in one go.
  192. Find out what has been done
  193. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  194. DataLad datasets contain their history in the ``git log``.
  195. By running ``git log`` (or a tool that displays Git history) in the dataset or on
  196. specific files, you can find out what has been done to the dataset or to individual files
  197. by whom, and when.
  198. More information
  199. ^^^^^^^^^^^^^^^^
  200. More information on DataLad and how to use it can be found in the DataLad Handbook at
  201. `handbook.datalad.org <https://handbook.datalad.org>`_. The
  202. chapter "DataLad datasets" can help you to familiarize yourself with the
  203. concept of a dataset.
  204. .. find-out-more:: Textblock in markdown format
  205. .. code-block:: md
  206. [![made-with-datalad](https://www.datalad.org/badges/made_with.svg)](https://datalad.org)
  207. ## DataLad datasets and how to use them
  208. This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
  209. fine-grained data access down to the level of individual files, and allows for
  210. tracking future updates. In order to use this repository for data retrieval,
  211. [DataLad](https://www.datalad.org/) is required. It is a free and
  212. open source command line tool, available for all major operating
  213. systems, and builds up on Git and [git-annex](https://git-annex.branchable.com/)
  214. to allow sharing, synchronizing, and version controlling collections of
  215. large files. You can find information on how to install DataLad at
  216. [handbook.datalad.org/intro/installation.html](https://handbook.datalad.org/intro/installation.html).
  217. ### Get the dataset
  218. A DataLad dataset can be `cloned` by running
  219. ```
  220. datalad clone <url>
  221. ```
  222. Once a dataset is cloned, it is a light-weight directory on your local machine.
  223. At this point, it contains only small metadata and information on the
  224. identity of the files in the dataset, but not actual *content* of the
  225. (sometimes large) data files.
  226. ### Retrieve dataset content
  227. After cloning a dataset, you can retrieve file contents by running
  228. ```
  229. datalad get <path/to/directory/or/file>
  230. ```
  231. This command will trigger a download of the files, directories, or
  232. subdatasets you have specified.
  233. DataLad datasets can contain other datasets, so called *subdatasets*.
  234. If you clone the top-level dataset, subdatasets do not yet contain
  235. metadata and information on the identity of files, but appear to be
  236. empty directories. In order to retrieve file availability metadata in
  237. subdatasets, run
  238. ```
  239. datalad get -n <path/to/subdataset>
  240. ```
  241. Afterwards, you can browse the retrieved metadata to find out about
  242. subdataset contents, and retrieve individual files with `datalad get`.
  243. If you use `datalad get <path/to/subdataset>`, all contents of the
  244. subdataset will be downloaded at once.
  245. ### Stay up-to-date
  246. DataLad datasets can be updated. The command `datalad update` will
  247. *fetch* updates and store them on a different branch (by default
  248. `remotes/origin/main`). Running
  249. ```
  250. datalad update --merge
  251. ```
  252. will *pull* available updates and integrate them in one go.
  253. ### Find out what has been done
  254. DataLad datasets contain their history in the `git log`.
  255. By running `git log` (or a tool that displays Git history) in the dataset or on
  256. specific files, you can find out what has been done to the dataset or to individual files
  257. by whom, and when.
  258. ### More information
  259. More information on DataLad and how to use it can be found in the DataLad Handbook at
  260. [handbook.datalad.org](https://handbook.datalad.org/index.html). The chapter
  261. "DataLad datasets" can help you to familiarize yourself with the concept of a dataset.
  262. .. find-out-more:: Textblock without formatting
  263. .. code-block:: md
  264. DataLad datasets and how to use them
  265. This repository is a DataLad (https://www.datalad.org) dataset. It provides
  266. fine-grained data access down to the level of individual files, and allows
  267. for tracking future updates. In order to use this repository for data
  268. retrieval, DataLad is required. It is a free and open source command line
  269. tool, available for all major operating systems, and builds up on Git and
  270. git-annex (https://git-annex.branchable.com) to allow sharing,
  271. synchronizing, and version controlling collections of large files. You can
  272. find information on how to install DataLad at
  273. https://handbook.datalad.org/intro/installation.html.
  274. Get the dataset
  275. A DataLad dataset can be "cloned" by running 'datalad clone <url>'.
  276. Once a dataset is cloned, it is a light-weight directory on your local
  277. machine.
  278. At this point, it contains only small metadata and information on the
  279. identity of the files in the dataset, but not actual *content* of the
  280. (sometimes large) data files.
  281. Retrieve dataset content
  282. After cloning a dataset, you can retrieve file contents by running
  283. 'datalad get <path/to/directory/or/file>'
  284. This command will trigger a download of the files, directories, or
  285. subdatasets you have specified.
  286. DataLad datasets can contain other datasets, so called "subdatasets".
  287. If you clone the top-level dataset, subdatasets do not yet contain
  288. metadata and information on the identity of files, but appear to be
  289. empty directories. In order to retrieve file availability metadata in
  290. subdatasets, run 'datalad get -n <path/to/subdataset>'
  291. Afterwards, you can browse the retrieved metadata to find out about
  292. subdataset contents, and retrieve individual files with 'datalad get'.
  293. If you use 'datalad get <path/to/subdataset>', all contents of the
  294. subdataset will be downloaded at once.
  295. Stay up-to-date
  296. DataLad datasets can be updated. The command 'datalad update' will
  297. "fetch" updates and store them on a different branch (by default
  298. 'remotes/origin/main'). Running 'datalad update --merge' will "pull"
  299. available updates and integrate them in one go.
  300. Find out what has been done
  301. DataLad datasets contain their history in the Git log.
  302. By running 'git log' (or a tool that displays Git history) in the dataset or on
  303. specific files, you can find out what has been done to the dataset or to individual files
  304. by whom, and when.
  305. More information
  306. More information on DataLad and how to use it can be found in the DataLad Handbook at
  307. https://handbook.datalad.org/index.html. The chapter "DataLad datasets"
  308. can help you to familiarize yourself with the concept of a dataset.
  309. What is the difference between DataLad, Git LFS, and Flywheel?
  310. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  311. `Flywheel <https://flywheel.io>`_ is an informatics platform for biomedical
  312. research and collaboration.
  313. `Git Large File Storage <https://github.com/git-lfs/git-lfs>`_ (Git LFS) is a
  314. command line tool that extends Git with the ability to manage large files. In
  315. that it appears similar to git-annex.
  316. .. todo::
  317. this.
  318. A more elaborate delineation from related solutions can be found in the DataLad
  319. `developer documentation <https://docs.datalad.org/related.html>`_.
  320. What is the difference between DataLad and DVC?
  321. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  322. `DVC <https://dvc.org>`_ is a version control system for machine learning projects.
  323. We have compared the two tools in a dedicated handbook section, :ref:`dvc`.
  324. DataLad version-controls my large files -- great. But how much is saved in total?
  325. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  326. .. todo::
  327. this.
  328. .. _copydata:
  329. How can I copy data out of a DataLad dataset?
  330. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  331. Moving or copying data out of a DataLad dataset is always possible and works in
  332. many cases just like in any regular directory. The only
  333. caveat exists in the case of annexed data: If file content is managed with
  334. git-annex and stored in the :term:`object-tree`, what *appears* to be the
  335. file in the dataset is merely a symlink (please read section :ref:`symlink`
  336. for details). Moving or copying this symlink will not yield the
  337. intended result -- instead you will have a broken symlink outside of your
  338. dataset.
  339. When using the terminal command ``cp`` [#f1]_, it is sufficient to use the
  340. ``-L``/``--dereference`` option. This will follow symbolic links, and make
  341. sure that content gets moved instead of symlinks.
  342. Remember that if you are copying some annexed content out of a dataset without
  343. unlocking it first, you will only have "read" :term:`permissions` on the files you have just
  344. copied. Therefore you can :
  345. - either unlock the files before copying them out,
  346. - or copy them and then use the command ``chmod`` to be able to edit the file.
  347. .. code-block:: console
  348. $ # this will give you 'write' permission on the file
  349. $ chmod +w filename
  350. If you are not familiar with how the ``chmod`` works (or if you forgot - let's be honest we
  351. all google it sometimes), this is `a nice tutorial <https://bids.github.io/2015-06-04-berkeley/shell/07-perm.html>`_ .
  352. With tools other than ``cp`` (e.g., graphical file managers), to copy or move
  353. annexed content, make sure it is *unlocked* first:
  354. After a :dlcmd:`unlock` copying and moving contents will work fine.
  355. A subsequent :dlcmd:`save` in the dataset will annex the content
  356. again.
  357. Is there Python 2 support for DataLad?
  358. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  359. No, Python 2 support has been dropped in
  360. `September 2019 <https://github.com/datalad/datalad/pull/3629>`_.
  361. Is there a graphical user interface for DataLad?
  362. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  363. Yes, a dedicated :term:`DataLad extension`, ``datalad-gooey``, provides a graphical user interface for DataLad.
  364. You can read more about it in the section :ref:`gooey`.
  365. How does DataLad interface with OpenNeuro?
  366. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  367. `OpenNeuro <https://openneuro.org>`_ is a free and open platform for sharing MRI,
  368. MEG, EEG, iEEG, and ECoG data. It publishes hosted data as DataLad datasets on
  369. :term:`GitHub`. The entire collection can be found at
  370. `github.com/OpenNeuroDatasets <https://github.com/OpenNeuroDatasets>`_. You can
  371. obtain the datasets just as any other DataLad datasets with :dlcmd:`clone`
  372. or :dlcmd:`install`.
  373. There is more info about this in the :ref:`OpenNeuro Quickstart Guide <openneuro>`.
  374. .. _bidsvalidator:
  375. BIDS validator issues in datasets with missing file content
  376. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  377. As outlined in section :ref:`symlink`, all unretrieved files in datasets are broken symlinks.
  378. This is desired, and not a problem per se, but some tools, among them the `BIDS validator <https://github.com/bids-standard/bids-validator>`_, can be confused by this.
  379. Should you attempt to validate a dataset in which all or some file contents are missing, for example after cloning a dataset or after dropping file contents, the validator may fail to report on the validity of the complete dataset or the specific unretrieved files.
  380. If you aim for a complete validation of your dataset, re-do the validation after retrieving all necessary file contents.
  381. If you only aim to validate file names and structure, invoke the bids validator with the additional flags ``--ignoreNiftiHeaders`` and ``--ignoreSymlinks``.
  382. .. _gitannexbranch:
  383. What is the git-annex branch?
  384. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  385. If your DataLad dataset contains an annex, there is also a ``git-annex`` :term:`branch`
  386. that is created, used, and maintained solely by :term:`git-annex`. It is completely
  387. unconnected to any other branches in your dataset, and contains different types
  388. of log files.
  389. The contents of this branch are used for git-annex internal tracking of the
  390. dataset and its annexed contents. For example, git-annex stores information where
  391. file content can be retrieved from in a ``.log`` file for each object, and if the object
  392. was obtained from web-sources (e.g., with :dlcmd:`download-url`), a
  393. ``.log.web`` file stores the URL. Other files in this branch store information about
  394. the known remotes of the dataset and their description, if they have one.
  395. You can find out much more about the ``git-annex`` branch and its contents in the
  396. `documentation <https://git-annex.branchable.com/internals>`_.
  397. This branch, however, is managed by git-annex, and you should not tamper with it.
  398. .. _gitannexdefault:
  399. Help - Why does Github display my dataset with git-annex as the default branch?
  400. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  401. If your dataset is represented on GitHub with cryptic directories instead of actual file names, GitHub probably declared the :term:`git-annex branch` to be your repositories "default branch".
  402. Here is an example:
  403. .. figure:: ../artwork/src/defaultgitannex_light.png
  404. This is related to GitHub's decision to make ``main`` `the default branch for newly created repositories <https://github.blog/changelog/2020-10-01-the-default-branch-for-newly-created-repositories-is-now-main>`_ -- datasets that do not have a ``main`` branch (but, for example, a ``master`` branch) may end up with a different branch being displayed on GitHub than intended.
  405. To fix this for present and/or future datasets, the default branch can be configured to a branch name of your choice on a repository- or organizational level `via GitHub's web-interface <https://github.blog/changelog/2020-08-26-set-the-default-branch-for-newly-created-repositories>`_.
  406. Alternatively, you can rename existing ``master`` branches into ``main`` using ``git branch -m master main`` (but beware of unforeseen consequences - your collaborators may try to ``update`` the ``master`` branch but fail, continuous integration workflows could still try to use ``master``, etc.).
  407. Lastly, you can initialize new datasets with ``main`` instead of ``master`` -- either with a global Git configuration [#f2]_ for ``init.defaultBranch`` (``git config --global init.defaultBranch main``), or by passing the ``--initial-branch <branchname>`` option via ``datalad create`` by appending ``--initial-branch main`` to the command (``datalad create mydataset --initial-branch main``) [#f3]_.
  408. .. rubric:: Footnotes
  409. .. [#f1] The absolutely amazing `Midnight Commander <https://github.com/MidnightCommander/mc>`_
  410. ``mc`` can also follow symlinks.
  411. .. [#f2] See the section :ref:`config` for more info on configurations
  412. .. [#f3] ``--initial-branch`` is not one of ``datalad create``'s parameters, but a parameter of a ``git init`` call. You can specify any of ``git init``'s parameters as the last arguments of ``datalad create`` (after the ``PATH``) and it will be passed to ``git init``.