101-101-create.rst 8.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
  1. .. index::
  2. pair: create; DataLad command
  3. pair: create dataset; with DataLad
  4. .. _createDS:
  5. Create a dataset
  6. ----------------
  7. We are about to start the educational course ``DataLad-101``.
  8. In order to follow along and organize course content, let us create
  9. a directory on our computer to collate the materials, assignments, and
  10. notes in.
  11. Since this is ``DataLad-101``, let's do it as a :term:`DataLad dataset`.
  12. You might associate the term "dataset" with a large spreadsheet containing
  13. variables and data.
  14. But for DataLad, a dataset is the core data type:
  15. As noted in :ref:`philo`, a dataset is a collection of *files*
  16. in folders, and a file is the smallest unit any dataset can contain.
  17. Although this is a very simple concept, datasets come with many
  18. useful features.
  19. Because experiencing is more insightful than just reading, we will explore the
  20. concepts of DataLad datasets together by creating one.
  21. Find a nice place on your computer's file system to put a dataset for ``DataLad-101``,
  22. and create a fresh, empty dataset with the :dlcmd:`create` command.
  23. Note the command structure of :dlcmd:`create` (optional bits are enclosed in ``[ ]``):
  24. .. code-block::
  25. datalad create [--description "..."] [-c <config options>] PATH
  26. .. _createdescription:
  27. .. index::
  28. pair: set description for dataset location; with DataLad
  29. .. find-out-more:: What is the description option of 'datalad create'?
  30. The optional ``--description`` flag allows you to provide a short description of
  31. the *location* of your dataset, for example with
  32. .. code-block:: console
  33. $ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
  34. If you want, use the above command instead to provide a description. Its use will not be immediately clear now, but the chapter
  35. :ref:`chapter_collaboration` shows where this description
  36. ends up and how it may be useful.
  37. Let's start:
  38. .. index::
  39. pair: create dataset; with DataLad
  40. .. runrecord:: _examples/DL-101-101-101
  41. :language: console
  42. :workdir: dl-101
  43. :env:
  44. DATALAD_SEED=0
  45. :realcommand: ( mkdir DataLad-101 && cd DataLad-101 && git init && git config annex.uuid 46b169aa-bb91-42d6-be06-355d957fb4f7 ) &> /dev/null && datalad create --force -c text2git DataLad-101
  46. :cast: 01_dataset_basics
  47. :notes: Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description
  48. $ datalad create -c text2git DataLad-101
  49. This will create a dataset called ``DataLad-101`` in the directory you are currently
  50. in. For now, disregard ``-c text2git``. It applies a configuration template, but there
  51. will be other parts of this book to explain this in detail.
  52. Once created, a DataLad dataset looks like any other directory on your file system.
  53. Currently, it seems empty.
  54. .. runrecord:: _examples/DL-101-101-102
  55. :language: console
  56. :workdir: dl-101
  57. :cast: 01_dataset_basics
  58. :notes: DataLad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.
  59. $ cd DataLad-101
  60. $ ls # ls does not show any output, because the dataset is empty.
  61. However, all files and directories you store within the DataLad dataset
  62. can be tracked (should you want them to be tracked).
  63. *Tracking* in this context means that edits done to a file are automatically
  64. associated with information about the change, the author of the edit,
  65. and the time of this change. This is already informative important on its own
  66. -- the :term:`provenance` captured with this can, for example, be used to learn
  67. about a file's lineage, and can establish trust in it.
  68. But what is especially helpful is that previous states of files or directories
  69. can be restored. Remember the last time you accidentally deleted content
  70. in a file, but only realized *after* you saved it? With DataLad, no
  71. mistakes are forever. We will see many examples of this later in the book,
  72. and such information is stored in what we will refer
  73. to as the *history* of a dataset.
  74. .. index::
  75. pair: log; Git command
  76. pair: exit pager; in a terminal
  77. pair: show history; with Git
  78. This history is almost as small as it can be at the current state, but let's take
  79. a look at it. For looking at the history, the code examples will use :gitcmd:`log`,
  80. a built-in :term:`Git` command [#f1]_ that works right in your terminal. Your log
  81. *might* be opened in a terminal :term:`pager`
  82. that lets you scroll up and down with your arrow keys, but not enter any more commands.
  83. If this happens, you can get out of ``git log`` by pressing ``q``.
  84. .. runrecord:: _examples/DL-101-101-103
  85. :language: console
  86. :workdir: dl-101/DataLad-101
  87. :emphasize-lines: 3-4, 6, 9-10, 12
  88. :cast: 01_dataset_basics
  89. :notes: GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let's take a look into the history, even if it is small atm
  90. $ git log
  91. We can see two :term:`commit`\s in the history of the repository.
  92. Each of them is identified by a unique 40 character sequence, called a
  93. :term:`shasum`.
  94. .. index::
  95. pair: log; Git command
  96. pair: corresponding branch; in adjusted mode
  97. pair: show history; on Windows
  98. .. windows-wit:: Your Git log may be more extensive - use 'git log main' instead!
  99. .. include:: topic/adjustedmode-log.rst
  100. Highlighted in this output is information about the author and about
  101. the time, as well as a :term:`commit message` that summarizes the
  102. performed action concisely. In this case, both commit messages were written by
  103. DataLad itself. The most recent change is on the top. The first commit
  104. written to the history therefore states that a new dataset was created,
  105. and the second commit is related to the ``-c text2git`` option (which
  106. uses a configuration template to instruct DataLad to store text files
  107. in Git, but more on this later).
  108. While these commits were produced and described by DataLad,
  109. in most other cases, you will have to create the commit and
  110. an informative commit message yourself.
  111. .. index::
  112. pair: create dataset; DataLad concept
  113. .. gitusernote:: Create internals
  114. :dlcmd:`create` uses :gitcmd:`init` and :gitannexcmd:`init`. Therefore,
  115. the DataLad dataset is a Git repository.
  116. Large file content in the
  117. dataset is tracked with git-annex. An ``ls -a``
  118. reveals that Git has secretly done its work:
  119. .. runrecord:: _examples/DL-101-101-104
  120. :language: console
  121. :workdir: dl-101/DataLad-101
  122. :emphasize-lines: 4-6
  123. :cast: 01_dataset_basics
  124. :notes: DataLad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!
  125. $ ls -a # show also hidden files
  126. **For non-Git-Users: these hidden** *dot-directories* and *dot-files* **are necessary for all Git magic**
  127. **to work. Please do not tamper with them, and, importantly,** *do not delete them.*
  128. Congratulations, you just created your first DataLad dataset!
  129. Let us now put some content inside.
  130. .. only:: adminmode
  131. Add a tag at the section end.
  132. .. runrecord:: _examples/DL-101-101-105
  133. :language: console
  134. :workdir: dl-101/DataLad-101
  135. $ git branch sct_create_a_dataset
  136. .. rubric:: Footnotes
  137. .. [#f1] A tool we can recommend as an alternative to :gitcmd:`log` is :term:`tig`.
  138. Once installed, exchange any ``git log`` command you see here with the single word ``tig``.
  139. .. ifconfig:: internal
  140. create a script to help make push targets
  141. .. runrecord:: _examples/DL-101-101-106
  142. :language: console
  143. :workdir: dl-101/DataLad-101
  144. $ cat << EOT >| /home/me/makepushtarget.py
  145. #!/usr/bin/python3
  146. from datalad.core.distributed.tests.test_push import mk_push_target
  147. from datalad.api import Dataset as ds
  148. import sys
  149. ds_path = sys.argv[1]
  150. name = sys.argv[2]
  151. path = sys.argv[3]
  152. annex = sys.argv[4]
  153. bare = sys.argv[5]
  154. if __name__ == '__main__':
  155. mk_push_target(ds=ds(ds_path),
  156. name=name,
  157. path=path,
  158. annex=annex,
  159. bare=bare)
  160. EOT