123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210 |
- .. index::
- pair: create; DataLad command
- pair: create dataset; with DataLad
- .. _createDS:
- Create a dataset
- ----------------
- We are about to start the educational course ``DataLad-101``.
- In order to follow along and organize course content, let us create
- a directory on our computer to collate the materials, assignments, and
- notes in.
- Since this is ``DataLad-101``, let's do it as a :term:`DataLad dataset`.
- You might associate the term "dataset" with a large spreadsheet containing
- variables and data.
- But for DataLad, a dataset is the core data type:
- As noted in :ref:`philo`, a dataset is a collection of *files*
- in folders, and a file is the smallest unit any dataset can contain.
- Although this is a very simple concept, datasets come with many
- useful features.
- Because experiencing is more insightful than just reading, we will explore the
- concepts of DataLad datasets together by creating one.
- Find a nice place on your computer's file system to put a dataset for ``DataLad-101``,
- and create a fresh, empty dataset with the :dlcmd:`create` command.
- Note the command structure of :dlcmd:`create` (optional bits are enclosed in ``[ ]``):
- .. code-block::
- datalad create [--description "..."] [-c <config options>] PATH
- .. _createdescription:
- .. index::
- pair: set description for dataset location; with DataLad
- .. find-out-more:: What is the description option of 'datalad create'?
- The optional ``--description`` flag allows you to provide a short description of
- the *location* of your dataset, for example with
- .. code-block:: console
- $ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
- If you want, use the above command instead to provide a description. Its use will not be immediately clear now, but the chapter
- :ref:`chapter_collaboration` shows where this description
- ends up and how it may be useful.
- Let's start:
- .. index::
- pair: create dataset; with DataLad
- .. runrecord:: _examples/DL-101-101-101
- :language: console
- :workdir: dl-101
- :env:
- DATALAD_SEED=0
- :realcommand: ( mkdir DataLad-101 && cd DataLad-101 && git init && git config annex.uuid 46b169aa-bb91-42d6-be06-355d957fb4f7 ) &> /dev/null && datalad create --force -c text2git DataLad-101
- :cast: 01_dataset_basics
- :notes: Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description
- $ datalad create -c text2git DataLad-101
- This will create a dataset called ``DataLad-101`` in the directory you are currently
- in. For now, disregard ``-c text2git``. It applies a configuration template, but there
- will be other parts of this book to explain this in detail.
- Once created, a DataLad dataset looks like any other directory on your file system.
- Currently, it seems empty.
- .. runrecord:: _examples/DL-101-101-102
- :language: console
- :workdir: dl-101
- :cast: 01_dataset_basics
- :notes: DataLad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.
- $ cd DataLad-101
- $ ls # ls does not show any output, because the dataset is empty.
- However, all files and directories you store within the DataLad dataset
- can be tracked (should you want them to be tracked).
- *Tracking* in this context means that edits done to a file are automatically
- associated with information about the change, the author of the edit,
- and the time of this change. This is already informative important on its own
- -- the :term:`provenance` captured with this can, for example, be used to learn
- about a file's lineage, and can establish trust in it.
- But what is especially helpful is that previous states of files or directories
- can be restored. Remember the last time you accidentally deleted content
- in a file, but only realized *after* you saved it? With DataLad, no
- mistakes are forever. We will see many examples of this later in the book,
- and such information is stored in what we will refer
- to as the *history* of a dataset.
- .. index::
- pair: log; Git command
- pair: exit pager; in a terminal
- pair: show history; with Git
- This history is almost as small as it can be at the current state, but let's take
- a look at it. For looking at the history, the code examples will use :gitcmd:`log`,
- a built-in :term:`Git` command [#f1]_ that works right in your terminal. Your log
- *might* be opened in a terminal :term:`pager`
- that lets you scroll up and down with your arrow keys, but not enter any more commands.
- If this happens, you can get out of ``git log`` by pressing ``q``.
- .. runrecord:: _examples/DL-101-101-103
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 3-4, 6, 9-10, 12
- :cast: 01_dataset_basics
- :notes: GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let's take a look into the history, even if it is small atm
- $ git log
- We can see two :term:`commit`\s in the history of the repository.
- Each of them is identified by a unique 40 character sequence, called a
- :term:`shasum`.
- .. index::
- pair: log; Git command
- pair: corresponding branch; in adjusted mode
- pair: show history; on Windows
- .. windows-wit:: Your Git log may be more extensive - use 'git log main' instead!
- .. include:: topic/adjustedmode-log.rst
- Highlighted in this output is information about the author and about
- the time, as well as a :term:`commit message` that summarizes the
- performed action concisely. In this case, both commit messages were written by
- DataLad itself. The most recent change is on the top. The first commit
- written to the history therefore states that a new dataset was created,
- and the second commit is related to the ``-c text2git`` option (which
- uses a configuration template to instruct DataLad to store text files
- in Git, but more on this later).
- While these commits were produced and described by DataLad,
- in most other cases, you will have to create the commit and
- an informative commit message yourself.
- .. index::
- pair: create dataset; DataLad concept
- .. gitusernote:: Create internals
- :dlcmd:`create` uses :gitcmd:`init` and :gitannexcmd:`init`. Therefore,
- the DataLad dataset is a Git repository.
- Large file content in the
- dataset is tracked with git-annex. An ``ls -a``
- reveals that Git has secretly done its work:
- .. runrecord:: _examples/DL-101-101-104
- :language: console
- :workdir: dl-101/DataLad-101
- :emphasize-lines: 4-6
- :cast: 01_dataset_basics
- :notes: DataLad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!
- $ ls -a # show also hidden files
- **For non-Git-Users: these hidden** *dot-directories* and *dot-files* **are necessary for all Git magic**
- **to work. Please do not tamper with them, and, importantly,** *do not delete them.*
- Congratulations, you just created your first DataLad dataset!
- Let us now put some content inside.
- .. only:: adminmode
- Add a tag at the section end.
- .. runrecord:: _examples/DL-101-101-105
- :language: console
- :workdir: dl-101/DataLad-101
- $ git branch sct_create_a_dataset
- .. rubric:: Footnotes
- .. [#f1] A tool we can recommend as an alternative to :gitcmd:`log` is :term:`tig`.
- Once installed, exchange any ``git log`` command you see here with the single word ``tig``.
- .. ifconfig:: internal
- create a script to help make push targets
- .. runrecord:: _examples/DL-101-101-106
- :language: console
- :workdir: dl-101/DataLad-101
- $ cat << EOT >| /home/me/makepushtarget.py
- #!/usr/bin/python3
- from datalad.core.distributed.tests.test_push import mk_push_target
- from datalad.api import Dataset as ds
- import sys
- ds_path = sys.argv[1]
- name = sys.argv[2]
- path = sys.argv[3]
- annex = sys.argv[4]
- bare = sys.argv[5]
- if __name__ == '__main__':
- mk_push_target(ds=ds(ds_path),
- name=name,
- path=path,
- annex=annex,
- bare=bare)
- EOT
|