Browse Source

add note on helpful git resource

Elena Piscopia 2 years ago
1 changed files with 8 additions and 106 deletions
  1. 8 106

+ 8 - 106

@@ -1,108 +1,10 @@
-One can create a new dataset with 'datalad create [--description] PATH'.
-The dataset is created empty
-The command "datalad save [-m] PATH" saves the file
-(modifications) to history. Note to self:
-Always use informative, concise commit messages.
-The command 'datalad clone URL/PATH [PATH]'
-clones a dataset from e.g., a URL or a path.
-If you clone a dataset into an existing
-dataset (as a subdataset), remember to specify the
-root of the superdataset with the '-d' option.
-There are two useful functions to display changes between two
-states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
-and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
-in the history.
-The datalad run command can record the impact a script or command has on a Dataset.
-In its simplest form, datalad run only takes a commit message and the command that
-should be executed.
-Any datalad run command can be re-executed by using its commit shasum as an argument
-in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
-commit, and re-execute it. If no changes happen with a rerun, the command will not be written
-to history. Note: you can also rerun a datalad rerun command!
-You should specify all files that a command takes as input with an -i/--input flag. These
-files will be retrieved prior to the command execution. Any content that is modified or
-produced by the command should be specified with an -o/--output flag. Upon a run or rerun
-of the command, the contents of these files will get unlocked so that they can be modified.
-Important! If the dataset is not "clean" (a datalad status output is empty),
-datalad run will not work - you will have to save modifications present in your
-A suboptimal alternative is the --explicit flag,
-used to record only those changes done
-to the files listed with --output flags.
-A source to clone a dataset from can also be a path,
-for example as in "datalad clone ../DataLad-101".
-Just as in creating datasets, you can add a
-description on the location of the new dataset clone
-with the -D/--description option.
-Note that subdatasets will not be installed by default,
-but are only registered in the superdataset -- you will
-have to do a "datalad get -n PATH/TO/SUBDATASET"
-to clone the subdataset for file availability meta data.
-The -n/--no-data options prevents that file contents are
-also downloaded.
-Note that a recursive "datalad get" would clone all further
-registered subdatasets underneath a subdataset, so a safer
-way to proceed is to set a decent --recursion-limit:
-"datalad get -n -r --recursion-limit 2 <subds>"
-The command "git annex whereis PATH" lists the repositories that have
-the file content of an annexed file. When using "datalad get" to retrieve
-file content, those repositories will be queried.
-To update a shared dataset, run the command "datalad update --merge".
-This command will query its origin for changes, and integrate the
-changes into the dataset.
-To update from a dataset with a shared history, you
-need to add this dataset as a sibling to your dataset.
-"Adding a sibling" means providing DataLad with info about
-the location of a dataset, and a name for it. Afterwards,
-a "datalad update --merge -s name" will integrate the changes
-made to the sibling into the dataset.
-A safe step in between is to do a "datalad update -s name"
-and checkout the changes with "git/datalad diff"
-to remotes/origin/master
-Configurations for datasets exist on different levels
-(systemwide, global, and local), and in different types
-of files (not version controlled (git)config files, or
-version controlled .datalad/config, .gitattributes, or
-gitmodules files), or environment variables.
-With the exception of .gitattributes, all configuration
-files share a common structure, and can be modified with
-the git config command, but also with an editor by hand.
-Depending on whether a configuration file is version
-controlled or not, the configurations will be shared together
-with the dataset. More specific configurations and not-shared
-configurations will always take precedence over more global or
-shared configurations, and environment variables take precedence
-over configurations in files.
-The git config --list --show-origin command is a useful tool
-to give an overview over existing configurations. Particularly
-important may be the .gitattributes file, in which one can set
-rules for git-annex about which files should be version-controlled
-with Git instead of being annexed.
-It can be useful to use pre-configured procedures that can apply
-configurations, create files or file hierarchies, or perform
-arbitrary tasks in datasets. They can be shipped with DataLad,
-its extensions, or datasets, and you can even write your own
-procedures and distribute them. The "datalad run-procedure"
-command is used to apply such a procedure to a dataset. Procedures
-shipped with DataLad or its extensions starting with a "cfg" prefix
-can also be applied at the creation of a dataset with
-"datalad create -c <PROC-NAME> <PATH>" (omitting the "cfg" prefix).
+Git has many handy tools to go back in forth in
+time and work with the history of datasets.
+Among many other things you can rewrite commit
+messages, undo changes, or look at previous versions
+of datasets. A superb resource to find out more about
+this and practice such Git operations is this
+chapter in the Pro-git book: