Parcourir la source

add the datalad guide

Jiameng Wu il y a 9 mois
Parent
commit
c68533369c
1 fichiers modifiés avec 88 ajouts et 0 suppressions
  1. 88 0
      README.md

+ 88 - 0
README.md

@@ -0,0 +1,88 @@
+# How to get this dataset?
+You have the following options to get the dataset:
+* use Datalad to clone and manage this dataset
+* use Gin to clone and manage this dataset
+* download the data manaually
+
+This guide will only cover the first option.
+
+## Install Datalad
+The Datalad handbook offers a comprehensive guide on [how to install Datalad](https://handbook.datalad.org/en/latest/intro/installation.html). You can simply follow it, or follow Meng's suggestions below. In principle, Datalad is a command line tool, so you'll need to use a shell such as CMD on Windows which is really ugly or you use Git Bash which looks like the terminal on Linux and uses the same set of commands just as Linux. 
+1. Download and install [Git](https://gitforwindows.org/)
+2. Download and install [Git-Annex](https://git-annex.branchable.com/install/Windows/)
+3. Download and install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/download/); it doesn't matter which, you only need conda, but maybe you already have Anaconda installed?
+4. Now open Git Bash and type `conda` to see if the command is found. If not, you need to add the path to conda to your environment variable:
+    * Hit the Windows key and search for "environ...", select "Edit the system environment variables"
+    * Click "Environment Variables..." in the System Properties window that pops up
+    * Edit "Path" under system variables
+    * Add the path to your Minicond/Anaconda to it, mine looks like "C:\Users\jiame\Anaconda3\Scripts"
+    * Save setting, and type again `conda` in a new Bash, you should see the doc string for conda
+5. Install Datalad via Bash using `conda install -c conda-forge datalad`
+6. Set up user configuration:
+```
+# enter your home directory using the ~ shortcut
+% cd ~
+% git config --global --add user.name "Bob McBobFace"
+% git config --global --add user.email bob@example.com
+```
+
+## Setup ssh connection to Gin
+Skip this one if you are already using Gin on your computer, i.e. you already have the ssh connection set up. 
+1. Generate a new SSH pair: `ssh-keygen -t ed25519 -C "your_email@example.com"`, press ENTER for the default settings; You can choose a file name by supplying the command with `-f <filename>`
+2. Start the SSH agent: `eval "$(ssh-agent -s)"`
+3. Add the private key to the SSH agent: `ssh-add ~/.ssh/id_ed25519`, replace `id_ed25519` with your chosen filename
+4. Copy the public key to the clip-board: `clip < ~/.ssh/id_ed25519.pub`, replace `id_ed25519` with your chosen filename
+5. Go to your gin personal settings on the website and add new SSH keys, give it a reasonable name that refers to the specific device, and `ctrl+V` to paste the public key
+You can read upon more information on the SSH keys in [this tutorial](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent). 
+
+## Add personal access token
+A personal access token is needed so you don't have to enter username and password when you try to either clone or publish a private repo.
+1. Go to Gin - Settings - Applications: click the blue button on top right to create a new token
+2. Give it a meaningful name, e.g. sara-laptop
+3. Once the token is generated, it will be displayed only once, make sure to copy it or save it in a text file temporarily
+4. Go to Git Bash, enter `datalad credentials set [name] token=<personal-access-token>`, if the command is unknown, you need to:
+    * `pip install datalad-next`, if you don't have `pip`, install it with `conda install pip`
+    * try setting the credentials again
+
+
+## Clone and get the dataset
+You are finally good to go! Fingers crossed ... 
+1. In Git Bash: go to the directory you want to clone the dataset into by using `cd` command, and run `datalad clone git@gin.g-node.org:/jwu/ds_SM_PT_2P.git`; if you want to name it differently, you simply add a name to the command
+2. Cloning should not take very long, since all your files are annexed, i.e. unless you explicitly get them, you only have placeholders on your computer
+3. Run `datalad status --annex all` and verify that the size of your dataset on the computer is only few KB
+3. Now choose your favorite file and run `datalad get <path/to/file/>` to see if you can indeed get it
+4. Run `datalad status --annex all` again and see if you have a bigger filesize now
+5. If it all looks good, you can either get the entire dataset by running `datalad get .` which might take days or you select specific directories to get one mouse at a time
+
+## Drop & get annexed content
+Sometimes you run out of storage on your device, so you might want to drop the annexed content to free up storage. Datalad would not drop anything that is not available on at least one sibling, in this case, on gin. 
+
+Run `datalad drop [path]`.
+
+You can run `datalad status --annex all` before and after dropping to see for yourself how much file content is being dropped.
+
+## Update a dataset with the remote 
+You always want to "sync" your local repo with the remote (gin). This requires you to push & pull updates to your dataset. It is recommended to do this on daily basis: start your day with pulling and finish your day with pushing! Always follow the steps to avoid merge conflicts:
+1. Check if you have unsaved changes on your local device: `datalad status`.
+2. If yes, commit the changes to datalad: `datalad save -m "Say what has happened to the dataset to the future you"`. Note, it is recommendable to do the commits in logical units that make sense to you, i.e. when you have changed something that can affect multiple files, for instance "rerun suite2p with different parameters" and specify the files that are associated with this change. You simply add the files/directories to the command above after the commit message. 
+3. Repeat 1.&2. until you have commited all changes and there is nothing to save. 
+4. Before you publish these changes, you have to collect the changes from the remote. Imagine someone (maybe yourself) has published changes to the remote that you haven't fetched yet, or you have changed the README.md using the browser, so the remote is ahead of you. But also your local changes are not available to the remote yet, so you are ahead of the remote as well. This can be complicated, so it is the best to first fetch all possible updates from the remote and integrate them into your local dataset, before you publish your changes: `datalad update --how merge`.
+5. Finally, you can publish your changes: `datalad push --to gin`.
+
+## Create a datalad dataset
+When you start with a new dataset, you want to turn it into a datalad dataset on your local device (e.g. acquisition system), and then publish the dataset to gin.
+1. In git bash, go into the root directory of your dataset.
+2. Create a datalad dataset here: `datalad create --force -c text2git`. `--force` is needed to create a dataset in a non-empty directory. `-c text2git` is a configuration that automatically tells datalad to not annex text files. Yes, everything else is annexed, which means the content of the files is actually hidden from the user behind a separate object folder. The user "only" sees the link to the object file. This allows your to drop & get content as you wish, and is the key to deal with large files. We don't want that with text files for convenience, so that we can edit them anytime without getting them. 
+3. Run `datalad status` and you should see that your dataset is empty so far, i.e. all your files are untracked. So use the `datalad save -m "add files"`command to add your files. You can add all at one by not specifying any particular files. 
+4. If you want to have subdatasets within this dataset, I suggest you visit the section subdatasets first, and do not add the directory that you want to turn into a subdataset as normal directory to your superdataset.
+5. Now you are ready to publish your dataset to gin: `datalad create-sibling-gin --private --access-protocol ssh --credential [name] [name of the gin repo]`. This step requires you to have set up the ssh connection and the personal-access-token to your gin account. `[name]` refers to the name you gave to the personal-access-token. `[name of the gin repo]` will be the name this repo has on gin. The recommendation is to be consistent with the local remote so you easily identify which gin repo belongs to which dataset.
+6. Run `datalad siblings` to see that gin has indeed been added as a sibling of your dataset.
+7. Now you can push your content to gin: `datalad push --to gin`.
+
+## Subdatasets
+One big advantage of datalad is that it allows you to have modular setup of your dataset. This means you can and should organize your project in a way that data, code, results, manuscripts are all separated from each other, and within data, you would separate between raw, processed or derived data, as well as between different data types: imaging, behavior, ephys, histology etc. For each of these modules (as much as it makes sens to you) you would create a subdataset within your superdataset (which is your project folder). We assume you have turned your project folder into a datalad dataset already. Now you want to add the raw 2p data as a subdataset. 
+1. Go into your project folder, run `datalad create --force -c text2git -d . [path/to/raw_2p_data]`. 
+2. Now go into the subdataset, and add files to datalad (see prev. section), until `datalad status` shows nothing to save. 
+3. Go back to the superds, run `datalad status` you should see that the subds is modified. This is important: the superds always records the most updated version of the subds ONLY, but not the entire history. So for all changes happening in the subds, you commit them there and the history is stored there by datalad. And then you save the newest version to the superds. 
+4. Run `datalad save -d . -m "save subds" [path/to/subds]`. 
+5. Run `datalad status` again and you should see a clean working tree.