some default

Jiameng Wu c68533369c add the datalad guide пре 9 месеци
.datalad 53f9339562 [DATALAD] new dataset пре 9 месеци
Christina project a5378f6365 save all content пре 9 месеци
JiyunDesktopBackup a5378f6365 save all content пре 9 месеци
Jiyunsetup backup a5378f6365 save all content пре 9 месеци
.gitattributes 749238248e Instruct annex to add text files to Git пре 9 месеци
.gitmodules a5378f6365 save all content пре 9 месеци
README.md c68533369c add the datalad guide пре 9 месеци

README.md

How to get this dataset?

You have the following options to get the dataset:

  • use Datalad to clone and manage this dataset
  • use Gin to clone and manage this dataset
  • download the data manaually

This guide will only cover the first option.

Install Datalad

The Datalad handbook offers a comprehensive guide on how to install Datalad. You can simply follow it, or follow Meng's suggestions below. In principle, Datalad is a command line tool, so you'll need to use a shell such as CMD on Windows which is really ugly or you use Git Bash which looks like the terminal on Linux and uses the same set of commands just as Linux.

  1. Download and install Git
  2. Download and install Git-Annex
  3. Download and install Miniconda or Anaconda; it doesn't matter which, you only need conda, but maybe you already have Anaconda installed?
  4. Now open Git Bash and type conda to see if the command is found. If not, you need to add the path to conda to your environment variable:
    • Hit the Windows key and search for "environ...", select "Edit the system environment variables"
    • Click "Environment Variables..." in the System Properties window that pops up
    • Edit "Path" under system variables
    • Add the path to your Minicond/Anaconda to it, mine looks like "C:\Users\jiame\Anaconda3\Scripts"
    • Save setting, and type again conda in a new Bash, you should see the doc string for conda
  5. Install Datalad via Bash using conda install -c conda-forge datalad
  6. Set up user configuration:

    # enter your home directory using the ~ shortcut
    % cd ~
    % git config --global --add user.name "Bob McBobFace"
    % git config --global --add user.email bob@example.com
    

Setup ssh connection to Gin

Skip this one if you are already using Gin on your computer, i.e. you already have the ssh connection set up.

  1. Generate a new SSH pair: ssh-keygen -t ed25519 -C "your_email@example.com", press ENTER for the default settings; You can choose a file name by supplying the command with -f <filename>
  2. Start the SSH agent: eval "$(ssh-agent -s)"
  3. Add the private key to the SSH agent: ssh-add ~/.ssh/id_ed25519, replace id_ed25519 with your chosen filename
  4. Copy the public key to the clip-board: clip < ~/.ssh/id_ed25519.pub, replace id_ed25519 with your chosen filename
  5. Go to your gin personal settings on the website and add new SSH keys, give it a reasonable name that refers to the specific device, and ctrl+V to paste the public key You can read upon more information on the SSH keys in this tutorial.

Add personal access token

A personal access token is needed so you don't have to enter username and password when you try to either clone or publish a private repo.

  1. Go to Gin - Settings - Applications: click the blue button on top right to create a new token
  2. Give it a meaningful name, e.g. sara-laptop
  3. Once the token is generated, it will be displayed only once, make sure to copy it or save it in a text file temporarily
  4. Go to Git Bash, enter datalad credentials set [name] token=<personal-access-token>, if the command is unknown, you need to:
    • pip install datalad-next, if you don't have pip, install it with conda install pip
    • try setting the credentials again

Clone and get the dataset

You are finally good to go! Fingers crossed ...

  1. In Git Bash: go to the directory you want to clone the dataset into by using cd command, and run datalad clone git@gin.g-node.org:/jwu/ds_SM_PT_2P.git; if you want to name it differently, you simply add a name to the command
  2. Cloning should not take very long, since all your files are annexed, i.e. unless you explicitly get them, you only have placeholders on your computer
  3. Run datalad status --annex all and verify that the size of your dataset on the computer is only few KB
  4. Now choose your favorite file and run datalad get <path/to/file/> to see if you can indeed get it
  5. Run datalad status --annex all again and see if you have a bigger filesize now
  6. If it all looks good, you can either get the entire dataset by running datalad get . which might take days or you select specific directories to get one mouse at a time

Drop & get annexed content

Sometimes you run out of storage on your device, so you might want to drop the annexed content to free up storage. Datalad would not drop anything that is not available on at least one sibling, in this case, on gin.

Run datalad drop [path].

You can run datalad status --annex all before and after dropping to see for yourself how much file content is being dropped.

Update a dataset with the remote

You always want to "sync" your local repo with the remote (gin). This requires you to push & pull updates to your dataset. It is recommended to do this on daily basis: start your day with pulling and finish your day with pushing! Always follow the steps to avoid merge conflicts:

  1. Check if you have unsaved changes on your local device: datalad status.
  2. If yes, commit the changes to datalad: datalad save -m "Say what has happened to the dataset to the future you". Note, it is recommendable to do the commits in logical units that make sense to you, i.e. when you have changed something that can affect multiple files, for instance "rerun suite2p with different parameters" and specify the files that are associated with this change. You simply add the files/directories to the command above after the commit message.
  3. Repeat 1.&2. until you have commited all changes and there is nothing to save.
  4. Before you publish these changes, you have to collect the changes from the remote. Imagine someone (maybe yourself) has published changes to the remote that you haven't fetched yet, or you have changed the README.md using the browser, so the remote is ahead of you. But also your local changes are not available to the remote yet, so you are ahead of the remote as well. This can be complicated, so it is the best to first fetch all possible updates from the remote and integrate them into your local dataset, before you publish your changes: datalad update --how merge.
  5. Finally, you can publish your changes: datalad push --to gin.

Create a datalad dataset

When you start with a new dataset, you want to turn it into a datalad dataset on your local device (e.g. acquisition system), and then publish the dataset to gin.

  1. In git bash, go into the root directory of your dataset.
  2. Create a datalad dataset here: datalad create --force -c text2git. --force is needed to create a dataset in a non-empty directory. -c text2git is a configuration that automatically tells datalad to not annex text files. Yes, everything else is annexed, which means the content of the files is actually hidden from the user behind a separate object folder. The user "only" sees the link to the object file. This allows your to drop & get content as you wish, and is the key to deal with large files. We don't want that with text files for convenience, so that we can edit them anytime without getting them.
  3. Run datalad status and you should see that your dataset is empty so far, i.e. all your files are untracked. So use the datalad save -m "add files"command to add your files. You can add all at one by not specifying any particular files.
  4. If you want to have subdatasets within this dataset, I suggest you visit the section subdatasets first, and do not add the directory that you want to turn into a subdataset as normal directory to your superdataset.
  5. Now you are ready to publish your dataset to gin: datalad create-sibling-gin --private --access-protocol ssh --credential [name] [name of the gin repo]. This step requires you to have set up the ssh connection and the personal-access-token to your gin account. [name] refers to the name you gave to the personal-access-token. [name of the gin repo] will be the name this repo has on gin. The recommendation is to be consistent with the local remote so you easily identify which gin repo belongs to which dataset.
  6. Run datalad siblings to see that gin has indeed been added as a sibling of your dataset.
  7. Now you can push your content to gin: datalad push --to gin.

Subdatasets

One big advantage of datalad is that it allows you to have modular setup of your dataset. This means you can and should organize your project in a way that data, code, results, manuscripts are all separated from each other, and within data, you would separate between raw, processed or derived data, as well as between different data types: imaging, behavior, ephys, histology etc. For each of these modules (as much as it makes sens to you) you would create a subdataset within your superdataset (which is your project folder). We assume you have turned your project folder into a datalad dataset already. Now you want to add the raw 2p data as a subdataset.

  1. Go into your project folder, run datalad create --force -c text2git -d . [path/to/raw_2p_data].
  2. Now go into the subdataset, and add files to datalad (see prev. section), until datalad status shows nothing to save.
  3. Go back to the superds, run datalad status you should see that the subds is modified. This is important: the superds always records the most updated version of the subds ONLY, but not the entire history. So for all changes happening in the subds, you commit them there and the history is stored there by datalad. And then you save the newest version to the superds.
  4. Run datalad save -d . -m "save subds" [path/to/subds].
  5. Run datalad status again and you should see a clean working tree.