instructions for adding a dataset into EL1000

This is a stub, to remind myself what needs to be explained/done

Explain

datalad
dataset nesting
GIN
templates

Do typically only once per computer

install childproject
make sure to create virtual environment
https://github.com/LAAC-LSCP/datalad-procedures/blob/alecristia-patch-1/README.md -- installation steps only

Do once per dataset

https://github.com/LAAC-LSCP/datalad-procedures/blob/alecristia-patch-1/README.md steps under Usage (only appropriate template)
Find your original dataset in your computer, from which you're going to draw your files for all processes below.
get a copy of a relevant creation.sh -- there are two types, creation.sh for files where the .its needs to be anonymized (example from rague); and creation.sh for files where the .its do NOT need to be anonymized (example from lyon)
get also a copy of the two .py files from lyon/scripts/:
- import.py
- metadata.py
put these three files (creation.sh, metadata.py, import.py) inside your own scripts/ folder

Next we'll start running one group of lines at a time within creation.sh, checking whether you get any errors. I'll be following the version that requires anonymization (example)

Start with SECTION ONE: CREATING LOCAL COPIES OF FILES. Adapt the paths to make sure you create the folders you need (with or without confidential); and that you make local copies of files in the folders you want (again, with or without confidential).

metadata: locate your original metadata file; we'll make a copy for back up inside your newly-minted datalad dataset, so that there is a record of your original metadata. Notice that you can put it in metadata/confidential/original/ if it contains something confidential, or in metadata/original/ otherwise
its: locate where your .its files are; note if you need to use a general expression to find them all (e.g., if you have several its in several folders). Reflect also on whether you want to anonymize the .its files or not. If you want to anonymize them, then plan on making a copy of them in annotations/its/confidential/raw/; otherwise, your local copies can be put in annotations/its/raw/
extra: notice any other files that you have in your original dataset that you do not want to lose, such as perhaps notes files, or additional records of other tasks, etc. We'll just make a copy of those for back-up purposes, but you don't need to do anything particular with them.

Some errors that could arise here are:

if you have spaces in your path -- you should avoid those! Put your files somewhere where the paths won't have spaces.
if you didn't create the folders you needed (eg you forgot to create the folder with confidential in the path).

Next we move on to SECTION TWO: CREATE METADATA. Before you adapt and run the command, first open your original metadata, as well as metadata.py. There are two areas where you need to adapt things inside metadata.py:

Around line 35+, there are columns that need to be renamed. You can read about the standardized names of the children.csv and recordings.csv columns here. Make sure you map your columns to these standardized names as much as possible.

Then line 155, these are lines that are going to be dropped from children.csv because they pertain to recordings, and not to children; and/or because they contain potentially identifying information. Make sure you remove any columns in this example that are not found in your metadata, and that you add any columns you want to remove.

When you've made these changes, run the lines under "SECTION TWO: CREATE METADATA". Pay particular attention to the output under child-project validate . --ignore-files -- are there columns that remain that you intended to remove?

When you only get 'warnings' about extra columns that you do want to keep, you can keep going.

Next we move on to SECTION THREE: IMPORT ITS. These commands should work ok. The following are warnings; they are normal and you can ignore them:

scripts/import.py:66: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy input['filter'] = input['recordingfilename'].str.extract(r"([0-9]{1,})$")

could not delete './annotations/its/confidential/converted', as it does not exist (yet?)

could not delete './annotations/its/converted', as it does not exist (yet?)

Same with SECTION FOUR: WRAP UP, the commands should work without any trouble. You may get warnings RE columns as you did in section two.

This is a stub, to remind myself what needs to be explained/done Explain - datalad - dataset nesting - GIN - templates Do typically only once per computer - install childproject - make sure to create virtual environment - https://github.com/LAAC-LSCP/datalad-procedures/blob/alecristia-patch-1/README.md -- installation steps only Do once per dataset - https://github.com/LAAC-LSCP/datalad-procedures/blob/alecristia-patch-1/README.md steps under Usage (only appropriate template) - Find your original dataset in your computer, from which you're going to draw your files for all processes below. - get a copy of a relevant creation.sh -- there are two types, creation.sh for files where the .its needs to be anonymized ([example](https://gin.g-node.org/EL1000/rague/src/main/scripts/creation.sh) from rague); and creation.sh for files where the .its do NOT need to be anonymized ([example](https://gin.g-node.org/EL1000/lyon/src/main/scripts/creation.sh) from lyon) - get also a copy of the two .py files from lyon/scripts/: - [import.py](https://gin.g-node.org/EL1000/lyon/src/main/scripts/import.py) - [metadata.py](https://gin.g-node.org/EL1000/lyon/src/main/scripts/metadata.py) - put these three files (creation.sh, metadata.py, import.py) inside your own scripts/ folder Next we'll start running one group of lines at a time within creation.sh, checking whether you get any errors. I'll be following the version that requires anonymization ([example](https://gin.g-node.org/EL1000/rague/src/main/scripts/creation.sh)) Start with SECTION ONE: CREATING LOCAL COPIES OF FILES. Adapt the paths to make sure you create the folders you need (with or without confidential); and that you make local copies of files in the folders you want (again, with or without confidential). - metadata: locate your original metadata file; we'll make a copy for back up inside your newly-minted datalad dataset, so that there is a record of your original metadata. Notice that you can put it in `metadata/confidential/original/` if it contains something confidential, or in `metadata/original/` otherwise - its: locate where your .its files are; note if you need to use a general expression to find them all (e.g., if you have several its in several folders). Reflect also on whether you want to anonymize the .its files or not. If you want to anonymize them, then plan on making a copy of them in `annotations/its/confidential/raw/`; otherwise, your local copies can be put in `annotations/its/raw/` - extra: notice any other files that you have in your original dataset that you do not want to lose, such as perhaps notes files, or additional records of other tasks, etc. We'll just make a copy of those for back-up purposes, but you don't need to do anything particular with them. Some errors that could arise here are: - if you have spaces in your path -- you should avoid those! Put your files somewhere where the paths won't have spaces. - if you didn't create the folders you needed (eg you forgot to create the folder with confidential in the path). Next we move on to SECTION TWO: CREATE METADATA. Before you adapt and run the command, first open your original metadata, as well as metadata.py. There are two areas where you need to adapt things inside metadata.py: Around line 35+, there are columns that need to be renamed. You can read about the standardized names of the children.csv and recordings.csv columns [here](https://childproject.readthedocs.io/en/latest/format.html#metadata). Make sure you map your columns to these standardized names as much as possible. Then line 155, these are lines that are going to be dropped from children.csv because they pertain to recordings, and not to children; and/or because they contain potentially identifying information. Make sure you remove any columns in this example that are not found in your metadata, and that you add any columns you want to remove. When you've made these changes, run the lines under "SECTION TWO: CREATE METADATA". Pay particular attention to the output under `child-project validate . --ignore-files` -- are there columns that remain that you intended to remove? When you only get 'warnings' about extra columns that you do want to keep, you can keep going. Next we move on to SECTION THREE: IMPORT ITS. These commands should work ok. The following are warnings; they are normal and you can ignore them: > scripts/import.py:66: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy input['filter'] = input['recording_filename'].str.extract(r"_([0-9]{1,})$") > could not delete './annotations/its/confidential/converted', as it does not exist (yet?) > could not delete './annotations/its/converted', as it does not exist (yet?) Same with SECTION FOUR: WRAP UP, the commands should work without any trouble. You may get warnings RE columns as you did in section two.

I'm working on rague and I'm up to the point in which we'd need to create the siblings. .... text removed to avoid confusion!!! see comment by me below with the full instructions

Hello,

I recommend you follow the procedures described here (in this case, the EL1000 procedures)

https://github.com/LAAC-LSCP/datalad-procedures

Sorry, I have not updated the docs accordingly yet!

lucasgautheron stängde 3 år sedan

lucasgautheron återöppnade 3 år sedan

This will fetch and save all updates that have been pushed to the subdatasets

git submodule update --remote --checkout --init
datalad save . -m "subdatasets update"
datalad push

Logga in för att delta i denna konversation.

#3 instructions for adding a dataset into EL1000