暫無描述

Loann Peurey 7863851913 update lyon with vcm 1 年之前
.datalad ed4b15d6d3 tests 3 年之前
alphen @ 6748764380 03c34dc7c7 +lena_speaker 3 年之前
bergelson @ 7d943f1a71 4286e84098 update bergelson png, recompute figures 1 年之前
code 4286e84098 update bergelson png, recompute figures 1 年之前
cougar @ 60f61a335e 9681e9a081 update subdatasets 1 年之前
documentation 4286e84098 update bergelson png, recompute figures 1 年之前
elo @ b2dcc7627b 03c34dc7c7 +lena_speaker 3 年之前
ganek @ 15088b17e6 03c34dc7c7 +lena_speaker 3 年之前
kalashnikova @ d343d90938 03c34dc7c7 +lena_speaker 3 年之前
kidd @ 8c2ad48a82 03c34dc7c7 +lena_speaker 3 年之前
lucid @ 25f638b485 82d5235159 update datasets with vtc_no_overlap 1 年之前
lyon @ 66d74e5489 7863851913 update lyon with vcm 1 年之前
png2019 @ 20a93d2fd8 9681e9a081 update subdatasets 1 年之前
rague @ 44f44eb072 82d5235159 update datasets with vtc_no_overlap 1 年之前
ramirez-esparza @ 6d93d6e1b1 82d5235159 update datasets with vtc_no_overlap 1 年之前
senegal @ 80d25ebd8e 03c34dc7c7 +lena_speaker 3 年之前
swedish @ 6d5936c370 03c34dc7c7 +lena_speaker 3 年之前
tests ed4b15d6d3 tests 3 年之前
tsimane2017 @ 555ce3a80e 9681e9a081 update subdatasets 1 年之前
warlaumont @ d7dfe79197 82d5235159 update datasets with vtc_no_overlap 1 年之前
weisleder @ da63e724b2 6c881bb94c add weisleder 2 年之前
winnipeg @ 2d3310abe9 82d5235159 update datasets with vtc_no_overlap 1 年之前
.gitattributes 4aee88e6c0 text2git conf 3 年之前
.gitignore 4aee88e6c0 text2git conf 3 年之前
.gitmodules 6c881bb94c add weisleder 2 年之前
README.md 24305d9ec7 update reamde, update lyon 2 年之前

README.md

EL1000

Requesting access to the data

The procedure to request access to the data can be found here.

Gaining access to the data

Once your project has been approved, the technical advisor will ensure your access to the data sets. Please note that you may not have been allowed access to all of the corpora, either because data donors declined, or because you are not a Homebank member.

Data (including .its and metadata) have been formatted using the ChildProject package; for an overview of the formatting and structure, see this introduction. We strongly encourage you to build on this (i.e., do not move data around, do not make other copies), which will allow you to maintain compatibility with others and increase reproducibility. For an example of how to set up an analysis that relates to data sets like this one, see this example or this one.

To access the data, you'll need to:

  1. Create an account on https://gin.g-node.org/user/sign_up
  2. Give your username to the technical advisor
  3. Follow the instructions to install the ChildProject package and DataLad
  4. Wait until you have received confirmation from the technical advisor, that you now have access. Then, follow instructions below.

Re-using EL1000 datasets

Requirements

You will first need to install the ChildProject package as well as DataLad. Instructions to install these packages can be found here.

Configuring your SSH key on GIN

This step should only be done once for all.

  1. Copy your SSH public key to your clipboard (usually located in ~/.ssh/id_rsa.pub). If you don't have one, please create one following these instructions.
  2. In your browser, go to GIN > Your parameters > SSH keys.
  3. Click on the blue "Add a key" button, then paste the content of your public key in the Content field, and submit.

Your key should now appear in your list of SSH keys - you can add as many as necessary.

Installing datasets

First, clone the EL1000 superdataset:

datalad install -r git@gin.g-node.org:/LAAC-LSCP/EL1000.git
cd EL1000

To get data from any of the EL1000 datasets (e.g.: kidd), cd into it, then run the setup script.

cd kidd
datalad run-procedure setup

If you would like to claim access to the confidential files as well, do the following instead (notice the --confidential flag):

cd kidd
datalad run-procedure setup --confidential

Note: you may not have been allowed access to all of the corpora, either because data donors declined, or because you are not a Homebank member. If you think you should have access to more corpora, please get in touch with the technical advisor.

Getting data

You can get data from a dataset using the datalad get command, e.g.:

datalad get annotations # get all files under annotations/

Or:

datalad get . # get everything

You can download many files in parallel using the -J or --jobs parameters:

datalad get . -J 4 # get everything, with 4 parallel transfers

For more help with using DataLad, please refer to our cheatsheet or DataLad's own cheatsheet. If this is not enough, check DataLad's documentation and Handbook.

Fetching updates

If you are notified of changes to the data, please retrieve them by issuing the following commands:

datalad update --merge
datalad get .

Removing the data

It is important that you delete the data once your project is complete. This can be done with datalad remove:

datalad remove -r path/to/your/dataset

Data description

Data documentation

Datasets are structured according to the ChildProject package standards detailed here.

Participants

The matrix of how many children are exposed to language X in corpus Y can be found in documentation/languages.csv.

Available annotations

Derived datasets

  • metrics: metrics derived from ACLEW and LENA annotations.
  • reliability: reliability estimations for ACLEW and LENA annotations based on manual annotations.

Maintainers

The EL1000 package

In order to maintain EL1k datasets (e.g. to export metadata from .its annotations, or to import annotations), the EL1000 package is needed. It can be installed with pip with the following command:

pip install git+ssh://git@gin.g-node.org:/LAAC-LSCP/tools.git --upgrade

How to import new datasets

Instructions to import new datasets can be found here.