Towards unsupervised metrics of child language development. Yaya Sy

yaya-sy 202ad44f37 importing conda environment		hai 1 ano
.datalad	26f6058844 [DATALAD] new dataset	%!s(int64=2) %!d(string=hai) anos
code	202ad44f37 importing conda environment	hai 1 ano
datasets	202ad44f37 importing conda environment	hai 1 ano
estimated	202ad44f37 importing conda environment	hai 1 ano
extra	f77b7928d2 add training and testing scripts	hai 1 ano
results	202ad44f37 importing conda environment	hai 1 ano
.gitattributes	d0fa2c1003 premieres configurations	hai 1 ano
CHANGELOG.md	2702adb24e Apply YODA dataset setup	%!s(int64=2) %!d(string=hai) anos
README.md	7ac48a81ab update readme	hai 1 ano
commands_reproduction.txt	6e9ca5f21b re-downloaded childes data	hai 1 ano
environment.yml	202ad44f37 importing conda environment	hai 1 ano

Towards unsupervised metrics of child language competences

Folder structure

All source code is located in code/
All datasets are located in datasets/ :
- datasets/childes_json_corpora/ contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
- datasets/opensubtitles_corpora/ contains a train and development corpora for each language. Each corpus contains one utterance per line.

extra/ contains configuration files. Those are important :

extra/languages_to_download_informations.yaml details all the information needed to make the training and test data for each language. This file is organized as follows:

    language:
        1 - identifier for the espeak backend
                
        2 - full language name
                
        3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
                
        4 - Whether to extract the orthography tier or not
                
        5 - The urls of the selected corpora for this language

extra/markers.json is a json file containing markers and pattern for cleaning the CHILDES corpora

Run the experiments

We provide the OpenSubtitles and CHILDES datasets already pre-processed (=cleaned and phonemized). The script code/download_opensubtitles_corpora.py was used to download the OpenSubtitles training data and the script code/download_childes_corpora.py was used to download data the CHILDES testing data.

Run the trainings for all languages

The script to run the training is coda/train_language_models.sh. This script takes as arguments :

-t : the folder containing the training corpora for each language (here, the opensubtitles_corpora folder).

-o : the output folder where the estimated language models will be stored.

-k : path to the kenlm folder

-n : the size of the ngrams

For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run with like that :

sh code/train_language_models.sh -t datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ -o estimated/ -k kenlm/ -n 5

Then, the trained language models will be stored in a estimated/ of the root folder of the project.

README.md

Towards unsupervised metrics of child language competences

Folder structure

Run the experiments

Run the trainings for all languages

Testing the trained language models on the CHILDES corpora