Towards unsupervised metrics of child language development. Yaya Sy

Yaya Sy ef5c5be41b Mise à jour de 'datacite.yml' 2 anni fa
.datalad 26f6058844 [DATALAD] new dataset 2 anni fa
code 24feb5e9fd opensubtitles downloader only phonemes 2 anni fa
datasets b4fa96e143 add rmarkdown code 2 anni fa
extra f77b7928d2 add training and testing scripts 2 anni fa
results f7c6ced59f add plot 2 anni fa
.gitattributes d0fa2c1003 premieres configurations 2 anni fa
CHANGELOG.md 2702adb24e Apply YODA dataset setup 2 anni fa
LICENSE f2a879340f Mise à jour de 'LICENSE' 2 anni fa
README.md e5d3e4c792 update readme 2 anni fa
datacite.yml ef5c5be41b Mise à jour de 'datacite.yml' 2 anni fa
environment.yml 202ad44f37 importing conda environment 2 anni fa
final_results_analysis.Rmd b4fa96e143 add rmarkdown code 2 anni fa

README.md

Towards unsupervised metrics of child language competences

Setup the conda environment

First create a conda environment with all the required dependencies and packages :

conda env create environment.yml

and activate it :

conda activate measuring_cld

You will also need to install KenLM (https://github.com/kpu/kenlm).

We provide all the data already pre-processed and phonemized. But if you want to re-download the raw data and to re-pre-process them entierely, then you will need to install phonemizer (https://github.com/bootphon/phonemizer) with the espeak backend.

Folder structure

  • All source code is located in code/
  • All datasets are located in datasets/ :
    • datasets/childes_json_corpora/ contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
    • datasets/opensubtitles_corpora/ contains a train and development corpora for each language. Each corpus contains one utterance per line.
  • extra/ contains configuration files. Those are important :

    • extra/languages_to_download_informations.yaml details all the information needed to make the training and test data for each language. For each language, the following information are given:

      • The language's identifier for the espeak backend
      • The full name of the language
      • Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
      • Whether to extract the orthography tier or not
      • The urls of the selected corpora for this language
    • extra/markers.json is a json file containing markers and pattern to be cleaned from the CHILDES corpora

Run the experiments

We provide the OpenSubtitles and CHILDES datasets already pre-processed (=cleaned and phonemized). The script code/download_opensubtitles_corpora.py was used to download the OpenSubtitles training data and the script code/download_childes_corpora.py was used to download data the CHILDES testing data.

Run the trainings for all languages

The script to run the training is coda/train_language_models.sh. This script takes as arguments:

-t : the folder containing the training corpora for each language (here, the `datasets/opensubtitles_corpora' folder).

-o : the output folder where the estimated language models will be stored.

-k : path to the Kenlm folder

-n : the size of the ngrams

For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run like that :

sh code/train_language_models.sh -t datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ -o estimated/ -k kenlm/ -n 5

Then, the trained language models will be stored in a folder estimated/.

Evaluate the language models on each corresponding language

We can use the script code/evaluate_language_models.py in order to assess the quality of the language models. The arguments of this script are:

--train_files_directory : The directory containing the OpenSubtitles training files

--dev_files_directory : The directory containing the OpenSubtitles test files

--models_directory : The directory containing the trained language models

If in the previous step you stored the language models in the estimated/ folder, then you can run the script like that :

python code/evaluate_language_models.py --train_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --dev_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_dev/ --models_directory estimated/

This will output a evalution.csv file in a results folder.

Testing the trained language models on the CHILDES corpora

We can now compute the entropies on the CHILDES utterances with the script code/test_on_all_languages.py. This script take the following arguments:

--train_directory : The directory containing the train files tokenized in phonemes.

--models_directory : The directory containing the trained language models.

--json_files_directory: The directory containing CHILDES utterances in json format for each language.

--add_noise, --no-add_noise : Whether noise the CHILDES utterances or not.

If you stored the language models in the estimated/ folder, then you can run the script like that :

python code/test_on_all_languages.py --word_train_directory datasets/opensubtitles_corpora/tokenized_in_words/ --phoneme_train_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --models_directory estimated/ --json_files_directory datasets/childes_json_corpo

This will output a results.csv file in a results folder.

Analyzeing and visualizing the results

You can reproduce the plots and analyses by using the analyses_of_results.Rmd Rmarkdown script.

datacite.yml
Title Unsupervised metrics of child language development
Authors Sy,Yaya;LSCP;ORCID:0000-0002-0292-451X
Description Code to extract metrics of child language development from language models based on n-grams.
License Creative Commons CC 4.0 By (https://creativecommons.org/licenses/by/4.0/)
References Sy, Y. (2022, July 18). Vers des métriques non supervisées des compétences langagières des enfants. [doi:10.31219/osf.io/4pe2u] (IsSupplementTo)
Funding ANR, ANR-17-EURE-0017
J. S. McDonnell Foundation, Understanding Human Cognition Scholar Award
ERC, European Union’s Horizon 2020 research and innovation programme (ExELang, Grant agreement No. 101001095)
Keywords Language acquisition
Language models
Development
Children
n-gram
Metrics
Entropy
Resource Type Software