Towards unsupervised metrics of child language competences
Setup the conda environment
First create a conda environment with all the required dependencies and packages :
conda env create environment.yml
and activate it :
conda activate measuring_cld
You will also need to install KenLM (https://github.com/kpu/kenlm).
We provide all the data already pre-processed and phonemized. But if you want to re-download the raw data and to re-pre-process them entierely, then you will need to install phonemizer (https://github.com/bootphon/phonemizer) with the espeak backend.
Folder structure
Run the experiments
We provide the OpenSubtitles and CHILDES datasets already pre-processed (=cleaned and phonemized). The script code/download_opensubtitles_corpora.py
was used to download the OpenSubtitles training data and the script code/download_childes_corpora.py
was used to download data the CHILDES testing data.
Run the trainings for all languages
The script to run the training is coda/train_language_models.sh
. This script takes as arguments:
-t
: the folder containing the training corpora for each language (here, the `datasets/opensubtitles_corpora' folder).
-o
: the output folder where the estimated language models will be stored.
-k
: path to the Kenlm folder
-n
: the size of the ngrams
For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run like that :
sh code/train_language_models.sh -t datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ -o estimated/ -k kenlm/ -n 5
Then, the trained language models will be stored in a folder estimated/
.
Evaluate the language models on each corresponding language
We can use the script code/evaluate_language_models.py
in order to assess the quality of the language models. The arguments of this script are:
--train_files_directory
: The directory containing the OpenSubtitles training files
--dev_files_directory
: The directory containing the OpenSubtitles test files
--models_directory
: The directory containing the trained language models
If in the previous step you stored the language models in the estimated/
folder, then you can run the script like that :
python code/evaluate_language_models.py --train_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --dev_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_dev/ --models_directory estimated/
This will output a evalution.csv
file in a results
folder.
Testing the trained language models on the CHILDES corpora
We can now compute the entropies on the CHILDES utterances with the script code/test_on_all_languages.py
. This script take the following arguments:
--train_directory
: The directory containing the train files tokenized in phonemes.
--models_directory
: The directory containing the trained language models.
--json_files_directory
: The directory containing CHILDES utterances in json format for each language.
--add_noise
, --no-add_noise
: Whether noise the CHILDES utterances or not.
If you stored the language models in the estimated/
folder, then you can run the script like that :
python code/test_on_all_languages.py --word_train_directory datasets/opensubtitles_corpora/tokenized_in_words/ --phoneme_train_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --models_directory estimated/ --json_files_directory datasets/childes_json_corpo
This will output a results.csv
file in a results
folder.
Analyzeing and visualizing the results
You can reproduce the plots and analyses by using the analyses_of_results.Rmd
Rmarkdown script.