code/
datasets/
:
childes_json_corpora/
contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
opensubtitles_corpora\
contains a train and development corpora for each language. Each corpus contains one utterance per line.extra/
contains configuration files. Those are important :
languages_to_download_informations.yaml
details all the information needed to construct the training and test data for each language. This file is organized as follows:
language:
1 - identifier for the espeak backend
2 - full language name
3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
4 - Whether to extract the orthography tier or not
5 - The urls of the selected corpora for this language
markers.json