README.md 1.3 KB

Towards unsupervised metrics of child language competences

Folder structure

  • All source code is located in code/
  • All datasets are located in datasets/ :
    • childes_json_corpora/ contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
    • opensubtitles_corpora\ contains a train and development corpora for each language. Each corpus contains one utterance per line.
  • extra/ contains configuration files. Those are important :

    • languages_to_download_informations.yaml details all the information needed to construct the training and test data for each language. This file is organized as follows:

          language:
              1 - identifier for the espeak backend
                      
              2 - full language name
                      
              3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
                      
              4 - Whether to extract the orthography tier or not
                      
              5 - The urls of the selected corpora for this language   
      
    • markers.json