Towards unsupervised metrics of child language development. Yaya Sy

yaya-sy 6f895387c4 updata readme 1 рік тому
.datalad 26f6058844 [DATALAD] new dataset 2 роки тому
code f77b7928d2 add training and testing scripts 1 рік тому
datasets f77b7928d2 add training and testing scripts 1 рік тому
extra f77b7928d2 add training and testing scripts 1 рік тому
results f77b7928d2 add training and testing scripts 1 рік тому
.gitattributes d0fa2c1003 premieres configurations 1 рік тому
CHANGELOG.md 2702adb24e Apply YODA dataset setup 2 роки тому
README.md 6f895387c4 updata readme 1 рік тому
commands_reproduction.txt 6e9ca5f21b re-downloaded childes data 1 рік тому
ter f77b7928d2 add training and testing scripts 1 рік тому

README.md

Towards unsupervised metrics of child language competences

Folder structure

  • All source code is located in code/
  • All datasets are located in datasets/ :
    • childes_json_corpora/ contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
    • opensubtitles_corpora\ contains a train and development corpora for each language. Each corpus contains one utterance per line.
  • extra/ contains configuration files. Those are important :

    • languages_to_download_informations.yaml details all the information needed to construct the training and test data for each language. This file is organized as follows:

      ` language:

          1 - identifier for the espeak backend
      
          2 - full language name
      
          3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
      
          4 - Whether to extract the orthography tier or not
      
          5 - The urls of the selected corpora for this language`
      
    • markers.json