Towards unsupervised metrics of child language development. Yaya Sy

yaya-sy 6f895387c4 updata readme před 1 rokem
.datalad 26f6058844 [DATALAD] new dataset před 2 roky
code f77b7928d2 add training and testing scripts před 1 rokem
datasets f77b7928d2 add training and testing scripts před 1 rokem
extra f77b7928d2 add training and testing scripts před 1 rokem
results f77b7928d2 add training and testing scripts před 1 rokem
.gitattributes d0fa2c1003 premieres configurations před 1 rokem
CHANGELOG.md 2702adb24e Apply YODA dataset setup před 2 roky
README.md 6f895387c4 updata readme před 1 rokem
commands_reproduction.txt 6e9ca5f21b re-downloaded childes data před 1 rokem
ter f77b7928d2 add training and testing scripts před 1 rokem

README.md

Towards unsupervised metrics of child language competences

Folder structure

  • All source code is located in code/
  • All datasets are located in datasets/ :
    • childes_json_corpora/ contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : {family : {age : {speaker : utterances} } }
    • opensubtitles_corpora\ contains a train and development corpora for each language. Each corpus contains one utterance per line.
  • extra/ contains configuration files. Those are important :

    • languages_to_download_informations.yaml details all the information needed to construct the training and test data for each language. This file is organized as follows:

      ` language:

          1 - identifier for the espeak backend
      
          2 - full language name
      
          3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
      
          4 - Whether to extract the orthography tier or not
      
          5 - The urls of the selected corpora for this language`
      
    • markers.json