|
@@ -1,7 +1,24 @@
|
|
|
-# Project <insert name>
|
|
|
+# Towards unsupervised metrics of child language competences
|
|
|
|
|
|
-## Dataset structure
|
|
|
+## Folder structure
|
|
|
|
|
|
-- All inputs (i.e. building blocks from other sources) are located in
|
|
|
- `inputs/`.
|
|
|
-- All custom code is located in `code/`.
|
|
|
+- All source code is located in `code/`
|
|
|
+- All datasets are located in `datasets/` :
|
|
|
+ - `childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
|
|
|
+ - `opensubtitles_corpora\` contains a train and development corpora for each language. Each corpus contains one utterance per line.
|
|
|
+- `extra/` contains configuration files. Those are important :
|
|
|
+ - `languages_to_download_informations.yaml` details all the information needed to construct the training and test data for each language. This file is organized as follows:
|
|
|
+
|
|
|
+ `
|
|
|
+ language:
|
|
|
+ 1 - identifier for the espeak backend
|
|
|
+
|
|
|
+ 2 - full language name
|
|
|
+
|
|
|
+ 3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
|
|
|
+
|
|
|
+ 4 - Whether to extract the orthography tier or not
|
|
|
+
|
|
|
+ 5 - The urls of the selected corpora for this language
|
|
|
+ `
|
|
|
+ - `markers.json`
|