Browse Source

updata readme

yaya-sy 1 year ago
parent
commit
55a21c0876
1 changed files with 22 additions and 5 deletions
  1. 22 5
      README.md

+ 22 - 5
README.md

@@ -1,7 +1,24 @@
-# Project <insert name>
+# Towards unsupervised metrics of child language competences
 
-## Dataset structure
+## Folder structure
 
-- All inputs (i.e. building blocks from other sources) are located in
-  `inputs/`.
-- All custom code is located in `code/`.
+- All source code is located in `code/`
+- All datasets are located in `datasets/` :
+  - `childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
+  - `opensubtitles_corpora\` contains a train and development corpora for each language. Each corpus contains one utterance per line.
+- `extra/` contains configuration files. Those are important :
+    - `languages_to_download_informations.yaml` details all the information needed to construct the training and test data for each language. This file is organized as follows:
+    
+        `
+            language:
+                1 - identifier for the espeak backend
+                
+                2 - full language name
+                
+                3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
+                
+                4 - Whether to extract the orthography tier or not
+                
+                5 - The urls of the selected corpora for this language   
+        `
+    - `markers.json`