1 year ago · 55a21c0876
--- a/README.md
+++ b/README.md
@@ -1,7 +1,24 @@
 
				-# Project <insert name>
			
 
				+# Towards unsupervised metrics of child language competences
			
 
				 
			
 
				-## Dataset structure
			
 
				+## Folder structure
			
 
				 
			
 
				-- All inputs (i.e. building blocks from other sources) are located in
			
 
				-  `inputs/`.
			
 
				-- All custom code is located in `code/`.
			
 
				+- All source code is located in `code/`
			
 
				+- All datasets are located in `datasets/` :
			
 
				+  - `childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
			
 
				+  - `opensubtitles_corpora\` contains a train and development corpora for each language. Each corpus contains one utterance per line.
			
 
				+- `extra/` contains configuration files. Those are important :
			
 
				+    - `languages_to_download_informations.yaml` details all the information needed to construct the training and test data for each language. This file is organized as follows:
			
 
				+    
			
 
				+        `
			
 
				+            language:
			
 
				+                1 - identifier for the espeak backend
			
 
				+                
			
 
				+                2 - full language name
			
 
				+                
			
 
				+                3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
			
 
				+                
			
 
				+                4 - Whether to extract the orthography tier or not
			
 
				+                
			
 
				+                5 - The urls of the selected corpora for this language   
			
 
				+        `
			
 
				+    - `markers.json`