1 year ago · ab70f62149
--- a/README.md
+++ b/README.md
@@ -7,22 +7,14 @@
 
				   - `datasets/childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
			
 
				   - `datasets/opensubtitles_corpora/` contains a train and development corpora for each language. Each corpus contains one utterance per line.
			
 
				 - `extra/` contains configuration files. Those are important :
			
 
				-    - `extra/languages_to_download_informations.yaml` details all the information needed to make the training and test data for each language. This file is organized as follows:
			
 
				-    
			
 
				-        ```
			
 
				-            language:
			
 
				-                1 - identifier for the espeak backend
			
 
				-                
			
 
				-                2 - full language name
			
 
				-                
			
 
				-                3 - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
			
 
				-                
			
 
				-                4 - Whether to extract the orthography tier or not
			
 
				-                
			
 
				-                5 - The urls of the selected corpora for this language   
			
 
				-        ```
			
 
				-        
			
 
				-    - `extra/markers.json` is a json file containing markers and pattern for cleaning the CHILDES corpora
			
 
				+    - `extra/languages_to_download_informations.yaml` details all the information needed to make the training and test data for each language. For each language, the following information are given:
			
 
				+        - The language's identifier for the espeak backend
			
 
				+        - The full name of the language
			
 
				+        - Speakers to consider when creating the CHILDES test corpus (in our case, adults=[Mother, Father], and child=[Target_Child])
			
 
				+        - Whether to extract the orthography tier or not
			
 
				+        - The urls of the selected corpora for this language   
			
 
				+
			
 
				+    - `extra/markers.json` is a json file containing markers and pattern to be cleaned from the CHILDES corpora
			
 
				 
			
 
				 ## Run the experiments
			
 
				 
			
@@ -30,19 +22,52 @@ We provide the OpenSubtitles and CHILDES datasets already pre-processed (=cleane
 
				 
			
 
				 ### Run the trainings for all languages
			
 
				 
			
 
				-The script to run the training is `coda/train_language_models.sh`. This script takes as arguments :
			
 
				-> `-t` : the folder containing the training corpora for each language (here, the opensubtitles_corpora folder).
			
 
				+The script to run the training is `coda/train_language_models.sh`. This script takes as arguments:
			
 
				+> `-t` : the folder containing the training corpora for each language (here, the `datasets/opensubtitles_corpora' folder).
			
 
				 
			
 
				 > `-o` : the output folder where the estimated language models will be stored.
			
 
				 
			
 
				-> `-k` :  path to the kenlm folder
			
 
				+> `-k` : path to the Kenlm folder
			
 
				 
			
 
				 > `-n` : the size of the ngrams
			
 
				 
			
 
				-For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run with like that :
			
 
				+For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run like that :
			
 
				 
			
 
				 ```sh code/train_language_models.sh -t datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ -o estimated/ -k kenlm/ -n 5```
			
 
				 
			
 
				-Then, the trained language models will be stored in a `estimated/` of the root folder of the project.
			
 
				+Then, the trained language models will be stored in a folder `estimated/`.
			
 
				+
			
 
				+## Evaluate the language models on each corresponding language
			
 
				+
			
 
				+We can use the script `code/evaluate_language_models.py` in order to assess the quality of the language models. The arguments of this script are:
			
 
				+
			
 
				+> `--train_files_directory` : The directory containing the OpenSubtitles training files
			
 
				+
			
 
				+> `--dev_files_directory` : The directory containing the OpenSubtitles test files
			
 
				+
			
 
				+> `--models_directory` : The directory containing the trained language models
			
 
				+
			
 
				+If in the previous step you stored the language models in the `estimated/` folder, then you can run the script like that :
			
 
				+
			
 
				+```python code/evaluate_language_models.py --train_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --dev_files_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_dev/ --models_directory estimated/```
			
 
				+
			
 
				+This will output a `evalution.csv` file in a `results` folder.
			
 
				+
			
 
				+### Testing the trained language models on the CHILDES corpora
			
 
				+
			
 
				+We can now compute the entropies on the CHILDES utterances with the script `code/test_on_all_languages.py`. This script take the following arguments:
			
 
				+
			
 
				+> `--train_directory` : The directory containing the train files tokenized in phonemes.
			
 
				+> `--models_directory`: The directory containing the trained language models.
			
 
				+>  --json_files_directory: The directory containing CHILDES utterances in json format for each language.
			
 
				+>  --add_noise, --no-add_noise: Whether noise the CHILDES utterances or not.
			
 
				+
			
 
				+If you stored the language models in the `estimated/` folder, then you can run the script like that :
			
 
				+
			
 
				+```python code/test_on_all_languages.py --word_train_directory datasets/opensubtitles_corpora/tokenized_in_words/ --phoneme_train_directory datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ --models_directory estimated/ --json_files_directory datasets/childes_json_corpo```
			
 
				+
			
 
				+This will output a `results.csv` file in a `results` folder.
			
 
				+
			
 
				+## Analyzeing and visualizing the results
			
 
				 
			
 
				-## Testing the trained language models on the CHILDES corpora
			
 
				+You can reproduce the plots and analyses by using the `code/analyses_of_results.Rmd` Rmarkdown script.