Browse Source

update readme

yaya-sy 1 year ago
parent
commit
7ac48a81ab
2 changed files with 28 additions and 4 deletions
  1. 28 4
      README.md
  2. 0 0
      code/train_language_models.sh

+ 28 - 4
README.md

@@ -4,10 +4,10 @@
 
 - All source code is located in `code/`
 - All datasets are located in `datasets/` :
-  - `childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
-  - `opensubtitles_corpora\` contains a train and development corpora for each language. Each corpus contains one utterance per line.
+  - `datasets/childes_json_corpora/` contains a test corpus for each language. Each test corpus is a json file containing utterances produced by a given speaker from a given family at a given child age : `{family : {age : {speaker : utterances} } }`
+  - `datasets/opensubtitles_corpora/` contains a train and development corpora for each language. Each corpus contains one utterance per line.
 - `extra/` contains configuration files. Those are important :
-    - `languages_to_download_informations.yaml` details all the information needed to construct the training and test data for each language. This file is organized as follows:
+    - `extra/languages_to_download_informations.yaml` details all the information needed to make the training and test data for each language. This file is organized as follows:
     
         ```
             language:
@@ -21,4 +21,28 @@
                 
                 5 - The urls of the selected corpora for this language   
         ```
-    - `markers.json`
+        
+    - `extra/markers.json` is a json file containing markers and pattern for cleaning the CHILDES corpora
+
+## Run the experiments
+
+We provide the OpenSubtitles and CHILDES datasets already pre-processed (=cleaned and phonemized). The script `code/download_opensubtitles_corpora.py` was used to download the OpenSubtitles training data and the script `code/download_childes_corpora.py` was used to download data the CHILDES testing data.
+
+### Run the trainings for all languages
+
+The script to run the training is `coda/train_language_models.sh`. This script takes as arguments :
+> `-t` : the folder containing the training corpora for each language (here, the opensubtitles_corpora folder).
+
+> `-o` : the output folder where the estimated language models will be stored.
+
+> `-k` :  path to the kenlm folder
+
+> `-n` : the size of the ngrams
+
+For example, if we assume that Kenlm is installed in the in the root folder of the project, to reproduce our results, the script have to be run with like that :
+
+```sh code/train_language_models.sh -t datasets/opensubtitles_corpora/tokenized_in_phonemes_train/ -o estimated/ -k kenlm/ -n 5```
+
+Then, the trained language models will be stored in a `estimated/` of the root folder of the project.
+
+## Testing the trained language models on the CHILDES corpora

code/train_language_models_cp.sh → code/train_language_models.sh