Browse Source

Update with new commands

William N. Havard 1 year ago
parent
commit
53a8e21004
1 changed files with 26 additions and 38 deletions
  1. 26 38
      README.md

+ 26 - 38
README.md

@@ -4,40 +4,36 @@ Uruguayan Chatbot Project
 
 ## Description
 
-This repository contains the code to extract the metrics defined in the Uruguayan Chatbot Project.
+This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project.
 
 ## Installation
 
-Clone this repository:
+This code should not be used directly but should be embedded in a [datalad](https://www.datalad.org/) repository in [ChildProject](https://childproject.readthedocs.io/en/latest/) format.
 
+To add this repository as a submodule, run the following command:
 ```bash
-git clone git@github.com:LAAC-LSCP/URUMETRICS.git
+git submodule add git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git
 ```
 
-If you which to install the dependencies directly you can run:
-
+and install the necessary dependancies
 ```bash
 pip install -r requirements.txt
 ```
 
 ## Repository structure 
 
+* `acoustic_annotations` contains the code that computes the acoustic annotations
+* `import_data` contains the code that imports the new recordings and new annotations to the data set
+* `compute_metrics` contains the code that computes and save the metrics
+* `generate_messages` is the code that reads the metrics file and generates the messages sent to the families
+
+## Requirements
 
-* `src` contains the code that is necessary to handle the data
-    * `src/acoustic_annotations` contains the code that compute the acoustic annotations
-    * `src/import_data` contains code that allows to convert the data to the ChildProject format
-    * `src/compute_metrics` contains the code that extract the metrics
-    * `src/generate_messages` is the code that reads the metrics file and generates the messages sent to the families
-* `egs` contains example files that allow you to test your installation
-* `tst` contains the code that allows you to test your installation
-* `dat` is the directory where the input data should be stored
-    * `dat/data_set` is the directory where the input files should be stored (if the directory does not exist, see step 1 of [How to use?](#how-to-use))
-    * `dat/out` is the directory that contains the CSV files of the messages to send to the families
-    * `dat/utility` contains utility files (such as the definition of the messages sent to the families)
+### Running requirements
 
-## Data requirements
+All the runnable files **should be run from the root of the data set** (i.e. the directory containing the `script` directory). If not, an exception will be raised and the code will stop running.
 
-### Recording file names
+### Naming convention: recording file names
 
 Recording file names should be **WAV files** that follow the following **naming convention**
 
@@ -54,42 +50,34 @@ Additional information will be store in the metadata file `metadata/recordings.c
 
 ## How to use?
 
-1. Set up the `dat` directory by running `python -um src.import_data.prepare_data_set`
-
-
-This command only needs to be run once, however if won't break anything if it is run several times. This commands creates a ChildProject `data_set` directory with several subdirectories:
-* `recordings`: stores the recordings for which we need to run the pipeline
-* `annotations`: contains the annotations pertaining to the recordings
-* `metadata`: stores the metadata of the whole data set (children, recordings, annotations, etc.)
-* `extra`: used to store extra item (used to store `metrics.csv`)
-
-2. Place the recordings in `dat/in/recordings/raw` and run `python -um src.import_data.import_recordings`
+1. Place the recordings in `dat/in/recordings/raw` and run ``python -u scripts/URUMETRICS-CODE/import_data/import_recordings.py --experiment Uruguayan_Chatbot_2022``
 
-This command will look at the new recordings found in the `raw` directory and add them to the metadata file `metadata/recordings.csv`. If some of the recordings belong to previously unknown children, they will be added to the metadata file `metadata/children.csv`.
+This command will look at the new recordings found in the `raw` directory and add them to the metadata file `metadata/recordings.csv`. If some recordings belong to previously unknown children, they will be added to the metadata file `metadata/children.csv`.
 
 Note that the recording file names **should comply with the file naming convention described above**!
 
-4. Compute the acoustic annotations using the following command `python -um src.acoustic_annotations.compute_acoustic_annotations --path-vtc /path/to/vtc/file.rttm --path-recordings /path/to/recordings/raw --save-path /path/where/to/save/the/annotations`.
+2. Extract the acoustic annotations using the following command ``python -u scripts/URUMETRICS-CODE/acoustic_annotations/compute_acoustic_annotations.py --path-vtc ./annotations/vtc/raw/VTC_FILE_FOR_WHICH_TO_DERIVE_ACOUSTIC_ANNOTATIONS_FOR.rttm --path-recordings ./recordings/raw/ --save-path .
+/annotations/acoustic/raw``.
 
-This command will compute acoustic annotations (mean pitch, pitch range) given *raw* VTC annotations. The output file will be named according to the following pattern `acoustic_annotations_YYYYMMDD_HHMMSS.csv`.
+This command will compute acoustic annotations (mean pitch, pitch range) for the VTC file passed as argument. The output file will be named according to the following pattern `acoustic_annotations_YYYYMMDD_HHMMSS.csv`.
 
-5. Place the annotations in their respective folder in `dat/in/annotations/{acoustic|vtc|vcm|alice}/raw`
+3. Place the annotations in their respective folder in `dat/in/annotations/{acoustic|vtc|vcm|alice}/raw`
 
 Note that the annotation files should have **unique names** (e.g. like the acoustic annotations) and **should by no means overwrite the files already present** in the aforementioned directories.
 
-6. Run the following command `python -um src.import_data.import_annotations` to convert the annotations to the ChildProject format.
+4. Run the following command ``python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py`` to convert the annotations to the ChildProject format.
 
-7. Run the following command `python -um src.compute_metrics.metrics` to compute ACLEW metrics as well as additional metrics defined in `compute_metrics/metrics_functions.py`
+5. Run the following command `python -u scripts/URUMETRICS-CODE/compute_metrics/metrics.py` to compute ACLEW metrics as well as the additional metrics defined in `compute_metrics/metrics_functions.py`
 
-This command will generate a file `metrics.csv` which will be stored in `dat/data_set/extra`. If the file already exists, new lines will be added at the end.
+This command will generate a file `metrics.csv` which will be stored in `extra/metrics`. If the file already exists, new lines will be added at the end.
 
 Note that the metrics are only computed for newly imported recordings and not for all the files. If not annotations are linked to the new files (e.g. you forgot to import them) the columns will be empty.
 
-8. Generate the message using the following command `python -um src.generate_messages.messages [--date YYYYMMDD]`
+6. Generate the message using the following command `python -u scripts/URUMETRICS-CODE/generate_messages/messages.py [--date YYYYMMDD]`
 
-This command will create a file in `dat/out` with the following name pattern `messages_YYYYMMDD.csv`
+This command will create a file in `extra/messages/generated` with the following name pattern `messages_YYYYMMDD.csv`
 
-The file will contain the message that correspond to each new audio file. The date parameter is used to specify the date for which to generate messages. If the date is before the current date, only recording available at the specified date will be considered to generate the messages. This allows to re-generate past messages if needed. If no date is specified, the current date is used.
+The file will contain the messages that correspond to each new audio file. The date parameter is used to specify the date for which to generate messages. If the date is before the current date, only recording available at the specified date will be considered to generate the messages. This allows to re-generate past messages if needed. If no date is specified, the current date is used.
 
 ## Return codes