# URUMETRICS Uruguayan Chatbot Project ## Description This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project. ## Installation This code should not be used directly but should be embedded in a [datalad](https://www.datalad.org/) repository in [ChildProject](https://childproject.readthedocs.io/en/latest/) format. To add this repository as a submodule, run the following command: ```bash mkdir scripts && cd scripts git submodule add git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git ``` and install the necessary dependencies ```bash pip install -r requirements.txt ``` ## Repository structure * `acoustic_annotations` contains the code that computes the acoustic annotations * `turn_annnotations` contains the code that compute the conversational annotations * `import_data` contains the code that imports the new recordings and new annotations to the data set * `compute_metrics` contains the code that computes and save the metrics * `generate_messages` is the code that reads the metrics file and generates the messages sent to the families ## Requirements ### Running requirements All the runnable files **should be run from the root of the data set** (i.e. the directory containing the `scripts` directory). If not, an exception will be raised and the code will stop running. ### Naming convention: recording file names Recording file names should be **WAV files** that follow the following **naming convention** ``` CHILD-ID_[info1_info2_..._infoX]_YYYYMMDD_HHMMSS.wav ``` where * `YYYYMMDD` corresponds to the date formatted according to the ISO 8601 format (i.e. `YYYY` for the year, `MM` for the month (from 01 to 12), and `DD` for the day (from 01 to 31)) and * `HHMMSS` date formatted according to the ISO 8601 format (`HH` for hours (from 00 to 23, 24-hour clock system), `MM` for minutes (from 00 to 59), and `SS` for seconds (from 00 to 59)). * `[info1_info2_..._infoX]` corresponds to **optional** information (without `[` and `]) separated by underscores `_` * `CHILD-ID` may use **any character except the underscore character (`_`)**. Additional information will be store in the metadata file `metadata/recordings.csv` in the column `experiment_stage`. ## How to use? The following instructions explain how to use this code when it is embedded (as a submodule) in a ChildProject project using datalad **(0)** Define the following bash variables ```bash today=$(date '+%Y%m%d') dataset="URUMETRICS" ``` The content of `dataset` should be the name of the data set you are interested in and should exist in the [LAAC-LSCP GIN repository](https://gin.g-node.org/LAAC-LSCP). **(1)** Run the following commands to install the data set ```bash datalad install -r git@gin.g-node.org:/LAAC-LSCP/${dataset}.git cd ${dataset} datalad get extra/messages/definition datalad get extra/metrics && datalad unlock extra/metrics datalad get metadata && datalad unlock metadata ``` **Note that you will only be allowed to clone and install the data set if you are added as a collaborator on GIN.** Please ask [William](mailto:william.havard@gmail.com) or [Alex](mailto:alecristia@gmail.com) for more information. **(2)** Prepare the data set by running the following command ```bash python -u scripts/URUMETRICS-CODE/import_data/prepare_data_set.py ``` This will create the necessary directories required by ChildProject if they do not exist. **(3)** Place the new recordings in `dat/in/recordings/raw` **(4)** Place the VTC, VCM, and ALICE annotation files in their respective folder in `dat/in/annotations/{vtc|vcm|alice}/raw` Note that the annotation files should have **unique names** (e.g. like the acoustic annotations) and **should by no means overwrite the files already present** in the aforementioned directories. **(5)** Save the data set and push the new annotations to GIN ```bash datalad save recordings -m "Added new recordings for date ${today}" datalad save annotations/*/raw -m "Added raw annotations for date ${today}" datalad push --to origin ``` This is a very important step. This allows us to push the new recordings and annotations before running any script that could potentially fail. **(6)** Import the new recordings ```bash python -u scripts/URUMETRICS-CODE/import_data/import_recordings.py --experiment Uruguayan_Chatbot_2022 ``` This command will look at the new recordings found in the `raw` directory and add them to the metadata file `metadata/recordings.csv`. If some recordings belong to previously unknown children, they will be added to the metadata file `metadata/children.csv`. Note that the recording file names **should comply with the file naming convention described above**! **(7)** Run the following commands to convert and import the annotations to the ChildProject format: ```bash python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type VTC --annotation-file VTC_FILE.rttm python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type VCM --annotation-file VCM_FILE.vcm python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type ALICE --annotation-file ALICE_FILE.txt ``` This will import the VTC, VCM, ALICE and ACOUSTIC annotations contains in the specified files. **Note that you shouldn't specify the full path to the file, but only its raw filename and extension** This script can also take the additional (optional) parameter `--recording`. When used, only the annotations pertaining to the specified recording (`filename.wav`) will be imported. This can be useful when you need to import only the annotations for a specific recording and not all the annnotations for all the recordings. You may also use the option `--recordings-from-annotation-file` (incompatible with the previous one) so as to only import the annotations contained in `--annotation-file` for the recordings that also appear in the annotation file `--recordings-from-annotation-file` (i.e. you filter the annotation of `--annotation-file` with the recordings of `--recordings-from-annotation-file`) You may now save and push. **Do not forget to unlock the `metadata` directory afterwards!** ```bash datalad save annotations/*/converted -m "Converted RAW annotations for date ${today}" datalad save metadata -m "Imported RAW annotations for date ${today}" datalad push --to origin datalad unlock metadata ``` **(8)** Compute the conversational annotations using the following command: ```bash python scripts/URUMETRICS-CODE/compute_annotations/compute_derived_annotations.py --annotation-type CONVERSATIONS --save-path annotations/conversations/raw/ ``` This command will only compute the conversational annotations for the newly imported VTC files. **(9)** Import the conversational annotations using the following command ```bash python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type CONVERSATIONS --annotation-file CONVERSATIONS_VTC_${today}.csv --recordings-from-annotation-file VTC_${today}.rttm ``` The name of `--annotations-file` is the name of the VTC file prefixed by `CONVERSATIONS_`. ⚠️! You may need to import more than one file. Indeed, annotations are computed for all the files that are missing this type of annotation. Hence, more than one output file may be generated if annotations were generated for files imported before. **(10)** Extract the acoustic annotations using the following command ```bash python scripts/URUMETRICS-CODE/compute_annotations/compute_derived_annotations.py --annotation-type ACOUSTIC --save-path annotations/acoustic/raw/ --target-sr 16000 ``` This command will compute acoustic annotations (mean pitch, pitch range) for the files that do not have any acoustic annotation yet (if the VTC annotations are available as well as their recordings). and similarly imported these annotations using this command ```bash python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type ACOUSTIC --annotation-file ACOUSTIC_VTC_${today}.csv --recordings-from-annotation-file VTC_${today}.rttm ``` as for conversational annotations, the name of `--annotations-file` is the name of the VTC file prefixed by `ACOUSTIC_`. ⚠️! You may need to import more than one file. Indeed, annotations are computed for all the files that are missing this type of annotation. Hence, more than one output file may be generated if annotations were generated for files imported before. You may now save and push. **Do not forget to unlock the `metadata` directory afterwards!** ```bash datalad save annotations/*/raw -m "Imported DERIVED RAW annotations for date ${today}" datalad save annotations/*/converted -m "Convected DERIVED annotations for date ${today}" datalad save metadata -m "Imported DERIVED CONVERTED annotations for date ${today}" datalad unlock metadata ``` **(11)** Run the following command to compute ACLEW metrics as well as the additional metrics defined in `compute_metrics/metrics_functions.py`: ```bash python -u scripts/URUMETRICS-CODE/compute_metrics/metrics.py ``` This command will generate a file `metrics.csv` which will be stored in `extra/metrics`. If the file already exists, new lines will be appended. Note that the metrics are only computed for newly imported recordings and not for all the files. If no annotations are linked to the new files (e.g. you forgot to import them) the columns will be empty. **(12)** Generate the message using the following command ```bash python -u scripts/URUMETRICS-CODE/generate_messages/messages.py [--date YYYYMMDD] ``` This command will create a file in `extra/messages/generated` with the following name pattern `messages_YYYYMMDD.csv` The file will contain the messages that correspond to each new audio file. The date parameter is used to specify the date for which to generate messages. If the date is before the current date, only recording available at the specified date will be considered to generate the messages. This allows to re-generate past messages if needed. If no date is specified, the current date is used. **Do something with the generated message file** **(13)** Save the data set and push everything to GIN ```bash datalad save extra/metrics -m "Computed new metrics for date ${today}" datalad save extra/messages/generated -m "Message generated for date ${today}" datalad save . datalad push --to origin ``` **(14)** Uninstall the data set ```bash git annex dead here datalad push --to origin cd .. datalad remove -d ${dataset} ``` ## Return codes Every command returns either a `0` (i.e. no problem) or `1` (i.e. problem) return code. They might print information, warning and error messages to STDERR. ## Test TO DO! ## Version Requirements * VTC: [66f87c2a8cef25c80c9d9b91f4023ab4757413da](https://github.com/MarvinLvn/voice-type-classifier/tree/66f87c2a8cef25c80c9d9b91f4023ab4757413da) * VCM: [37e27e75c613ef78f375ff43f1d69940b02d0713](https://github.com/LAAC-LSCP/vcm/tree/37e27e75c613ef78f375ff43f1d69940b02d0713) * Alice: [f7962f46615a6a433f0da5398f61282d9961c101](https://github.com/orasanen/ALICE/tree/f7962f46615a6a433f0da5398f61282d9961c101) * Conversations: [1f89a3e818df6e4e2cf87c0c17e7cfe31867e84e](https://github.com/LAAC-LSCP/conversations/tree/1f89a3e818df6e4e2cf87c0c17e7cfe31867e84e)