Uruguayan Chatbot Project - Code

5 Ágak

William N. Havard 011c73cc4c Updated requirements.txt with ChildProject=0.0.5 to merge partial sets and added datalad and the conversations package as requirements		1 éve
acoustic_annotations	87e9ad60ad Updated consts to reflect file extension	1 éve
compute_metrics	97404871c6 Add skip_existing	1 éve
generate_messages	144d85f224 Quote text (prevent any mistake that could arise if messages contain commas)	1 éve
import_data	1fb139d78b Corrected annotation type/extension	1 éve
turn_annotations	1609477b3b Use partial set merge	1 éve
.gitignore	6fa72f247e First commit.	1 éve
LICENSE	6fa72f247e First commit.	1 éve
README.md	76f8306af4 Updated README.md	1 éve
__init__.py	6fa72f247e First commit.	1 éve
requirements.txt	011c73cc4c Updated requirements.txt with ChildProject=0.0.5 to merge partial sets and added datalad and the conversations package as requirements	1 éve

URUMETRICS

Uruguayan Chatbot Project

Description

This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project.

Installation

This code should not be used directly but should be embedded in a datalad repository in ChildProject format.

To add this repository as a submodule, run the following command:

mkdir scripts && cd scripts
git submodule add git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git

and install the necessary dependancies

pip install -r requirements.txt

Repository structure

acoustic_annotations contains the code that computes the acoustic annotations
import_data contains the code that imports the new recordings and new annotations to the data set
compute_metrics contains the code that computes and save the metrics
generate_messages is the code that reads the metrics file and generates the messages sent to the families

Requirements

Running requirements

All the runnable files should be run from the root of the data set (i.e. the directory containing the script directory). If not, an exception will be raised and the code will stop running.

Naming convention: recording file names

Recording file names should be WAV files that follow the following naming convention

CHILD-ID_[info1_info2_..._infoX]_YYYYMMDD_HHMMSS.wav

where

YYYYMMDD corresponds to the date formatted according to the ISO 8601 format (i.e. YYYY for the year, MM for the month (from 01 to 12), and DD for the day (from 01 to 31)) and
HHMMSS date formatted according to the ISO 8601 format (HH for hours (from 00 to 23, 24-hour clock system), MM for minutes (from 00 to 59), and SS for seconds (from 00 to 59)).
[info1_info2_..._infoX] corresponds to optional information (without [ and ]) separated by underscores_`
CHILD-ID may use any character except the underscore character (_).

Additional information will be store in the metadata file metadata/recordings.csv in the column experiment_stage.

How to use?

The following instructions explain how to use this code when it is embedded (as a submodule) in a ChildProject project.

Prepare the data set by running the following command python -u scripts/URUMETRICS-CODE/import_data/prepare_data_set.py

This will create the necessary directories required by ChildProject if they do not exist.

Place the recordings in dat/in/recordings/raw and run python -u scripts/URUMETRICS-CODE/import_data/import_recordings.py --experiment Uruguayan_Chatbot_2022

This command will look at the new recordings found in the raw directory and add them to the metadata file metadata/recordings.csv. If some recordings belong to previously unknown children, they will be added to the metadata file metadata/children.csv.

Note that the recording file names should comply with the file naming convention described above!

Place the annotations in their respective folder in dat/in/annotations/{vtc|vcm|alice}/raw

Note that the annotation files should have unique names (e.g. like the acoustic annotations) and should by no means overwrite the files already present in the aforementioned directories.

Extract the acoustic annotations using the following command python -u scripts/URUMETRICS-CODE/acoustic_annotations/compute_acoustic_annotations.py --path-vtc ./annotations/vtc/raw/VTC_FILE_FOR_WHICH_TO_DERIVE_ACOUSTIC_ANNOTATIONS_FOR.rttm --path-recordings ./recordings/raw/ --save-path ./annotations/acoustic/raw.

This command will compute acoustic annotations (mean pitch, pitch range) for the VTC file passed as argument. The output file will have the same name as the input VTC file with the rttm extension replaced by csv.

Run the following command python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py to convert the annotations to the ChildProject format.
Run the following command python -u scripts/URUMETRICS-CODE/compute_metrics/metrics.py to compute ACLEW metrics as well as the additional metrics defined in compute_metrics/metrics_functions.py

This command will generate a file metrics.csv which will be stored in extra/metrics. If the file already exists, new lines will be added at the end.

Note that the metrics are only computed for newly imported recordings and not for all the files. If not annotations are linked to the new files (e.g. you forgot to import them) the columns will be empty.

Generate the message using the following command python -u scripts/URUMETRICS-CODE/generate_messages/messages.py [--date YYYYMMDD]

This command will create a file in extra/messages/generated with the following name pattern messages_YYYYMMDD.csv

The file will contain the messages that correspond to each new audio file. The date parameter is used to specify the date for which to generate messages. If the date is before the current date, only recording available at the specified date will be considered to generate the messages. This allows to re-generate past messages if needed. If no date is specified, the current date is used.

Return codes

Every command returns either a 0 (i.e. no problem) or 1 (i.e. problem) return code. They might print information, warning and error messages to STDERR.

Test

TO DO!

README.md