Uruguayan Chatbot Project - Code

LPeurey a30454d1c1 rebase conversations package 2 months ago
compute_annotations c16ead672a do not skip files 5 months ago
compute_metrics 2eb32e701a Renmaed metric chi_adu_turn_transitions_ph 1 year ago
dependencies a30454d1c1 rebase conversations package 2 months ago
example 047209fdea Updated pipeline 1 year ago
generate_messages 3a2ad5586a take received date into account to generate messages instead of date_iso 1 year ago
import_data 56cc31a7a2 Added default value to correspondance in _build_recording_metadata (fixed failing tests) 1 year ago
tests 61a18ca7dc fix generate messages for default, extend tests 1 year ago
.gitignore b22e84b20e tests for metrics and computed annotations 1 year ago
.gitmodules 2d7e2f2306 SSH submodules 1 year ago
LICENSE 6fa72f247e First commit. 1 year ago
README.md 74edc18a25 some change to readme 1 year ago
__init__.py 6fa72f247e First commit. 1 year ago
requirements.txt 011c73cc4c Updated requirements.txt with ChildProject=0.0.5 to merge partial sets and added datalad and the conversations package as requirements 1 year ago

README.md

URUMETRICS

Uruguayan Chatbot Project.

This project inside of the LAAC team involves:

  • William N Havard : Postdoctoral Researcher, william.havard@ens.fr / william.havard@gmail.com
  • Loann Peurey : Data manager, loannpeurey@gmail.com
  • Alejandrina Cristia : Research director, acristia@gmail.com

Description

This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project. The Uruguayan Chatbot Project uses this repo embedded into the URUMETRICS dataset. A server was set up on Amazon AWS servers (S3, lambda, API gateway) to:

  • receive and store audio files from collaborators (simple mind) collecting audios in the field (via an http API).
  • run the importation of the new audios in the dataset, run the extraction of metrics
  • launch the generation of messages
  • make the messages available to collaborators via http API (Simple mind) who can then send them to the participating families.

The intended usage for the repository is by simply feeding audios to a dataset (audios that follow the naming standard given). The pipeline is able to generate feedback messages on the evolution of certain metrics (and for each child).

For example, child number 1 (C1) has a a first recording (R1C1) posted, the audio is integrated to the dataset and the number of vocalizations he produced and number interactions with his mother is calculated. A week later, a second audio is posted (R2C1). The same process is applied. The system then compares vocalizations and interactions between R1C1 and R2C1. Interactions increased but number of vocalizations did not. A message will be generated for C1, congratulating the mother for that improvement and suggesting encouraging the child to vocalize more.

You can use this code if you have a dataset available and would like to automate the process of incorporating new audios and/or annotations to that dataset, then extract metrics and generate messages for feedback on new data. In order for the automation to work, you will need to structure and name your audios accordingly.

URUMETRICS-CODE is structured to be used inside of a ChildProject organized dataset. It should be integrated as a datalad nested repository inside the scripts folder of your dataset.

Installation

This code should not be used directly but should be embedded in a datalad repository in ChildProject format.

Example : we use the URUMETRICS in the currect project.

To add this repository as a submodule, run the following command:

cd path/to/dataset
mkdir scripts && cd scripts
datalad install -d .. git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git

and install the necessary dependencies

pip install -r requirements.txt

Repository structure

  • acoustic_annotations contains the code that computes the acoustic annotations
  • turn_annnotations contains the code that computes the conversational annotations
  • import_data contains the code that imports the new recordings and new annotations to the data set
  • compute_metrics contains the code that computes and saves the metrics
  • generate_messages is the code that reads the metrics file and generates the messages sent to the families

Naming constraints

Recording file names should be WAV files that follow the following naming convention

CHILD-ID_[info1_info2_..._infoX]_YYYYMMDD_HHMMSS.wav

where

  • YYYYMMDD corresponds to the date formatted according to the ISO 8601 format (i.e. YYYY for the year, MM for the month (from 01 to 12), and DD for the day (from 01 to 31)) and
  • HHMMSS date formatted according to the ISO 8601 format (HH for hours (from 00 to 23, 24-hour clock system), MM for minutes (from 00 to 59), and SS for seconds (from 00 to 59)).
  • [info1_info2_..._infoX] corresponds to optional information (without [ and ]) separated by underscores_`
  • CHILD-ID may use any character except the underscore character (_).

The metadata of the dataset will be generated using the file names, so you don't need to edit the dataset.

All of the optional information will be stored in the datasets's metadata files but will not be used by any of the scripts here. Use it if you would like to link other information to your data (you can consult the info in the metadata/recordings.csv file in the column experiment_stage).

How to use?

You will need to use a terminal and first nivage to the path of your dataset using the cd path/to/dataset command.

All the runnable files should be run from the root of the data set (i.e. the directory containing the scripts directory). If not, an exception will be raised and the code will stop running.

The following instructions explain how to use this code when it is embedded (as a submodule) in a ChildProject project using datalad. In the unlikely event that you need to run those scripts for a dataset this isn't embedded, all of the commands can take an additional --project-path path/to/dataset option that allows you to change the considered dataset manually.

We assume that you currently have an empty repository available and ready to use on GIN in the LAAC-LSCP organization.

Use the already set up repository

(0) Define the following bash variables

today=$(date '+%Y%m%d')
dataset="URUMETRICS" #change for your repo name

To use the already prepared dataset, you need to install it the first time only:

The content of dataset should be the name of the data set you are interested in and should exist in the LAAC-LSCP GIN repository.

(1) Run the following commands to install the data set

datalad install -r git@gin.g-node.org:/LAAC-LSCP/${dataset}.git
cd ${dataset}
#only run the following if the dataset is already populated, otherwise there is nothing to get yet
datalad get extra/messages/definition
datalad get extra/metrics && datalad unlock extra/metrics
datalad get metadata && datalad unlock metadata

Note that you will only be allowed to clone and install the data set if you are added as a collaborator on GIN. Please ask William or Alex for more information.

If this is the first setup of this repository, you will get a warning that the repository is empty, this is expected

Run the pipeline

(2) Prepare the data set by running the following command

python -u scripts/URUMETRICS-CODE/import_data/prepare_data_set.py

N.B. this command will use the default message description, you can choose a custom one with the --message-definition xxx.yaml option.

This will create the necessary directories required by ChildProject if they do not exist.

(3) Place the new recordings in dat/in/recordings/raw

(4) Place the VTC, VCM, and ALICE annotation files in their respective folder in dat/in/annotations/{vtc|vcm|alice}/raw

Note that the annotation files should have unique names (e.g. like the acoustic annotations) and should by no means overwrite the files already present in the aforementioned directories.

(5) Save the data set and push the new annotations to GIN

datalad save recordings -m "Added new recordings for date ${today}"
datalad save annotations/*/raw -m "Added raw annotations for date ${today}"
datalad push --to origin

This is a very important step. This allows us to push the new recordings and annotations before running any script that could potentially fail.

(6) Import the new recordings

python -u scripts/URUMETRICS-CODE/import_data/import_recordings.py --experiment Uruguayan_Chatbot_2022

This command will look at the new recordings found in the raw directory and add them to the metadata file metadata/recordings.csv. If some recordings belong to previously unknown children, they will be added to the metadata file metadata/children.csv.

Note that the recording file names should comply with the file naming convention described above!

(7) Run the following commands to convert and import the annotations to the ChildProject format:

python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type VTC --annotation-file VTC_FILE.rttm
python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type VCM --annotation-file VCM_FILE.vcm
python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type ALICE --annotation-file ALICE_FILE.txt

This will import the VTC, VCM, ALICE and ACOUSTIC annotations contains in the specified files. Note that you shouldn't specify the full path to the file, but only its raw filename and extension

This script can also take the additional (optional) parameter --recording. When used, only the annotations pertaining to the specified recording (filename.wav) will be imported. This can be useful when you need to import only the annotations for a specific recording and not all the annnotations for all the recordings.

You may also use the option --recordings-from-annotation-file (incompatible with the previous one) so as to only import the annotations contained in --annotation-file for the recordings that also appear in the annotation file --recordings-from-annotation-file (i.e. you filter the annotation of --annotation-file with the recordings of --recordings-from-annotation-file)

You may now save and push. Do not forget to unlock the metadata directory afterwards!

datalad save annotations/*/converted -m "Converted RAW annotations for date ${today}"
datalad save metadata -m "Imported RAW annotations for date ${today}"
datalad push --to origin
datalad unlock metadata

(8) Compute the conversational annotations using the following command:

python scripts/URUMETRICS-CODE/compute_annotations/compute_derived_annotations.py --annotation-type CONVERSATIONS --save-path annotations/conversations/raw/

This command will only compute the conversational annotations for the newly imported VTC files.

(9) Import the conversational annotations using the following command

python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type CONVERSATIONS --annotation-file CONVERSATIONS_VTC_${today}.csv --recordings-from-annotation-file VTC_${today}.rttm

The name of --annotations-file is the name of the VTC file prefixed by CONVERSATIONS_.

⚠️! You may need to import more than one file. Indeed, annotations are computed for all the files that are missing this type of annotation. Hence, more than one output file may be generated if annotations were generated for files imported before.

(10) Extract the acoustic annotations using the following command

python scripts/URUMETRICS-CODE/compute_annotations/compute_derived_annotations.py --annotation-type ACOUSTIC --save-path annotations/acoustic/raw/ --target-sr 16000

This command will compute acoustic annotations (mean pitch, pitch range) for the files that do not have any acoustic annotation yet (if the VTC annotations are available as well as their recordings).

and similarly imported these annotations using this command

python -u scripts/URUMETRICS-CODE/import_data/import_annotations.py --annotation-type ACOUSTIC --annotation-file ACOUSTIC_VTC_${today}.csv --recordings-from-annotation-file VTC_${today}.rttm

as for conversational annotations, the name of --annotations-file is the name of the VTC file prefixed by ACOUSTIC_.

⚠️! You may need to import more than one file. Indeed, annotations are computed for all the files that are missing this type of annotation. Hence, more than one output file may be generated if annotations were generated for files imported before.

You may now save and push. Do not forget to unlock the metadata directory afterwards!

datalad save annotations/*/raw -m "Imported DERIVED RAW annotations for date ${today}"
datalad save annotations/*/converted -m "Convected DERIVED annotations for date ${today}"
datalad save metadata -m "Imported DERIVED CONVERTED annotations for date ${today}"
datalad unlock metadata

(11) Run the following command to compute ACLEW metrics as well as the additional metrics defined in compute_metrics/metrics_functions.py:

python -u scripts/URUMETRICS-CODE/compute_metrics/metrics.py

This command will generate a file metrics.csv which will be stored in extra/metrics. If the file already exists, new lines will be appended.

Note that the metrics are only computed for newly imported recordings and not for all the files. If no annotations are linked to the new files (e.g. you forgot to import them) the columns will be empty.

(12) Generate the message using the following command

python -u scripts/URUMETRICS-CODE/generate_messages/messages.py [--date YYYYMMDD]

This command will create a file in extra/messages/generated with the following name pattern messages_YYYYMMDD.csv

The file will contain the messages that correspond to each new audio file. The date parameter is used to specify the date for which to generate messages. If the date is before the current date, only recording available at the specified date will be considered to generate the messages. This allows to re-generate past messages if needed. If no date is specified, the current date is used.

Do something with the generated message file

(13) Save the data set and push everything to GIN

datalad save extra/metrics -m "Computed new metrics for date ${today}"
datalad save extra/messages/generated -m "Message generated for date ${today}"
datalad save .
datalad push --to origin

(14) Uninstall the data set

git annex dead here
datalad push --to origin
cd ..
datalad remove -d ${dataset}

Return codes

Every command returns either a 0 (i.e. no problem) or 1 (i.e. problem) return code. They might print information, warning and error messages to STDERR.

Test

TO DO!

Version Requirements