1 year ago · 74edc18a25
--- a/README.md
+++ b/README.md
@@ -1,19 +1,41 @@
 
				 # URUMETRICS
			
 
				 
			
 
				-Uruguayan Chatbot Project
			
 
				+Uruguayan Chatbot Project.
			
 
				+
			
 
				+This project inside of the [LAAC](https://lscp.dec.ens.fr/en/research/teams-lscp/language-acquisition-across-cultures) team involves:
			
 
				+- William N Havard : Postdoctoral Researcher, william.havard@ens.fr / william.havard@gmail.com
			
 
				+- Loann Peurey : Data manager, loannpeurey@gmail.com
			
 
				+- Alejandrina Cristia : Research director, acristia@gmail.com
			
 
				 
			
 
				 ## Description
			
 
				 
			
 
				-This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project.
			
 
				+This repository contains the code to extract the metrics and generated the messages of the Uruguayan Chatbot Project. The Uruguayan Chatbot Project uses this repo embedded into the [URUMETRICS](https://gin.g-node.org/laac-lscp/urumetrics) dataset.
			
 
				+A server was set up on Amazon AWS servers (S3, lambda, API gateway) to:
			
 
				+- receive and store audio files from collaborators (simple mind) collecting audios in the field (via an http API).
			
 
				+- run the importation of the new audios in the dataset, run the extraction of metrics
			
 
				+- launch the generation of messages
			
 
				+- make the messages available to collaborators via http API (Simple mind) who can then send them to the participating families.
			
 
				+
			
 
				+The intended usage for the repository is by simply feeding audios to a dataset (audios that follow the naming standard given). The pipeline is able to generate feedback messages on the evolution of certain metrics (and for each child).
			
 
				+
			
 
				+For example, child number 1 (C1) has a a first recording (R1C1) posted, the audio is integrated to the dataset and the number of vocalizations he produced and number interactions with his mother is calculated. A week later, a second audio is posted (R2C1). The same process is applied.
			
 
				+The system then compares vocalizations and interactions between R1C1 and R2C1. Interactions increased but number of vocalizations did not. A message will be generated for C1, congratulating the mother for that improvement and suggesting encouraging the child to vocalize more.
			
 
				+
			
 
				+You can use this code if you have a dataset available and would like to automate the process of incorporating new audios and/or annotations to that dataset, then extract metrics and generate messages for feedback on new data. In order for the automation to work, you will need to structure and [name your audios accordingly](#naming-constrains).
			
 
				+
			
 
				+URUMETRICS-CODE is structured to be used inside of a [ChildProject](https://childproject.readthedocs.io/en/latest/) organized dataset. It should be integrated as a [datalad nested repository](http://handbook.datalad.org/en/latest/basics/101-106-nesting.html#nesting) inside the `scripts` folder of your dataset.
			
 
				 
			
 
				 ## Installation
			
 
				 
			
 
				 This code should not be used directly but should be embedded in a [datalad](https://www.datalad.org/) repository in [ChildProject](https://childproject.readthedocs.io/en/latest/) format.
			
 
				 
			
 
				+Example : we use the [URUMETRICS](https://gin.g-node.org/LAAC-LSCP/URUMETRICS) in the currect project.
			
 
				+
			
 
				 To add this repository as a submodule, run the following command:
			
 
				 ```bash
			
 
				+cd path/to/dataset
			
 
				 mkdir scripts && cd scripts
			
 
				-git submodule add git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git
			
 
				+datalad install -d .. git@gin.g-node.org:/LAAC-LSCP/URUMETRICS-CODE.git
			
 
				 ```
			
 
				 
			
 
				 and install the necessary dependencies
			
@@ -24,18 +46,12 @@ pip install -r requirements.txt
 
				 ## Repository structure 
			
 
				 
			
 
				 * `acoustic_annotations` contains the code that computes the acoustic annotations
			
 
				-* `turn_annnotations` contains the code that compute the conversational annotations
			
 
				+* `turn_annnotations` contains the code that computes the conversational annotations
			
 
				 * `import_data` contains the code that imports the new recordings and new annotations to the data set
			
 
				-* `compute_metrics` contains the code that computes and save the metrics
			
 
				+* `compute_metrics` contains the code that computes and saves the metrics
			
 
				 * `generate_messages` is the code that reads the metrics file and generates the messages sent to the families
			
 
				 
			
 
				-## Requirements
			
 
				-
			
 
				-### Running requirements
			
 
				-
			
 
				-All the runnable files **should be run from the root of the data set** (i.e. the directory containing the `scripts` directory). If not, an exception will be raised and the code will stop running.
			
 
				-
			
 
				-### Naming convention: recording file names
			
 
				+## Naming constraints
			
 
				 
			
 
				 Recording file names should be **WAV files** that follow the following **naming convention**
			
 
				 
			
@@ -48,19 +64,31 @@ where
 
				 * `[info1_info2_..._infoX]` corresponds to **optional** information (without `[` and `]) separated by underscores `_`
			
 
				 * `CHILD-ID` may use **any character except the underscore character (`_`)**.
			
 
				 
			
 
				-Additional information will be store in the metadata file `metadata/recordings.csv` in the column `experiment_stage`.
			
 
				+The metadata of the dataset will be generated using the file names, so you don't need to edit the dataset.
			
 
				+
			
 
				+All of the optional information will be stored in the datasets's metadata files but will not be used by any of the scripts here. Use it if you would like to link other information to your data (you can consult the info in the `metadata/recordings.csv` file in the column `experiment_stage`).
			
 
				 
			
 
				 ## How to use?
			
 
				 
			
 
				-The following instructions explain how to use this code when it is embedded (as a submodule) in a ChildProject project using datalad
			
 
				+You will need to use a terminal and first nivage to the path of your dataset using the `cd path/to/dataset` command.
			
 
				+
			
 
				+All the runnable files **should be run from the root of the data set** (i.e. the directory containing the `scripts` directory). If not, an exception will be raised and the code will stop running.
			
 
				+
			
 
				+The following instructions explain how to use this code when it is embedded (as a submodule) in a ChildProject project using datalad. In the unlikely event that you need to run those scripts for a dataset this isn't embedded, all of the commands can take an additional `--project-path path/to/dataset` option that allows you to change the considered dataset manually.
			
 
				+
			
 
				+We assume that you currently have an empty repository available and ready to use on [GIN](https://gin.g-node.org/) in the LAAC-LSCP organization.
			
 
				+
			
 
				+### Use the already set up repository
			
 
				 
			
 
				 **(0)** Define the following bash variables
			
 
				 
			
 
				 ```bash
			
 
				 today=$(date '+%Y%m%d')
			
 
				-dataset="URUMETRICS"
			
 
				+dataset="URUMETRICS" #change for your repo name
			
 
				 ```
			
 
				 
			
 
				+To use the already prepared dataset, you need to install it the first time only:
			
 
				+
			
 
				 The content of `dataset` should be the name of the data set you are interested in and should exist in the [LAAC-LSCP GIN repository](https://gin.g-node.org/LAAC-LSCP).
			
 
				 
			
 
				 **(1)** Run the following commands to install the data set
			
@@ -68,6 +96,7 @@ The content of `dataset` should be the name of the data set you are interested i
 
				 ```bash
			
 
				 datalad install -r git@gin.g-node.org:/LAAC-LSCP/${dataset}.git
			
 
				 cd ${dataset}
			
 
				+#only run the following if the dataset is already populated, otherwise there is nothing to get yet
			
 
				 datalad get extra/messages/definition
			
 
				 datalad get extra/metrics && datalad unlock extra/metrics
			
 
				 datalad get metadata && datalad unlock metadata
			
@@ -75,11 +104,16 @@ datalad get metadata && datalad unlock metadata
 
				 
			
 
				 **Note that you will only be allowed to clone and install the data set if you are added as a collaborator on GIN.** Please ask [William](mailto:william.havard@gmail.com) or [Alex](mailto:alecristia@gmail.com) for more information.
			
 
				 
			
 
				+If this is the first setup of this repository, you will get a warning that the repository is empty, this is expected
			
 
				+
			
 
				+### Run the pipeline
			
 
				+
			
 
				 **(2)** Prepare the data set by running the following command 
			
 
				 
			
 
				 ```bash
			
 
				 python -u scripts/URUMETRICS-CODE/import_data/prepare_data_set.py
			
 
				 ```
			
 
				+N.B. this command will use the default message description, you can choose a custom one with the `--message-definition xxx.yaml` option.
			
 
				 
			
 
				 This will create the necessary directories required by ChildProject if they do not exist.