|
@@ -0,0 +1,136 @@
|
|
|
+# Zooniverse campaign
|
|
|
+
|
|
|
+- [Summary](#summary)
|
|
|
+ - [Workflow](#workflow)
|
|
|
+ - [Repository structure](#repository-structure)
|
|
|
+- [Sample selection and extraction](#sample-selection-and-extraction)
|
|
|
+ - [Sampling](#sampling)
|
|
|
+ - [Building empty eaf annotations](#building-empty-eaf-annotations)
|
|
|
+ - [Extracting audio clips (optional)](#extracting-audio-clips-optional)
|
|
|
+- [Exchanging annotations with annotators](#exchanging-annotations-with-annotators)
|
|
|
+ - [Allocating annotations](#allocating-annotations)
|
|
|
+ - [Uploading annotations to the DropBox](#uploading-annotations-to-the-dropbox)
|
|
|
+ - [Fetching humans annotations](#fetching-humans-annotations)
|
|
|
+- [Running tests](#running-tests)
|
|
|
+
|
|
|
+## Summary
|
|
|
+
|
|
|
+The present repository showcases the organization
|
|
|
+of a Zooniverse campaign using ChildProject and DataLad.
|
|
|
+
|
|
|
+This campaign requires citizens to listen to 500 ms audio clips
|
|
|
+and to perform the following tasks:
|
|
|
+
|
|
|
+1. Decide whether they hear speech from either a Baby, a Child, an Adolescent, an Adult, or no speech.
|
|
|
+2. Guess the gender of the speaker in case of non-baby speech.
|
|
|
+3. Classify the type of sound among four categories (Canonical, Non-Canonical, Laughing, Crying).
|
|
|
+
|
|
|
+### Workflow
|
|
|
+
|
|
|
+1. We used [DataLad](https://joss.theoj.org/papers/10.21105/joss.03262) to manage this campaign.
|
|
|
+2. The primary dataset (containing the audio and the metadata)
|
|
|
+was included in this repository as a subdataset. It was structured according to ChildProject's standards.
|
|
|
+3. [ChildProject](https://childproject.readthedocs.io/en/latest/) was used to generate the samples, to upload the audio chunks to zooniverse, and to retrieve the classifications.
|
|
|
+
|
|
|
+This repository contains all the scripts that we used to implement this workflow.
|
|
|
+You are welcome to re-use this code and adapt it to your needs.
|
|
|
+
|
|
|
+### Repository structure
|
|
|
+
|
|
|
+ - `annotations` contains annotations built from the classifications retrieved from Zooniverse.
|
|
|
+ - `classifications` contains the classifications retrieved from Zooniverse.
|
|
|
+ - `samples` contains the samples that were selected as well as the chunks generated from them.
|
|
|
+ - `vandam-data` is a subdataset containing VanDam Daylong corpus, structed according to ChildProject's standards.
|
|
|
+
|
|
|
+## Preparing samples
|
|
|
+
|
|
|
+### Sampling
|
|
|
+
|
|
|
+Sampling consists in selecting which portions of the recordings should be annotated by humans.
|
|
|
+It can be done through using the [samplers provided in the ChildProject package](https://childproject.readthedocs.io/en/latest/samplers.html).
|
|
|
+
|
|
|
+Here, we sample 50 vocalizations per recording among all those detected and attributed to the key child (CHI) or a female adult (FEM) by the Voice Type Classifier:
|
|
|
+
|
|
|
+```bash
|
|
|
+# download VTC annotations
|
|
|
+datalad get vandam-data/annotations/vtc/converted
|
|
|
+
|
|
|
+# sample random CHI and FEM vocalizations from these annotations
|
|
|
+child-project sampler vandam-data samples/chi_fem/ random-vocalizations \
|
|
|
+ --annotation-set vtc \
|
|
|
+ --target-speaker-type CHI FEM \
|
|
|
+ --sample-size 50 \
|
|
|
+ --by recording_filename
|
|
|
+```
|
|
|
+
|
|
|
+The sampler produces a CSV dataframe as `samples/chi_fem/segments_YYYYMMDD_HHMMSS.csv`, e.g. `samples/chi_fem/segments_20210716_184443.csv`.
|
|
|
+
|
|
|
+### Preparing the chunks for Zooniverse
|
|
|
+
|
|
|
+After the samples have been generated, they have to be extracted from the audio and uploaded to Zooniverse, which can be done with ChildProject's [extract-chunks](https://childproject.readthedocs.io/en/latest/zooniverse.html#chunk-extraction) function.
|
|
|
+However, these samples may contain private information about the participants. Therefore, they cannot be shared as is on a public crowd-sourcing platform. We therefore configure `extract-chunks` to split these samples into 500ms chunks, which will be classified in random order, thus preventing the recovery of sensitive information by the contributors:
|
|
|
+
|
|
|
+```bash
|
|
|
+datalad get vandam-data/recordings/converted/standard
|
|
|
+
|
|
|
+child-project zooniverse extract-chunks vandam-data \
|
|
|
+ --keyword chi_fem \
|
|
|
+ --chunks-length 500 \
|
|
|
+ --segments samples/chi_fem/segments_20210716_184443.csv \
|
|
|
+ --destination samples/chi_fem/chunks \
|
|
|
+ --profile standard
|
|
|
+```
|
|
|
+
|
|
|
+This will extract the audio chunks into `samples/chi_fem/chunks/chunks/` and write a metadata file into `samples/chi_fem/chunks` (in our case, as `samples/chi_fem/chunks/chunks_20210716_191944.csv`).
|
|
|
+
|
|
|
+See [ChildProject's documentation](https://childproject.readthedocs.io/en/latest/zooniverse.html) for more information about the Zooniverse pipeline.
|
|
|
+
|
|
|
+### Uploading audio chunks to Zooniverse
|
|
|
+
|
|
|
+Once the chunks have been extracted, the next step is to upload them to Zooniverse.
|
|
|
+Note that due to quotas, it is recommended to upload only a few at time (e.g. 1000 per day).
|
|
|
+
|
|
|
+You will need to provide the numerical id of your Zooniverse project;
|
|
|
+you will also need to set Zooniverse credentials as environment variables:
|
|
|
+
|
|
|
+```bash
|
|
|
+export ZOONIVERSE_LOGIN=""
|
|
|
+export ZOONIVERSE_PWD=""
|
|
|
+export PROJECT_ID=14957
|
|
|
+```
|
|
|
+
|
|
|
+```bash
|
|
|
+child-project zooniverse upload-chunks \
|
|
|
+ --chunks samples/chi_fem/chunks/chunks_20210716_191944.csv \
|
|
|
+ --project-id $PROJECT_ID \
|
|
|
+ --set-name vandam_chi_fem \
|
|
|
+ --amount 1000
|
|
|
+```
|
|
|
+
|
|
|
+This will display a message for each chunk:
|
|
|
+
|
|
|
+```bash
|
|
|
+...
|
|
|
+uploading chunk BN32_010007.mp3 (25153080,25153580)
|
|
|
+uploading chunk BN32_010007.mp3 (45016146,45016646)
|
|
|
+uploading chunk BN32_010007.mp3 (46794141,46794641)
|
|
|
+uploading chunk BN32_010007.mp3 (14107752,14108252)
|
|
|
+uploading chunk BN32_010007.mp3 (35709983,35710483)
|
|
|
+uploading chunk BN32_010007.mp3 (45433933,45434433)
|
|
|
+uploading chunk BN32_010007.mp3 (35711483,35711983)
|
|
|
+uploading chunk BN32_010007.mp3 (38737938,38738438)
|
|
|
+uploading chunk BN32_010007.mp3 (24586156,24586656)
|
|
|
+uploading chunk BN32_010007.mp3 (15556956,15557456)
|
|
|
+uploading chunk BN32_010007.mp3 (28439601,28440101)
|
|
|
+uploading chunk BN32_010007.mp3 (27317629,27318129)
|
|
|
+uploading chunk BN32_010007.mp3 (38391252,38391752)
|
|
|
+...
|
|
|
+```
|
|
|
+
|
|
|
+The subject set and its subjects (i.e. the chunks) now appears in the project:
|
|
|
+
|
|
|
+![Zooniverse subjects](extra/zoo1.png)
|
|
|
+1[Zooniverse subjects](extra/zoo2.png)
|
|
|
+
|
|
|
+## Retrieving classifications
|
|
|
+
|