The present repository showcases the organization of a Zooniverse campaign using ChildProject and DataLad. Zooniverse is a crowd-sourcing platform that may be used for large-scale annotation tasks. The ExElang book provides examples of research goals for which Zooniverse may be useful.
The present campaign requires citizens to listen to 500 ms audio clips and then to perform the following tasks:
This repository contains all the scripts that we used to implement this workflow. You are welcome to re-use this code and adapt it to your needs.
annotationscontains annotations built from the classifications retrieved from Zooniverse.
classificationscontains the classifications retrieved from Zooniverse.
samplescontains the samples that were selected as well as the chunks generated from them.
vandam-datais a subdataset containing VanDam Daylong corpus, structed according to ChildProject's format — this is very important as it allows to use all the features of ChildProject that will be used next.
The first step is to prepare the samples to be annotated; this includes choosing the portions of audio to annotate, and then extracting audio files from these portions. See this issue for examples of sampling strategies that have been used or considered. Many of them involve splitting the audio in short chunks in order to preserve privacy. However, this technique may not be suited for certain annotation tasks (e.g. for semantics).
Sampling consists in selecting which portions of the recordings should be annotated by humans. It can be done through using the samplers provided in the ChildProject package.
Here, we sample 50 vocalizations per recording among all those detected and attributed to the key child (CHI) or a female adult (FEM) by the Voice Type Classifier:
# download VTC annotations datalad get vandam-data/annotations/vtc/converted # sample random CHI and FEM vocalizations from these annotations child-project sampler vandam-data samples/chi_fem/ random-vocalizations \ --annotation-set vtc \ --target-speaker-type CHI FEM \ --sample-size 50 \ --by recording_filename
The sampler produces a CSV dataframe as
After the samples have been generated, they have to be extracted from the audio and uploaded to Zooniverse, which can be done with ChildProject's extract-chunks function.
However, these samples may contain private information about the participants. Therefore, they cannot be shared as is on a public crowd-sourcing platform. We therefore configure
extract-chunks to split these samples into 500ms chunks, which will be classified in random order, thus preventing the recovery of sensitive information by the contributors:
datalad get vandam-data/recordings/converted/standard child-project zooniverse extract-chunks vandam-data \ --keyword chi_fem \ --chunks-length 500 \ --segments samples/chi_fem/segments_20210716_184443.csv \ --destination samples/chi_fem/chunks \ --profile standard
This will extract the audio chunks into
samples/chi_fem/chunks/chunks/ and write a metadata file into
samples/chi_fem/chunks (in our case, as
See ChildProject's documentation for more information about the Zooniverse pipeline.
Once the chunks have been extracted, the next step is to upload them to Zooniverse. Note that due to quotas, it is recommended to upload only a few at time (e.g. 1000 per day).
You will need to provide the numerical id of your Zooniverse project. Instructions to create a Zooniverse project are available here.
You will also need to set Zooniverse credentials as environment variables:
export ZOONIVERSE_LOGIN="" export ZOONIVERSE_PWD="" export PROJECT_ID=14957
child-project zooniverse upload-chunks \ --chunks samples/chi_fem/chunks/chunks_20210716_191944.csv \ --project-id $PROJECT_ID \ --set-name vandam_chi_fem \ --amount 1000
This will display a message for each chunk:
... uploading chunk BN32_010007.mp3 (25153080,25153580) uploading chunk BN32_010007.mp3 (45016146,45016646) uploading chunk BN32_010007.mp3 (46794141,46794641) uploading chunk BN32_010007.mp3 (14107752,14108252) uploading chunk BN32_010007.mp3 (35709983,35710483) uploading chunk BN32_010007.mp3 (45433933,45434433) uploading chunk BN32_010007.mp3 (35711483,35711983) uploading chunk BN32_010007.mp3 (38737938,38738438) uploading chunk BN32_010007.mp3 (24586156,24586656) uploading chunk BN32_010007.mp3 (15556956,15557456) uploading chunk BN32_010007.mp3 (28439601,28440101) uploading chunk BN32_010007.mp3 (27317629,27318129) uploading chunk BN32_010007.mp3 (38391252,38391752) ...
The subject set and its subjects (i.e. the chunks) now appears in the project:
The classifications performed by citizens on Zooniverse for this project can be retrieved with ChildProject's retrieve-classifications command:
child-project zooniverse retrieve-classifications \ --destination classifications/classifications.csv \ --project-id $PROJECT_ID
$ head classifications/classifications.csv id,user_id,subject_id,task_id,answer_id,workflow_id,answer 335442480,2202359,61513545,T1,4,17576,Junk 335442502,2202359,61513542,T1,4,17576,Junk 335442523,2202359,61513540,T1,4,17576,Junk 335442614,2202359,61513546,T1,4,17576,Junk 335442630,2202359,61513544,T1,4,17576,Junk 335443247,2221541,61513546,T1,4,17576,Junk 335443443,2202359,61513544,T1,4,17576,Junk 335450309,1078249,61513540,T1,4,17576,Junk 335457868,2202359,61514279,T1,4,17576,Junk
The output contains no information related to the metadata of the input dataset, which is the desired behavior (Zooniverse should not store any data that might compromise the privacy of the participants) It also contains all the classifications for this project, including those for data outside of the current campaign. As a result, these data alone cannot be exploited and they have to be matched to the chunks metadata.
Classifications can be matched to the original metadata (the recording from which the clips were extracted, the timestamps of the clips, etc.) manually, but it is possible to retrieve the classifications from Zooniverse and match them with their metadata at the same time with ChildProject:
child-project zooniverse retrieve-classifications \ --destination classifications/classifications.csv \ --project-id $PROJECT_ID \ --chunks samples/chi_fem/chunks/chunks*.csv
Now, only relevant chunks are returned, and they are associated to all corresponding metadata:
Once the classifications have been recovered, they can be used to enrich the source dataset with more annotations. While this step may depend a lot on the type of annotation that you are doing, this repository provides an example.
The feed-annotations script does just that. It can be run with:
python annotations/feed-annotations.py classifications/classifications.csv
The classifications are then imported into the
vandam-data subdataset using ChildProject:
$ tail -n 1 vandam-data/metadata/annotations.csv zoo,BN32_010007.mp3,0,0,50464512,BN32_010007.csv,csv,,BN32_010007_0_50464512.csv,2021-07-19 11:14:58,,0.0.1
In case several users have classified the same chunks, the majority choice is retained. You can have a look at the source of the script to see how that works - or to adapt it to your needs!
Other strategies can be considered; in previous work, Semenzin et al. (2020) have reconstructed the original segments by combining the classifications of the 500 ms chunks, while Cychosz et al. (2021) used the classifications of the individual chunks without reconstruction.
Once the annotations have been imported into the original dataset, you can use all the functionalities of the ChildProject package e.g. for reliability estimations.
For instance, let's day we'd like to test the ability of the VTC to distinguish children from adults, based on the classifications retrieved with Zooniverse.
Which will output:
This example only contains around a hundred subjects extracted from a sole recording. Real-life projects usually involve much more data - typically tens of thousands of subjects. In order to go big, we advise you of the following:
.gitignorefile. Versioning too many files within one repository may cripple it and render operations much slower. Also, provided the metadata for the selected chunks and the original recordings are properly stored and backed-up, the audio chunks can be extracted again at any later time if necessary.
--threadsoption for parallel processing.