Browse Source

more links and details (see issue #2)

Lucas Gautheron 2 years ago
parent
commit
15731315c8
1 changed files with 7 additions and 1 deletions
  1. 7 1
      README.md

+ 7 - 1
README.md

@@ -47,6 +47,10 @@ You are welcome to re-use this code and adapt it to your needs.
 
 ## Preparing samples
 
+The first step is to prepare the samples to be annotated; this includes choosing the portions of audio to annotate, and then extracting audio files from these portions.
+See [this issue](https://github.com/LAAC-LSCP/ChildProject/issues/109) for examples of sampling strategies that have been used or considered.
+Many of them involve splitting the audio in short chunks in order to [preserve privacy](https://laac-lscp.github.io/exelang-book/irb.html#additional-considerations). However, this technique may not be suited for certain annotation tasks (e.g. for semantics).
+
 ### Sampling
 
 Sampling consists in selecting which portions of the recordings should be annotated by humans.
@@ -209,6 +213,8 @@ zoo,BN32_010007.mp3,0,0,50464512,BN32_010007.csv,csv,,BN32_010007_0_50464512.csv
 
 In case several users have classified the same chunks, the majority choice is retained. You can have a look at the [source of the script](https://gin.g-node.org/LAAC-LSCP/zoo-campaign/src/master/annotations/feed-annotations.py) to see how that works - or to adapt it to your needs!
 
+Other strategies can be considered; in previous work, Semenzin et al. ([2020](https://osf.io/gpxf5)) have reconstructed the original segments by combining the classifications of the 500 ms chunks, while Cychosz et al. ([2021](https://onlinelibrary.wiley.com/doi/10.1111/desc.13090)) used the classifications of the individual chunks without reconstruction. 
+
 ### Comparing Zooniverse annotations with other annotations
 
 Once the annotations have been imported into the original dataset, you can use all the functionalities of the ChildProject package e.g. for reliability estimations.
@@ -231,6 +237,6 @@ This example only contains around a hundred subjects extracted from a sole recor
 Real-life projects usually involve much more data - typically tens of thousands of subjects.
 In order to go big, we advise you of the following:
 
-- Ask Zooniverse for increased subjects quota.
+- Ask Zooniverse for [increased subjects quota](https://help.zooniverse.org/getting-started/lab-policies/).
 - If you are using a version control system such as git/DataLad, you may not want to commit the audio chunks. This can be avoided with appropriate rules in a `.gitignore` file. Versioning too many files within one repository may cripple it and render operations much slower. Also, provided the metadata for the selected chunks and the original recordings are properly stored and backed-up, the audio chunks can be extracted again at any later time if necessary.
 - Some operations such as sampling or extracting chunks may be demanding for large datasets. We recommend performing this step on a cluster using several CPU cores. The ChildProject provides a `--threads` option for parallel processing.