Browse Source

going big

Lucas Gautheron 6 months ago
parent
commit
60c7f88082
4 changed files with 15 additions and 7 deletions
  1. 1 0
      .DS_Store
  2. 12 1
      README.md
  3. 1 5
      annotations/compare.py
  4. 1 1
      annotations/comparison.png

+ 1 - 0
.DS_Store

@@ -0,0 +1 @@
+.git/annex/objects/f4/Jv/MD5E-s6148--ca848894e6b31acf26b81f5a5dfa66cb/MD5E-s6148--ca848894e6b31acf26b81f5a5dfa66cb

+ 12 - 1
README.md

@@ -11,6 +11,7 @@
   - [Matching classifications back to the metadata](#matching-classifications-back-to-the-metadata)
 - [Importing classifications into the source dataset](#importing-classifications-into-the-source-dataset)
   - [Comparing Zooniverse annotations with other annotations](#comparing-zooniverse-annotations-with-other-annotations)
+- [Going big](#going-big)
 
 ## Summary
 
@@ -218,4 +219,14 @@ The [compare](https://gin.g-node.org/LAAC-LSCP/zoo-campaign/src/master/annotatio
 
  Which will output:
 
- ![Comparing the VTC and Zooniverse classifications](annotations/comparison.png)
+ ![Comparing the VTC and Zooniverse classifications](annotations/comparison.png)
+
+ ## Going big
+
+This example only contains around a hundred subjects extracted from a sole recording.
+Real-life projects usually involve much more data - typically tens of thousands of subjects.
+In order to go big, we advise you of the following.
+
+- Ask Zooniverse for increased subjects quota.
+- If you are using a version control system such as git/DataLad, you may not want to commit the audio chunks. This can be avoided with appropriate rules in a `.gitignore` file. Versioning too many files within one repository may cripple it and render operations much slower. Also, provided the metadata for the selected chunks and the original recordings are properly stored and backed-up, the audio chunks can be extracted again at any later time if necessary.
+- Some operations such as sampling or extracting chunks may be demanding for large datasets. We recommend performing this step on a cluster using several CPU cores. The ChildProject provides a `--threads` option for parallel processing.

+ 1 - 5
annotations/compare.py

@@ -1,10 +1,6 @@
-import argparse
+import matplotlib.pyplot as plt
 import numpy as np
-import os
-import pandas as pd
-
 import seaborn as sns
-import matplotlib.pyplot as plt
 
 from ChildProject.projects import ChildProject
 from ChildProject.annotations import AnnotationManager

+ 1 - 1
annotations/comparison.png

@@ -1 +1 @@
-../.git/annex/objects/Jv/vP/MD5E-s34903--a879abed2cb645edc4877fd3c15d4eb3.png/MD5E-s34903--a879abed2cb645edc4877fd3c15d4eb3.png
+../.git/annex/objects/x2/6W/MD5E-s34867--c14191004dfb20b03d3c9fb9d8de8589.png/MD5E-s34867--c14191004dfb20b03d3c9fb9d8de8589.png