2 years ago · 78cb001172
--- a/main.tex
+++ b/main.tex
@@ -112,7 +112,7 @@ Metadata                & none           & excel \\ \bottomrule
 
				 
			
 
				 \subsubsection*{Keeping up with updates and contributions}
			
 
				 
			
 
				-Datasets are not frozen. Rather, they are continuously enriched through annotations provided by humans or new algorithms. Human annotations may also undergo corrections as errors are discovered. The process of collecting the recordings may also require a certain amount of time, as they are progressively returned by the field workers or the participants themselves. In the case of longitudinal studies, supplementary audio data may accumulate over several years. Researchers should be able to keep track of these changes while also upgrading their analyses. Moreover, several collaborators may be brought to contribute work to the same dataset simultaneously. To take the example of ACLEW, PIs first annotated a random selection of 2-minute clips for 10 children in-house. They then exchanged some of these audio clips so that the annotators in another lab could re-annotate the same data, for the purposes of inter-rater reliability. This revealed divergences in definitions, and all datasets needed to be revised. Finally, a second sample of 2-minute clips with high levels of speech activity were annotated, and another process of reliability was performed. Particularly in such collaborative projects, one cannot overestimate the importance of "hyperactive curation," a concept promoted by \cite{soska2021hyper}, whereby researchers consider data sharing at every step of the research cycle (including before data collection begins), and "upload [data] as you go". As explained in Section \ref{section:datalad}, datasets can be kept up to date including in a collaborative setting thanks to DataLad \citep{datalad_paper}, an extension of git and one of the components of our proposed design, .
			
 
				+Datasets are not frozen. Rather, they are continuously enriched through annotations provided by humans or new algorithms. Human annotations may also undergo corrections as errors are discovered. The process of collecting the recordings may also require a certain amount of time, as they are progressively returned by the field workers or the participants themselves. In the case of longitudinal studies, supplementary audio data may accumulate over several years. Researchers should be able to keep track of these changes while also upgrading their analyses. Moreover, several collaborators may be brought to contribute work to the same dataset simultaneously. To take the example of ACLEW, PIs first annotated a random selection of 2-minute clips for 10 children in-house. They then exchanged some of these audio clips so that the annotators in another lab could re-annotate the same data, for the purposes of inter-rater reliability. This revealed divergences in definitions, and all datasets needed to be revised. Finally, a second sample of 2-minute clips with high levels of speech activity were annotated, and another process of reliability was performed. Particularly in such collaborative projects, one cannot overestimate the importance of "hyperactive curation," a concept promoted by \cite{soska2021hyper}, whereby researchers consider data sharing at every step of the research cycle (including before data collection begins), and "upload [data] as you go". As explained in Section \ref{section:datalad}, datasets can be kept up to date including in a collaborative setting thanks to DataLad \citep{datalad_paper}, an extension of git and one of the components of our proposed design.
			
 
				 
			
 
				 \subsubsection*{Performance and infrastructure}