2 years ago · 4966f5c5f8
--- a/main.tex
+++ b/main.tex
@@ -256,7 +256,7 @@ We provide a validation script that returns a detailed reporting of all the erro
 
				 
			
 
				 Fig. \ref{fig:annotations} shows that, whatever their format, annotations are always conceptually segments delimited by an onset and an offset timestamps, to which are attached a number of properties such as a speaker's identity or a transcription. Therefore, annotations can almost always be represented as Pandas dataframes, with one row per segment and one column per property.
			
 
				 
			
 
				-Taking advantage of this, the package converts input annotations to standardized, wide-table CSV dataframes\footnote{\url{https://childproject.readthedocs.io/en/latest/annotations.html}}. The columns in these wide-table formats have been determined based on previous work, and are largely specific to the goal of studying infants' language environment and production. However, users can introduce custom columns if required.
			
 
				+Taking advantage of this, the package converts input annotations to standardized, wide-table CSV dataframes\footnote{\url{https://childproject.readthedocs.io/en/paper/annotations.html}}. The columns in these wide-table formats have been determined based on previous work, and are largely specific to the goal of studying infants' language environment and production. However, users can introduce custom columns if required.
			
 
				 
			
 
				 Annotations are indexed into a unique CSV dataframe which stores their location in the dataset, the set of annotations they belong to, and the recording and time interval they cover. The index, therefore, allows an easy retrieval of all the annotations that cover any given segment of audio, regardless of their original format and the naming conventions that were used. The system interfaces well with extant annotation standards. Currently, ChildProject supports: LENA annotations in .its \citep{xu2008lenatm}; ELAN annotations following the ACLEW DAS template  (\citealt{Casillas2017}, imported using Pympi: \citealt{pympi-1.70}); CHAT annotations \citep{MacWhinney2000}; as well as rttm files outputted by ACLEW tools, namely the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}, the Linguistic Unit Count Estimator (ALICE) by \citet{rasanen2020}, and the VoCalisation Maturity Network (VCMNet) by \citet{AlFutaisi2019}. Users can also adapt routines for file types or conventions that vary. For instance, users can adapt the ELAN import developed for the ACLEW DAS template for their own template (e.g., \url{https://github.com/LAAC-LSCP/ChildProject/discussions/204}); and examples are also provided for Praat's .TextGrid files \citep{boersma2006praat}. The package also supports custom, user-defined conversion routines.
			
 
				 
			
@@ -266,19 +266,19 @@ Relying on the annotations index, the package can also calculate the intersectio
 
				 
			
 
				 As noted in the Introduction, recordings are too extensive to be manually annotated in their entirety. We and colleagues have typically annotated manually clips of 0.5-5 minutes in length, and the way these clips are extracted and annotated varies (as illustrated in Table \ref{table:datasets}).
			
 
				 
			
 
				-The package allows the use of predefined or custom sampling algorithms\footnote{\url{https://childproject.readthedocs.io/en/latest/samplers.html}}. Samples' timestamps are exported to CSV dataframes. In order to keep track of the sample generating process, input parameters are simultaneously saved into a YAML file. Predefined samplers include a periodic sampler, a sampler targeting specific speakers' vocalizations, a sampler targeting regions of high-volubility according to input annotations, and a more agnostic sampler targeting high-energy regions. In all cases, the user can define the number of regions and their duration, as well as the context that may be inspected by human annotators. These options cover all documented sampling strategies. Evaluations of the performance of some of these samplers can be found in \citep[Chapter 15, ``Human annotation'']{exelang-book}.
			
 
				+The package allows the use of predefined or custom sampling algorithms\footnote{\url{https://childproject.readthedocs.io/en/paper/samplers.html}}. Samples' timestamps are exported to CSV dataframes. In order to keep track of the sample generating process, input parameters are simultaneously saved into a YAML file. Predefined samplers include a periodic sampler, a sampler targeting specific speakers' vocalizations, a sampler targeting regions of high-volubility according to input annotations, and a more agnostic sampler targeting high-energy regions. In all cases, the user can define the number of regions and their duration, as well as the context that may be inspected by human annotators. These options cover all documented sampling strategies. Evaluations of the performance of some of these samplers can be found in \citep[Chapter 15, ``Human annotation'']{exelang-book}.
			
 
				 
			
 
				 \subsubsection*{Generating ELAN files ready to be annotated}
			
 
				 
			
 
				-Although there was some variability in terms of the program used for human annotation, the field has now by and large settled on ELAN \citep{wittenburg2006elan}. ELAN employs xml files with a hierarchical structure which are both customizable and flexible. The ChildProject can be used to generate .eaf files which can be annotated with the ELAN software\footnote{\url{https://childproject.readthedocs.io/en/latest/elan.html}} based on samples of the recordings drawn using the package, as described in Section \ref{section:choosing}.
			
 
				+Although there was some variability in terms of the program used for human annotation, the field has now by and large settled on ELAN \citep{wittenburg2006elan}. ELAN employs xml files with a hierarchical structure which are both customizable and flexible. The ChildProject can be used to generate .eaf files which can be annotated with the ELAN software\footnote{\url{https://childproject.readthedocs.io/en/paper/elan.html}} based on samples of the recordings drawn using the package, as described in Section \ref{section:choosing}.
			
 
				 
			
 
				 \subsubsection*{Extracting and uploading audio samples to Zooniverse}
			
 
				 
			
 
				-The crowd-sourcing platform Zooniverse \citep{zooniverse} has been extensively employed in both natural \citep{gravityspy} and social sciences. More recently, researchers have been investigating its potential to classify samples of audio extracted from daylong recordings of children and the results have been encouraging  \citep{semenzin2020a,semenzin2020b}. We provide tools interfacing with Zooniverse's API for preparing and uploading audio samples to the platform and for retrieving the results, while protecting the privacy of the participants\footnote{\url{https://childproject.readthedocs.io/en/latest/zooniverse.html}}. A step-by-step tutorial including re-usable code is also provided \citep{zooniverse_example}.
			
 
				+The crowd-sourcing platform Zooniverse \citep{zooniverse} has been extensively employed in both natural \citep{gravityspy} and social sciences. More recently, researchers have been investigating its potential to classify samples of audio extracted from daylong recordings of children and the results have been encouraging  \citep{semenzin2020a,semenzin2020b}. We provide tools interfacing with Zooniverse's API for preparing and uploading audio samples to the platform and for retrieving the results, while protecting the privacy of the participants\footnote{\url{https://childproject.readthedocs.io/en/paper/zooniverse.html}}. A step-by-step tutorial including re-usable code is also provided \citep{zooniverse_example}.
			
 
				 
			
 
				 \subsubsection*{Audio processing}
			
 
				 
			
 
				-ChildProject allows the batch-conversion of the recordings to any target audio format (thanks to \citealt{ffmpeg})\footnote{\url{https://childproject.readthedocs.io/en/latest/processors.html}}.
			
 
				+ChildProject allows the batch-conversion of the recordings to any target audio format (thanks to \citealt{ffmpeg})\footnote{\url{https://childproject.readthedocs.io/en/paper/processors.html}}.
			
 
				 
			
 
				 The package also implements a ``vetting" \citep{vandam2018vetting,Cychosz2020} pipeline, which mutes segments of the recordings previously annotated by humans as confidential while preserving the duration of the audio files. After being processed, the recordings can safely be shared with other researchers or annotators.
			
 
				 
			
@@ -286,7 +286,7 @@ Another pipeline allows to perform filtering or linear combinations of audio cha
 
				 
			
 
				 \subsubsection*{Metrics extraction}
			
 
				 
			
 
				-The package includes a pipeline to extract metrics that are commonly used in this research area -- such as the speech rates of each speaker -- by aggregating annotations at the desired level, e.g. per recording or per child\footnote{\url{https://childproject.readthedocs.io/en/latest/metrics.html}}. Metrics can also be aggregated depending on the time of the day, where the width of the time bins is chosen by the user.
			
 
				+The package includes a pipeline to extract metrics that are commonly used in this research area -- such as the speech rates of each speaker -- by aggregating annotations at the desired level, e.g. per recording or per child\footnote{\url{https://childproject.readthedocs.io/en/paper/metrics.html}}. Metrics can also be aggregated depending on the time of the day, where the width of the time bins is chosen by the user.
			
 
				 
			
 
				 \subsubsection*{Other functionalities}
			
 
				 
			
@@ -348,7 +348,7 @@ Table \ref{table:providers} sums up the most relevant characteristics of a few p
 
				 
			
 
				 Among criteria of special interest are: the provider's ability to handle complex permissions; how much data it can accept; its ability to assign permanent URLs and identifiers to the datasets; and of course, whether it complies with the legislation regarding privacy. For our purposes, Table \ref{table:providers} suggests GIN is the best option, handling well large files, with highly customizable permissions, and Git-based version control and access (see Appendix \ref{appendix:gin} for a practical use-case of GIN). That said, private projects are limited in space, although at the time of writing this limit can be raised by contributions to the GIN administrators. Moreover, there is no long-term guarantee that GIN will keep operating as it currently does. However, GIN's software is open-source, enabling users to run their own instance, where they could move their data to at any time -- which is very straightforward with DataLad. The next best option may be S3, and some users may prefer S3 when they do not have access to a local cluster, since S3 allows both easy storage and processing. 
			
 
				 
			
 
				-To render comparison of options easier, detailed examples of storage designs taken from real datasets are listed in Appendix \ref{appendix:examples}. Scripts to implement these strategies can be found on our GitHub and OSF \citep{datalad_procedures}. We also provide a tutorial based on a public corpus \citep{vandam-day} to convert existing data to our standards and then publish it with DataLad\footnote{\url{https://childproject.readthedocs.io/en/latest/vandam.html}}.
			
 
				+To render comparison of options easier, detailed examples of storage designs taken from real datasets are listed in Appendix \ref{appendix:examples}. Scripts to implement these strategies can be found on our GitHub and OSF \citep{datalad_procedures}. We also provide a tutorial based on a public corpus \citep{vandam-day} to convert existing data to our standards and then publish it with DataLad\footnote{\url{https://childproject.readthedocs.io/en/paper/vandam.html}}.
			
 
				 We would like to emphasize that the flexibility of DataLad makes it very easy to migrate from one architecture to another. The underlying infrastructure may change, with little to no impact on the users, and little efforts from the maintainers.
			
 
				 
			
 
				 In any case, we strongly recommend users to bear in mind that redundancy is important to make sure data are not lost, so a backup sibling may be hosted in an additional site (e.g., in a computer on campus in addition to the cloud-based version). 
			
@@ -561,7 +561,7 @@ s3 & Amazon S3  &  recordings; annotations  & Collaborators  & AES-128 \\ \botto
 
				 \caption{\label{table:storage2}Example 2 - Storage strategy example relying on GitHub and Amazon S3.}
			
 
				 \end{table*}
			
 
				 
			
 
				-Amazon is superior to most alternatives for a number of reasons, including that it is highly tested, developed by engineers with a high-level of knowledge of the platform, and widely used. This means that the code is robust even before it is released, and it is widely tested once it is released. The fact that there are many users also entails that issues or questions can be looked up online. In addition, in the context of data durability, Amazon is a good choice because it is ``too big to fail'', and thus probably available for the long-term. Moreover, in sheer terms of flexibility and coverage, Amazon provides a whole suite of tools (for data sharing, backups, and processing), which may be useful for researchers with little access to high-capacity infrastructures. Additionally, it is not very costly (see comparison table on \url{https://childproject.readthedocs.io/en/latest/vandam.html?highlight=amazon#where-to-publish-my-dataset}).
			
 
				+Amazon is superior to most alternatives for a number of reasons, including that it is highly tested, developed by engineers with a high-level of knowledge of the platform, and widely used. This means that the code is robust even before it is released, and it is widely tested once it is released. The fact that there are many users also entails that issues or questions can be looked up online. In addition, in the context of data durability, Amazon is a good choice because it is ``too big to fail'', and thus probably available for the long-term. Moreover, in sheer terms of flexibility and coverage, Amazon provides a whole suite of tools (for data sharing, backups, and processing), which may be useful for researchers with little access to high-capacity infrastructures. Additionally, it is not very costly (see comparison table on \url{https://childproject.readthedocs.io/en/paper/vandam.html?highlight=amazon#where-to-publish-my-dataset}).
			
 
				 
			
 
				 \subsection{Example 3 - sharing large datasets with outside collaborators  and multi-tier access (GIN)}\label{appendix:gin}
			
 
				 
			
@@ -609,7 +609,7 @@ s3 & Amazon S3  & annotations; recordings & Alice, Bob and Carol  & No \\ \botto
 
				 \caption{\label{table:storage4}Example 4 - Storage strategy example relying on OSF and Amazon S3 to deliver the data.}
			
 
				 \end{table*}
			
 
				 
			
 
				-We use a reversed approach for our demo dataset\footnote{https://github.com/LAAC-LSCP/vandam-daylong-demo} based on  \citep{vandam-day}, by hosting the git repository on GitHub, and hosting the large files on OSF. This is possible only because of the small size of the dataset.
			
 
				+We use a reversed approach for our demo dataset\footnote{\url{https://github.com/LAAC-LSCP/vandam-daylong-demo}} based on  \citep{vandam-day}, by hosting the git repository on GitHub, and hosting the large files on OSF. This is possible only because of the small size of the dataset.
			
 
				 
			
 
				 \bibliographystyle{spbasic}
			
 
				 \bibliography{references}