3 years ago · 955b6cbb5b
--- a/main.pdf
+++ b/main.pdf
@@ -1 +0,0 @@
 
				-/annex/objects/MD5E-s367152--f8c3725421503da23710aee274222ce1.pdf
			
--- a/main.pdf
+++ b/main.pdf
@@ -0,0 +1 @@
 
				+.git/annex/objects/2j/8G/MD5E-s369148--257f5395ae626048144e91224e4a7963.pdf/MD5E-s369148--257f5395ae626048144e91224e4a7963.pdf
			
--- a/main.tex
+++ b/main.tex
@@ -134,7 +134,7 @@ The field of child-centered long-form recordings has benefited from a purpose-bu
 
				 
			
 
				 As briefly noted above, Databrary \url{databrary.org} also already hosts some long-form recording data. The aforementioned ACLEW project actually committed to archiving data there, rather than on HomeBank, because it allowed direct control and update (without needing to ask the HomeBank management).  As re-users, one of the most useful features of Databrary is the possibility to search the full archive for data pertaining to children of specific ages or origins. Using this archiving option led us to realize there were some limitations, including the fact that there is no API system, meaning that all updates need to be done via a graphical browser-based interface.
			
 
				 
			
 
				-Additional options have been considered by researchers in the community, including OSF \url{osf.io}, and the Language Archive \url{https://archive.mpi.nl/tla/}. Detailing all their features is beyond the scope of the present paper, but some discussion can be found in \cite{casillas2019step}. 
			
 
				+Additional options have been considered by researchers in the community, including OSF \footnote{\url{osf.io}}, and the Language Archive \footnote{\url{https://archive.mpi.nl/tla/}}. Detailing all their features is beyond the scope of the present paper, but some discussion can be found in \cite{casillas2019step}. 
			
 
				 
			
 
				 Without denying their usefulness and importance, none of these archives provides perfect solutions to all of the problems we laid out above -- and notably, in our vision, researchers should not have to choose among them when archiving their data. These limitations have brought us to envision a new strategy for sharing these datasets, which we detail next. 
			
 
				 
			
@@ -143,10 +143,9 @@ Without denying their usefulness and importance, none of these archives provides
 
				 We propose a storing-and-sharing method designed to address the challenges outlined above simultaneously. It can be noted that these problems are, in many respects, similar to those faced by researchers in neuroimaging, a field which has long been confronting the need for reproducible analyses on large datasets of potentially sensitive data \citep{Poldrack2014}.
			
 
				 Their experience may, therefore, provide precious insight for linguists, psychologists, and developmental scientists engaging with the big-data approach of long-form recordings.
			
 
				 For instance, in the context of neuroimaging, \citet{Gorgolewski2016} have argued in favor of ``machine-readable metadata'', standard file structures and metadata, as well as consistency tests. Similarly, \citet{Eglen2017} have recommended the application of formatting standards, version control, and continuous testing.\footnote{Note that these concepts are all used in the key archiving options we evoked: HomeBank, Databrary, and the Language Archive all have defined metadata and file structures. However, they are {\it different} standards, which cannot be translated to each other, and which have not considered all the features that are relevant for long-form recordings, such as having multiple layers of annotations, with some based on sparse sampling. Additionally, the use of dataset versioning, automated consistency tests, and analyses based on subsumed datasets are less widespread in the language acquisition community.} In the following, we will demonstrate how all of these practices have been implemented in our proposed design.
			
 
				+Albeit designed for child-centered daylong recordings, we believe our solution could be replicated across a wider range of datasets with constraints similar to those exposed above.
			
 
				 
			
 
				-Albeit designed for child-centered daylong recordings, we believe our solution could be replicated across a wider range of datasets with constraints similar to those exposed above. Furthermore, our approach is flexible and leaves room for customization.
			
 
				-
			
 
				-This solution relies on four main components, each of which is conceptually separable from the others: i) a standardized data format optimized for child-centered long-form recordings; ii) ChildProject, a python package to perform basic operations on these datasets; iii) DataLad, ``a decentralized system for integrated discovery, management, and publication of digital objects of science'' \citep{hanke_defense_2021} iv) GIN, a live archiving option for storage and distribution. Our choice for each one of these components can be revisited based on the needs of a project and/or as other options appear. Table \ref{table:components} summarizes which of these components help address each of the challenges listed in Section \ref{section:problemspace}.
			
 
				+This solution relies on four main components, each of which is conceptually separable from the others: i) a standardized data format optimized for child-centered long-form recordings; ii) ChildProject, a python package to perform basic operations on these datasets; iii) DataLad, ``a decentralized system for integrated discovery, management, and publication of digital objects of science'' \citep{hanke_defense_2021} iv) GIN, a live archiving option for storage and distribution. Our choice for each one of these components can be revisited based on the needs of a project and/or as other options appear. Table \ref{table:components} summarizes which of these components helps address each of the challenges listed in Section \ref{section:problemspace}.
			
 
				 
			
 
				 \begin{table*}[ht]
			
 
				 \centering
			
@@ -211,22 +210,22 @@ Reproducibility &
 
				 \begin{figure}[ht]
			
 
				     \centering
			
 
				     \inputTikZ{0.8}{Fig2.tex}
			
 
				-    \caption{\textbf{Structure of a dataset}. Metadata, recordings and annotations each belong to their own folder. Raw annotations (i.e., the audio files as they have been collected, before post-processing) are separated from their post-processed counterparts (in this case: standardized and vetted recordings. Similarly, raw annotations -- in this case, LENA's its annotations -- are set apart from the corresponding CSV version.}
			
 
				+    \caption{\textbf{Structure of a dataset}. Metadata, recordings and annotations each belong to their own folder. Raw annotations (i.e., the audio files as they have been collected, before post-processing) are separated from their post-processed counterparts (in this case: standardized and vetted recordings). Similarly, raw annotations (in this case, LENA's its annotations) are set apart from the corresponding CSV version.}
			
 
				     \label{fig:tree}
			
 
				 \end{figure}
			
 
				 
			
 
				-To begin with, we propose a set of proven standards which we use in our lab and which build on previous experience in several collaborative projects including ACLEW. It must be emphasized, however, that standards should be elaborated collaboratively by the community and that the following is merely a starting point.
			
 
				+To begin with, we propose a set of proven standards which we use in the LAAC Team \url{https://lscp.dec.ens.fr/en/research/teams-lscp/language-acquisition-across-cultures} and which build on previous experience in several collaborative projects including ACLEW. It must be emphasized, however, that standards should be elaborated collaboratively by the community and that the following is merely a starting point.
			
 
				 
			
 
				 Data that are part of the same collection effort are bundled together within one folder\footnote{We believe a reasonable unit of bundling is the collection effort, for instance a single field trip,  a full bout of data collection for a cross-sectional sample, or a set of recordings done more or less at the same time in a longitudinal sample. Given the possibilities of versioning, some users may decide they want to keep all data from a longitudinal sample in the same dataset, adding to it progressively over months and years, to avoid having duplicate children.csv files. That said, given DataLad's system of subdatasets (see Section \ref{section:datalad}), one can always define different datasets, each of which contains the recordings collected in subsequent time periods.}, preferably a DataLad dataset (see Section \ref{section:datalad}). Datasets are packaged according to the structure given in fig. \ref{fig:tree}. The \path{metadata} folder contains at least three dataframes in CSV format: (i) \path{children.csv} contains information about the participants, such as their age or the language(s) they speak. (ii) \path{recordings.csv} contains the metadata for each recording, such as when the recording started, which device was used, or its relative path in the dataset. (iii) \path{annotations.csv} contains information concerning the annotations provided in the dataset, how they were produced, or which range they cover, etc. The dataframes are standardized according to guidelines which set conventional names for the columns and the range of allowed values. The guidelines are enforced through tests which print all the errors and inconsistencies in a dataset implemented in the ChildProject package introduced below.
			
 
				 
			
 
				-The \path{recordings} folder contains two subfolders: \path{raw}, which stores the recordings as delivered by the experimenters, and \path{converted} which contains processed copies of the recordings. All the audio files in \path{recordings/raw} are indexed in the recordings dataframe. Thus, there is no need for naming conventions for the audio files themselves, and maintainers can decide how they want to organize them.
			
 
				+The \path{recordings} folder contains two subfolders: \path{raw}, which stores the recordings as delivered by the experimenters, and \path{converted}, which contains processed copies of the recordings. All the audio files in \path{recordings/raw} are indexed in the recordings dataframe. Thus, there is no need for naming conventions for the audio files themselves, and maintainers can decide how they want to organize them.
			
 
				 
			
 
				-The \path{annotations} folder contains all sets of annotations. Each set itself consists of a folder containing two subfolders : i) \path{raw}, which stores the output of the annotation pipelines and ii) \path{converted}, which stores the annotations after being converted to a standardized CSV format and indexed into \path{metadata/annotations.csv}. A set of annotations can contain an unlimited amount of subsets, with any amount of recursions. For instance, a set of human-produced annotations could include one subset per annotator. Recursion facilitates the inheritance of access permissions, as explained in Section \ref{section:datalad}.
			
 
				+The \path{annotations} folder contains all sets of annotations. Each set itself consists of a folder containing two subfolders: i) \path{raw}, which stores the output of the annotation pipelines and ii) \path{converted}, which stores the annotations after being converted to a standardized CSV format and indexed into \path{metadata/annotations.csv}. A set of annotations can contain an unlimited amount of subsets, with any amount of recursions. For instance, a set of human-produced annotations could include one subset per annotator. Recursion facilitates the inheritance of access permissions, as explained in Section \ref{section:datalad}.
			
 
				 
			
 
				 
			
 
				 \subsection{ChildProject}\label{section:childproject}
			
 
				 
			
 
				-The ChildProject package is a Python 3.6+ package that performs common operations on a dataset of child-centered recordings. It can be used from the command-line or by importing the modules from within Python. Assuming the target datasets are packaged according to the standards summarized in section \ref{sec:format}, the package supports the functions listed below.
			
 
				+The ChildProject package is a Python 3.6+ package that performs common operations on a dataset of child-centered recordings. It can be used from the command-line or by importing the modules from within Python. Assuming the target datasets are packaged according to the standards summarized in Section \ref{sec:format}, the package supports the functions listed below.
			
 
				 
			
 
				 \subsubsection*{Listing errors and inconsistencies in a dataset}
			
 
				 
			
@@ -236,13 +235,13 @@ We provide a validation script that returns a detailed reporting of all the erro
 
				 
			
 
				 The package converts input annotations to standardized, wide-table CSV dataframes. The columns in these wide-table formats have been determined based on previous work, and are largely specific to the goal of studying infants' language environment and production.
			
 
				 
			
 
				-Annotations are indexed into a unique CSV dataframe which stores their location in the dataset, the set of annotations they belong to, and the recording and time interval they cover. The index, therefore, allows an easy retrieval of all the annotations that cover any given segment of audio, regardless of their original format and the naming conventions that were used. The system interfaces well with extant annotation standards. Currently, ChildProject supports: LENA annotations in .its \citep{xu2008lenatm}; ELAN annotations following the ACLEW DAS template  \citep{Casillas2017,pympi-1.70}; the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}; the Linguistic Unit Count Estimator (ALICE) by \citet{rasanen2020}; and the VoCalisation Maturity Network (VCMNet) by \citet{AlFutaisi2019}. Users can also adapt routines for file types or conventions that vary. For instance, users can adapt the ELAN import developed for the ACLEW DAS template for their own template; and examples are also provided for Praat's .TextGrid files \citep{boersma2006praat}. The package also supports custom, user-defined conversion routines.
			
 
				+Annotations are indexed into a unique CSV dataframe which stores their location in the dataset, the set of annotations they belong to, and the recording and time interval they cover. The index, therefore, allows an easy retrieval of all the annotations that cover any given segment of audio, regardless of their original format and the naming conventions that were used. The system interfaces well with extant annotation standards. Currently, ChildProject supports: LENA annotations in .its \citep{xu2008lenatm}; ELAN annotations following the ACLEW DAS template  (\citealt{Casillas2017}, imported using Pympi: \citealt{pympi-1.70}); as well as rttm files outputted by ACLEW tools, namely the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}, the Linguistic Unit Count Estimator (ALICE) by \citet{rasanen2020}, and the VoCalisation Maturity Network (VCMNet) by \citet{AlFutaisi2019}. Users can also adapt routines for file types or conventions that vary. For instance, users can adapt the ELAN import developed for the ACLEW DAS template for their own template (e.g., \url{https://github.com/LAAC-LSCP/ChildProject/discussions/204}); and examples are also provided for Praat's .TextGrid files \citep{boersma2006praat}. The package also supports custom, user-defined conversion routines.
			
 
				 
			
 
				-Relying on the annotations index, the package can also calculate the intersection of the portions of audio covered by several annotators and align their annotations. This is useful when annotations from different annotators need to be combined (in order to retain the majority choice for instance) or compared (e.g. for reliability evaluations).
			
 
				+Relying on the annotations index, the package can also calculate the intersection of the portions of audio covered by several annotators and align their annotations. This is useful when annotations from different annotators need to be combined (in order to retain the majority choice for instance) or compared (e.g., for reliability evaluations).
			
 
				 
			
 
				 \subsubsection*{Choosing audio samples of the recordings to be annotated}\label{section:choosing}
			
 
				 
			
 
				-As noted in the Introduction, recordings are too extensive to be manually annotated in their entirety. We and colleagues have typically annotated manually clips of 0.5-5 minutes in length, and the way these clips are extracted and annotated constitutes one of the ways in which there is divergent standards (as illustrated in Table \ref{table:datasets}).
			
 
				+As noted in the Introduction, recordings are too extensive to be manually annotated in their entirety. We and colleagues have typically annotated manually clips of 0.5-5 minutes in length, and the way these clips are extracted and annotated varies (as illustrated in Table \ref{table:datasets}).
			
 
				 
			
 
				 The package allows the use of predefined or custom sampling algorithms. Samples' timestamps are exported to CSV dataframes. In order to keep track of the sample generating process, input parameters are simultaneously saved into a YAML file. Predefined samplers include a periodic sampler, a sampler targeting specific speakers' vocalizations, a sampler targeting regions of high-volubility according to input annotations, and a more agnostic sampler targeting high-energy regions. In all cases, the user can define the number of regions and their duration, as well as the context that may be inspected by human annotators. These options cover all documented sampling strategies.
			
 
				 
			
@@ -256,21 +255,21 @@ The crowd-sourcing platform Zooniverse \citep{zooniverse} has been extensively e
 
				 
			
 
				 \subsubsection*{Audio processing}
			
 
				 
			
 
				-ChildProject allows the batch-conversion of the recordings to any target audio format (using ffmpeg \citealt{ffmpeg}).
			
 
				+ChildProject allows the batch-conversion of the recordings to any target audio format (thanks to \citealt{ffmpeg}).
			
 
				 
			
 
				-The package also implements a ``vetting" \citep{Cychosz2020} pipeline, which mutes segments of the recordings previously annotated by humans as confidential while preserving the duration of the audio files. After being processed, the recordings can safely be shared with other researchers or annotators.
			
 
				+The package also implements a ``vetting" \citep{vandam2018vetting,Cychosz2020} pipeline, which mutes segments of the recordings previously annotated by humans as confidential while preserving the duration of the audio files. After being processed, the recordings can safely be shared with other researchers or annotators.
			
 
				 
			
 
				-Another pipeline allows filtering and linear combinations of audio channels for multi-channel recordings such as those produced with the BabyLogger; if necessary, users can easily design custom audio converters suiting more specific needs.
			
 
				+Another pipeline allows to perform filtering or linear combinations of audio channels for multi-channel recordings such as those produced with the BabyLogger\footnote{\url{https://docs.babycloudlab.com/}}; if necessary, users can easily design custom audio converters suiting more specific needs.
			
 
				 
			
 
				 \subsubsection*{Other functionalities}
			
 
				 
			
 
				 The package offers additional functions such as a pipeline that strips LENA's annotations from data that could be used to identify the participants, built upon previous code by \citet{eaf-anonymizer-original}.
			
 
				 
			
 
				-Notably, the package facilitate the computation of a number of typical measures of annotations reliability and accuracy, as demonstrated in Section \ref{section:application}.
			
 
				+Notably, the package facilitates the computation of a number of typical measures of annotations reliability and accuracy, as demonstrated in Section \ref{section:application}.
			
 
				 
			
 
				 \subsubsection*{User empowerment}
			
 
				 
			
 
				-The present effort is led by a research lab, and thus with personnel and funding that is not permanent. We therefore have done our best to provide information to help the community adopt and maintain this code in the future. Extensive documentation is provided on \url{https://childproject.readthedocs.io}, including detailed tutorials. The code is accessible on GitHub.com.
			
 
				+The present effort is led by one research team, and thus with personnel and funding that is not permanent. We therefore have done our best to provide information to help the community adopt and maintain this code in the future. Extensive documentation is provided on \url{https://childproject.readthedocs.io}, including detailed tutorials. The code is accessible on GitHub.com.
			
 
				 
			
 
				 
			
 
				 \subsection{DataLad}\label{section:datalad}
			
@@ -409,7 +408,7 @@ It should be noted that these measures are most useful in the absence of ground
 
				 \centering
			
 
				 \includegraphics[width=0.8\textwidth]{Fig4.pdf}
			
 
				 
			
 
				-\caption{\label{fig:precision}\textbf{Examples of diarization performance evaluation using recall, precision and F1 score}. Audio from the the public VanDam corpus \citep{vandam-day} is annotated according to who-speak-when, using both the LENA diarizer (its) and the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}. Speech segments are classified among four speaker types: the key child (CHI), other children (OCH), male adults (MAL) and female adults (FEM). For illustration purposes, fake annotations are generated from that of the VTC. Two are computed by randomly assigning the speaker type to 50\% and 75\% (conf) of the VTC's speech segments. Two are computed by dropping 50\% of speech segments from the VTC (drop). Recall, precision and F1 score are calculated for each of these annotations, by comparing them to the VTC. The worst F-score for the LENA is reached for OCH segments. Dropping segments does not alter precision; however, as expected, it has a substantially negative impact on recall.
			
 
				+\caption{\label{fig:precision}\textbf{Examples of diarization performance evaluation using recall, precision and F1 score}. Audio from the the public VanDam corpus \citep{vandam-day} is annotated according to who-speaks-when, using both the LENA diarizer (its) and the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}. Speech segments are classified among four speaker types: the key child (CHI), other children (OCH), male adults (MAL) and female adults (FEM). For illustration purposes, fake annotations are generated from that of the VTC. Two are computed by randomly assigning the speaker type to 50\% or 75\% (conf) of the VTC's speech segments. Two are computed by dropping 50\% or 75\% of speech segments from the VTC (drop). Recall, precision and F1 score are calculated for each of these annotations, by comparing them to the VTC. The worst F-score for the LENA is reached for OCH segments. Dropping segments does not alter precision; however, as expected, it has a substantially negative impact on recall.
			
 
				 }
			
 
				 
			
 
				 \end{figure*}
			
@@ -435,7 +434,7 @@ Our proposed solution can be readily adapted to the first body of data: All that
 
				 
			
 
				 Generalizing our solution to the second body of data requires more adaptation. For such use cases, it would be ideal for the equipment to be left in the patients' house, so that it can be used for instance one day a week or month. Additional work is needed to facilitate this, ranging from making the equipment easier to use and more robust by for instance facilitating charging and secure data transfer from such off-site locations.
			
 
				 
			
 
				-The third use case requires further adaptation, in addition to those just mentioned (making the sensors easy to use and allowing data transfer from potentially insecure home settings). In particular, multiple sensors' data need to be integrated together and time-stamped. We have made some progress in this sense in the context of the collection of multiple audio tracks collected with different physical devices (example on XXX), but have not yet developed structure and code to support the integration of pictures, videos, heart rate data, parental questionnaire data, etc. 
			
 
				+The third use case requires further adaptation, in addition to those just mentioned (making the sensors easy to use and allowing data transfer from potentially insecure home settings). In particular, multiple sensors' data need to be integrated together and time-stamped. We have made some progress in this sense in the context of the collection of multiple audio tracks collected with different physical devices (example forthcoming), but have not yet developed structure and code to support the integration of pictures, videos, heart rate data, parental questionnaire data, etc. 
			
 
				 
			
 
				 \section{Limitations}
		`@@ -1 +0,0 @@`
		`-/annex/objects/MD5E-s367152--f8c3725421503da23710aee274222ce1.pdf`
		`@@ -0,0 +1 @@`
		`+.git/annex/objects/2j/8G/MD5E-s369148--257f5395ae626048144e91224e4a7963.pdf/MD5E-s369148--257f5395ae626048144e91224e4a7963.pdf`