2 years ago · 5c529e2ae1
--- a/Fig5a.jpg
+++ b/Fig5a.jpg
--- a/Fig5a.jpg
+++ b/Fig5a.jpg
@@ -0,0 +1 @@
 
				+.git/annex/objects/1q/K2/MD5E-s41986--dcc61fd0b67dd6cd23ff408292dc567c.jpg/MD5E-s41986--dcc61fd0b67dd6cd23ff408292dc567c.jpg
			
--- a/main.tex
+++ b/main.tex
@@ -60,7 +60,7 @@ Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d'Etud
 
				 \maketitle
			
 
				 
			
 
				 \abstract{
			
 
				-The technique of  long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and pathology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora  that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release \emph{ChildProject}, a python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system that allows the importation of annotations from a wide range of existing formats, as well as data validation procedures to assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to   populations other than children. 
			
 
				+The technique of  long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora  that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release \emph{ChildProject}, a Python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system, which allows the importation of annotations from a wide range of existing formats, as well as upon data validation procedures, which assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to populations other than children and beyond linguistics. 
			
 
				 }
			
 
				 
			
 
				 \keywords{daylong recordings, speech data management, data distribution, annotation evaluation, inter-rater reliability, reproducible research}
			
@@ -68,41 +68,41 @@ The technique of  long-form recordings via wearables is gaining momentum in diff
 
				 
			
 
				 \section{Introduction}
			
 
				 
			
 
				-Long-form recordings are those collected over extended periods of time, typically via a wearable. Although the technique was used with normotypical adults decades ago \citep{ear1,ear2}, it became widespread in the study of early childhood over the last decade since the publication of a seminal white paper by the LENA Foundation \citep{gilkerson2008power}. The LENA Foundation created a hardware-software combination that illuminated the potential of this technique for theoretical and applied purposes (e.g., \citealt{christakis2009audible,warlaumont2014social}). Fig. \ref{fig:data} summarizes which data are typically found in corpora of day-long recordings used for child language acquisition studies, while Fig. \ref{fig:annotations} illustrates annotations drawn from a public corpus.
			
 
				-More recently, long-form data is also being discussed in the context of neurological disorders (e.g., \citealt{riad2020vocal}). In this article, we define the unique space of difficulties surrounding long-form recordings, and introduce a set of packages that provides practical solutions, with a focus on child-centered recordings. Put briefly, we provide a solution that is compatible with a wide range of annotation and storage approaches through a package that builds on a common standard to integrate functionalities for data processing and continuous validation, and which is combined with extant solutions allowing collaborative work and striking a balance between privacy on the one hand, reproducibility, findability, and long-term archiving on the other.  We end by discussing ways in which these solutions could be generalized to other populations. \footnote{In order to demonstrate how our proposal could foster reproducible research on day-long recordings of children, we have released the source of the paper and the code used to build the figures which illustrate the capabilities of our python package in Section \ref{section:application}.}
			
 
				+Long-form recordings are those collected over extended periods of time, typically via a wearable. Although the technique was used with normotypical adults decades ago \citep{ear1,ear2}, it became widespread in the study of early childhood over the last decade since the publication of a seminal white paper by the LENA Foundation \citep{gilkerson2008power}. The LENA Foundation created a hardware-software combination that illuminated the potential of this technique for theoretical and applied purposes (e.g., \citealt{christakis2009audible,warlaumont2014social}). Fig. \ref{fig:data} summarizes which data are typically found in corpora of long-form recordings used for child language acquisition studies, while Fig. \ref{fig:annotations} illustrates annotations drawn from a public corpus.
			
 
				+More recently, long-form data is also being discussed in the context of neurological disorders (e.g., \citealt{riad2020vocal}). In this article, we define the unique space of difficulties surrounding long-form recordings, and introduce a set of packages that provides practical solutions, with a focus on child-centered recordings. Put briefly, we provide solutions that are compatible with a wide range of annotation and storage methods. These include a package that builds on existing standards to integrate functionalities for data processing and continuous validation, in addition to extant solutions allowing collaborative work and striking a balance between privacy on the one hand, reproducibility, findability, and long-term archiving on the other.  We end by discussing ways in which these solutions could be generalized to other populations.\footnote{In order to demonstrate how our proposal could foster reproducible research on long-form recordings of children, we have released the source code of the paper as well as the code used to build the figures in Section \ref{section:application}.}
			
 
				 
			
 
				 \begin{figure}
			
 
				     \centering
			
 
				     \input{Fig1.tex}
			
 
				-    \caption{\textbf{Data typically encountered in corpora of child-centered day-long recordings}. Media files (usually only audio recordings) are annotated by either humans or automated tools. Metadata often contain information about both the subject and his or her environment.}
			
 
				+    \caption{\textbf{Data typically encountered in corpora of child-centered daylong recordings}. Media files (usually only audio recordings) are annotated by either humans or automated tools. Metadata often contain information about both the participant and their environment.}
			
 
				     \label{fig:data}
			
 
				 \end{figure}
			
 
				     
			
 
				 \begin{figure}
			
 
				     \centering
			
 
				     \includegraphics[width=0.8\linewidth]{Fig2.pdf}
			
 
				-    \caption{\label{fig:annotations}\textbf{Example of annotations derived from \cite{vandam-day}}. Annotator 1 positioned and labelled segments according to who speaks when, using the ELAN software \citep{wittenburg2006elan}; Annotator 2 transcribed speech using CHAT \citep{MacWhinney2000}; The LENA software \citep{gilkerson2008power} performed voice activation detection, speaker classification and count estimation.}
			
 
				+    \caption{\label{fig:annotations}\textbf{Example of annotations derived from \cite{vandam-day}}. Annotator 1 positioned and labelled segments according to who was speaking and when, using the ELAN software \citep{wittenburg2006elan}. Annotator 2 transcribed speech using CHAT \citep{MacWhinney2000}. The LENA software \citep{gilkerson2008power} performed voice activation detection, speaker classification and count estimation.}
			
 
				 \end{figure}
			
 
				 
			
 
				 \section{Problem space}\label{section:problemspace}
			
 
				 
			
 
				-Management of scientific data is a long-standing issue which has been the subject of substantial progress in the recent years. For instance, FAIR principles (Findability, Accessibility, Interoperability, and Reusability; see \citealt{Wilkinson2016}) have been proposed to help improve the usefulness of data and data analysis pipelines. Similarly, databases implementing these practices have emerged, such as Dataverse \citep{dataverse} and Zenodo \citep{zenodo}. Daylong recordings cannot be treated in precisely the same way \citep{Cychosz2020}, and therefore require specific solutions. Below, we list some of the challenges that researchers are likely to face while employing long-form recordings in naturalistic environments.
			
 
				+Management of scientific data is a long-standing issue which has been the subject of substantial progress in the recent years. For instance, FAIR principles (Findability, Accessibility, Interoperability, and Reusability; see \citealt{Wilkinson2016}) have been proposed to help improve the usefulness of data and data analysis pipelines. Similarly, databases implementing these practices have emerged, such as Dataverse \citep{dataverse} and Zenodo \citep{zenodo}. Long-form recordings cannot be treated in precisely the same way \citep{Cychosz2020}, and therefore require specific solutions. Below, we list some of the challenges that researchers are likely to face while employing long-form recordings in naturalistic environments.
			
 
				 
			
 
				 \subsubsection*{The need for standards}
			
 
				 
			
 
				-Extant datasets rely on a wide variety of metadata structures, file formats, and naming conventions. For instance, some data from long-form recordings have been archived publicly on Databrary (such as the ACLEW starter set \citep{starter}) and HomeBank (including the VanDam Daylong corpus from \citealt{vandam-day}). Table \ref{table:datasets} shows some divergence across the two, which is simply the result of researchers working in parallel. As a result of this divergence, however, each lab finds itself re-inventing the wheel. For instance, the HomeBankCode organization \footnote{\url{https://github.com/homebankcode/}} contains at least 4 packages that do more or less the same operations, such as aggregating how much speech was produced in each recording, but implemented in different languages (MatLab,  R, perl, and Python). This divergence may also hide different operationalizations, rendering comparisons across labs fraught, effectively diminishing replicability.\footnote{\textit{Replicability} is typically defined as the effort to re-do a study with a new sample, whereas \textit{reproducibility} relates to re-doing the exact same analyses with the exact same data. Reproducibility is addressed in another section.} The variety of annotation formats (as illustrated in Fig. \ref{fig:annotations} for instance) has also led to duplication of efforts, as very similar tasks were implemented for one specific format and then later re-developed for another format.
			
 
				+Extant datasets rely on a wide variety of metadata structures, file formats, and naming conventions. For instance, some data from long-form recordings have been archived publicly on Databrary (such as the ACLEW starter set \citealt{starter}) and HomeBank (including the VanDam Daylong corpus from \citealt{vandam-day}). Table \ref{table:datasets} shows some divergence across the two, which is simply the result of researchers working in parallel. As a result of this divergence, however, each lab finds itself re-inventing the wheel. For instance, the HomeBankCode organization\footnote{\url{https://github.com/homebankcode/}} contains at least 4 packages that do more or less the same operations, such as aggregating how much speech was produced in each recording, but implemented in different languages (MatLab,  R, perl, and Python). This divergence may also hide different operationalizations, rendering comparisons across labs fraught, effectively diminishing replicability.\footnote{\textit{Replicability} is typically defined as the effort to re-do a study with a new sample, whereas \textit{reproducibility} relates to re-doing the exact same analyses with the exact same data. Reproducibility is addressed in another section.} The variety of annotation formats (as illustrated in Fig. \ref{fig:annotations} for instance) has also led to duplication of efforts, as very similar tasks were implemented for one specific format and then later re-developed for another format.
			
 
				 
			
 
				-Designing pipelines and analyses that are consistent across datasets requires standards for how the datasets are structured. Although this may represent an initial investment, such standards facilitate the pooling of research efforts, by allowing labs to benefit from code developed in other labs. Additionally, this field operates increasingly via collaborative cross-lab efforts. For instance, the ACLEW project\footnote{\url{https://sites.google.com/view/aclewdid/home}} involved nine principal investigators (PIs) from five different countries, who needed a substantive initial investment to agree on a standard organization for the six corpora used in the project. We expect even larger collaborations to emerge in the future, a move that would benefit from standardization, as exemplified by the community that emerged around CHILDES for short-form recordings \citep{macwhinney2000childes}. We show how building on the standards described in Section \ref{sec:format} allows our proposed python package to accomplish a wide variety of tasks summarized in Section \ref{section:childproject}.
			
 
				+Designing pipelines and analyses that are consistent across datasets requires standards for how the datasets are structured. Although this may represent an initial investment, such standards facilitate the pooling of research efforts, by allowing labs to benefit from code developed in other labs. Additionally, this field operates increasingly via collaborative cross-lab efforts. For instance, the ACLEW project\footnote{\url{https://sites.google.com/view/aclewdid/home}} involved nine principal investigators (PIs) from five different countries, who needed a substantive initial investment to agree on a standard organization for the six corpora used in the project \citep{soderstrom2021developing}. We expect even larger collaborations to emerge in the future, a move that would benefit from standardization, as exemplified by the community that emerged around CHILDES for short-form recordings \citep{macwhinney2000childes}. We show how building on the standards described in Section \ref{sec:format} allows our proposed Python package to accomplish a wide variety of tasks summarized in Section \ref{section:childproject}.
			
 
				 
			
 
				 \begin{table}
			
 
				 \centering
			
 
				 \begin{tabular}{@{}lll@{}}
			
 
				 \toprule
			
 
				-                        & ACLEW starter  & Van Dam \\ \midrule
			
 
				-\begin{tabular}[t]{@{}l@{}}Audio's scope\end{tabular}             & 5-minute clips & Full day       \\
			
 
				+                        & ACLEW starter  & VanDam \\ \midrule
			
 
				+\begin{tabular}[t]{@{}l@{}}Audios' scope\end{tabular}             & 5-minute clips & Full day       \\
			
 
				 \begin{tabular}[t]{@{}l@{}}Automated annotations'\\format\end{tabular}    & none         & LENA           \\
			
 
				-\begin{tabular}[t]{@{}l@{}}Automated annotations'\\format\end{tabular} & .eaf           & .cha           \\
			
 
				-Annotations' scope        & only clips     & Full day       \\
			
 
				+\begin{tabular}[t]{@{}l@{}}Human annotations'\\format\end{tabular} & .eaf           & .cha           \\
			
 
				+Human annotations' scope        & only clips     & Full day       \\
			
 
				 Metadata                & none           & excel \\ \bottomrule
			
 
				 \end{tabular}
			
 
				 \caption{\textbf{Divergences between the \cite{starter} and \cite{vandam-day} datasets}. Audios' scope indicates the size of the audio that has been archived: all recordings last for a full day, but for ACLEW starter, three five-minute clips were selected from each child. The automated annotations format indicates which software was used to annotate the audio automatically. Annotations' scope shows the scope of human annotation. Metadata indicates whether information about the children and recording were shared, and in what format.}
			
@@ -112,11 +112,19 @@ Metadata                & none           & excel \\ \bottomrule
 
				 
			
 
				 \subsubsection*{Keeping up with updates and contributions}
			
 
				 
			
 
				-Datasets are not frozen. Rather, they are continuously enriched through annotations provided by humans or new algorithms. Human annotations may also undergo corrections as errors are discovered. The process of collecting the recordings may also require a certain amount of time, as they are progressively returned by the field workers or the participants themselves. In the case of longitudinal studies, supplementary audio data may accumulate over several years. Researchers should be able to keep track of these changes while also upgrading their analyses. Moreover, several collaborators may be brought to contribute work to the same dataset simultaneously. To take the example of ACLEW, PIs first annotated a random selection of 2-minute clips for 10 children in-house. They then exchanged some of these audio clips so that the annotators in another lab could re-annotate the same data, for the purposes of inter-rater reliability. This revealed divergences in definitions, and all datasets needed to be revised. Finally, a second sample of 2-minute clips with high levels of speech activity were annotated, and another process of reliability was performed. We suggest to solve these problems through the use of DataLad \citep{datalad_paper}, an extension of git and one of the components of our proposed design, as explained in Section \ref{section:datalad}.
			
 
				+Datasets are not frozen. Rather, they are continuously enriched through annotations provided by humans or new algorithms. Human annotations may also undergo corrections as errors are discovered. The process of collecting the recordings may also require a certain amount of time, as they are progressively returned by the field workers or the participants themselves. In the case of longitudinal studies, supplementary audio data may accumulate over several years. Researchers should be able to keep track of these changes while also upgrading their analyses. Moreover, several collaborators may be brought to contribute work to the same dataset simultaneously. To take the example of ACLEW, PIs first annotated a random selection of 2-minute clips for 10 children in-house. They then exchanged some of these audio clips so that the annotators in another lab could re-annotate the same data, for the purposes of inter-rater reliability. This revealed divergences in definitions, and all datasets needed to be revised. Finally, a second sample of 2-minute clips with high levels of speech activity were annotated, and another process of reliability was performed. We suggest to solve these problems using DataLad \citep{datalad_paper}, an extension of git and one of the components of our proposed design, as explained in Section \ref{section:datalad}.
			
 
				 
			
 
				-\subsubsection*{Delivering large amounts of data}
			
 
				+\subsubsection*{Performance and infrastructure}
			
 
				+
			
 
				+Corpora of long-form recordings include large amounts of data, which raises a new set of challenges related to performance. 
			
 
				+Considering typical values for the bit depth and sampling rates of the recordings -- 16 bits and 16 kilohertz respectively -- yields a throughput of approximately three gigabytes per day of audio. Although there is a great deal of variation, past studies often involved at least 30 recording days (e.g., three days for each of ten children). The trend, however, is for datasets to be larger; for instance, last year, we collaborated in the collection of a single dataset, in which 200 children each contributed two recordings \citep{solomon-data}. Such datasets may exceed one terabyte. Moreover, these recordings can be associated with annotations spread across thousands of files. In the ACLEW example discussed above, there was one .eaf file per human annotator per each of four types of annotation (i.e., random, high speech, random reliability, high speech reliability). In addition, the full day was analyzed with between one and four automated routines. Thus, for each recording day there were 8 annotation files, leading to 6 corpora $\times$ 10 children $\times \ (4+4)$  annotations = 480 annotation files. Other researchers will use one annotation file per clip selected for annotation, which quickly adds up to thousands of files\footnote{For most ongoing research projects of which we are aware, there is no central annotation system; instead, annotators work in parallel on separate files. Some researchers may prefer to have the ``final'' version of the annotations in a merged format that represents the ``current best guess''. For transparency and clarity, however, such merged formats will emerge at a secondary stage, with a first stage represented by independent files including information about the independent listeners' judgments. Our package provides a solution that considers the current practice of working in parallel, but will adapt easily to alternative habits based on merged or collaborative formats.}.
			
 
				+
			
 
				+Under such constraints, the storage, data delivery and data processing infrastructure plays a critical role. However, researchers' own infrastructure may be insufficient -- for instance, they may lack storage capacity, or GPU time --, and they may resort to outsourcing, e.g. using cloud providers. On the other hand, researchers that have access to their own cluster may favor that option instead of costly third-party services.
			
 
				+Ideally, a proper data management design should behave consistently regardless of a particular choice of infrastructure, provided the hardware meets some minimum standards. Users should not need to learn new techniques, procedures and software for each dataset they want to access.
			
 
				+
			
 
				+
			
 
				+Our proposal addresses the matter of data access performance by using DataLad (see Section \ref{section:datalad}), which supports a wide range of storage providers and is specifically designed to handle large files. Our Python package (Section \ref{section:childproject}) achieves scability by implementing parallelisation for most data processing operations. 
			
 
				 
			
 
				-Considering typical values for the bit depth and sampling rates of the recordings -- 16 bits and 16 kilohertz respectively -- yields a throughput of approximately three gigabytes per day of audio. Although there is a great deal of variation, past studies often involved at least 30 recording days (e.g., three days for each of ten children). The trend, however, is for datasets to be larger; for instance, last year, we collaborated in the collection of a single dataset, in which 200 children each contributed two recordings. Such datasets may exceed one terabyte. Moreover, these recordings can be associated with annotations spread across thousands of files. In the ACLEW example discussed above, there was one .eaf file per human annotator per each of four types of annotation (i.e., random, high speech, random reliability, high speech reliability). In addition, the full day was analyzed with between one and four automated routines. Thus, for each recording day there were 8 annotation files, leading to 6 corpora $\times$ 10 children $\times \ (4+4)$  annotations = 480 annotation files. Other researchers will use one annotation file per clip selected for annotation, which quickly adds up to thousands of files\footnote{For most ongoing research projects we know of, there is no central annotation system and instead annotators work in parallel on separate files. Some researchers may prefer to have the "final" version of the annotations in a merged format that represents the "current best guess". For transparency and clarity, however, such merged formats will emerge at a secondary stage, with a first stage represented by independent files including information about the independent listeners' judgments. Our package provides a solution that considers the current practice of working in parallel, but will adapt easily to alternative habits based on merged or collaborative formats.}. Even a small processing latency may result in significant overheads while gathering so many files. As a result of these constraints, data access performance is a key aspect of the management of daylong recordings corpora. Our proposal addresses this matter by using DataLad (see Section \ref{section:datalad}), which is specifically designed to handle large files.
			
 
				 
			
 
				 
			
 
				 \subsubsection*{Privacy}
			
@@ -129,15 +137,15 @@ Therefore, the ideal storing-and-sharing strategy should naturally enforce secur
 
				 
			
 
				 \subsubsection*{Long-term availability}
			
 
				 
			
 
				-The collection of long-form recordings requires a considerable level of investment to explain the technique to families and communities, to ensure a secure data management system, and, in the case of remote populations, to access the site. In our experience, we have spent up to 15 thousand USD to complete one round of data collection, including the cost of travel.\footnote{This grossly underestimates overall costs, because the best way to do any kind of field research is through maintaining strong bonds with the community and helping them in other ways throughout the year, not only during our visits (read more about ethical fieldwork on \citealt{broesch2020navigating}). A successful example for this is that of the UNM-UCSB Tsimane' Project (\url{http://tsimane.anth.ucsb.edu/}), which has been collaborating with the Tsimane' population since 2001. They are currently funded by a 5-year, 3-million US\$ NIH grant \url{https://reporter.nih.gov/project-details/9538306}. } These data are precious not only because of the investment that has gone into them, but also because they capture slices of life at a given point in time, which is particularly informative in the case of populations that are experiencing market integration or other forms of societal change -- which today is most or all populations. Moreover, some communities who are collaborating in such research speak languages that are minority languages in the local context, and thus at a potential risk for being lost in the future. The conservation of naturalistic speech samples of children's language acquisition throughout a normal day could be precious for fueling future efforts of language revitalization \citep{Nee2021}. It would therefore be particularly damaging to lose such data prematurely, from  financial,  scientific, and  human standpoints.
			
 
				+The collection of long-form recordings requires a considerable level of investment to explain the technique to families and communities, to ensure a secure data management system, and, in the case of remote populations, to access the site. We have spent about 15,000 US\$ on a single round of data collection.\footnote{This grossly underestimates overall costs, because the best way to do any kind of field research is through maintaining strong bonds with the community and helping them in other ways throughout the year, not only during our visits (read more about ethical fieldwork on \citealt{broesch2020navigating}). A successful example for this is that of the UNM-UCSB Tsimane' Project (\url{http://tsimane.anth.ucsb.edu/}), which has been collaborating with the Tsimane' population since 2001. They are currently funded by a 5-year, 3-million US\$ NIH grant \url{https://reporter.nih.gov/project-details/9538306}. } These data are precious not only because of the investment that has gone into them, but also because they capture slices of life at a given point in time, which is particularly informative in the case of populations that are experiencing market integration or other forms of societal change -- which today is most or all populations. Moreover, some communities who are collaborating in such research speak languages that are minority languages in the local context, and thus at a potential risk for being lost in the future. The conservation of naturalistic speech samples of children's language acquisition throughout a normal day could be precious for fueling future efforts of language revitalization \citep{Nee2021}. It would therefore be particularly damaging to lose such data prematurely, from  financial,  scientific, and  human standpoints.
			
 
				 
			
 
				-In addition, one advantage of daylong recordings over other observational methods such as parental reports is that they can be re-analyzed at later times to observe behaviors that had not been foreseen at the time of data collection. This implies that their interest partly lies in long-term re-usability.
			
 
				+In addition, one advantage of long-form recordings over other observational methods such as parental reports is that they can be re-analyzed at later times to observe behaviors that had not been foreseen at the time of data collection. This implies that their interest partly lies in long-term re-usability.
			
 
				 
			
 
				-Moreover, even state-of-the-art speech processing tools still perform poorly on daylong recordings, due to their intrinsic noisy nature \citep{casillas2019step}. As a result, taking full advantage of present data will necessitate new or improved computational models, which may take years to develop. For example, the DIHARD Challenge series has been running for three consecutive years, and documents the difficulty of making headway with complex audio data \citep{ryant2018first,ryant2019second,ryant2020third}. For instance, the best submission for speaker diarization in their meeting subcorpus achieved about 35\% Diarization Error Rate in 2018 and 2019, with improvements seen only in 2020, when the best system scored a 20\% Diarization Error Rate (Neville Ryant, personal communication, 2021-04-09). Other tasks are progressing much more slowly. For instance, the best performance in a classifier for deciding whether adult speech was addressed to the child or to an adult scored about 70\% correct in 2017 \citep{schuller2017interspeech} -- but nobody has been able to beat this record since. Recordings should therefore remain available for long periods of time -- potentially decades --, thus increasing the risk for data loss to occur at some point in their lifespan. For these reasons, the reliability of the storage design is critical, and redundancy is most certainly required. Likewise, persistent URLs may be needed in order to ensure the long-term accessibility of the datasets. These are key features of our proposal, as argued in sections \ref{section:datalad} and \ref{section:gin}.
			
 
				+Moreover, even state-of-the-art speech processing tools still perform poorly on long-form recordings, due to their intrinsic noisy nature \citep{casillas2019step}. As a result, taking full advantage of present data will necessitate new or improved computational models, which may take years to develop. For example, the DIHARD Challenge series has been running for three consecutive years, and documents the difficulty of making headway with complex audio data \citep{ryant2018first,ryant2019second,ryant2020third}. For instance, the best submission for speaker diarization in their meeting subcorpus achieved about 35\% Diarization Error Rate in 2018 and 2019, with improvements seen only in 2020, when the best system scored a 20\% Diarization Error Rate (Neville Ryant, personal communication, 2021-04-09). Other tasks are progressing much more slowly. For instance, the best performance in a classifier for deciding whether adult speech was addressed to the child or to an adult scored about 70\% correct in 2017 \citep{schuller2017interspeech} -- but nobody has been able to beat this record since. Recordings should therefore remain available for long periods of time -- potentially decades --, thus increasing the risk for data loss to occur at some point in their lifespan. For these reasons, the reliability of the storage design is critical, and redundancy is most certainly required. Likewise, persistent URLs may be needed in order to ensure the long-term accessibility of the datasets. These are key features of our proposal, as argued in sections \ref{section:datalad} and \ref{section:gin}.
			
 
				 
			
 
				 \subsubsection*{Findability}
			
 
				 
			
 
				-FAIR Principles include findability and accessibility. A crucial aspect of findability of datasets involves their being indexed in ways that potential re-users can discover them.  We would like to emphasize that findability of daylong recordings, especially those from under-represented populations, is of a peculiar importance. Indeed, although one of the many strengths of such recordings is that they can theoretically be sampled from any environment outside the lab, current corpora are still heavily biased in favor of WEIRD (Western, Educated, Industrialized, Rich, Democratic) populations: \cite{cychosz2021using} report that 81\% of samples (and 82\% of first authors) in papers included in systematic reviews of daylong recordings come from North America (with a further 12\% of samples, and 14\% of authors, based in Europe; see Figure 2). Not only more data should be collected from more diverse populations, but they should also be at least equally as findable and accessible in order to overcome the current representativeness bias. We would also like to stress again that the needs for Privacy and for Findability/Accessibility are not mutually exclusive. Although some of the data are of course sensitive, some of them could be made available to a broad audience without any harm to the privacy of the participants -- for instance, annotations that contain no transcription and parts of the metadata -- as discussed in the \textit{Privacy} section above.
			
 
				+FAIR Principles include findability and accessibility. A crucial aspect of findability of datasets involves their being indexed in ways that potential re-users can discover them.  We would like to emphasize that findability of long-form recordings, especially those from under-represented populations, is of a peculiar importance. Indeed, although one of the many strengths of such recordings is that they can theoretically be sampled from any environment outside the lab, current corpora are still heavily biased in favor of WEIRD (Western, Educated, Industrialized, Rich, Democratic) populations: \cite{cychosz2021using} report that 81\% of samples (and 82\% of first authors) in papers included in systematic reviews of long-form recordings come from North America (with a further 12\% of samples, and 14\% of authors, based in Europe; see Figure 2). Not only more data should be collected from more diverse populations, but they should also be at least equally as findable and accessible in order to overcome the current representativeness bias. We would also like to stress again that the needs for Privacy and for Findability/Accessibility are not mutually exclusive. Although some of the data are of course sensitive, some of them could be made available to a broad audience without any harm to the privacy of the participants -- for instance, annotations that contain no transcription and parts of the metadata -- as discussed in the \textit{Privacy} section above.
			
 
				 
			
 
				 Although we elaborate on it below, we want to already highlight HomeBank (\url{homebank.talkabank.org}; part of TalkBank, a recognized CLARIN Knowledge Centre) as one archiving option exists which is specific for long-form recordings, thus making any corpora hosted  there easily discoverable by other researchers using the technique. Also of relevance is Databrary (\url{databrary.org}), an archive specialized on child development, which can thus make the data visible to the developmental science community. However, the current standard practice  is archiving data in either one or another of these repositories, despite the fact that if a copy of the corpus were visible from one of these archives, the dataset would be overall more easily discovered. Additionally, it is uncertain whether these highly re-usable long-form recordings are visible to researchers who are more broadly interested in spoken corpora and/or naturalistic human behavior and/or other topics that could be studied in such data. In fact, one can conceive of a future in which the technique is used with people of different ages, in which case a system that allows users to discover other datasets based on relevant metadata would be ideal. For some research purposes (e.g., trying to stream overlapping voices and noise, technically referred to as "source separation") any recording may be useful, whereas for others (neurodegenerative disorders, early language acquisition) only some ages would. In any case, options exist to allow accessibility once a dataset is archived in one of those databases. We show how our proposed solution can be used to improve the findability of datasets in Sections \ref{section:datalad} and \ref{section:gin}.
			
 
				 
			
@@ -147,9 +155,9 @@ Independent verification of results by a third party can be facilitated by impro
 
				 
			
 
				 \subsubsection*{Current archiving options}
			
 
				 
			
 
				-The field of child-centered long-form recordings has benefited from a purpose-built scientific archive from an early stage. HomeBank \cite{vandam2016homebank} builds on the same architecture as CHILDES \cite{MacWhinney2000} and other TalkBank corpora. Although this architecture served the purposes of the language-oriented community well for short recordings, there are numerous issues when using it for long-form recordings. To begin with, curators do not directly control their datasets' contents and structures, and if a curator wants to make a modification, they need to ask the HomeBank management team to make it for them. Similarly, other collaborators who spot errors cannot correct them directly, but again must request changes be made by the HomeBank management team.  Only one type of annotation is innately managed, and that is CHAT \cite{MacWhinney2000}, which is ideal for transcriptions of  recordings. However, transcriptions are of a lesser interest in child-centered daylong recordings because the amounts of audio they generate are such that humans would not be able to transcribe them to their full extent, and automatic transcription of such recordings -- which are very noisy -- is out of the reach of present models.
			
 
				+The field of child-centered long-form recordings has benefited from a purpose-built scientific archive from an early stage. HomeBank \cite{vandam2016homebank} builds on the same architecture as CHILDES \cite{MacWhinney2000} and other TalkBank corpora. Although this architecture served the purposes of the language-oriented community well for short recordings, there are numerous issues when using it for long-form recordings. To begin with, curators do not directly control their datasets' contents and structures, and if a curator wants to make a modification, they need to ask the HomeBank management team to make it for them. Similarly, other collaborators who spot errors cannot correct them directly, but again must request changes be made by the HomeBank management team.  Only one type of annotation is innately managed, and that is CHAT \cite{MacWhinney2000}, which is ideal for transcriptions of  recordings. However, transcriptions are of a lesser interest in child-centered long-form recordings because the amounts of audio they generate are such that humans would not be able to transcribe them to their full extent, and automatic transcription of such recordings -- which are very noisy -- is out of the reach of present models.
			
 
				 
			
 
				-As briefly noted above, Databrary \url{databrary.org} also already hosts some long-form recording data. The aforementioned ACLEW project actually committed to archiving data there, rather than on HomeBank, because it allowed direct control and update (without needing to ask the HomeBank management).  As re-users, one of the most useful features of Databrary is the possibility to search the full archive for data pertaining to children of specific ages or origins. Using this archiving option led us to realize there were some limitations, including the fact that there is no API system, meaning that all updates need to be done via a graphical browser-based interface.
			
 
				+As briefly noted above, Databrary \url{databrary.org} also already hosts some long-form recording data. The aforementioned ACLEW project actually committed to archiving data there, rather than on HomeBank, because it allowed direct control and update (without needing to ask the HomeBank management).  As re-users, one of the most useful features of Databrary is the possibility to search the full archive for data pertaining to children of specific ages or origins. Using this archiving option led us to realize there were some limitations, including the fact that there is no Application Programming Interface (API) to retrieve the data programmatically at the time of writing, meaning that all updates need to be done via a graphical browser-based interface.
			
 
				 
			
 
				 Additional options have been considered by researchers in the community, including OSF \footnote{\url{osf.io}}, and the Language Archive \footnote{\url{https://archive.mpi.nl/tla/}, which holds a CLARIN certificate B}. Detailing all their features is beyond the scope of the present paper, but some discussion can be found in \cite{casillas2019step}. As a way of explaining why we think they are insufficient solutions, OSF provides very limited storage capacities and requires no structure or metadata, thus does not solve problems of storage or standards. As for the Language Archive, it does not currently have an API for allowing updates of the data, nor automatic tests for its continued integrity.
			
 
				 
			
@@ -160,9 +168,9 @@ Without denying their usefulness and importance, none of these archives provides
 
				 We propose a storing-and-sharing method designed to address the challenges outlined above simultaneously. It can be noted that these problems are, in many respects, similar to those faced by researchers in neuroimaging, a field which has long been confronting the need for reproducible analyses on large datasets of potentially sensitive data \citep{Poldrack2014}.
			
 
				 Their experience may, therefore, provide precious insight for linguists, psychologists, and developmental scientists engaging with the big-data approach of long-form recordings.
			
 
				 For instance, in the context of neuroimaging, \citet{Gorgolewski2016} have argued in favor of ``machine-readable metadata'', standard file structures and metadata, as well as consistency tests. Similarly, \citet{Eglen2017} have recommended the application of formatting standards, version control, and continuous testing. Before moving on, we would like to note that these concepts are all used in the key archiving options we evoked: HomeBank, Databrary, and the Language Archive all have defined metadata and file structures. However, they are {\it different} standards, which cannot be translated to each other, and which have not considered all the features that are relevant for long-form recordings, such as having multiple layers of annotations, with some based on sparse sampling. Additionally, the use of dataset versioning, automated consistency tests, and analyses based on subsumed datasets are less widespread in the language acquisition community. In the following, we will demonstrate how these practices have been implemented in our proposed design.
			
 
				-Albeit designed for child-centered daylong recordings, we believe our solution could be replicated across a wider range of datasets with constraints similar to those exposed above.
			
 
				+Albeit designed for child-centered long-form recordings, we believe our solution could be replicated across a wider range of datasets with constraints similar to those exposed above.
			
 
				 
			
 
				-This solution relies on four main components, each of which is conceptually separable from the others: i) a standardized data format optimized for child-centered long-form recordings; ii) ChildProject, a python package to perform basic operations on these datasets; iii) DataLad, ``a decentralized system for integrated discovery, management, and publication of digital objects of science'' \citep{hanke_defense_2021,datalad_paper} iv) GIN, a live archiving option for storage and distribution. Our choice for each one of these components can be revisited based on the needs of a project and/or as other options appear. Table \ref{table:components} summarizes which of these components helps address each of the challenges listed in Section \ref{section:problemspace}.
			
 
				+This solution relies on four main components, each of which is conceptually separable from the others: i) a standardized data format optimized for child-centered long-form recordings; ii) ChildProject, a Python package to perform basic operations on these datasets; iii) DataLad, ``a decentralized system for integrated discovery, management, and publication of digital objects of science'' \citep{hanke_defense_2021,datalad_paper} iv) GIN, a live archiving option for storage and distribution. Our choice for each one of these components can be revisited based on the needs of a project and/or as other options appear. Table \ref{table:components} summarizes which of these components helps address each of the challenges listed in Section \ref{section:problemspace}.
			
 
				 
			
 
				 \begin{table*}[ht]
			
 
				 \centering
			
@@ -181,9 +189,9 @@ The need for standards &
 
				   \begin{tabular}[t]{@{}l@{}}version control\\(git)\end{tabular} &
			
 
				   git repository host
			
 
				    \\ \midrule
			
 
				-\begin{tabular}[t]{@{}l@{}}Delivering large amounts\\ of data\end{tabular} &
			
 
				+\begin{tabular}[t]{@{}l@{}}Performance\end{tabular} &
			
 
				    parallelised processing &
			
 
				-  git-annex &
			
 
				+  \begin{tabular}[t]{@{}l@{}}git-annex\\ (supports large files,\\ parallel download)\\\end{tabular} &
			
 
				   \begin{tabular}[t]{@{}l@{}}git-annex compatible;\\ high storage capacity;\\ parallelised operations\end{tabular}
			
 
				    \\ \midrule
			
 
				 Ensuring privacy & \begin{tabular}[t]{@{}l@{}}Optional metadata\\detection;\end{tabular}
			
@@ -216,7 +224,7 @@ Reproducibility &
 
				   \begin{tabular}[t]{@{}l@{}}run/rerun/container-run\\ functions\end{tabular} &
			
 
				 
			
 
				 \end{tabular}
			
 
				-\caption{\textbf{\label{table:components}Contributions of each component of our proposed design in resolving the difficulties caused by daylong recordings} and laid out in Section \ref{section:problemspace}. ChildProject is a python package designed to perform recurrent tasks on the datasets; DataLad is a python package for the management of large, version-controlled datasets; GIN is a hosting provider dedicated to scientific data.}
			
 
				+\caption{\textbf{\label{table:components}Contributions of each component of our proposed design in resolving the difficulties caused by long-form recordings} and laid out in Section \ref{section:problemspace}. ChildProject is a Python package designed to perform recurrent tasks on the datasets; DataLad is a Python package for the management of large, version-controlled datasets; GIN is a hosting provider dedicated to scientific data.}
			
 
				 \end{table*}
			
 
				 
			
 
				 \section{Proposed solution}
			
@@ -244,7 +252,7 @@ The \path{annotations} folder contains all sets of annotations. Each set itself
 
				 
			
 
				 \subsection{ChildProject}\label{section:childproject}
			
 
				 
			
 
				-The ChildProject package is a Python 3.6+ package that performs common operations on a dataset of child-centered recordings. It can be used from the command-line or by importing the modules from within Python. It should be noted that the Python API stores metadata and annotations as Pandas dataframes \citep{pandas-software,pandas-paper}. As a result of relying on such a widely used scientific library, it is not necessary to learn new data types in order to use this package. Moreover, most operations are thus naturally vectorized, which contributes to better performance.
			
 
				+The ChildProject package is a Python 3.6+ package that performs common operations on a dataset of child-centered recordings. It can be used from the command-line or by importing the modules from within Python. It should be noted that the Python API stores metadata and annotations as Pandas dataframes \citep{Pandas-software,Pandas-paper}. As a result of relying on such a widely used scientific library, it is not necessary to learn new data types in order to use this package. Moreover, most operations are thus naturally vectorized, which contributes to better performance.
			
 
				 
			
 
				 Assuming the target datasets are packaged according to the standards summarized in Section \ref{sec:format}, the package supports the functions listed below.
			
 
				 
			
@@ -258,7 +266,7 @@ Fig. \ref{fig:annotations} shows that, whatever their format, annotations are al
 
				 
			
 
				 Taking advantage of this, the package converts input annotations to standardized, wide-table CSV dataframes\footnote{\url{https://childproject.readthedocs.io/en/paper/annotations.html}}. The columns in these wide-table formats have been determined based on previous work, and are largely specific to the goal of studying infants' language environment and production. However, users can introduce custom columns if required.
			
 
				 
			
 
				-Annotations are indexed into a unique CSV dataframe which stores their location in the dataset, the set of annotations they belong to, and the recording and time interval they cover. The index, therefore, allows an easy retrieval of all the annotations that cover any given segment of audio, regardless of their original format and the naming conventions that were used. The system interfaces well with extant annotation standards. Currently, ChildProject supports: LENA annotations in .its \citep{xu2008lenatm}; ELAN annotations following the ACLEW DAS template  (\citealt{Casillas2017}, imported using Pympi: \citealt{pympi-1.70}); CHAT annotations \citep{MacWhinney2000}; as well as rttm files outputted by ACLEW tools, namely the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}, the Linguistic Unit Count Estimator (ALICE) by \citet{rasanen2020}, and the VoCalisation Maturity Network (VCMNet) by \citet{AlFutaisi2019}. Users can also adapt routines for file types or conventions that vary. For instance, users can adapt the ELAN import developed for the ACLEW DAS template for their own template (e.g., \url{https://github.com/LAAC-LSCP/ChildProject/discussions/204}); and examples are also provided for Praat's .TextGrid files \citep{boersma2006praat}. The package also supports custom, user-defined conversion routines.
			
 
				+Annotations are indexed into a unique CSV dataframe which stores their location in the dataset, the set of annotations they belong to, and the recording and time interval they cover. The index, therefore, allows an easy retrieval of all the annotations that cover any given segment of audio, regardless of their original format and the naming conventions that were used. The system interfaces well with extant annotation standards. Currently, ChildProject supports: LENA annotations in .its \citep{xu2008lenatm}; ELAN annotations following the ACLEW DAS template  (\citealt{Casillas2017}, imported using Pympi: \citealt{pympi-1.70}); CHAT annotations \citep{MacWhinney2000}; as well as rttm files outputted by ACLEW tools, namely the Voice Type Classifier (VTC) by \citet{lavechin2020opensource}, the Automatic Linguistic Unit Count Estimator (ALICE) by \citet{rasanen2020}, and the VoCalisation Maturity Network (VCMNet) by \citet{AlFutaisi2019}. Users can also adapt routines for file types or conventions that vary. For instance, users can adapt the ELAN import developed for the ACLEW DAS template for their own template (e.g., \url{https://github.com/LAAC-LSCP/ChildProject/discussions/204}); and examples are also provided for Praat's .TextGrid files \citep{boersma2006praat}. The package also supports custom, user-defined conversion routines.
			
 
				 
			
 
				 Relying on the annotations index, the package can also calculate the intersection of the portions of audio covered by several annotators and align their annotations. This is useful when annotations from different annotators need to be combined (in order to retain the majority choice for instance) or compared (e.g., for reliability evaluations).
			
 
				 
			
@@ -274,7 +282,7 @@ Although there was some variability in terms of the program used for human annot
 
				 
			
 
				 \subsubsection*{Extracting and uploading audio samples to Zooniverse}
			
 
				 
			
 
				-The crowd-sourcing platform Zooniverse \citep{zooniverse} has been extensively employed in both natural \citep{gravityspy} and social sciences. More recently, researchers have been investigating its potential to classify samples of audio extracted from daylong recordings of children and the results have been encouraging  \citep{semenzin2020a,semenzin2020b}. We provide tools interfacing with Zooniverse's API for preparing and uploading audio samples to the platform and for retrieving the results, while protecting the privacy of the participants\footnote{\url{https://childproject.readthedocs.io/en/paper/zooniverse.html}}. A step-by-step tutorial including re-usable code is also provided \citep{zooniverse_example}.
			
 
				+The crowd-sourcing platform Zooniverse \citep{zooniverse} has been extensively employed in both natural \citep{gravityspy} and social sciences. More recently, researchers have been investigating its potential to classify samples of audio extracted from long-form recordings of children and the results have been encouraging  \citep{semenzin2020a,semenzin2020b}. We provide tools interfacing with Zooniverse's API for preparing and uploading audio samples to the platform and for retrieving the results, while protecting the privacy of the participants\footnote{\url{https://childproject.readthedocs.io/en/paper/zooniverse.html}}. A step-by-step tutorial including re-usable code is also provided \citep{zooniverse_example}.
			
 
				 
			
 
				 \subsubsection*{Audio processing}
			
 
				 
			
@@ -394,13 +402,13 @@ Practice                                                         & \begin{tabula
 
				 \begin{tabular}[t]{@{}l@{}}privacy\\ safe-guarding\end{tabular} & git-annex & encryption                                                         \\ \midrule
			
 
				 regular tests                                                   & git-annex & \texttt{fsck}\footnote{integrity check}                                     \\  \bottomrule
			
 
				 \end{tabular}
			
 
				-\caption{\label{table:backups}\textbf{Examples of recommended practices for data backups, associated to the software that could be used for their implementation}.}
			
 
				 \end{minipage}
			
 
				+\caption{\label{table:backups}\textbf{Examples of recommended practices for data backups, associated to the software that could be used for their implementation}.}
			
 
				 \end{table*}
			
 
				 
			
 
				 \section{Application: evaluating annotations' reliability}\label{section:application}
			
 
				 
			
 
				-Assessing the reliability of the annotations is crucial to linguistic research, but it can prove tedious in the case of daylong recordings. On one hand, analysis of the massive amounts of annotations generated by automatic tools may be computationally intensive. On the other hand, human annotations are usually sparse and thus more difficult to match with each other. Moreover, as emphasized in Section \ref{section:problemspace}, the variety of file formats used to store the annotations makes it even harder to compare them.
			
 
				+Assessing the reliability of the annotations is crucial to linguistic research, but it can prove tedious in the case of long-form recordings. On one hand, analysis of the massive amounts of annotations generated by automatic tools may be computationally intensive. On the other hand, human annotations are usually sparse and thus more difficult to match with each other. Moreover, as emphasized in Section \ref{section:problemspace}, the variety of file formats used to store the annotations makes it even harder to compare them.
			
 
				 
			
 
				 Making use of the consistent data structures that it provides, the ChildProject package implements functions for extracting and aligning annotations regardless of their provenance or nature (human vs algorithm, ELAN vs Praat, etc.). It also provides functions to compute most of the metrics commonly used in linguistics and speech processing for comparing annotations, relying on existing efficient and field-tested implementations.
			
 
				 
			
@@ -410,11 +418,8 @@ In real datasets with many recordings and several human and automatic annotators
 
				 
			
 
				 \begin{figure*}[htb]
			
 
				 \centering
			
 
				-\subfloat[]{%
			
 
				-\centering
			
 
				-  \includegraphics[trim=0 250 100 25, clip, width=0.8\textwidth]{Fig5a.jpg}
			
 
				+  \includegraphics[width=0.8\textwidth]{Fig5a.jpg}
			
 
				   \label{Annotation:1}%
			
 
				-}
			
 
				 
			
 
				 %\subfloat[]{%
			
 
				 %\centering
			
@@ -426,9 +431,9 @@ In real datasets with many recordings and several human and automatic annotators
 
				 
			
 
				 \end{figure*}
			
 
				 
			
 
				-In psychometrics, the reliability of annotators is usually evaluated using inter-coder agreement indicators. The python package enables the calculation of some of these measures, including all of the coefficients implemented in the NLTK package \citep{nltk} such as Krippendorff's Alpha \citep{alpha} and Fleiss' Kappa \citep{kappa}. The gamma method by \citet{gamma}, which aims to improve upon previous indicators by evaluating simultaneously the quality of both the segmentation and the categorization of speech, has been included \emph{via} the \texttt{pygamma-agreement} package \citep{pygamma_agreement}.
			
 
				+In psychometrics, the reliability of annotators is usually evaluated using inter-coder agreement indicators. The Python package enables the calculation of some of these measures, including all of the coefficients implemented in the NLTK package \citep{nltk} such as Krippendorff's Alpha \citep{alpha} and Fleiss' Kappa \citep{kappa}. The gamma method by \citet{gamma}, which aims to improve upon previous indicators by evaluating simultaneously the quality of both the segmentation and the categorization of speech, has been included \emph{via} the \texttt{pygamma-agreement} package \citep{pygamma_agreement}.
			
 
				 
			
 
				-It should be noted that these measures are most useful in the absence of ground truth, when reliability of the annotations can only be inferred by evaluating their overall agreement. Automatic annotators, however, are usually evaluated against a gold standard produced by human experts. In such cases, the package allows comparisons of pairs of annotators using metrics such as F-score, recall, and precision. Figure \ref{fig:precision} illustrates this functionality. Additionally, the package can compute confusion matrices between two annotators, allowing more informative comparisons, as demonstrated in Figure \ref{fig:confusion}. Finally, the python package interfaces well with \texttt{pyannote.metrics} \citep{pyannote.metrics}, and all the metrics implemented by the latter can be effectively used on the annotations managed with ChildProject.
			
 
				+It should be noted that these measures are most useful in the absence of ground truth, when reliability of the annotations can only be inferred by evaluating their overall agreement. Automatic annotators, however, are usually evaluated against a gold standard produced by human experts. In such cases, the package allows comparisons of pairs of annotators using metrics such as F-score, recall, and precision. Figure \ref{fig:precision} illustrates this functionality. Additionally, the package can compute confusion matrices between two annotators, allowing more informative comparisons, as demonstrated in Figure \ref{fig:confusion}. Finally, the Python package interfaces well with \texttt{pyannote.metrics} \citep{pyannote.metrics}, and all the metrics implemented by the latter can be effectively used on the annotations managed with ChildProject.
			
 
				 
			
 
				 \begin{figure*}[htb]
			
 
				 
			
@@ -478,7 +483,7 @@ Finally, this paper has laid out technical solutions to key issues surrounding l
 
				 % removed: assessing data reliability
			
 
				 % Data managers should be interested in DataLad because it might benefit to many studies, beyond long-form recordings. We should convince them it is worth diving into it
			
 
				 
			
 
				-We provide a solution to the technical challenges related to the management, storage and sharing of datasets of child-centered daylong recordings. This solution relies on four components: i) a set of standards for the structuring of the datasets; ii) \emph{ChildProject}, a python package to enforce these standards and perform useful operations on the datasets; iii) DataLad, a mature and actively developed version-control software for the management of scientific datasets; and iv) GIN, a storage provider compatible with Datalad. Building upon these standards, we have also provide tools to simplify the extraction of information from the annotations and the evaluation of their reliability along with the python package. The four components of our proposed design serve partially independent goals and can thus be decoupled, but we believe their combination would greatly benefit the technique of long-form recordings applied to language acquisition studies.
			
 
				+We provide a solution to the technical challenges related to the management, storage and sharing of datasets of child-centered long-form recordings. This solution relies on four components: i) a set of standards for the structuring of the datasets; ii) \emph{ChildProject}, a Python package to enforce these standards and perform useful operations on the datasets; iii) DataLad, a mature and actively developed version-control software for the management of scientific datasets; and iv) GIN, a storage provider compatible with Datalad. Building upon these standards, we also provide tools to simplify the extraction of information from the annotations and the evaluation of their reliability along with the Python package. The four components of our proposed design serve partially independent goals and can thus be decoupled, but we believe their combination would greatly benefit the technique of long-form recordings applied to language acquisition studies.
			
 
				 
			
 
				 \section*{Declarations}
			
 
				 
			
@@ -503,7 +508,7 @@ This paper does not directly rely on specific data or material.
 
				 The present paper can be reproduced from its source, which is hosted on GIN at \url{https://gin.g-node.org/LAAC-LSCP/managing-storing-sharing-paper}.
			
 
				 The ChildProject package is available on GitHub at \url{https://github.com/LAAC-LSCP/ChildProject}. 
			
 
				 A step-by-step tutorial to launch annotation campaigns on Zooniverse is published along with the source code at \url{https://doi.gin.g-node.org/10.12751/g-node.k2h9az} \citep{zooniverse_example}.
			
 
				-We provide scripts and templates for DataLad managed datasets at \url{http://doi.org/10.17605/OSF.IO/6VCXK} \citep{datalad_procedures}. We also provide a DataLad extension to extract metadata from corpora of daylong recordings \citep{datalad_extension}.
			
 
				+We provide scripts and templates for DataLad managed datasets at \url{http://doi.org/10.17605/OSF.IO/6VCXK} \citep{datalad_procedures}. We also provide a DataLad extension to extract metadata from corpora of long-form recordings \citep{datalad_extension}.
			
 
				 
			
 
				 \appendix
			
 
				 
			
@@ -517,7 +522,7 @@ Alice decides to store the git repository itself on GitHub -- or a GitLab instan
 
				 
			
 
				 Since Bob has been given SSH access to the cluster and belongs to the right UNIX group, he can download recordings and annotations from their joint institution cluster. Alice also made sure to configure the dataset in a way that makes sure every change published to GitHub is also published to the cluster, with DataLad's ``publish-depends'' option. 
			
 
				 
			
 
				-For backup purposes, a third sibling is hosted on Amazon S3 Glacier -- which is cheaper than S3 at the expense of higher retrieval costs and delays -- as a git-annex \href{https://git-annex.branchable.com/special_remotes/}{special remote}. Special remotes do not store the git history and they cannot be used to clone the dataset. However, they can be used as a storage support for the recordings and other large files. In order to increase the security of the data, Alice uses encryption. Git-annex implements several encryption schemes\footnote{\url{https://git-annex.branchable.com/encryption/}}. The hybrid scheme allows to add public GPG keys at any time without additional decryption/encryption steps. Each user can then later decrypt the data with their own private key. This way, as long as at least one private GPG key has not been lost, data are still recoverable. This is especially valuable in that in naturally ensures redundancy of the decryption keys, which is critical in the case of encrypted backups.
			
 
				+For backup purposes, a third sibling is hosted on Amazon S3 Glacier -- which is cheaper than S3 at the expense of higher retrieval costs and delays -- as a git-annex \href{https://git-annex.branchable.com/special_remotes/}{special remote}\footnote{\url{https://git-annex.branchable.com/special_remotes/}}. Special remotes do not store the git history and they cannot be used to clone the dataset. However, they can be used as a storage support for the recordings and other large files. In order to increase the security of the data, Alice uses encryption. Git-annex implements several encryption schemes\footnote{\url{https://git-annex.branchable.com/encryption/}}. The hybrid scheme allows to add public GPG keys at any time without additional decryption/encryption steps. Each user can then later decrypt the data with their own private key. This way, as long as at least one private GPG key has not been lost, data are still recoverable. This is especially valuable in that in naturally ensures redundancy of the decryption keys, which is critical in the case of encrypted backups.
			
 
				 
			
 
				 By default, file names are hashed with an HMAC algorithm, and their content is encrypted with AES-128 -- GPG's default --, although another algorithm could be selected.
			
 
				 
			
@@ -567,7 +572,7 @@ Amazon is superior to most alternatives for a number of reasons, including that
 
				 
			
 
				 Due to legislation in some countries, there are researchers who may not be authorized to store their data on Amazon. If they also do not have access to a local cluster (see Example 1) and/or even in the case that they have a local cluster, but need finer control of access permissions, there are alternatives which can be used as a workaround.
			
 
				 
			
 
				-Finding herself in this setting, Alice decides to use the G-Node Infrastructure (GIN)\footnote{\url{https://gin.g-node.org/}}, which is dedicated to providing ``Modern Research Data Management for Neuroscience''. GIN is similar to GitLab and GitHub in many aspects, except that is also supports git-annex and thus can directly host the large files that required third-party providers while using those platforms.
			
 
				+Finding herself in this setting, Alice decides to use the G-Node Infrastructure (GIN)\footnote{\url{https://gin.g-node.org/}}, which is dedicated to providing ``Modern Research Data Management for Neuroscience''. GIN is similar to GitLab and GitHub in many aspects, except that it also supports git-annex and thus can directly host the large files that required third-party providers while using those platforms.
			
 
				 
			
 
				 Just like GitLab or GitHub, it can handle complex permissions, at the user or group-level, thus surpassing Unix-style permissions management.
		`@@ -0,0 +1 @@`
		`+.git/annex/objects/1q/K2/MD5E-s41986--dcc61fd0b67dd6cd23ff408292dc567c.jpg/MD5E-s41986--dcc61fd0b67dd6cd23ff408292dc567c.jpg`