6 months ago · 0806a15c20
--- a/CODE/SM.Rmd
+++ b/CODE/SM.Rmd
@@ -134,12 +134,13 @@ plot(df.icc.simu$icc_child_id~df.icc.simu$myr,xlab="r used to simulate the data"
 
				 
			
 
				 ## SM B: More information for benchmarking our results against previously reported reliability studies
			
 
				 
			
 
				-First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. All of the instruments we found rely on reports from caregivers, who are basing their judgments on their cumulative experience with the child (e.g., the Child Observation Record Advantage, Schweinhart, McNair, & Larner, 1993; the Desired Results Developmental Profile, REF; the MacArthur-Bates Communicative Development Inventory, REF). Readers are likely most familiar with the MB-CDI,  Fenson et al., 1994 report a correlation of r=.95 in their sample of North American, monolingual infants. We did not find a systematic review or meta-analysis providing more such estimates. However, @frank2021 analyzed data archived in a CDI repository, concentrating on American English and Norwegian, where longitudinal data was available. They found that for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the CDI tends to have ceiling and floor effects at these extreme ages. Another report looked at correlations when parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' ranking for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  For example, a meta-analysis of infant laboratory tasks, including test-retest data for sound discrimination, word recognition, and prosodic processing, found that the meta-analytic weighted average was not different from zero, suggesting that performance in these short laboratory tasks may not be captured in a stable way across testing days. Thus, parental report (or short lab studies) may not be the most appropriate comparisons for our own study.
			
 
				+First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. All of the instruments we found rely on reports from caregivers, who are basing their judgments on their cumulative experience with the child (e.g., the Child Observation Record Advantage, Schweinhart, McNair, & Larner, 1993; the Desired Results Developmental Profile, REF; the MacArthur-Bates Communicative Development Inventory, REF). Readers are likely most familiar with the MB-CDI,  Fenson et al., 1994 report a correlation of r=.95 in their sample of North American, monolingual infants. We did not find a systematic review or meta-analysis providing more such estimates. However, @frank2021 analyzed data archived in a CDI repository, concentrating on American English and Norwegian, where longitudinal data was available. They found that for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the CDI tends to have ceiling and floor effects at these extreme ages. Another report looked at correlations when parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' relative scores for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  For example, a meta-analysis of infant laboratory tasks, including test-retest data for sound discrimination, word recognition, and prosodic processing, found that the meta-analytic weighted average was not different from zero, suggesting that performance in these short laboratory tasks may not be captured in a stable way across testing days. Thus, parental report (or short lab studies) may not be the most appropriate comparisons for our own study.
			
 
				 
			
 
				-Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English. Although reliability in these ones may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (REF) and GSED (REF), and thus we cannot include them in the present summary.
			
 
				+Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English. Although reliability in these tests may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (REF) and GSED (REF), and thus we cannot include them in the present summary.
			
 
				 
			
 
				 The Ages and Stages Questionnaire (ASQ) is a screening tool to measure infants and children's development based on observation of age-specific behaviors: For instance, at 6 months, in the domain of communication, one of the items asks whether the child smiles. The ASQ's reliability has been the object of a systematic review of independent evidence (Velikonja et al., 2017). Across 10 articles with data from children in the USA, Canada, China, Brazil, India, Chile, the Netherlands, Korea, and Turkey, only three articles reported test-retest correlations, two in the USA (r=.84-1)  and one in Turkey (r=.67). However, the meta-analysis authors judged these three studies to be "poor" in quality. Moreover, the ASQ is a questionnaire that an observer fills in, but it can also be administered as a parental questionnaire, reducing its comparability with our purely observational method. 
			
 
				-For the other tests, reliability is mainly available from reports by the companies commercializing the test. The Goldman Fristoe Articulation Test – 3rd edition (GFTA-3) (REF) focuses on the ability to articulate certain sounds through picture-based elicitation and can be used from two to five years of age. Available in English and Spanish for a USA context, it has a reported reliability of r=.92 -- although we do not know whether it is this high for two-year-olds specifically.  The Preschool Language Scales – 5th edition can be used to measure both comprehension and production from birth to 7 years of age. According to Pearson's report, its test-retest reliability is .86 to .95, depending on the age bracket (0;0-2;11, 3;0-4;11, and 5;0-7;11, 0-7 years). The test has also been adapted to other languages, with good reported test-retest reliability (r=.93; Turkish, Sahli & Belgin, 2017). One issue we see with both of these reports is that children tested varied greatly in age, and the correlation seems to have been calculated based on the raw (rather than the normed) score. As a result, children's ranking may have been stable over test and retest mainly because they varied greatly in age. The Expressive Vocabulary Test – 2nd Edition (EVT-2) is a picture-based elicitation method that can be used with children from 2.5 years of age. The company Springer reports a test-retest reliability of r=.95 by age (REF).
			
 
				+
			
 
				+For the other tests, reliability is mainly available from reports by the companies commercializing the test. The Goldman Fristoe Articulation Test – 3rd edition (GFTA-3) (REF) focuses on the ability to articulate certain sounds through picture-based elicitation and can be used from two to five years of age. Available in English and Spanish for a USA context, it has a reported reliability of r=.92 -- although we do not know whether it is this high for two-year-olds specifically.  The Preschool Language Scales – 5th edition can be used to measure both comprehension and production from birth to 7 years of age. According to Pearson's report, its test-retest reliability is .86 to .95, depending on the age bracket (0;0-2;11, 3;0-4;11, and 5;0-7;11, 0-7 years). The test has also been adapted to other languages, with good reported test-retest reliability (r=.93; Turkish, Sahli & Belgin, 2017). One issue we see with both of these reports is that children tested varied greatly in age, and the correlation seems to have been calculated based on the raw (rather than the normed) score. As a result, children's relative scores may have been stable over test and retest mainly because they varied greatly in age, rather than capturing variance from more meaningful individual differences.. The Expressive Vocabulary Test – 2nd Edition (EVT-2) is a picture-based elicitation method that can be used with children from 2.5 years of age. The company Springer reports a test-retest reliability of r=.95 by age (REF).
			
 
				 
			
 
				 Many other standardized tests exist for children over three years of age. Given that most children included in the present study were under three, these other tests do not constitute an ideal comparison point. Nonetheless, in the spirit of informativeness, it is useful to consider that a systematic review and meta-analysis has been done looking at all psychometric properties (internal consistency, reliability, measurement error, content and structural validity, convergent and discriminant validity) for standardized assessments targeted at children aged 4-12 years (REF). Out of 76 assessments found in the literature, only 15 could be evaluated for their psychometric properties, and 14 reported on reliability based on independent evidence (i.e., the researchers tested and retested children, rather than relying on the company's report of reliability). Among these, correlations for test-retest reliability averaged r=.67, with a range from .35 to .76. The authors concluded that psychometric quality was limited for all assessments, but based on the available evidence, PLS-5 (whose test-retest reliability was r=.69) was among those recommended for use.
			
 
				 
			
@@ -280,7 +281,9 @@ rval_tab$Type<-get_type(rval_tab)
 
				 
			
 
				 ```
			
 
				 
			
 
				-Out of our `r length(levels(factor(mydat_aclew$experiment)))` corpora and `r length(levels(factor(mydat_aclew$child_id)))` children, `r length(levels(factor(dist_contig_lena$child_id)))` children (belonging to `r length(levels(factor(gsub(" .*","",dist_contig_lena$child_id))))`  corpora) could be included in this analysis, as some children did not have recordings less than two months apart (in particular, no child from the Warlaumont corpus did). 
			
 
				+> QUOTED TEXT
			
 
				+
			
 
				+Out of  `r length(levels(factor(mydat_aclew$child_id)))` children in `r length(levels(factor(mydat_aclew$experiment)))` corpora, `r length(levels(factor(dist_contig_lena$child_id)))` children (belonging to `r length(levels(factor(gsub(" .*","",dist_contig_lena$child_id))))`  corpora) could be included in this analysis, as some children did not have recordings less than two months apart. In particular, no child from the Warlaumont corpus did. 
			
 
				 
			
 
				 ## SM F: Exploration: Is lower Child ICC than correlations due to the fact that we are controlling for age?
			
 
				 
			
@@ -331,6 +334,8 @@ cor_t=t.test(rval_tab$m ~ rval_tab$data_set)
 
				 
			
 
				 ```
			
 
				 
			
 
				+> QUOTED TEXT
			
 
				+
			
 
				 To see whether correlations in this analysis differed by talker types and pipelines, we fit a linear model with the formula $lm(cor ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. We found an adjusted R-squared of `r round(reg_sum_cor$adj.r.squared*100)`%, suggesting this model did not explain a great deal of variance in correlation coefficients. A Type 3 ANOVA on this model revealed a significant effect of pipeline (F = `r round(reg_anova_cor["data_set","F value"],2)`, p = `r round(reg_anova_cor["data_set","Pr(>F)"],2)`), due to higher correlations for ACLEW (`r r_msds["aclew","x"]`) than for LENA metrics (m = `r r_msds["lena","x"]`). See below for fuller results.
			
 
				 
			
 
				 ```{r print out anova results rec on cor}
			
@@ -378,7 +383,7 @@ for(i in 2:length(corp_w_sib)) corp_w_sib_clean=paste(corp_w_sib_clean,corp_w_si
 
				 ```
			
 
				 
			
 
				 
			
 
				-We reasoned the high Child ICC for metrics related to other children may be because children in our corpora vary in terms of the number of siblings they have, that siblings' presence may be stable across recordings, and that a greater number of siblings would lead to more other child vocalizations. As a result, any measure based on other child vocalizations would result in stable relative ranking of children due to the number of siblings present. To test this hypothesis, we selected the metric with the highest Child ICC, namely ACLEW's total vocalization duration by other children. We fit the full model again to predict this metric, but this time, in addition to controlling for age, we included sibling number as a fixed effect $lmer(metric~ age + sibling_number + (1|corpus/child))$, so that individual variation that was actually due to sibling number was captured by that fixed effect instead of the random effect for child. We had sibling number data for `r sum(has_n_of_sib[,"TRUE"])` recordings from `r length(levels(factor(mydat2$child_id[!is.na(mydat2$n_of_siblings)])))` in `r length(levels(factor(mydat2$experiment[!is.na(mydat2$n_of_siblings)])))` corpora (`r corp_w_sib_clean`). The number of siblings varied from `r min(mydat2$n_of_siblings,na.rm=T)` to `r max(mydat2$n_of_siblings,na.rm=T)`, with a mean of `r round(mean(mydat2$n_of_siblings,na.rm=T),1)` and a median of `r round(median(mydat2$n_of_siblings,na.rm=T),1)`.  Results indicated the full model was singular, so we fitted a No Corpus model to be able to extract a Child ICC. As a sanity check, we verified that the number of siblings predicted the outcome, total vocalization duration by other children -- and found that it did: ß = `r round(summary(model)$coefficients["n_of_siblings","Estimate"],2)`, t = `r round(summary(model)$coefficients["n_of_siblings","t value"],2)`, p < .001. This effect is relatively small: It means that per additional sibling, there is a .2 standard deviation increase in this variable. Turning now to how much variance is allocated to the random factor of Child, there was no difference in Child ICC in our original analysis (`r round(df.icc.mixed[df.icc.mixed$metric=="voc_dur_och_ph" & df.icc.mixed$data_set=="aclew","icc_child_id"],2)`) versus this re-analysis including the number of siblings (`r round(icc.result.split["icc_child_id"],2)`).
			
 
				+We reasoned the high Child ICC for metrics related to other children may be because children in our corpora vary in terms of the number of siblings they have, that siblings' presence may be stable across recordings, and that a greater number of siblings would lead to more other child vocalizations. As a result, any measure based on other child vocalizations would result in stable relative scores of children due to the number of siblings present. To test this hypothesis, we selected the metric with the highest Child ICC, namely ACLEW's total vocalization duration by other children. We fit the full model again to predict this metric, but this time, in addition to controlling for age, we included sibling number as a fixed effect $lmer(metric~ age + sibling\_number + (1|corpus/child))$, so that individual variation that was actually due to sibling number was captured by that fixed effect instead of the random effect for child. We had sibling number data for `r sum(has_n_of_sib[,"TRUE"])` recordings from `r length(levels(factor(mydat2$child_id[!is.na(mydat2$n_of_siblings)])))` in `r length(levels(factor(mydat2$experiment[!is.na(mydat2$n_of_siblings)])))` corpora (`r corp_w_sib_clean`). The number of siblings varied from `r min(mydat2$n_of_siblings,na.rm=T)` to `r max(mydat2$n_of_siblings,na.rm=T)`, with a mean of `r round(mean(mydat2$n_of_siblings,na.rm=T),1)` and a median of `r round(median(mydat2$n_of_siblings,na.rm=T),1)`.  Results indicated the full model was singular, so we fitted a No Corpus model to be able to extract a Child ICC. As a sanity check, we verified that the number of siblings predicted the outcome, total vocalization duration by other children -- and found that it did: ß = `r round(summary(model)$coefficients["n_of_siblings","Estimate"],2)`, t = `r round(summary(model)$coefficients["n_of_siblings","t value"],2)`, p < .001. This effect is relatively small: It means that per additional sibling, there is a .2 standard deviation increase in this variable. Turning now to how much variance is allocated to the random factor of Child, there was no difference in Child ICC in our original analysis (`r round(df.icc.mixed[df.icc.mixed$metric=="voc_dur_och_ph" & df.icc.mixed$data_set=="aclew","icc_child_id"],2)`) versus this re-analysis including the number of siblings (`r round(icc.result.split["icc_child_id"],2)`).
			
 
				 
			
 
				 
			
 
				 ```{r sib-presence}