alecristia 2 months ago
parent
commit
bb62b5daa3
2 changed files with 17 additions and 10 deletions
  1. 13 10
      CODE/SM.Rmd
  2. 4 0
      CODE/create-all-rs.R

+ 13 - 10
CODE/SM.Rmd

@@ -136,11 +136,17 @@ plot(df.icc.simu$icc_child_id~df.icc.simu$myr,xlab="r used to simulate the data"
 
 ## SM B: More information for benchmarking our results against previously reported reliability studies
 
-First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. To this end, we did searches in scholar.google.com and google.com with keywords like ("language development" AND "standardized test" AND "infancy"), complementing these searches with our own knowledge. All of the instruments we found rely on reports from caregivers, including the Child Observation Record Advantage (Schweinhart, McNair, & Larner, 1993); the Desired Results Developmental Profile (Kriener-Althen, Newton, Draney, & Mangione, 2020); and the MacArthur-Bates Communicative Development Inventory (MB-CDI, Fenson et al., 1994). Readers are likely most familiar with the MB-CDI, so we focus our discussion on this instrument here, to illustrate the level of test-retest reliability such instruments can attain. In the work most commonly cited in relation to this instrument's creation, Fenson et al. (1994) reported a correlation of r=.95 in their sample of North American, monolingual infants. Since it has been over 20 years, and this instrument has been used with other samples and languages, a more relevant comparison point for our own study would be one that takes into account more attempts at using this instrument. We did not find a systematic review or meta-analysis on test-retest estimates for the MB-CDI. However, Frank et al. (2021) analyzed data archived in an MB-CDI repository, concentrating on two corpora where longitudinal data was available, American English and Norwegian. They found that, for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the MB-CDI tends to have ceiling and floor effects at these extreme ages. Alcock, Meints, and Rowland (n.d.) looked at correlations when British English-speaking parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. Thus, although evidence is scant, it is sufficient to see that the MB-CDI can lead to high correlations, that are nearly comparable to those expected in standardized testing among adults. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' relative scores for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  Thus, parental report may not be the most appropriate comparisons for our own study because they may overestimate reliability in the child's own behavior.
+First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. To this end, we did searches in scholar.google.com and google.com with keywords like ("language development" AND "standardized test" AND "infancy") around February 2021, complementing these searches with our own knowledge. 
+
+All of the instruments we found rely on reports from caregivers, including the Child Observation Record Advantage (Schweinhart, McNair, & Larner, 1993); the Desired Results Developmental Profile (Kriener-Althen, Newton, Draney, & Mangione, 2020); and the MacArthur-Bates Communicative Development Inventory (MB-CDI, Fenson et al., 1994). Readers are likely most familiar with the MB-CDI, so we focus our discussion on this instrument here, to illustrate the level of test-retest reliability such instruments can attain. 
+
+In the work most commonly cited in relation to the MB-CDI's creation, Fenson et al. (1994) reported a correlation of r=.95 in their sample of North American, monolingual infants. Since it has been over 20 years, and this instrument has been used with other samples and languages, a more relevant comparison point for our own study would be one that takes into account more attempts at using this instrument. We did not find a systematic review or meta-analysis on test-retest estimates for the MB-CDI. However, Frank et al. (2021) analyzed data archived in an MB-CDI repository, concentrating on two corpora where longitudinal data was available, American English and Norwegian. They found that, for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the MB-CDI tends to have ceiling and floor effects at these extreme ages. Alcock, Meints, and Rowland (n.d.) looked at correlations when British English-speaking parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. 
+
+Thus, although evidence is scant, it is sufficient to see that the MB-CDI can lead to high correlations, that are nearly comparable to those expected in standardized testing among adults. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' relative scores for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  Thus, parental report may not be the most appropriate comparisons for our own study because they may overestimate reliability in the child's own behavior.
 
 <!-- For example, a meta-analysis of infant laboratory tasks, including test-retest data for sound discrimination, word recognition, and prosodic processing, found that the meta-analytic weighted average was not different from zero , suggesting that performance in these short laboratory tasks may not be captured in a stable way across testing days.  -->
 
-Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English using similar methods to those mentioned above. Although reliability in these tests may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (NIH, 2023) and GSED (Cavallera et al., 2023), and thus we cannot include them in the present summary.
+Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English using similar methods to those mentioned above, with the searches performed in June 2023. Although reliability in these tests may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (NIH, 2023) and GSED (Cavallera et al., 2023), and thus we cannot include them in the present summary.
 
 The Ages and Stages Questionnaire (ASQ) is a screening tool to measure infants and children's development based on observation of age-specific behaviors: For instance, at 6 months, in the domain of communication, one of the items asks whether the child smiles. The ASQ's reliability has been the object of a systematic review of independent evidence (Velikonja et al., 2017). Across 10 articles with data from children in the USA, Canada, China, Brazil, India, Chile, the Netherlands, Korea, and Turkey, only three articles reported test-retest correlations, two in the USA (r=.84-1)  and one in Turkey (r=.67). However, the meta-analysis authors judged these three studies to be "poor" in quality. Moreover, the ASQ is a questionnaire that an observer fills in, but it can also be administered as a parental questionnaire, reducing its comparability with our purely observational method. 
 
@@ -148,7 +154,7 @@ For the other tests, reliability is mainly available from reports by the compani
 
 Many other standardized tests exist for children over three years of age. Given that most children included in the present study were under three, these other tests do not constitute an ideal comparison point. Nonetheless, in the spirit of informativeness, it is useful to consider that a systematic review and meta-analysis has been done looking at all psychometric properties (internal consistency, reliability, measurement error, content and structural validity, convergent and discriminant validity) for standardized assessments targeted at children aged 4-12 years (Denman et al., 2017). Out of 76 assessments found in the literature, only 15 could be evaluated for their psychometric properties, and 14 reported on reliability based on independent evidence (i.e., the researchers tested and retested children, rather than relying on the company's report of reliability). Among these, correlations for test-retest reliability averaged r=.67, with a range from .35 to .76. The authors concluded that psychometric quality was limited for all assessments, but based on the available evidence, PLS-5 (whose test-retest reliability was r=.69) was among those recommended for use.
 
-Third, and perhaps most relevant, we looked for references that evaluated the psychometric properties of measures extracted from wearable data. We found no previous work attempting to do so on the basis of completely ecological, unconstrained data like ours. The closest references we could find reported on reliability and/or validity of measurements from wearable data collected in constrained situations, such as having 4.5 year old children wear interior sensors and asking them to complete four tests of balance (e.g., standing with their eyes closed; Liu et al., 2022). It is likely that consistency and test-retest reliability are higher in such cases than in data like ours, making it hard to compare. Nonetheless, to give an idea, a recent meta-analysis of wearable inertial sensors in healthy adults (Kobsar et al., 2020) found correlations between these instruments and gold standards above r= .88 for one set of measures (based on means) but much lower for another (based on variability, max weighted mean effect r = .58). Regarding test-retest reliability, the meta-analysts report ICCs above .6 for all measures for which they could find multiple studies reporting them. However, those authors point out that the majority of the included studies were classified as low quality, according to a standardized quality assessment for that work. 
+Third, and perhaps most relevant, we looked for references that evaluated the psychometric properties of measures extracted from wearable data, also in June 2023. We found no previous work attempting to do so on the basis of completely ecological, unconstrained data like ours. The closest references we could find reported on reliability and/or validity of measurements from wearable data collected in constrained situations, such as having 4.5 year old children wear interior sensors and asking them to complete four tests of balance (e.g., standing with their eyes closed; Liu et al., 2022). It is likely that consistency and test-retest reliability are higher in such cases than in data like ours, making it hard to compare. Nonetheless, to give an idea, a recent meta-analysis of wearable inertial sensors in healthy adults (Kobsar et al., 2020) found correlations between these instruments and gold standards above r= .88 for one set of measures (based on means) but much lower for another (based on variability, max weighted mean effect r = .58). Regarding test-retest reliability, the meta-analysts report ICCs above .6 for all measures for which they could find multiple studies reporting them. However, those authors point out that the majority of the included studies were classified as low quality, according to a standardized quality assessment for that work. 
 
 
 
@@ -192,10 +198,9 @@ nrecs=length(levels(mydat_aclew$session_id))
 
 
 
+## SM D: Code to reproduce Fig. 1
 
-## SM D: Code to reproduce Fig. 2
-
-```{r icc-examples-fig2,  fig.width=6, fig.height=4.5,fig.cap="Figure 2 (reproduced). Scatterplots for two selected variables. The left one has relatively low ICCs; the right one has relatively higher ICCs."}
+```{r icc-examples-fig1,  fig.width=6, fig.height=4.5,fig.cap="Figure 2 (reproduced). Scatterplots for two selected variables. The left one has relatively low ICCs; the right one has relatively higher ICCs."}
 # figure of bad ICC: lena     used to be: avg_voc_dur_chi, now is: peak_wc_adu_ph; good ICC: lena used to be: voc_och_ph, now is: voc_dur_och_ph
 
 # remove missing data points altogether
@@ -744,12 +749,8 @@ bias_tab$recXcor<-bias_tab$recXcor/sum(bias_tab$recXcor)
 Another potential negative contribution to reliability that is currently not discussed is variability in the experimental setup. In a corpus collected in the Solomon Islands, children were wearing two recorders simultaneously. These were USB devices, sourced from two different providers. In this dataset, the duration of the recordings could be very different within the same pair (a ~10% difference was not atypical), which means that what is actually recorded is somewhat random in itself. Even comparing identical audio ranges covered by both recordings of each pair, the corresponding ACLEW metrics differed slightly; they were strongly correlated (R^2 was close to 0.95) but not perfectly correlated. This suggests that randomness in the recorders properties and their placement may also contribute to a decrease reliability. This reliability is importantly not at all due to changing underlying conditions, as both recorders picked up on the exact same day, so it is not due to variability in underlying behaviors. It is also not due to algorithmic variation because ACLEW algorithms are deterministic. Thus, this variability is only due to hardware differences and potentially also differences in e.g. USB placement.
 
 ## SM Z: References
-Schweinhart, L. J., Mcnair, S., Barnes, H., & Larner, A. M. (1993). Observing Young Children in Action to Assess their Development: The High/Scope Child Observation Record Study. Educational and Psychological Measurement, 53(2), 445–455. https://doi.org/10.1177/0013164493053002014
 
 
-
-REFERENCES
-
 Alcock, K., Meints, K., & Rowland, C. (2020). The UK Communicative Development Inventories. London, UK: J&R Press. 
 
 Cavallera, V., Lancaster, G., Gladstone, M., Black, M. M., McCray, G., Nizar, A., ... & Janus, M. (2023). Protocol for validation of the Global Scales for Early Development (GSED) for children under 3 years of age in seven countries. BMJ open, 13(1), e062562. http://dx.doi.org/10.1136/bmjopen-2022-062562
@@ -777,6 +778,8 @@ NIH (2023). NIH Infant and Toddler (Baby) Toolbox. Retrieved March 16, 2023, fro
 
 Sahli, A. S., & Belgin, E. (2017). Adaptation, validity, and reliability of the Preschool Language Scale–Fifth Edition (PLS–5) in the Turkish context: The Turkish Preschool Language Scale–5 (TPLS–5). International Journal of Pediatric Otorhinolaryngology, 98, 143–149. https://doi.org/10.1016/j.ijporl.2017.05.003
 
+Schweinhart, L. J., Mcnair, S., Barnes, H., & Larner, A. M. (1993). Observing Young Children in Action to Assess their Development: The High/Scope Child Observation Record Study. Educational and Psychological Measurement, 53(2), 445–455. https://doi.org/10.1177/0013164493053002014
+
 Velikonja, T., Edbrooke-Childs, J., Calderon, A., Sleed, M., Brown, A., & Deighton, J. (2017). The psychometric properties of the Ages & Stages Questionnaires for ages 2-2.5: A systematic review. Child: Care, Health and Development, 43(1), 1–17. https://doi.org/10.1111/cch.12397
 
 Williams, K. T. (1997). Expressive vocabulary test second edition (EVT™ 2). J. Am. Acad. Child Adolesc. Psychiatry, 42, 864-872.

+ 4 - 0
CODE/create-all-rs.R

@@ -60,6 +60,10 @@ colnames(all_iccs[,1:nsamples])<-paste0("sample",1:nsamples)
 all_iccs$data_set<-df.icc.mixed$data_set
 all_iccs$metric<-df.icc.mixed$metric
 
+#dist_contig_lena[,c("child_id","age","age_dist_next_rec")]
+table(dist_contig_lena[,c("age_dist_next_rec")]) # FYI
+table(dist_contig_lena[,c("age")]) # FYI
+
 for(i in 1:nsamples){#i=1
   
   #for each child, sample 2 contiguous recordings that are less than 2 months away