Browse Source

added dfs

alecristia 4 months ago
parent
commit
6e1a52398b
1 changed files with 54 additions and 10 deletions
  1. 54 10
      CODE/SM.Rmd

+ 54 - 10
CODE/SM.Rmd

@@ -135,17 +135,19 @@ plot(df.icc.simu$icc_child_id~df.icc.simu$myr,xlab="r used to simulate the data"
 
 ## SM B: More information for benchmarking our results against previously reported reliability studies
 
-First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. All of the instruments we found rely on reports from caregivers, who are basing their judgments on their cumulative experience with the child (e.g., the Child Observation Record Advantage, Schweinhart, McNair, & Larner, 1993; the Desired Results Developmental Profile, REF; the MacArthur-Bates Communicative Development Inventory, REF). Readers are likely most familiar with the MB-CDI,  Fenson et al., 1994 report a correlation of r=.95 in their sample of North American, monolingual infants. We did not find a systematic review or meta-analysis providing more such estimates. However, @frank2021 analyzed data archived in a CDI repository, concentrating on American English and Norwegian, where longitudinal data was available. They found that for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the CDI tends to have ceiling and floor effects at these extreme ages. Another report looked at correlations when parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' relative scores for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  For example, a meta-analysis of infant laboratory tasks, including test-retest data for sound discrimination, word recognition, and prosodic processing, found that the meta-analytic weighted average was not different from zero, suggesting that performance in these short laboratory tasks may not be captured in a stable way across testing days. Thus, parental report (or short lab studies) may not be the most appropriate comparisons for our own study.
+First, we looked for measures of language development used with observational data that can be employed with children aged 0-3 years, and which are available at least in English. To this end, we did searches in scholar.google.com and google.com with keywords like ("language development" AND "standardized test" AND "infancy"), complementing these searches with our own knowledge. All of the instruments we found rely on reports from caregivers, including the Child Observation Record Advantage (Schweinhart, McNair, & Larner, 1993); the Desired Results Developmental Profile (**REF missing fix**); and the MacArthur-Bates Communicative Development Inventory (MB-CDI, Fenson et al., 1994). Readers are likely most familiar with the MB-CDI, so we focus our discussion on this instrument here, to illustrate the level of test-retest reliability such instruments can attain. In the work most commonly cited in relation to this instrument's creation, Fenson et al. (1994) reported a correlation of r=.95 in their sample of North American, monolingual infants. Since it has been over 20 years, and this instrument has been used with other samples and languages, a more relevant comparison point for our own study would be one that takes into account more attempts at using this instrument. We did not find a systematic review or meta-analysis on test-retest estimates for the MB-CDI. However, Frank et al. (2021) analyzed data archived in an MB-CDI repository, concentrating on two corpora where longitudinal data was available, American English and Norwegian. They found that, for both datasets, correlations within 2-4 months were above r=.8, with correlations at 16 months of distance (i.e., a form filled in at 8 and 24 months) at their lowest, r=.5. These correlations are very high when considering that the MB-CDI tends to have ceiling and floor effects at these extreme ages. Alcock, Meints, and Rowland (n.d.) looked at correlations when British English-speaking parents completed two versions of the form in two media (e.g., short form in paper and long form online, or vice versa) within a month. Here, the correlation was r=.8 for comprehension and r=.6 for production. Thus, although evidence is scant, it is sufficient to see that the MB-CDI can lead to high correlations, that are nearly comparable to those expected in standardized testing among adults. It is worth bearing in mind that test-retest reliability in parental report measures does not depend only on the consistency in the individual infants' relative scores for a given behavior, but also on the consistency of the adult in reporting it. Moreover, they are based on cumulative experience, rather than a one-shot observation, as in the case of long-form recordings. Therefore, they do not constitute an appropriate comparison point, and by and large we can be quite certain that they will yield higher reliability than metrics based on the children themselves.  Thus, parental report may not be the most appropriate comparisons for our own study because they may overestimate reliability in the child's own behavior.
 
-Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English. Although reliability in these tests may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (REF) and GSED (REF), and thus we cannot include them in the present summary.
+<!-- For example, a meta-analysis of infant laboratory tasks, including test-retest data for sound discrimination, word recognition, and prosodic processing, found that the meta-analytic weighted average was not different from zero , suggesting that performance in these short laboratory tasks may not be captured in a stable way across testing days.  -->
+
+Second, we did a bibliographic search for systematic reviews of test-retest reliability of standardized instruments to measure language development up to age three years that are available at least in English using similar methods to those mentioned above. Although reliability in these tests may also partially reflect consistency in the adults' reports, they are at least based on a one-shot observation of how the child behaves, rather than the adult's cumulative experience with the child. Note that some promising tests are currently in development or have only recently been released, including the NIH Baby Toolbox (**I THINK REF IS WEIRD, LOOK UP AGAIN**) and GSED (**REF MISSING**), and thus we cannot include them in the present summary.
 
 The Ages and Stages Questionnaire (ASQ) is a screening tool to measure infants and children's development based on observation of age-specific behaviors: For instance, at 6 months, in the domain of communication, one of the items asks whether the child smiles. The ASQ's reliability has been the object of a systematic review of independent evidence (Velikonja et al., 2017). Across 10 articles with data from children in the USA, Canada, China, Brazil, India, Chile, the Netherlands, Korea, and Turkey, only three articles reported test-retest correlations, two in the USA (r=.84-1)  and one in Turkey (r=.67). However, the meta-analysis authors judged these three studies to be "poor" in quality. Moreover, the ASQ is a questionnaire that an observer fills in, but it can also be administered as a parental questionnaire, reducing its comparability with our purely observational method. 
 
-For the other tests, reliability is mainly available from reports by the companies commercializing the test. The Goldman Fristoe Articulation Test – 3rd edition (GFTA-3) (REF) focuses on the ability to articulate certain sounds through picture-based elicitation and can be used from two to five years of age. Available in English and Spanish for a USA context, it has a reported reliability of r=.92 -- although we do not know whether it is this high for two-year-olds specifically.  The Preschool Language Scales – 5th edition can be used to measure both comprehension and production from birth to 7 years of age. According to Pearson's report, its test-retest reliability is .86 to .95, depending on the age bracket (0;0-2;11, 3;0-4;11, and 5;0-7;11, 0-7 years). The test has also been adapted to other languages, with good reported test-retest reliability (r=.93; Turkish, Sahli & Belgin, 2017). One issue we see with both of these reports is that children tested varied greatly in age, and the correlation seems to have been calculated based on the raw (rather than the normed) score. As a result, children's relative scores may have been stable over test and retest mainly because they varied greatly in age, rather than capturing variance from more meaningful individual differences.. The Expressive Vocabulary Test – 2nd Edition (EVT-2) is a picture-based elicitation method that can be used with children from 2.5 years of age. The company Springer reports a test-retest reliability of r=.95 by age (REF).
+For the other tests, reliability is mainly available from reports by the companies commercializing the test. The Goldman Fristoe Articulation Test – 3rd edition (GFTA-3) (**REF MISSING**) focuses on the ability to articulate certain sounds through picture-based elicitation and can be used from two to five years of age. Available in English and Spanish for a USA context, it has a reported reliability of r=.92 -- although we do not know whether it is this high for two-year-olds specifically.  The Preschool Language Scales – 5th edition (**REF MISSING**) can be used to measure both comprehension and production from birth to 7 years of age. According to Pearson's report (**REF MISSING**), its test-retest reliability is .86 to .95, depending on the age bracket (0;0-2;11, 3;0-4;11, and 5;0-7;11, 0-7 years). The test has also been adapted to other languages, with good reported test-retest reliability (r=.93; Turkish, Sahli & Belgin, 2017). One issue we see with both of these reports is that children tested varied greatly in age, and the correlation seems to have been calculated based on the raw (rather than the normed) score. As a result, children's relative scores may have been stable over test and retest mainly because they varied greatly in age, rather than capturing variance from more meaningful individual differences. The Expressive Vocabulary Test – 2nd Edition (**REF MISSING**) is a picture-based elicitation method that can be used with children from 2.5 years of age. The company Springer reports a test-retest reliability of r=.95 by age (**REF MISSING**).
 
-Many other standardized tests exist for children over three years of age. Given that most children included in the present study were under three, these other tests do not constitute an ideal comparison point. Nonetheless, in the spirit of informativeness, it is useful to consider that a systematic review and meta-analysis has been done looking at all psychometric properties (internal consistency, reliability, measurement error, content and structural validity, convergent and discriminant validity) for standardized assessments targeted at children aged 4-12 years (REF). Out of 76 assessments found in the literature, only 15 could be evaluated for their psychometric properties, and 14 reported on reliability based on independent evidence (i.e., the researchers tested and retested children, rather than relying on the company's report of reliability). Among these, correlations for test-retest reliability averaged r=.67, with a range from .35 to .76. The authors concluded that psychometric quality was limited for all assessments, but based on the available evidence, PLS-5 (whose test-retest reliability was r=.69) was among those recommended for use.
+Many other standardized tests exist for children over three years of age. Given that most children included in the present study were under three, these other tests do not constitute an ideal comparison point. Nonetheless, in the spirit of informativeness, it is useful to consider that a systematic review and meta-analysis has been done looking at all psychometric properties (internal consistency, reliability, measurement error, content and structural validity, convergent and discriminant validity) for standardized assessments targeted at children aged 4-12 years (**REF MISSING**). Out of 76 assessments found in the literature, only 15 could be evaluated for their psychometric properties, and 14 reported on reliability based on independent evidence (i.e., the researchers tested and retested children, rather than relying on the company's report of reliability). Among these, correlations for test-retest reliability averaged r=.67, with a range from .35 to .76. The authors concluded that psychometric quality was limited for all assessments, but based on the available evidence, PLS-5 (whose test-retest reliability was r=.69) was among those recommended for use.
 
-Third, and perhaps most relevant, we looked for references that evaluated the psychometric properties of measures extracted from wearable data. We found no previous work attempting to do so on the basis of completely ecological, unconstrained data like ours. The closest references we could find reported on reliability and/or validity of measurements from wearable data collected in constrained situations, such as having 4.5 year old children wear interior sensors and asking them to complete four tests of balance (e.g., standing with their eyes closed; Liu et al., 2022). It is likely that consistency and test-retest reliability are higher in such cases than in data like ours, making it hard to compare. Nonetheless, to give an idea, a recent meta-analysis of wearable inertial sensors in healthy adults found correlations between these instruments and gold standards above r= .88 for one set of measures (based on means) but much lower for another (based on variability, max weighted mean effect r = .58). Regarding test-retest reliability, the meta-analysts report ICCs above .6 for all measures for which they could find multiple studies reporting them. However, those authors point out that the majority of the included studies were classified as low quality, according to a standardized quality assessment for that work. 
+Third, and perhaps most relevant, we looked for references that evaluated the psychometric properties of measures extracted from wearable data. We found no previous work attempting to do so on the basis of completely ecological, unconstrained data like ours. The closest references we could find reported on reliability and/or validity of measurements from wearable data collected in constrained situations, such as having 4.5 year old children wear interior sensors and asking them to complete four tests of balance (e.g., standing with their eyes closed; Liu et al., 2022). It is likely that consistency and test-retest reliability are higher in such cases than in data like ours, making it hard to compare. Nonetheless, to give an idea, a recent meta-analysis of wearable inertial sensors in healthy adults (Kobsar et al., 2020) found correlations between these instruments and gold standards above r= .88 for one set of measures (based on means) but much lower for another (based on variability, max weighted mean effect r = .58). Regarding test-retest reliability, the meta-analysts report ICCs above .6 for all measures for which they could find multiple studies reporting them. However, those authors point out that the majority of the included studies were classified as low quality, according to a standardized quality assessment for that work. 
 
 
 
@@ -315,7 +317,7 @@ cor_t=t.test(rval_tab$m ~ rval_tab$data_set)
 
 ```
 
-> To see whether correlations in this analysis differed by talker types and pipelines, we fit a linear model with the formula $lm(cor ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. Although the model was overall significant (F We found an adjusted R-squared of `r round(reg_sum_cor$adj.r.squared*100)`%, suggesting this model did not explain a great deal of variance in correlation coefficients. A Type 3 ANOVA on this model revealed a significant effect of pipeline (F = `r round(reg_anova_cor["data_set","F value"],2)`, p = `r round(reg_anova_cor["data_set","Pr(>F)"],2)`), due to higher correlations for ACLEW (`r r_msds["aclew","x"]`) than for LENA metrics (m = `r r_msds["lena","x"]`). 
+> To see whether correlations in this analysis differed by talker types and pipelines, we fit a linear model with the formula $lm(cor ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. The model was overall significant (F(`round(reg_sum_cor$fstatistic["dendf"],2)`) = `round(reg_sum_cor$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum_cor$adj.r.squared*100)`%, suggesting this model did not explain a great deal of variance in correlation coefficients. A Type 3 ANOVA on this model revealed a significant effect of pipeline (F = `r round(reg_anova_cor["data_set","F value"],2)`, p = `r round(reg_anova_cor["data_set","Pr(>F)"],2)`), due to higher correlations for ACLEW (`r r_msds["aclew","x"]`) than for LENA metrics (m = `r r_msds["lena","x"]`). 
 
 See table below for results of the Type 3 ANOVA.
 
@@ -475,7 +477,7 @@ rownames(msds_p)<-msds_p$Group.1
 ```
 
 
-> Next, we explored how similar Child ICCs were across different talker types and pipelines. We fit a linear model with the formula $lm(icc\_child\_id ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. We found an adjusted R-squared of `r round(reg_sum$adj.r.squared*100)`%, suggesting much of the variance across Child ICCs was explained by these factors. A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. The main effect of type emerged because output metrics tended to have higher Child ICC (`r msds["Output","x"]`)  than those associated to adults in general (`r msds["Adults","x"]`), females (`r msds["Female","x"]`), and males (`r msds["Male","x"]`); whereas those associated with other children had even higher Child ICCs (`r msds["Other children","x"]`). The main effect of pipeline arose because of slightly higher Child ICCs for the ACLEW metrics (`r msds_p["aclew","x"]`) than for LENA metrics (`r msds_p["lena","x"]`). 
+> Next, we explored how similar Child ICCs were across different talker types and pipelines. We fit a linear model with the formula $lm(icc\_child\_id ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. The model was overall significant (F(`round(reg_sum$fstatistic["dendf"],2)`) = `round(reg_sum$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum$adj.r.squared*100)`%, suggesting much of the variance across Child ICCs was explained by these factors. A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. The main effect of type emerged because output metrics tended to have higher Child ICC (`r msds["Output","x"]`)  than those associated to adults in general (`r msds["Adults","x"]`), females (`r msds["Female","x"]`), and males (`r msds["Male","x"]`); whereas those associated with other children had even higher Child ICCs (`r msds["Other children","x"]`). The main effect of pipeline arose because of slightly higher Child ICCs for the ACLEW metrics (`r msds_p["aclew","x"]`) than for LENA metrics (`r msds_p["lena","x"]`). 
 
 
 ## SM O: Code to reproduce Table 4
@@ -553,7 +555,7 @@ reg_anova_age_icc=Anova(age_icc)
 
 ```
 
-> To interrogate these results statistically, and assess whether Child ICCs tended to be higher or lower in certain age bins, we fit a linear model with the formula $lm(Child_ICC ~ type * pipeline * age_bin)$. We found an adjusted R-squared of `r round(reg_sum_age_icc$adj.r.squared*100)`%, suggesting this model explained about a third of the variance in Child ICC.  A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), whereas as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. 
+> To interrogate these results statistically, and assess whether Child ICCs tended to be higher or lower in certain age bins, we fit a linear model with the formula $lm(Child_ICC ~ type * pipeline * age_bin)$. The model was overall significant (F(`round(reg_sum_age_icc$fstatistic["dendf"],2)`) = `round(reg_sum_age_icc$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum_age_icc$adj.r.squared*100)`%, suggesting this model explained about a third of the variance in Child ICC.  A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), whereas as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. 
 
 See table below for results of the Type 3 ANOVA.
 
@@ -631,7 +633,7 @@ panel.background = element_blank(), legend.key=element_blank(), axis.line = elem
 ## SM V: Code to reproduce text below Figure 8
 
 ```{r reg model corpus}
-
+# note, there is a warning here because some of the corpus names contain a dash. This does not affect the results
 
 cor_icc <- lm(icc_child_id ~ Type * data_set * corpus, data=df.icc.corpus) 
 #plot(cor_icc)
@@ -643,7 +645,7 @@ reg_anova_cor_icc=Anova(cor_icc)
 
 ```
 
-> The fact that we cannot infer reliability from one corpus based on another one was confirmed statistically: We checked whether Child ICC differed by talker types and pipelines across corpora by fitting a linear model with the formula $lm(Child_ICC ~ type * pipeline * corpus)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children;  pipeline LENA or ACLEW; and corpus the corpus ID. We found an adjusted R-squared of `r round(reg_sum_cor_icc$adj.r.squared*100)`%, suggesting this model explained nearly half of the variance in Child ICC. A Type 3 ANOVA on this model revealed several significant effects and interactions, including a three-way interaction of type, pipeline, and corpus  (F(`r reg_anova_cor_icc["Type:data_set:corpus","Df"]`) = `r round(reg_anova_cor_icc["Type:data_set:corpus","F value"],1)`, p<.001); a two-way interaction of type and corpus  (F(`r reg_anova_cor_icc["data_set:corpus","Df"]`) = `r round(reg_anova_cor_icc["data_set:corpus","F value"],1)`, p<.001); and a main effect of corpus (F(`r reg_anova_cor_icc["corpus","Df"]`) = `r round(reg_anova_cor_icc["corpus","F value"],1)`, p<.001). 
+> The fact that we cannot infer reliability from one corpus based on another one was confirmed statistically: We checked whether Child ICC differed by talker types and pipelines across corpora by fitting a linear model with the formula $lm(Child_ICC ~ type * pipeline * corpus)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children;  pipeline LENA or ACLEW; and corpus the corpus ID. The model was overall significant (F(`round(reg_sum_cor_icc$fstatistic["dendf"],2)`) = `round(reg_sum_cor_icc$fstatistic["value"],2)`, p < .001).  We found an adjusted R-squared of `r round(reg_sum_cor_icc$adj.r.squared*100)`%, suggesting this model explained nearly half of the variance in Child ICC. A Type 3 ANOVA on this model revealed several significant effects and interactions, including a three-way interaction of type, pipeline, and corpus  (F(`r reg_anova_cor_icc["Type:data_set:corpus","Df"]`) = `r round(reg_anova_cor_icc["Type:data_set:corpus","F value"],1)`, p<.001); a two-way interaction of type and corpus  (F(`r reg_anova_cor_icc["data_set:corpus","Df"]`) = `r round(reg_anova_cor_icc["data_set:corpus","F value"],1)`, p<.001); and a main effect of corpus (F(`r reg_anova_cor_icc["corpus","Df"]`) = `r round(reg_anova_cor_icc["corpus","F value"],1)`, p<.001). 
 
 See Table below for results of the Type 3 ANOVA.
 
@@ -713,6 +715,48 @@ bias_tab$rec_per_corpus<-bias_tab$rec_per_corpus/sum(bias_tab$rec_per_corpus)
 
 Another potential negative contribution to reliability that is currently not discussed is variability in the experimental setup. In a corpus collected in the Solomon Islands, children were wearing two recorders simultaneously. These were USB devices, sourced from two different providers. In this dataset, the duration of the recordings could be very different within the same pair (a ~10% difference was not atypical), which means that what is actually recorded is somewhat random in itself. Even comparing identical audio ranges covered by both recordings of each pair, the corresponding ACLEW metrics differed slightly; they were strongly correlated (R^2 was close to 0.95) but not perfectly correlated. This suggests that randomness in the recorders properties and their placement may also contribute to a decrease reliability. This reliability is importantly not at all due to changing underlying conditions, as both recorders picked up on the exact same day, so it is not due to variability in underlying behaviors. It is also not due to algorithmic variation because ACLEW algorithms are deterministic. Thus, this variability is only due to hardware differences and potentially also differences in e.g. USB placement.
 
+## SM Z: References
+Schweinhart, L. J., Mcnair, S., Barnes, H., & Larner, A. M. (1993). Observing Young Children in Action to Assess their Development: The High/Scope Child Observation Record Study. Educational and Psychological Measurement, 53(2), 445–455. https://doi.org/10.1177/0013164493053002014
+
+
+
+REFERENCES
+
+Alcock, K., Meints, K., & Rowland, C. (n.d.). The UK Communicative Development Inventories.
+
+Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., & Pethick, S. J. (1994). Variability in early communicative development. Monographs of the Society for Research in Child Development, 59(5), 1–173; discussion 174-185.
+
+Frank, M. C., Braginsky, M., Yurovsky, D., & Marchman, V. A. (2021). Variability and Consistency in Early Language Learning: The Wordbank Project. MIT Press.
+
+Sahli, A. S., & Belgin, E. (2017). Adaptation, validity, and reliability of the Preschool Language Scale–Fifth Edition (PLS–5) in the Turkish context: The Turkish Preschool Language Scale–5 (TPLS–5). International Journal of Pediatric Otorhinolaryngology, 98, 143–149. https://doi.org/10.1016/j.ijporl.2017.05.003
+
+Velikonja, T., Edbrooke-Childs, J., Calderon, A., Sleed, M., Brown, A., & Deighton, J. (2017). The psychometric properties of the Ages & Stages Questionnaires for ages 2-2.5: A systematic review. Child: Care, Health and Development, 43(1), 1–17. https://doi.org/10.1111/cch.12397
+
+Liu, R., Zhang, R., Qu, Y., Jin, W., Dong, B., Liu, Y., & Mao, L. (2022). Reliability analysis of inertial sensors for testing static balance of 4-to-5-year-old preschoolers. Gait & Posture, 92, 176–180. https://doi.org/10.1016/j.gaitpost.2021.11.029
+
+
+Kobsar, D., Charlton, J. M., Tse, C. T. F., Esculier, J.-F., Graffos, A., Krowchuk, N. M., Thatcher, D., & Hunt, M. A. (2020). Validity and reliability of wearable inertial sensors in healthy adult walking: A systematic review and meta-analysis. Journal of NeuroEngineering and Rehabilitation, 17(1), 62. https://doi.org/10.1186/s12984-020-00685-3
+
+
+LOOK UP AGAIN
+NIH Infant and Toddler (Baby) Toolbox | Blueprint. (n.d.). Retrieved March 16, 2023, from https://neuroscienceblueprint.nih.gov/resources-tools/blueprint-resources-tools-library/nih-infant-and-toddler-baby-toolbox
+
+GSED (**REF MISSING**),
+
+the Desired Results Developmental Profile (**REF missing fix**);
+
+Goldman Fristoe Articulation Test – 3rd edition (GFTA-3) (**REF MISSING**)
+
+Preschool Language Scales – 5th edition (**REF MISSING**)
+
+According to Pearson's report (**REF MISSING**),
+
+Expressive Vocabulary Test – 2nd Edition (**REF MISSING**)
+
+The company Springer reports a test-retest reliability of r=.95 by age (**REF MISSING**).
+
+a systematic review and meta-analysis has been done looking at all psychometric properties (internal consistency, reliability, measurement error, content and structural validity, convergent and discriminant validity) for standardized assessments targeted at children aged 4-12 years (**REF MISSING**).
+
 
 ```{r}
 # Save information about packages used