Browse Source

with all numbers checked

alecristia 4 months ago
parent
commit
43e1b80649
7 changed files with 97 additions and 93 deletions
  1. 13 10
      CODE/SM.Rmd
  2. 59 57
      CODE/SM.html
  3. 25 26
      CODE/SM.log
  4. BIN
      CODE/SM.pdf
  5. BIN
      CODE/fig5.png
  6. BIN
      CODE/fig8.png
  7. BIN
      CODE/fig9.png

+ 13 - 10
CODE/SM.Rmd

@@ -379,7 +379,7 @@ best_metric$icc_child_id=round(best_metric$icc_child_id,2)
 ```
 
 
-> Figure 5 shows the distribution of Child ICC across all `r dim(df.icc.mixed)[1]` metrics, separately for each pipeline. The majority of measures had Child ICCs between .3 and .5. `r sum(df.icc.mixed$icc_child_id > .5)` measures had Child ICCs higher or equal to .5. Surprisingly, the top 6 metrics in terms of Child ICC corresponded to the "other child" category, known to have the worst accuracy according to previous analyses (Cristia et al., 2020). In an analysis fully reported in the SM, we find some evidence that this may be due to the presence versus absence of siblings. The next metric with the highest Child ICC corresponded to an output measure, namely the total vocalization duration per hour extracted from ACLEW annotations (`r best_metric[best_metric$Type=="Output",c("metric","data_set")]`), with a Child ICC of `r best_metric[best_metric$Type=="Output","icc_child_id"]`. Among adult metrics, the average vocalization duration for female vocalizations for ACLEW (`r best_metric[best_metric$Type=="Female",c("metric","data_set")]`) and the ACLEW equivalent of CTC had the highest Child ICC (`r best_metric[best_metric$Type=="Female","icc_child_id"]` and `r best_metric[best_metric$Type=="Adults","icc_child_id"]`, respectively). 
+> Figure 5 shows the distribution of Child ICC across all `r dim(df.icc.mixed)[1]` metrics, separately for each pipeline. The majority of measures had Child ICCs between .3 and .5. `r sum(df.icc.mixed$icc_child_id > .5)` measures had Child ICCs higher or equal to .5. Surprisingly, the top 6 metrics in terms of Child ICC corresponded to the "other child" category, known to have the worst accuracy according to previous analyses (Cristia et al., 2020). In an analysis fully reported in supplementary materials (SM K), we find some evidence that this may be due to the presence versus absence of siblings. The next metric with the highest Child ICC corresponded to an output measure, namely the total vocalization duration per hour extracted from ACLEW annotations (`r best_metric[best_metric$Type=="Output",c("metric","data_set")]`), with a Child ICC of `r best_metric[best_metric$Type=="Output","icc_child_id"]`. Among adult metrics, the average vocalization duration for female vocalizations for ACLEW (`r best_metric[best_metric$Type=="Female",c("metric","data_set")]`) and the ACLEW equivalent of CTC had the highest Child ICC (`r best_metric[best_metric$Type=="Female","icc_child_id"]` and `r best_metric[best_metric$Type=="Adults","icc_child_id"]`, respectively). 
 
 ## SM K: Are high Child ICCs for "other child" measures due to number or presence of siblings? (Exploration)
 
@@ -429,7 +429,7 @@ As in the sibling number analysis, the full model was singular, so we fitted a N
 
 Among ACLEW measures, a fair number of them come from VCM, a module that classifies child vocalizations in terms of vocal maturity types into cry, canonical, and non-canonical categories. In unpublished analyses, we have found that VCM labels are inaccurate when compared to human labels of the same vocalizations, relatively to other metrics. In this analysis, we checked whether VCM-derived measures had lower Child ICC than other ACLEW measures. As shown in the next Figure, this was not the case: Some output measures from the ACLEW pipeline have lower Child ICC than VCM ones.
 
-```{r}
+```{r,fig.cap="Figure SM L. Violin plot reflecting the distribution of Child ICC for ACLEW VCM versus other ACLEW or LENA metrics."}
 vcm_type<-rep("Other ACLEW",dim(df.icc.mixed)[1])
 vcm_type[df.icc.mixed$data_set=="lena"]<-"LENA"
 vcm_type[grep("lp",df.icc.mixed$metric)]<-"ACLEW VCM"
@@ -449,13 +449,13 @@ panel.background = element_blank(), legend.key=element_blank(), axis.line = elem
 
 ## SM M: Code to reproduce Figure 5
 
-```{r icc-allexp-fig5, echo=F,fig.width=4, fig.height=3,fig.cap="Figure 5 (reproduced). Violin plot reflecting the distribution of Child ICC."}
+```{r icc-allexp-fig5, echo=F,fig.width=6, fig.height=3,fig.cap="Figure 5 (reproduced). Violin plot reflecting the distribution of Child ICC."}
 
 
 fig5 <- ggplot(df.icc.mixed, aes(y = icc_child_id, x = toupper(data_set))) +
   geom_violin(alpha = 0.5) +
   geom_quasirandom(aes(colour = Type,shape = Type)) +  
-  labs( y = "Child ICC",x="Pipeline") +  theme(text = element_text(size = 20)) + 
+  labs( y = "Child ICC",x="Pipeline") +  theme(text = element_text(size = 16)) + 
   theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
 panel.background = element_blank(), legend.key=element_blank(), axis.line = element_line(colour = "black")) 
 
@@ -488,7 +488,7 @@ rownames(msds_p)<-msds_p$Group.1
 ```
 
 
-> Next, we explored how similar Child ICCs were across different talker types and pipelines. We fit a linear model with the formula $lm(icc\_child\_id ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. The model was overall significant (F(`r round(reg_sum$fstatistic["dendf"],2)`) = `r round(reg_sum$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum$adj.r.squared*100)`%, suggesting much of the variance across Child ICCs was explained by these factors. A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. The main effect of type emerged because output metrics tended to have higher Child ICC (`r msds["Output","x"]`)  than those associated to adults in general (`r msds["Adults","x"]`), females (`r msds["Female","x"]`), and males (`r msds["Male","x"]`); whereas those associated with other children had even higher Child ICCs (`r msds["Other children","x"]`). The main effect of pipeline arose because of slightly higher Child ICCs for the ACLEW metrics (`r msds_p["aclew","x"]`) than for LENA metrics (`r msds_p["lena","x"]`). 
+> Next, we explored how similar Child ICCs were across different talker types and pipelines. We fit a linear model with the formula $lm(icc\_child\_id ~ type * pipeline)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children; and pipeline LENA or ACLEW. The model was overall significant (F(`r round(reg_sum$fstatistic["dendf"],2)`) = `r round(reg_sum$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum$adj.r.squared*100)`%, suggesting much of the variance across Child ICCs was explained by these factors. A Type 3 ANOVA on this model revealed type was a significant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. The main effect of type emerged because output metrics tended to have higher Child ICC (`r msds["Output","x"]`)  than those associated to adults in general (`r msds["Adults","x"]`), females (`r msds["Female","x"]`), and males (`r msds["Male","x"]`); whereas those associated with other children had even higher Child ICCs (`r msds["Other children","x"]`). The main effect of pipeline arose because of slightly higher Child ICCs for the ACLEW metrics (`r msds_p["aclew","x"]`) than for LENA metrics (`r msds_p["lena","x"]`). 
 
 
 ## SM O: Code to reproduce Table 4
@@ -569,7 +569,10 @@ reg_anova_age_icc=Anova(age_icc)
 
 ```
 
-> To interrogate these results statistically, and assess whether Child ICCs tended to be higher or lower in certain age bins, we fit a linear model with the formula $lm(Child_ICC ~ type * pipeline * age_bin)$. The model was overall significant (F(`r round(reg_sum_age_icc$fstatistic["dendf"],2)`) = `r round(reg_sum_age_icc$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum_age_icc$adj.r.squared*100)`%, suggesting this model explained about a third of the variance in Child ICC.  A Type 3 ANOVA on this model revealed type was a signficant predictor (F(`r reg_anova["Type","Df"]`) = `r round(reg_anova["Type","F value"],1)`, p<.001), whereas as was pipeline (F(`r reg_anova["data_set","Df"]`) = `r round(reg_anova["data_set","F value"],1)`, p = `r round(reg_anova["data_set","Pr(>F)"],3)`); the interaction between type and pipeline was not significant. 
+> To interrogate these results statistically, and assess whether Child ICCs tended to be higher or lower in certain age bins, we fit a linear model with the formula $lm(Child\_ICC ~ type * pipeline * age\_bin)$. The model was overall significant (F(`r round(reg_sum_age_icc$fstatistic["dendf"],2)`) = `r round(reg_sum_age_icc$fstatistic["value"],2)`, p < .001). We found an adjusted R-squared of `r round(reg_sum_age_icc$adj.r.squared*100)`%, suggesting this model explained more than a third of the variance in Child ICC.  A Type 3 ANOVA on this model revealed that the interactions between type and pipeline, pipeline and age, and the three-way interaction (type, pipeline, age) were not significant. However, both the type by age bin interaction (F(`r reg_anova_age_icc["Type:age_bin","Df"]`) = `r round(reg_anova_age_icc["Type:age_bin","F value"],1)`, p < .001) and the three main effects were significant (
+type: F(`r reg_anova_age_icc["Type","Df"]`) = `r round(reg_anova_age_icc["Type","F value"],1)`, p < .001; 
+age: F(`r reg_anova_age_icc["age_bin","Df"]`) = `r round(reg_anova_age_icc["age_bin","F value"],1)`, p < .001; 
+pipeline: F(`r reg_anova_age_icc["data_set","Df"]`) = `r round(reg_anova_age_icc["age_bin","F value"],1)`, p = .01).
 
 See table below for results of the Type 3 ANOVA.
 
@@ -627,7 +630,7 @@ ggsave("fig7.png", plot = fig7, width = 4, height = 4, units = "in")
 ## SM U: Code to reproduce Figure 8
 
 
-```{r icc-bycor-fig8, echo=F,fig.width=4, fig.height=4,fig.cap="Figure 8 (reproduced). Child ICC by metric type and pipeline, when considering each corpus separately."}
+```{r icc-bycor-fig8, echo=F,fig.width=8, fig.height=4,fig.cap="Figure 8 (reproduced). Child ICC by metric type and pipeline, when considering each corpus separately."}
 
 facet_labels_chi = paste0("N chi=",chiXcor)
 
@@ -647,7 +650,7 @@ panel.background = element_blank(), legend.key=element_blank(), axis.line = elem
 
 fig8
 
-ggsave("fig8.png", plot = fig8, width = 4, height = 4, units = "in")
+ggsave("fig8.png", plot = fig8, width = 8, height = 4, units = "in")
 
 ```
 
@@ -678,7 +681,7 @@ kable(round(reg_anova_cor_icc,2),caption="Type 3 ANOVA on model attempting to ex
 
 ## SM W: Code to reproduce Figure 9
 
-```{r icc-bycor-fig9, echo=F,fig.width=4, fig.height=4,fig.cap="Figure 9 (reproduced). Correlations in Child ICC across corpora."}
+```{r icc-bycor-fig9, echo=F,fig.width=8, fig.height=4,fig.cap="Figure 9 (reproduced). Correlations in Child ICC across corpora."}
 
 
 
@@ -707,7 +710,7 @@ panel.background = element_blank(), legend.key=element_blank(), axis.line = elem
 
 fig9
 
-ggsave("fig9.png", plot = fig9, width = 4, height = 4, units = "in")
+ggsave("fig9.png", plot = fig9, width = 8, height = 4, units = "in")
 ```
 
 ## SM X: Code to reproduce text in the Discussion section

File diff suppressed because it is too large
+ 59 - 57
CODE/SM.html


+ 25 - 26
CODE/SM.log

@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2021) (preloaded format=pdflatex 2021.11.25)  12 DEC 2023 15:43
+This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2021) (preloaded format=pdflatex 2021.11.25)  14 DEC 2023 16:20
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.
@@ -661,53 +661,52 @@ Package pdftex.def Info: SM_files/figure-latex/r-fig4-1.pdf  used on input line
 File: SM_files/figure-latex/unnamed-chunk-2-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/unnamed-chunk-2-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/unnamed-chunk-2-1.pdf  used on i
-nput line 556.
+nput line 559.
 (pdftex.def)             Requested size: 452.69014pt x 315.17673pt.
-<SM_files/figure-latex/icc-allexp-fig5-1.pdf, id=152, 272.01625pt x 206.7725pt>
-
+<SM_files/figure-latex/icc-allexp-fig5-1.pdf, id=152, 417.56pt x 206.7725pt>
 File: SM_files/figure-latex/icc-allexp-fig5-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/icc-allexp-fig5-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/icc-allexp-fig5-1.pdf  used on i
-nput line 564.
-(pdftex.def)             Requested size: 272.02388pt x 206.77829pt.
-[8 <./SM_files/figure-latex/unnamed-chunk-2-1.pdf>] [9 <./SM_files/figure-latex
-/icc-allexp-fig5-1.pdf>]
-<SM_files/figure-latex/relBYage-fig6-1.pdf, id=188, 421.575pt x 712.6625pt>
+nput line 570.
+(pdftex.def)             Requested size: 417.5717pt x 206.77829pt.
+<SM_files/figure-latex/relBYage-fig6-1.pdf, id=153, 421.575pt x 712.6625pt>
 File: SM_files/figure-latex/relBYage-fig6-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/relBYage-fig6-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/relBYage-fig6-1.pdf  used on inp
-ut line 637.
+ut line 643.
 (pdftex.def)             Requested size: 384.77885pt x 650.45953pt.
 
-LaTeX Warning: Float too large for page by 35.97395pt on input line 640.
+LaTeX Warning: Float too large for page by 35.97395pt on input line 646.
 
-<SM_files/figure-latex/icc-bycor-fig7-1.pdf, id=189, 283.0575pt x 279.0425pt>
+[8] [9 <./SM_files/figure-latex/unnamed-chunk-2-1.pdf> <./SM_files/figure-latex
+/icc-allexp-fig5-1.pdf>]
+<SM_files/figure-latex/icc-bycor-fig7-1.pdf, id=190, 283.0575pt x 279.0425pt>
 File: SM_files/figure-latex/icc-bycor-fig7-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/icc-bycor-fig7-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/icc-bycor-fig7-1.pdf  used on in
-put line 693.
+put line 702.
 (pdftex.def)             Requested size: 283.06543pt x 279.05032pt.
-<SM_files/figure-latex/icc-bycor-fig8-1.pdf, id=190, 289.08pt x 268.00125pt>
+<SM_files/figure-latex/icc-bycor-fig8-1.pdf, id=191, 568.1225pt x 269.005pt>
 File: SM_files/figure-latex/icc-bycor-fig8-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/icc-bycor-fig8-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/icc-bycor-fig8-1.pdf  used on in
-put line 718.
-(pdftex.def)             Requested size: 289.07928pt x 268.00058pt.
+put line 727.
+(pdftex.def)             Requested size: 469.7731pt x 222.43672pt.
 [10] [11 <./SM_files/figure-latex/relBYage-fig6-1.pdf>] [12 <./SM_files/figure-
 latex/icc-bycor-fig7-1.pdf> <./SM_files/figure-latex/icc-bycor-fig8-1.pdf>]
-<SM_files/figure-latex/icc-bycor-fig9-1.pdf, id=244, 283.0575pt x 279.0425pt>
+<SM_files/figure-latex/icc-bycor-fig9-1.pdf, id=245, 572.1375pt x 279.0425pt>
 File: SM_files/figure-latex/icc-bycor-fig9-1.pdf Graphic file (type pdf)
 <use SM_files/figure-latex/icc-bycor-fig9-1.pdf>
 Package pdftex.def Info: SM_files/figure-latex/icc-bycor-fig9-1.pdf  used on in
-put line 780.
-(pdftex.def)             Requested size: 283.06543pt x 279.05032pt.
+put line 789.
+(pdftex.def)             Requested size: 469.75816pt x 229.1101pt.
 [13 <./SM_files/figure-latex/icc-bycor-fig9-1.pdf>] [14] [15] (./SM.aux) ) 
 Here is how much of TeX's memory you used:
- 14222 strings out of 478994
- 213348 string characters out of 5858183
+ 14225 strings out of 478994
+ 213378 string characters out of 5858183
  567492 words of memory out of 5000000
  31199 multiletter control sequences out of 15000+600000
- 458561 words of font info for 137 fonts, out of 8000000 for 9000
+ 458644 words of font info for 138 fonts, out of 8000000 for 9000
  1141 hyphenation exceptions out of 8191
  85i,9n,88p,1305b,397s stack positions out of 5000i,500n,10000p,200000b,80000s
 {/usr
@@ -721,10 +720,10 @@ ocal/texlive/2021/texmf-dist/fonts/type1/public/lm/lmmi7.pfb></usr/local/texliv
 e/2021/texmf-dist/fonts/type1/public/lm/lmr10.pfb></usr/local/texlive/2021/texm
 f-dist/fonts/type1/public/lm/lmr17.pfb></usr/local/texlive/2021/texmf-dist/font
 s/type1/public/lm/lmsy10.pfb>
-Output written on SM.pdf (15 pages, 519763 bytes).
+Output written on SM.pdf (15 pages, 528225 bytes).
 PDF statistics:
- 424 PDF objects out of 1000 (max. 8388607)
- 370 compressed objects within 4 object streams
- 87 named destinations out of 1000 (max. 500000)
+ 425 PDF objects out of 1000 (max. 8388607)
+ 371 compressed objects within 4 object streams
+ 88 named destinations out of 1000 (max. 500000)
  30990 words of extra memory for PDF output out of 35830 (max. 10000000)
 

BIN
CODE/SM.pdf


BIN
CODE/fig5.png


BIN
CODE/fig8.png


BIN
CODE/fig9.png