finished pass in results, now have to wait for the new metrics and corpora to finalize results

alecristia 1 рік тому
37 13

@@ -489,7 +489,7 @@ kable(reg_anova_cor)
-## Code to reproduce "Exploratory analyses: Reliability within corpus"
Code to reproduce text and figures in "Exploratory analyses: Reliability within corpus"
 ```{r read in icc by corpus}
@@ -497,6 +497,8 @@ df.icc.corpus$Type <- get_type(df.icc.corpus)
Figure 5A addresses this question, showing the distribution of ICC across our 53 metrics in each of the `r length(levels(factor(df.icc.corpus$corpus)))` included corpora.  Out of `r dim(df.icc.corpus)[1]` fitted models (53 metrics times `r length(levels(factor(df.icc.corpus$corpus)))` corpora), `r sum(df.icc.corpus$formula=="no_chi_effect")` were singular when including a random intercept per child, and therefore they could not be included in these analyses at all, and the remaining `r sum(df.icc.corpus$formula=="no_exp")` were singular when including a random intercept per corpus.
 ```{r icc-bycor-fig5A, echo=F,fig.width=4, fig.height=10,fig.cap="Child ICC by metric type and pipeline, when considering each corpus separately."}
@@ -535,7 +537,7 @@ ggplot(r_X_corpus, aes(y = cor, x = corpusA)) +
-```{r reg model cor}
+```{r reg model corpusm}
 cor_icc <- lm(icc_child_id ~ Type * data_set * corpus, data=df.icc.corpus) 
@@ -548,15 +550,17 @@ reg_anova_cor_icc=Anova(cor_icc)
-We checked whether Child ICC differed by talker types and pipelines across corpora by fitting a linear model with the formula $lm(Child_ICC ~ type * pipeline * corpus)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children;  pipeline LENA or ACLEW; and corpus the corpus ID. We found an adjusted R-squared of `r round(reg_sum_cor_icc$adj.r.squared*100)`%, suggesting this model explained over half of the variance in Child ICC. A Type 3 ANOVA on this model revealed several significant effects and interactions, including a three-way interaction of type, pipeline, and corpus.
The fact that we cannot infer reliability from one corpus based on another one was confirmed statistically: We checked whether Child ICC differed by talker types and pipelines across corpora by fitting a linear model with the formula $lm(Child_ICC ~ type * pipeline * corpus)$, where type indicates whether the measure pertained to the key child, (female/male) adults, other children;  pipeline LENA or ACLEW; and corpus the corpus ID. We found an adjusted R-squared of `r round(reg_sum_cor_icc$adj.r.squared*100)`%, suggesting this model explained over half of the variance in Child ICC. A Type 3 ANOVA on this model revealed several significant effects and interactions, including a three-way interaction of type, pipeline, and corpus; a two-way interaction of type and corpus; and a main effect of corpus. See the Supplementary Materials for more information.
-```{r print out anova results rec on cor}
+```{r print out anova results rec on icc by corpus}
-## Code to reproduce "Exploratory analyses: Reliability across age groups"
Code to reproduce text and figures in "Exploratory analyses: Reliability across age groups"
 ```{r prepAge}
@@ -568,20 +572,24 @@ df.icc.age$age_bin<-factor(df.icc.age$age_bin,levels=age_levels)
Out of `r dim(df.icc.age)[1]` fitted models (53 metrics times `r length(levels(factor(df.icc.age$age_bin)))` age bins), `r sum(df.icc.age$formula=="no_chi_effect")` were singular when including a random intercept per child, and therefore they could not be included in these analyses at all. In addition, `r sum(df.icc.age$formula=="no_exp")` were singular when including a random intercept per corpus. The remaining `r sum(df.icc.age$formula=="full")` could be analyzed with the full model.
 ```{r relBYage-fig6A, echo=F,fig.width=6, fig.height=10,fig.cap="Distribution of ICC attributed to corpus (a) and children (b), when binning children's age."}
 #this complicated section is just to add N of participants in each facet, we first estimate it:
 for(thisage in levels(df.icc.age$age_bin)){#thisage="(0,6]"
-  if(min(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T) !=max(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T)){
-    facet_labels = c(facet_labels,paste0("N chi=",paste(range(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T),collapse="-"))) 
-  } else facet_labels = c(facet_labels,paste0("N chi=",min(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T))) 
+  facet_labels_cor = c(facet_labels_cor,paste0("N cor=",min(df.icc.age$ncor[df.icc.age$age_bin==thisage],na.rm=T))) #checked: there is no variation across metrics in n of corpora included
+    if(min(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T) !=max(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T)){
+    facet_labels_chi = c(facet_labels_chi,paste0("N chi=",paste(range(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T),collapse="-"))) 
+  } else {
+    facet_labels_chi = c(facet_labels_chi,paste0("N chi=",min(df.icc.age$nchi[df.icc.age$age_bin==thisage],na.rm=T))) 
+  }
-#and then we structure it so that it goes ont he plot
+#and then we structure it so that it goes on the plot
@@ -589,12 +597,28 @@ ggplot(df.icc.age, aes(y = icc_child_id, x = toupper(data_set))) +
   geom_violin(alpha = 0.5) +
   geom_quasirandom(aes(colour = Type,shape = Type)) +  
   theme(legend.position="none") +labs( y = "r",x="Pipeline") + facet_wrap(~age_bin, ncol = 3) +
-  geom_text(x=1.5,y=max(df.icc.age$icc_child_id,na.rm=T),aes(label=label),data=f_labels,size=2)
+  geom_text(x=1.5,y=max(df.icc.age$icc_child_id,na.rm=T),aes(label=facet_labels_chi),data=f_labels,size=2) +
+  geom_text(x=1.5,y=max(df.icc.age$icc_child_id,na.rm=T)*.95,aes(label=facet_labels_cor),data=f_labels,size=2)
+```{r reg model age}
+age_icc <- lm(icc_child_id ~ Type * data_set * age_bin, data=df.icc.age) 
+#binomial could be used,  diagnostic plots look good
As we did in the previous section for corpus, we checked whether Child ICC differed by talker types and pipelines across age bins by fitting a linear model with the formula $lm(Child_ICC ~ type * pipeline * age_bin)$. We found an adjusted R-squared of `r round(reg_sum_age_icc$adj.r.squared*100)`%, suggesting this model explained over half of the variance in Child ICC. However, a Type 3 ANOVA on this model revealed only an interaction of type and age bin, as well as a main effect of age bin, suggesting less complex effects than in the case of corpus. See the Supplementary Materials for more information.
 ```{r icc-bycor-fig6B, echo=F,fig.width=4, fig.height=4,fig.cap="Correlations in Child ICC across corpora. Each point indicates the correlation in Child ICC for the corpus named in the x-axis with every other corpus."}

Різницю між файлами не показано, бо вона завелика
+ 508 - 501

+ 23 - 22

