models.tex 4.3 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
  1. \documentclass{article}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{amsmath}
  4. \usepackage{amssymb}
  5. \usepackage{bbm}
  6. \usepackage{stmaryrd}
  7. \title{Speaker confusion models}
  8. \author{}
  9. \date{}
  10. \begin{document}
  11. \maketitle
  12. \tableofcontents
  13. \section{Models}
  14. \subsection{Simple model}
  15. This simple model assumes that confusion rates (the probabilities that the algorithm attributes a vocalization from a certain speaker to another speaker) depend on the children only, and that they all derive from the same distribution, regardless of the corpus (and the surveyed population).
  16. The simple model is defined as follows:
  17. \begin{align}
  18. \mu_{ij} &\sim \mathcal{U}(0,1) \\
  19. \eta_{ij} &\sim \mathrm{Pareto}(1.5,1) \\
  20. \alpha_{ij} &= \mu_{ij} \eta_{ij} \\
  21. \beta_{ij} &= (1-\mu_{ij}) \eta_{ij} \\
  22. p_{c,i,j}|\alpha_{ij},\beta_{ij} &\sim \mathrm{Beta(\alpha_{ij},\beta_{ij})}\label{eqref:pcij} \\
  23. N_{k,i,j}|t_{k,j},p_{c,j,i} &\sim \mathrm{Binomial(t_{k,j},p_{c,i,j})}
  24. \end{align}
  25. Where:
  26. \begin{itemize}
  27. \item $i$ is the speaker the diarizer returns (one of FEM, MAL, CHI, OCH)
  28. \item $j$ is the speaker the human detected (one of FEM, MAL, CHI, OCH)
  29. \item $k$ is a clip (i.e., a recording section that has been annotated by both a human and a diarizer)
  30. \item $c$ is the key child to which a clip belongs
  31. \item $N_{k,i,j}$ is the number of vocalizations the human attributed to $j$ and the diarizer attributed to $i$ for the clip $k$ ($i$ and $j$ could be the same or different categories)
  32. \item $t_{k,j}$ is the number of vocalizations returned by the human for the clip $k$ and speaker $j$ observed in the data
  33. \item $p_{c,i,j}$ is the probability that a vocalization from the speaker $j$ will trigger a detection for the speaker $i$, for the child $c$.
  34. \item $\alpha_{ij}$ are the $\alpha$ hyperparameters for the Beta priors
  35. \item $\beta_{ij}$ are the $\beta$ hyperparameters for the Beta priors
  36. \item $\mu_{ij} = \alpha_{ij}/(\alpha_{ij}+\beta_{ij})$ are the success probabilities of the Beta priors
  37. \item $\eta_{ij} = \alpha_{ij}+\beta_{ij}$ are the effective sample sizes of the Beta priors
  38. \end{itemize}
  39. \subsection{Model with corpus bias}
  40. We extended the previous model by including the effect of potential biases at the level of each corpus. In this model, the confusion rates do not directly derive from a Beta distribution as in \eqref{eqref:pcij}; they are shifted by some amount depending on the corpus:
  41. \begin{align}
  42. \sigma_{i,j} &\sim \mathrm{HalfNormal}(0, 1) \\
  43. b_{\text{corpus},i,j} &\sim \mathrm{Normal}(0, \sigma_{i,j}) \\
  44. \pi_{c,i,j}|\alpha_{ij},\beta_{ij} &\sim \mathrm{Beta(\alpha_{ij},\beta_{ij})} \\
  45. \text{logit} (p_{c,i,j}) &= \text{logit}(\pi_{c,i,j}) + b_{\text{corpus},i,j}\label{eqref:pcij_bias}
  46. \end{align}
  47. In this model, $\pi_{c,i,j}$ (which still derive from a Beta distribution) captures the child-level effects, and $b_{\text{corpus},i,j}$ captures corpus-level biases.
  48. \section{Synthetic datasets}
  49. We generate datasets under the null-hypothesis, i.e. the hypothesis that the amounts of speech from each speaker are uncorrelated:
  50. \begin{align}
  51. t_{c,j} | \lambda_{c,j} &\sim \mathrm{Poisson}(\lambda_{c,j}) \\
  52. \lambda_{c,j} &\sim \mathrm{Gamma}(a_j, b_j)\\
  53. c &\in \llbracket 1,n_{\text{children}}\rrbracket
  54. \end{align}
  55. Where:
  56. \begin{itemize}
  57. \item $t_{c,j}$ is the amount of true vocalizations from speaker $j$ of child $c$
  58. \item $\lambda_{c,j}$ is the latent expected amount of vocalizations for the speaker $j$ and child $c$ (assuming 9 recorded hours per child)
  59. \item $n_{\text{children}}$ is the number of children
  60. \item $a_j$ and $b_j$ are parameters fitted to speech rates distribution derived from manual annotations of recordings within 9am and 6pm.
  61. \end{itemize}
  62. We simultaneously simulate the effect of the diarization algorithm by applying the selected model to generated datasets:
  63. \begin{align}
  64. v_{c,i,j} | t_{c,j},p_{c,i,j} &\sim \mathrm{Binomial}(t_{c,j},p_{c,i,j}) \\
  65. v_{c,i} &= \sum_{j} v_{c,i,j}\\
  66. c &\in \llbracket 1,n_{\text{children}}\rrbracket
  67. \end{align}
  68. Where $p_{c,i,j}$ is sampled according to the distributions derived using the selected model (\eqref{eqref:pcij} or \eqref{eqref:pcij_bias}).
  69. For the model that includes corpus-level bias, the corpus from which the corresponding bias should be applied is defined by the user.
  70. \end{document}