# Updates

* *04/03/2021*: Added `normative` and `normative_criterion` to metadata as one child is non-normative.

# Conversion to ChildProject

## Preprocessing

### File renaming

All files (MP4/MP3/WAV/CHA) were prefixed with the child's name using the following snippet.

``` bash
PREFIX="child_id"; EXT="wav"; for FILENAME in *.${EXT}; do mv "$FILENAME" "${PREFIX}_${FILENAME}"; done;
```

### MP4/MP3 file conversion

WAV files were extracted for each MP4 (video) and MP3s were converted to WAV. In all cases, extraction/resampling was done so as to obtain 16kHz mono WAV files.

```bash
for f in $( find . -type f -name "*.mp3" ); do ffmpeg -i "$f" -ac 1 -ar 16000 "${f%mp3}wav" ; done
for f in $( find . -type f -name "*.mp4" ); do ffmpeg -i "$f" -ac 1 -ar 16000 "${f%mp4}wav" ; done
```

### CHA `@Date` extraction

Date of all CHA files were extracted using the following snippet:

```bash
find . -type f -name *.cha -exec sh -c "echo -n {}' '; cat {} | grep Date | sed -e 's/^@Date://' | tr -d '\n'; echo" \;
```

They were then dumped to a CSV file and dates were update to ChildProject's ISO standard.

### `recordings` structure

* *recordings/original/CHILD_ID/MP4* and *recordings/original/CHILD_ID/MP3*: original MP4/MP3 files for each child
* *recordings/raw/CHILD_ID/wav*: extracted WAV files

### Manual corrections

* *Lily/Lily_011107.wav* & *Lily/Lily_011107.cha*: updated date from 2001-12-20 to 2002-12-20 (obviously a typo)
* *Alex/Alex_010700.cha* ("Dummy file to permit playback from the TalkBank browser"): Recording date was inferred from the filename as it is missing in the CHA file
* *Violet/Violet_030200.cha* ("Dummy file to permit playback from the TalkBank browser"): Recording date was inferred from the filename as it is missing in the CHA file

## Importation

Annotation importation was done using *./scripts/import_annotations.py*. This script should work with any data set, regardeless of its organisation, provided minor modifications.

### Duration

Duration was computed using ChildProject's command line tool

```bash
child-project compute-durations .
```

### VTC/ALICE

VTC annotations were computed through ALICE (@ hash ID: [f7962f46615a6a433f0da5398f61282d9961c101](https://github.com/orasanen/ALICE/tree/f7962f46615a6a433f0da5398f61282d9961c101))

### VCM

The following command was used to create the VCM annotations. (@ hash ID: [71cec64eff8563956e67a20c473834d53634eb68](https://github.com/LAAC-LSCP/vcm/tree/71cec64eff8563956e67a20c473834d53634eb68))

```bash
python ./src/vcm.py -a ~/DATA/LSFER/providence/recordings/raw/ -r ~/DATA/LSFER/providence/annotations/vtc/raw/ -s ~/PACKAGES/opensmile/bin/linux_x64_standalone_static/SMILExtract -o ~/DATA/LSFER/providence/annotations/vcm/raw --keep-other
```

### Failures

* Duration

  * *Violet/Violet_030200.wav*: failed as the MP4 file (and resulting WAV file) is corrupted (duration automatically set to 0)

* CHA

  * *Alex/Alex_010700.cha*: importation failed (expected as the file is a dummy CHA file with no transcriptions)
  * *Violet/Violet_030200.cha*: importation failed (expected as the file is a dummy CHA file with no transcriptions)

* ALICE

  Sylnet failed for the following files:

  * *Violet/Violet_020527.txt*
  * *Violet/Violet_020707.txt*
  * *Violet/Violet_020724.txt*
  * *Violet/Violet_020807.txt*
  * *Violet/Violet_021126.txt*
  * *Violet/Violet_030400.txt*
  * *Violet/Violet_030600.txt*

   It seems that portions of these WAV files are corrupted and contain non-finite values (`np.NaN` or `np.inf`).

  ```python
  Traceback (most recent call last):
    File "/scratch2/whavard/PACKAGES/ALICE/SylNet/run_SylNet.py", line 104, in <module>
      X[i] = np.transpose(20*np.log10(librosa.feature.melspectrogram(y=y, sr=Fs, n_mels=24, n_fft=w_l, hop_length=w_h)))
    File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/feature/spectral.py", line 2004, in melspectrogram
      pad_mode=pad_mode,
    File "/scratch2/whavard/.conda/envs/ALICE/lib/python.6/site-packages/librosa/core/spectrum.py", line 2519, in _spectrogram
      pad_mode=pad_mode,
    File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/core/spectrum.py", line 217, in stft
      util.valid_audio(y)
    File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/util/utils.py", line 310, in valid_audio
      raise ParameterError("Audio buffer is not finite everywhere")
  librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere
  ```