# Updates * *04/03/2021*: Added `normative` and `normative_criterion` to metadata as one child is non-normative. # Conversion to ChildProject ## Preprocessing ### File renaming All files (MP4/MP3/WAV/CHA) were prefixed with the child's name using the following snippet. ``` bash PREFIX="child_id"; EXT="wav"; for FILENAME in *.${EXT}; do mv "$FILENAME" "${PREFIX}_${FILENAME}"; done; ``` ### MP4/MP3 file conversion WAV files were extracted for each MP4 (video) and MP3s were converted to WAV. In all cases, extraction/resampling was done so as to obtain 16kHz mono WAV files. ```bash for f in $( find . -type f -name "*.mp3" ); do ffmpeg -i "$f" -ac 1 -ar 16000 "${f%mp3}wav" ; done for f in $( find . -type f -name "*.mp4" ); do ffmpeg -i "$f" -ac 1 -ar 16000 "${f%mp4}wav" ; done ``` ### CHA `@Date` extraction Date of all CHA files were extracted using the following snippet: ```bash find . -type f -name *.cha -exec sh -c "echo -n {}' '; cat {} | grep Date | sed -e 's/^@Date://' | tr -d '\n'; echo" \; ``` They were then dumped to a CSV file and dates were update to ChildProject's ISO standard. ### `recordings` structure * *recordings/original/CHILD_ID/MP4* and *recordings/original/CHILD_ID/MP3*: original MP4/MP3 files for each child * *recordings/raw/CHILD_ID/wav*: extracted WAV files ### Manual corrections * *Lily/Lily_011107.wav* & *Lily/Lily_011107.cha*: updated date from 2001-12-20 to 2002-12-20 (obviously a typo) * *Alex/Alex_010700.cha* ("Dummy file to permit playback from the TalkBank browser"): Recording date was inferred from the filename as it is missing in the CHA file * *Violet/Violet_030200.cha* ("Dummy file to permit playback from the TalkBank browser"): Recording date was inferred from the filename as it is missing in the CHA file ## Importation Annotation importation was done using *./scripts/import_annotations.py*. This script should work with any data set, regardeless of its organisation, provided minor modifications. ### Duration Duration was computed using ChildProject's command line tool ```bash child-project compute-durations . ``` ### VTC/ALICE VTC annotations were computed through ALICE (@ hash ID: [f7962f46615a6a433f0da5398f61282d9961c101](https://github.com/orasanen/ALICE/tree/f7962f46615a6a433f0da5398f61282d9961c101)) ### VCM The following command was used to create the VCM annotations. (@ hash ID: [71cec64eff8563956e67a20c473834d53634eb68](https://github.com/LAAC-LSCP/vcm/tree/71cec64eff8563956e67a20c473834d53634eb68)) ```bash python ./src/vcm.py -a ~/DATA/LSFER/providence/recordings/raw/ -r ~/DATA/LSFER/providence/annotations/vtc/raw/ -s ~/PACKAGES/opensmile/bin/linux_x64_standalone_static/SMILExtract -o ~/DATA/LSFER/providence/annotations/vcm/raw --keep-other ``` ### Failures * Duration * *Violet/Violet_030200.wav*: failed as the MP4 file (and resulting WAV file) is corrupted (duration automatically set to 0) * CHA * *Alex/Alex_010700.cha*: importation failed (expected as the file is a dummy CHA file with no transcriptions) * *Violet/Violet_030200.cha*: importation failed (expected as the file is a dummy CHA file with no transcriptions) * ALICE Sylnet failed for the following files: * *Violet/Violet_020527.txt* * *Violet/Violet_020707.txt* * *Violet/Violet_020724.txt* * *Violet/Violet_020807.txt* * *Violet/Violet_021126.txt* * *Violet/Violet_030400.txt* * *Violet/Violet_030600.txt* It seems that portions of these WAV files are corrupted and contain non-finite values (`np.NaN` or `np.inf`). ```python Traceback (most recent call last): File "/scratch2/whavard/PACKAGES/ALICE/SylNet/run_SylNet.py", line 104, in X[i] = np.transpose(20*np.log10(librosa.feature.melspectrogram(y=y, sr=Fs, n_mels=24, n_fft=w_l, hop_length=w_h))) File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/feature/spectral.py", line 2004, in melspectrogram pad_mode=pad_mode, File "/scratch2/whavard/.conda/envs/ALICE/lib/python.6/site-packages/librosa/core/spectrum.py", line 2519, in _spectrogram pad_mode=pad_mode, File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/core/spectrum.py", line 217, in stft util.valid_audio(y) File "/scratch2/whavard/.conda/envs/ALICE/lib/python3.6/site-packages/librosa/util/utils.py", line 310, in valid_audio raise ParameterError("Audio buffer is not finite everywhere") librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere ```