No Description

Loann Peurey 159773a59f change gitattributes 8 months ago
.datalad 398c579655 [DATALAD] new dataset 2 years ago
annotations 7ba3b6683f remove duplicated cha annotations 8 months ago
extra 31965c3d42 [DATALAD] Recorded changes 2 years ago
metadata 995789ca5e unannex metadata and remove converted wav of metadata 8 months ago
recordings 5239ca87e8 move wav converted files to the right format and into converted/standard 8 months ago
scripts bafc25a722 unannex scripts 8 months ago
.gitattributes 159773a59f change gitattributes 8 months ago
README.md c08bea3ae3 Ajout de 'README.md' 2 years ago

README.md

This is a child project version of the corpus Tsay: https://phonbank.talkbank.org/access/Chinese/Taiwanese/Tsay.html

The corpus Tsay was not downloaded as it is presented on the website, because some recordings or information in the annotations are missing. Missing recordings: 'CEY_020226.wav', 'HBL_031102.wav', 'HBL_040003.wav', 'LMC_030009.wav', 'LMC_030021.wav', 'LMC_030115.wav, 'LMC_030205.wav', 'LMC_030226.wav', 'LMC_030324.wav', 'LMC_030412.wav', 'LMC_030503.wav', 'LMC_030525.wav', 'LMC_030614.wav', 'LMC_030707.wav', 'LMC_030804.wav', 'LMC_030825.wav', 'LMC_030925.wav', 'LMC_031004.wav', 'LMC_031025.wav', 'LMC_031122.wav', 'LMC_040015.wav', 'LMC_040101.wav', 'LMC_040115.wav', 'LMC_040129.wav', 'LMC_040212.wav', 'LMC_040226.wav', 'LMC_040310.wav', 'LMC_040323.wav', 'LMC_040419.wav', 'LMC_040505.wav', 'LMC_040607.wav', 'LMC_040802.wav', 'TWX_010805.wav', 'TWX_020901.wav'.

The time stamps are missing in the following annotations: CEY_022226.cha, HBL_031102.cha, HBL_040003.cha, all the files LMC, except LMC_050321.cha, LYC_020810.cha, LYC_021124.cha, TWX_010721.cha, TWX_010805.cha, TWX_020901.cha, all the files LJX, WZX, YCX, YJK, YDA, YSW, ZQM. All these files were not downloaded from the website of the corpus and not included into the LAAC_Tsay corpus.

After preprocessing, the corpus includes data from the children: CEY, HBL, HYS, LWJ, LYC, TWX and one recording of the child LMC, in total 490 recordings.

In order to recreate a childProject version of the Tsay corpus launch main.py (from scripts) with the following attributes: --corpus /path/to the directory where you would like to create the Datalad folder with your corpus \ -- url /link/to the corpus on the phonbank.talkbank.org

Then, you can validate your corpus with child-project and add recording durations with the following command: $ child-project compute-durations /path/to/dataset

Generate a dataframe for bulk importation of annotations by launching dataframe_for_ann_importation.py with argument: --corpus /path/to your datalad folder and do the bulk importation: child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv