LAAC-LSCP/tsay

Geen omschrijving

Loann Peurey bfe4c4bd6c merge vtc and alice		2 weken geleden
.datalad	398c579655 [DATALAD] new dataset	2 jaren geleden
annotations	bfe4c4bd6c merge vtc and alice	2 weken geleden
extra	31965c3d42 [DATALAD] Recorded changes	2 jaren geleden
metadata	bfe4c4bd6c merge vtc and alice	2 weken geleden
recordings	5239ca87e8 move wav converted files to the right format and into converted/standard	1 jaar geleden
scripts	bafc25a722 unannex scripts	1 jaar geleden
.gitattributes	159773a59f change gitattributes	1 jaar geleden
README.md	c08bea3ae3 Ajout de 'README.md'	2 jaren geleden

This is a child project version of the corpus Tsay: https://phonbank.talkbank.org/access/Chinese/Taiwanese/Tsay.html

The corpus Tsay was not downloaded as it is presented on the website, because some recordings or information in the annotations are missing. Missing recordings: 'CEY_020226.wav', 'HBL_031102.wav', 'HBL_040003.wav', 'LMC_030009.wav', 'LMC_030021.wav', 'LMC_030115.wav, 'LMC_030205.wav', 'LMC_030226.wav', 'LMC_030324.wav', 'LMC_030412.wav', 'LMC_030503.wav', 'LMC_030525.wav', 'LMC_030614.wav', 'LMC_030707.wav', 'LMC_030804.wav', 'LMC_030825.wav', 'LMC_030925.wav', 'LMC_031004.wav', 'LMC_031025.wav', 'LMC_031122.wav', 'LMC_040015.wav', 'LMC_040101.wav', 'LMC_040115.wav', 'LMC_040129.wav', 'LMC_040212.wav', 'LMC_040226.wav', 'LMC_040310.wav', 'LMC_040323.wav', 'LMC_040419.wav', 'LMC_040505.wav', 'LMC_040607.wav', 'LMC_040802.wav', 'TWX_010805.wav', 'TWX_020901.wav'.

The time stamps are missing in the following annotations: CEY_022226.cha, HBL_031102.cha, HBL_040003.cha, all the files LMC, except LMC_050321.cha, LYC_020810.cha, LYC_021124.cha, TWX_010721.cha, TWX_010805.cha, TWX_020901.cha, all the files LJX, WZX, YCX, YJK, YDA, YSW, ZQM. All these files were not downloaded from the website of the corpus and not included into the LAAC_Tsay corpus.

After preprocessing, the corpus includes data from the children: CEY, HBL, HYS, LWJ, LYC, TWX and one recording of the child LMC, in total 490 recordings.

In order to recreate a childProject version of the Tsay corpus launch main.py (from scripts) with the following attributes: --corpus /path/to the directory where you would like to create the Datalad folder with your corpus \ -- url /link/to the corpus on the phonbank.talkbank.org

Then, you can validate your corpus with child-project and add recording durations with the following command: $ child-project compute-durations /path/to/dataset

Generate a dataframe for bulk importation of annotations by launching dataframe_for_ann_importation.py with argument: --corpus /path/to your datalad folder and do the bulk importation: child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv

README.md