Scheduled service maintenance on November 22


On Friday, November 22, 2024, between 06:00 CET and 18:00 CET, GIN services will undergo planned maintenance. Extended service interruptions should be expected. We will try to keep downtimes to a minimum, but recommend that users avoid critical tasks, large data uploads, or DOI requests during this time.

We apologize for any inconvenience.

Bez popisu

Loann Peurey bfe4c4bd6c merge vtc and alice před 2 týdny
.datalad 398c579655 [DATALAD] new dataset před 2 roky
annotations bfe4c4bd6c merge vtc and alice před 2 týdny
extra 31965c3d42 [DATALAD] Recorded changes před 2 roky
metadata bfe4c4bd6c merge vtc and alice před 2 týdny
recordings 5239ca87e8 move wav converted files to the right format and into converted/standard před 1 rokem
scripts bafc25a722 unannex scripts před 1 rokem
.gitattributes 159773a59f change gitattributes před 1 rokem
README.md c08bea3ae3 Ajout de 'README.md' před 2 roky

README.md

This is a child project version of the corpus Tsay: https://phonbank.talkbank.org/access/Chinese/Taiwanese/Tsay.html

The corpus Tsay was not downloaded as it is presented on the website, because some recordings or information in the annotations are missing. Missing recordings: 'CEY_020226.wav', 'HBL_031102.wav', 'HBL_040003.wav', 'LMC_030009.wav', 'LMC_030021.wav', 'LMC_030115.wav, 'LMC_030205.wav', 'LMC_030226.wav', 'LMC_030324.wav', 'LMC_030412.wav', 'LMC_030503.wav', 'LMC_030525.wav', 'LMC_030614.wav', 'LMC_030707.wav', 'LMC_030804.wav', 'LMC_030825.wav', 'LMC_030925.wav', 'LMC_031004.wav', 'LMC_031025.wav', 'LMC_031122.wav', 'LMC_040015.wav', 'LMC_040101.wav', 'LMC_040115.wav', 'LMC_040129.wav', 'LMC_040212.wav', 'LMC_040226.wav', 'LMC_040310.wav', 'LMC_040323.wav', 'LMC_040419.wav', 'LMC_040505.wav', 'LMC_040607.wav', 'LMC_040802.wav', 'TWX_010805.wav', 'TWX_020901.wav'.

The time stamps are missing in the following annotations: CEY_022226.cha, HBL_031102.cha, HBL_040003.cha, all the files LMC, except LMC_050321.cha, LYC_020810.cha, LYC_021124.cha, TWX_010721.cha, TWX_010805.cha, TWX_020901.cha, all the files LJX, WZX, YCX, YJK, YDA, YSW, ZQM. All these files were not downloaded from the website of the corpus and not included into the LAAC_Tsay corpus.

After preprocessing, the corpus includes data from the children: CEY, HBL, HYS, LWJ, LYC, TWX and one recording of the child LMC, in total 490 recordings.

In order to recreate a childProject version of the Tsay corpus launch main.py (from scripts) with the following attributes: --corpus /path/to the directory where you would like to create the Datalad folder with your corpus \ -- url /link/to the corpus on the phonbank.talkbank.org

Then, you can validate your corpus with child-project and add recording durations with the following command: $ child-project compute-durations /path/to/dataset

Generate a dataframe for bulk importation of annotations by launching dataframe_for_ann_importation.py with argument: --corpus /path/to your datalad folder and do the bulk importation: child-project import-annotations /path/to/dataset --annotations /path/to/dataframe.csv