No Description

William N. Havard 8f1d693e71 Added script to compute grapheme distribution 8 months ago
.datalad c84aefcc66 initial commit 1 year ago
.idea fa0d31d4bd IS23 11 months ago
annotations fa0d31d4bd IS23 11 months ago
extra 31a0392d8d Cleaned bib 11 months ago
metadata 3eebded562 Added paper 11 months ago
recordings fa0d31d4bd IS23 11 months ago
scripts 8f1d693e71 Added script to compute grapheme distribution 8 months ago
.gitattributes c84aefcc66 initial commit 1 year ago
.gitignore fa0d31d4bd IS23 11 months ago
README.md 1ee017406c First commit 1 year ago

README.md

tsimane-glottal

This repository contains recordings of three native Tsimane' speakers collected in San Francisco de Borja (Bolivia) in Sept-Nov 2022. The data was collected by William N. Havard (WNH) in coordination with Camila Scaff, and Alejandrina Cristia.

The goal of this study was to collect recordings of words containing a glottal stop ⟨'⟩ in order to carry out phonetic analyses on how this sound is realised in Tsimane'. We tried to find as many minimal pairs as possible using Wayne Gill's dictionary and the dictionary compiled by the Instituto de Idioma y Cultura of the Gran Consejo Tsimane'. The list was then curated by two native Tsimane' speakers in coordination with WNH. The list was then recorded by two native Tsimane' speakers (and partly respoken by a thirs native Tsimane' speaker). In order to reduce pronunciation biases, the words/sentences to be recorded were presented in a random order. All the words were recorded in isolation, in a natural sentence, and in three carrier sentences (see below).

The data of this repository served as a basis for a perception experiment (see paper reference below) focusing on the qui'/qui sound pair. The data of this experiment is to be found here.

If you use any of the data in this repository, please cite the following article:

@inproceedings{havard:hal-03852211,
TITLE = {{A study of the production and perception of ' in Tsimane'}},
AUTHOR = {Havard, William and Scaff, Camila and Peurey, Loann and Cristia, Alejandrina},
URL = {https://hal.archives-ouvertes.fr/hal-03852211},
BOOKTITLE = {{Journ{\'e}es Jointes des Groupements de Recherche Linguistique Informatique, Formelle et de Terrain (LIFT) et Traitement Automatique des Langues (TAL)}},
ADDRESS = {Marseille, France},
EDITOR = {Becerra, Leonor and Favre, Beno{\^i}t and Gardent, Claire and Parmentier, Yannick},
PUBLISHER = {{CNRS}},
PAGES = {1-8},
YEAR = {2022},
MONTH = Nov,
KEYWORDS = {phonology ; perception ; production ; adapted lab experiments.},
PDF = {https://hal.archives-ouvertes.fr/hal-03852211/file/8189.pdf},
HAL_ID = {hal-03852211},
HAL_VERSION = {v1},
}

Data Access

Getting access to the data

To gain access to the data, please email William N. Havard or Alejandrina Cristia. You will be granted access to the data only if you agree to the points mentioned in section Data Usage.

Re-using the dataset

Requirements

You will first need to install the ChildProject package for Python (optional) as well as DataLad. Instructions to install these packages can be found here.

This data set was formatted as a ChildProject data set even though it does not contain any children recording. We decided to use this format as it allowed us to quickly import and process TextGrid annotation files. Hence, metadata files are formatted according to ChildProject's standards (e.g. children.csv with child_id columns, etc.). When using this data set, keep in mind the recordings were not made by children but by ADULTS.*

This data set was formatted so that the recording software that was used (Williaikuma, see below) is still able to open the recording sessions. This allows to easily listen to the audio recordings while reading the target sentence/word at the same time.

Configuring your SSH key on GIN

This step should only be done once:

  1. Create an account on (GIN)[https://gin.g-node.org/] if you don't have one already

  2. Copy your SSH public key to your clipboard (usually located in ~/.ssh/id_rsa.pub). If you don't have one, please create one following these instructions

  3. In your browser, go to GIN > Your parameters > SSH keys

  4. Click on the blue "Add a key" button, then paste the content of your public key in the Content field, and submit

Your key should now appear in your list of SSH keys - you can add as many as necessary.

Installing the dataset

The next step is to clone the dataset :

datalad install git@gin.g-node.org:/William-N-Havard/tsimane-glottal.git
cd tsimane-glottal

Getting data

You can get data from a dataset using the datalad get command, e.g.:

datalad get recordings/* # download recordings
datalad get annotations/* # get converted annotations

Or:

datalad get . # get everything

You can download many files in parallel using the -J or --jobs parameters:

datalad get . -J 4 # get everything, with 4 parallel transfers

For more help with using DataLad, please refer to our cheatsheet or DataLad's own cheatsheet. If this is not enough, check DataLad's documentation and Handbook.

Fetching updates

If you are notified of changes to the data, please retrieve them by issuing the following commands:

datalad update --merge
datalad get .

Removing the data

It is important that you delete the data once your project is complete. This can be done with datalad remove:

datalad remove -r path/to/your/dataset

Maintainers

Maintainers should install the dataset from LAAC-LSCP and run the setup procedure as follows:

datalad install git@gin.g-node.org:/William-N-Havard/tsimane-glottal.git
cd tsimane-glottal
datalad run-procedure setup --public --confidential

Changes should be pushed to origin, that will trigger a push to the others:

datalad push

Data Usage

The word and sentences featured in this data set were translated by two native Tsimane' speakers and recorded by three Tsimane' speakers (from now on referred to as participants). Data collection was done in accordance to European GDPR and was approved by an ethics committee (CER U-Paris 2022-84-CRISTIA).

Both European GDPR and the protocol that was approved by our ethics board guarantee the right to the participants to recant their participation from this study at any time. If that happens, it means the recordings that were made by this participant should not be used anymore. By using this data, you agree to comply with these regulations, and not use the recordings of that participant anymore. This is non-negotiable.

Because this data set uses a file-versioning system the recordings of the participant who recanted their participation will be deleted from the current and future releases, but will still be accessible when checkout-ing to previous releases. You agree not to checkout to the releases containing these recordings and use them for any new experiment. These recordings are only kept to ensure the reproducibility of past experiments.

When using this data set, we require you to specify in the research output you will make public to mention the release number of this data set. This allows us to control that only the current release is being used.

In case you work in collaboration with other researchers, we require that at least one member of the collaboration has been granted access to the data. Other members may use the data but agree to deleted it once the collaboration is over.

By agreeing to this terms, you agree that you name, current affiliation and email we'll be stored on LAAC/LSCP servers as well on GIN-NODE. This data will be kept private and will be handled by authorised members affiliated with LSCP.

You are granted the right to use this data for commercial purposes, provided you comply with the data usage terms mentioned there above.

Data Description

Recordings

Spoken data was collected using Williaikuma, a partial Python re-implementation of LIG-Aikuma. Participants recorded the target sentences in a quiet room using a Dell Precision 3561 Computer running Ubuntu 20.04.5 LTS. Participants used JBL Quantum 300 headphones, equipped with a foam windscreen to record the stimuli. Headsets were connected to the computer via a USB audio cable adapter. The recordings were saved as 1-channeled 44.1kHz WAVE files.

TextGrids

The TextGrid files were generated using Williaikuma. They were further postprocessed in order to add the target word and key syllable(s) that need to be annotated (see scripts/annotate_textgrids.py).

Stimuli

The text files that were used in the application to elicitate speech are to be found in extra/sentences (batch_1.txt, batch_2.txt, and batch_3.txt).

sentences/summary/{batch_1.csv,batch_2.csv,batch_3.csv} contains the same data as the above in a human-readable format. More specifically, the files present the words that constitute the minimal pairs ⟨'⟩/⟨∅⟩ in isolation, embedded in a natural sentences, and in three carrier sentences (see below). Not all the words that were recorded belong to a minimal pair (i.e. in some cases, the other word of the pair did not exist, or contained a mistake) but were recorded as they feature a glottal stop (see below)

Mono-syllabic pseudo-words were also used. There are to be found in extra/sentences/mono_syl_batch_1.txt

⟨'⟩/⟨∅⟩ Minimal pairs

Please note that we did not consider stress when deciding whether words belonged to a minimal pair or not, as we hypothetise there might be an interaction between stress and glottal stops. This might also not be the case, and thus the number of true minimal pairs is lower (please keep that in mind when using this data). Also, there is a lot of intra and interspeaker variation when transcribing a word (accentuation is almost never written, nasality is inconsistently written). We (informants and WNH) tried our best to add this information, but we might have done some mistakes. Similarly, Tsimane' (presumably) distinguished between dental, alveolar and dental [n], [t], and [d] and this information is not shown orthographically. Hence, the minimal pairs we collected should be used with caution.

Carrier Sentences

The carrier sentences used are the following (also in extra/sentences/other/carrier_sentences.txt):

{{WORD}} mo' nash peyacdye' yu yi.
yu ra' yi {{WORD}} jeñej peyacdye'.
yu ra' yi mo' peyacdye' {{WORD}}.

These sentences translate as follows:

{{WORD}} is the word I am saying.
I will say {{WORD}} as a word.
I will say the word {{WORD}}.

The natural sentences were written by the two informants who recorded the text-elicitated data.

Issues

Duplicates

Some words appear in several minimal pairs and therefore were recorded several times. We decided to keep these semi-duplicate recordings as they might be useful (e.g. test for speaker consistency). Some pairs were also inadvertently recorded twice (all found in batch_3). These duplicate words are listed below:

  • batch_1

    • ĉó'chaqui/ĉocháqui (pair 8) ĉó'chaqui'/ĉó'chaqui (pair 9)
    • ji'jun'taqui/ji'juntaqui (pair 16) and ji'jun'taqui'/ji'jun'taqui (pair 18)
    • ji'jun'taqui/ji'juntaqui (pair 16) and ji'juntaqui'/ji'juntaqui (pair 17)
  • batch_2

    • ji'shucaqui'/ji'shucaqui (pair 11) and ji'shu'caqui/ji'shucaqui (pair 12)
    • ji'shu'caqui'/ji'shu'caqui (pair 10) and ji'shu'caqui/ji'shucaqui (pair 12)
  • batch_3

    • tó'caqui/tocáqui (pair 15) and tó'caqui/tócaqui (pair 16)
    • duplicate pair yọcyi'/yọcyi: pair 22 and 25
    • duplicate pair yútyi'/yútyi: pair 23 and 26
    • duplicate pair ĉhá'baqui/ĉhabáqui: pair 24 and 27

Isolated words

The words that do not belong to a minimal pair and that were still recorded are to be found in batch_3. These words are the listed below:

  • jí'cham'tacsis
  • tyetyéjta'
  • vó'vodye'
  • jäjä́m'
  • cajótaqui
  • cäi'tsii'ya'
  • jambí'dyem'
  • fọjjẹyạquị
  • jí'juban'

The IDs of the sentences these words appear in are labeled pairX.wordX even though these words do not appear in minimal pairs. The word index (word1, word2) is also meaningless for these words and only take the index these words should have had if the first (resp. second) word of the pair had existed/had been correct.

Acknowledgements

We thank our informers, Arnulfo Cary and Manuel Roca, who manually checked each minimal pair and wrote example sentences for each of the selected words. We also thank the three native Tsimane's speakers who recorded the stimuli.

This study was approved by CER U-Paris 2022-84-CRISTIA.

Funding for this study comes from the J. S. McDonnell Foundation (Understanding Human Cognition Scholar Award); European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (ExELang, Grant agreement No. 101001095).