# tsimane-glottal

This repository contains recordings of three native [Tsimane'](https://www.ethnologue.com/language/cas) speakers collected in [San Francisco de Borja](https://www.wikiwand.com/en/San_Borja,_Bolivia) (Bolivia) in Sept-Nov 2022. The data was collected by [William N. Havard](william.havard@gmail.com) (WNH) in coordination with Camila Scaff, and Alejandrina Cristia.

The goal of this study was to collect recordings of words containing a glottal stop ⟨'⟩ in order to carry out phonetic analyses on how this sound is realised in Tsimane'. We tried to find as many minimal pairs as possible using [Wayne Gill's dictionary](https://www.pueblos-originarios.ucb.edu.bo/Record/106000410) and [the dictionary](http://www.illaa.org/index.php/tsimane-peib-tsimane/) compiled by the Instituto de Idioma y Cultura of the Gran Consejo Tsimane'. The list was then curated by two native Tsimane' speakers in coordination with WNH. The list was then recorded by two native Tsimane' speakers (and partly respoken by a thirs native Tsimane' speaker). In order to reduce pronunciation biases, the words/sentences to be recorded were presented in a random order. All the words were recorded in isolation, in a natural sentence, and in three carrier sentences (see below).

The data of this repository served as a basis for a perception experiment (see paper reference below) focusing on the qui'/qui sound pair. The data of this experiment is to be found [here](https://gin.g-node.org/William-N-Havard/tsimane-glottal-perception-qui).

If you use any of the data in this repository, please cite the following article:

```latex
@inproceedings{havard:hal-03852211,
TITLE = {{A study of the production and perception of ' in Tsimane'}},
AUTHOR = {Havard, William and Scaff, Camila and Peurey, Loann and Cristia, Alejandrina},
URL = {https://hal.archives-ouvertes.fr/hal-03852211},
BOOKTITLE = {{Journ{\'e}es Jointes des Groupements de Recherche Linguistique Informatique, Formelle et de Terrain (LIFT) et Traitement Automatique des Langues (TAL)}},
ADDRESS = {Marseille, France},
EDITOR = {Becerra, Leonor and Favre, Beno{\^i}t and Gardent, Claire and Parmentier, Yannick},
PUBLISHER = {{CNRS}},
PAGES = {1-8},
YEAR = {2022},
MONTH = Nov,
KEYWORDS = {phonology ; perception ; production ; adapted lab experiments.},
PDF = {https://hal.archives-ouvertes.fr/hal-03852211/file/8189.pdf},
HAL_ID = {hal-03852211},
HAL_VERSION = {v1},
}
```

# Data Access

## Getting access to the data

To gain access to the data, please email [William N. Havard](mailto:william.havard@gmail.com) or [Alejandrina Cristia](mailto:alecristia@gmail.com). You will be granted access to the data only if you agree to the points mentioned in section `Data Usage`.


## Re-using the dataset

### Requirements

You will first need to install the [ChildProject](https://childproject.readthedocs.io/en/latest/) package for Python (optional) as well as DataLad. Instructions to install these packages can be found [here](https://childproject.readthedocs.io/en/latest/install.html).

This data set was formatted as a ChildProject data set even though it does not contain any children recording. We decided to use this format as it allowed us to quickly import and process TextGrid annotation files. Hence, metadata files are formatted according to ChildProject's standards (e.g. `children.csv` with `child_id` columns, etc.). **When using this data set, keep in mind the recordings were not made by children but by ADULTS**.*

This data set was formatted so that the recording software that was used ([Williaikuma](https://github.com/William-N-Havard/williaikuma), see below) is still able to open the recording sessions. This allows to easily listen to the audio recordings while reading the target sentence/word at the same time.

### Configuring your SSH key on GIN

This step should only be done once:

0. Create an account on (GIN)[https://gin.g-node.org/] if you don't have one already

1. Copy your SSH public key to your clipboard (usually located in ~/.ssh/id_rsa.pub). If you don't have one, please create one following [these instructions](https://docs.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent)
2. In your browser, go to [GIN > Your parameters > SSH keys](https://gin.g-node.org/user/settings/ssh)
3. Click on the blue "Add a key" button, then paste the content of your public key in the Content field, and submit

Your key should now appear in your list of SSH keys - you can add as many as necessary.

### Installing the dataset

The next step is to clone the dataset :

```bash
datalad install git@gin.g-node.org:/William-N-Havard/tsimane-glottal.git
cd tsimane-glottal
```

### Getting data

You can get data from a dataset using the `datalad get` command, e.g.:

```bash
datalad get recordings/* # download recordings
datalad get annotations/* # get converted annotations
```

Or:

```bash
datalad get . # get everything
```

You can download many files in parallel using the -J or --jobs parameters:

```bash
datalad get . -J 4 # get everything, with 4 parallel transfers
```

For more help with using DataLad, please refer to our [cheatsheet](https://childproject.readthedocs.io/en/latest/cheatsheet.html) or DataLad's own [cheatsheet](http://handbook.datalad.org/en/latest/basics/101-136-cheatsheet.html). If this is not enough, check DataLad's [documentation](http://docs.datalad.org/en/stable/) and [Handbook](http://handbook.datalad.org/en/latest/).

### Fetching updates

If you are notified of changes to the data, please retrieve them by issuing the following commands:

```bash
datalad update --merge
datalad get .
```

### Removing the data

It is important that you delete the data once your project is complete.
This can be done with `datalad remove`:

```bash
datalad remove -r path/to/your/dataset
```


## Maintainers

Maintainers should install the dataset from LAAC-LSCP and run the setup procedure as follows:

```bash
datalad install git@gin.g-node.org:/William-N-Havard/tsimane-glottal.git
cd tsimane-glottal
datalad run-procedure setup --public --confidential
```

Changes should be pushed to origin, that will trigger a push to the others:

```bash
datalad push
```

# Data Usage

The word and sentences featured in this data set were translated by two native Tsimane' speakers and recorded by three Tsimane' speakers (from now on referred to as *participants*). Data collection was done in accordance to European GDPR and was approved by an ethics committee (CER U-Paris 2022-84-CRISTIA).

Both European GDPR and the protocol that was approved by our ethics board guarantee the right to the participants to recant their participation from this study at any time. If that happens, it means **the recordings that were made by this participant should not be used anymore**. By using this data, you agree to comply with these regulations, and **not use the recordings of that participant anymore**. **This is non-negotiable**.

Because this data set uses a file-versioning system the recordings of the participant who recanted their participation will be deleted from the current and future releases, but will still be accessible when *checkout-ing* to previous releases. **You agree not to checkout to the releases containing these recordings and use them for any new experiment**. These recordings are only kept to ensure the **reproducibility of past experiments**.

When using this data set, **we require you to specify in the research output you will make public to mention the release number of this data set**. This allows us to control that only the current release is being used.

In case you work in collaboration with other researchers, **we require that at least one member of the collaboration has been granted access to the data**. Other members may use the data but agree to deleted it once the collaboration is over.

By agreeing to this terms, you agree that you name, current affiliation and email we'll be stored on LAAC/LSCP servers as well on GIN-NODE. **This data will be kept private** and will be handled by authorised members affiliated with LSCP.

You are **granted the right to use this data for commercial purposes**, provided you comply with the data usage terms mentioned there above.

# Data Description

## Recordings

Spoken data was collected using [Williaikuma](https://github.com/William-N-Havard/williaikuma), a partial Python re-implementation of LIG-Aikuma. Participants recorded the target sentences in a quiet room using a Dell Precision 3561 Computer running Ubuntu 20.04.5 LTS. Participants used JBL Quantum 300 headphones, equipped with a foam windscreen to record the stimuli. Headsets were connected to the computer via a USB audio cable adapter. The recordings were saved as 1-channeled 44.1kHz WAVE files.

## TextGrids

The TextGrid files were generated using Williaikuma. They were further postprocessed in order to add the target word and key syllable(s) that need to be annotated (see [`scripts/annotate_textgrids.py`](scripts/annotate_textgrids.py)).

## Stimuli

The text files that were used in the application to elicitate speech are to be found in `extra/sentences` ([`batch_1.txt`](extra/sentences/williaikuma/batch_1.txt), [`batch_2.txt`](annotations/extra/williaikuma/batch_2.txt), and [`batch_3.txt`](extra/sentences/williaikuma/batch_3.txt)). 

`sentences/summary/{batch_1.csv,batch_2.csv,batch_3.csv}` contains the same data as the above in a human-readable format. More specifically, the files present the words that constitute the minimal pairs ⟨'⟩/⟨∅⟩ in isolation, embedded in a natural sentences, and in three carrier sentences (see below). Not all the words that were recorded belong to a minimal pair (i.e. in some cases, the other word of the pair did not exist, or contained a mistake) but were recorded as they feature a glottal stop (see below)

Mono-syllabic pseudo-words were also used. There are to be found in [`extra/sentences/mono_syl_batch_1.txt`](annotations/sentences/williaikuma/mono_syl_batch_1.txt)

## ⟨'⟩/⟨∅⟩ Minimal pairs

Please note that **we did not consider stress when deciding whether words belonged to a minimal pair or not**, as we hypothetise there might be an interaction between stress and glottal stops. **This might also not be the case**, and thus the number of true minimal pairs is lower (please keep that in mind when using this data). Also, there is a lot of intra and interspeaker variation when transcribing a word (accentuation is almost never written, nasality is inconsistently written). We (informants and WNH) tried our best to add this information, but we might have done some mistakes. Similarly, Tsimane' (presumably) distinguished between dental, alveolar and dental [n], [t], and [d] and this information is not shown orthographically. **Hence, the minimal pairs we collected should be used with caution.**

## Carrier Sentences

The carrier sentences used are the following (also in [`extra/sentences/other/carrier_sentences.txt`](annotations/sentences/other/carrier_sentences.txt)):

```
{{WORD}} mo' nash peyacdye' yu yi.
yu ra' yi {{WORD}} jeñej peyacdye'.
yu ra' yi mo' peyacdye' {{WORD}}.
```

These sentences translate as follows:

```
{{WORD}} is the word I am saying.
I will say {{WORD}} as a word.
I will say the word {{WORD}}.
```

The natural sentences were written by the two informants who recorded the text-elicitated data.

# Issues

## Duplicates

Some words appear in several minimal pairs and therefore were recorded several times. We decided to keep these semi-duplicate recordings as they might be useful (e.g. test for speaker consistency). Some pairs were also inadvertently recorded twice (all found in batch_3). These duplicate words are listed below:

* **batch_1**
  * **ĉó'chaqui**/ĉocháqui (pair 8) ĉó'chaqui'/**ĉó'chaqui** (pair 9)
  * **ji'jun'taqui**/ji'juntaqui (pair 16) and ji'jun'taqui'/**ji'jun'taqui** (pair 18)
  * ji'jun'taqui/**ji'juntaqui** (pair 16) and ji'juntaqui'/**ji'juntaqui** (pair 17)

* **batch_2**
  * ji'shucaqui'/**ji'shucaqui** (pair 11) and ji'shu'caqui/**ji'shucaqui** (pair 12)
  * ji'shu'caqui'/**ji'shu'caqui** (pair 10) and **ji'shu'caqui**/ji'shucaqui (pair 12)

* **batch_3**
  * **tó'caqui**/tocáqui (pair 15) and **tó'caqui**/tócaqui (pair 16)
  * duplicate pair **yọcyi'/yọcyi**: pair 22 and 25
  * duplicate pair **yútyi'/yútyi**: pair 23 and 26
  * duplicate pair **ĉhá'baqui/ĉhabáqui**: pair 24 and 27


## Isolated words

The words that do not belong to a minimal pair and that were still recorded are to be found in **batch_3**. These words are the listed below:

* jí'cham'tacsis
* tyetyéjta'
* vó'vodye'
* jäjä́m'
* cajótaqui
* cäi'tsii'ya'
* jambí'dyem'
* fọjjẹyạquị
* jí'juban'

The IDs of the sentences these words appear in are labeled **pairX.wordX** even though these words do not appear in minimal pairs. The word index (word1, word2) is also meaningless for these words and only take the index these words should have had if the first (resp. second) word of the pair had existed/had been correct.

# Acknowledgements

We thank our informers, Arnulfo Cary and Manuel Roca, who manually checked each minimal pair and wrote example sentences for each of the selected words. We also thank the three native Tsimane's speakers who recorded the stimuli.

This study was approved by CER U-Paris 2022-84-CRISTIA.

Funding for this study comes from the J. S. McDonnell Foundation (Understanding Human Cognition Scholar Award); European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (ExELang, Grant agreement No. 101001095).