The first two INEL corpora published online
Texts are provided with interlinear glossing (with lexical glosses in English and Russian), translations into English, Russian and German. Some texts also have (partial) annotations for syntactic functions, semantic roles and information status, lexical borrowings and code-switching.
The corpora are published in open access under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). See below for details on using the corpora.
The corpora are primarily intended for typologically aware corpus-based grammatical research but may also be of interest to linguists of other branches as well as to specialists in folklore, anthropology and history.
1. INEL Selkup Corpus (v0.1)
Selkup is an endangered Samoyedic language (Uralic family), which used to be spoken in many small settlements dispersed over a large territory in Western Siberia.
The INEL Selkup corpus is composed of texts from the archive of Angelina Ivanovna Kuzmina (1924–2002), who gathered a large amount of material on Selkup in almost all regions where the Selkup people lived in 1962–1977. Most texts in the corpus originate from the handwritten part of the archive that she transferred to Hamburg in 2001, the others come from her sound recordings digitized in 2001, which have been transcribed and translated within the INEL project.
The present version of the corpus comprises 78 texts (18 673 words), mostly representing Northern varieties of Selkup.
2. INEL Kamas Corpus (v0.1)
Kamas belongs to the Samoyedic branch of the Uralic language family. The language became extinct by the late XXth century, with the death of its last known speaker, Klavdiya Plotnikova (1895–1989). All the surviving Kamas texts document Forest Kamas varieties spoken in the settlement of Abalakovo, in the present Krasnoyarsk Krai in Southern Siberia.
The INEL Kamas corpus is the first publicly available digital resource with annotated Kamas texts. The INEL Kamas corpus consists of two parts: folklore texts collected by Kai Donner in 1912–1914, and transcribed audio recordings of Klavdiya Plotnikova made between 1964 and 1970 in Abalakovo, Tartu and Tallinn. Most of these recordings were transcribed within the INEL project (including re-transcribing some tapes fragments of which were published by Ago Künnap in 1976–1992).
The present version of the corpus comprises 137 texts (48 293 words); this includes 16 texts collected by Kai Donner and 121 text from the recordings of Klavdiya Plotnikova (ca. 10,5 hours).
Working with the corpora
The data in the corpora (annotated texts as well as corresponding metadata) are represented in XML formats of the freely distributed EXMARaLDA suite (http://exmaralda.org/en/).
For browsing (and playback) of individual texts, use «Sessions» tab on the main corpus page. Each text can be viewed in one of three online formats (e.g. Visualizations: Score) and downloaded in EXB (an EXMARaLDA format). The sources of texts, i.e. scanned pages (PDF) or sound files (WAV, MP3) can also be viewed/downloaded.
For searching across the whole corpus, the complete archive of the corpus files can be downloaded and searched with the EXAKT program of the EXMARaLDA suite.
Furthermore, in the next few weeks, an online search interface will be open for both corpora, based on the Tsakonian Corpus Platform (Tsakorpus). A test search across a fragment of the Selkup corpus is currently available.
Please send your comments and suggestions to: email@example.com.