BilingBank C-ORAL-IC Corpuss

BilingBank C-ORAL-IC Corpus

Nada Poropat Jeletić
Humanities and Social Sciences
Juraj Dobrila University of Pula

Gordana Hržica
Faculty of Education
University of Zagreb

Eliana Moscarda Mirković
Humanities and Social Sciences
Juraj Dobrila University of Pula

Participants: 87
Type of Study: naturalistic
Location: Istria, Croatia
Media type: audio
DOI: doi:10.21415/PAZ2-EP87

Browsable transcripts

Download transcripts

Media folder

Citation information

Poropat Jeletić, Nada ; Moscarda Mirković, Eliana ; Bortoletto, Anna Incidenza e implicazioni di alcuni tratti formali pertinenti tipici del discorso bilingue istriano: i casi di commutazione di codice // Annales. Series historia et sociologia, 31 (2021), 2; 329-340, DOI 10.19233/ASHS.2021.21

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Corpus Description

The Corpus of Spoken Istrovenetian/Fiuman and Croatian (C-ORAL-IC) consists of transcripts of conversations of 87 adult bilingual participants, speakers of Croatian and Istrovenetian or Fiuman, recorded in Istrian peninsula and in the city of Rijeka. The corpus is accompanied by a MS Excel spreadsheet containing demographic and sociolinguistic data about each speaker. In each row data for one speaker in each transcript is presented and the data concerns: gender, year of birth, place of birth, place of living, education, education of parents, number of family members (nuclear family), average income, information about the onset of language exposure for each language variety, information about the amount of language exposure for each language variety. The information facilitates statistical analyses of the database, the selection of subsets of certain transcripts or speaker for more detailed and specific research purposes and it is very useful for determining the speakers bilingual/multilingual status.

Sociolinguistic situation

The Istrian peninsula is the western-most part of Croatia, surrounded by the northern Adriatic Sea. It stretches between from the bays of Triest and Venice (in the North-West) to the bay of Rijeka and Kvarner (in the North-East) and the Cape Premanture (in the South). The part of the peninsula belonging to Croatia partially coincides with the Istria County, the only statutary bilingual County in Croatia where the Croatian-Italian bilingualism is recognized de jure and de facto (therefore its official name is bilingual: Istarska županija/Regione istriana). The Istria County covers most of the Istrian peninsula, with the exception of the Liburnian part, namely the marginal area in the North-East, belonging to the Primorje-Gorski Kotar County, whose main center is the city of Rijeka (in Croatian Rijeka; in Italian Fiume). The whole area is characterized by a permanent contact with Croatophone and Italophone cultures and language varieties that date back several centuries and are still used in everyday spoken communication thus shaping a complex and fragmented sociolinguistic macro-system shaped by the mutual interplay of asymmetric and diglossic/polyglossic relations among two official languages (Croatian and Italian), complemented by macro-regional dialects (the Istrovenetian koine and the Chakavian koine), micro-regional dialects (Chakavian, Kaikavian, Shtokavian), with the addition of local dialects in Istria (like the Istriot dialects, the Istroromanian dialects, etc.) (Blagoni 2007). The most widespread variants of the Italophone repertoire in the Istria County and Rijeka are respectively Istrovenetian and Fiuman, the low varieties with high ethnolinguistic prestige. They are considered to be the L1 of the members of the Italian National Community and the primary/privileged code of their community identification (ethnolinguistic identity of the Italophone community - generational memory and heritage ethnic history that serve as vital factors of personal identity).


Data were collected from 2018 to 2021 and language sampling was performed by investigators from bilingual communities with access to groups of bilingual speakers. Investigators were mostly students in their own social networks, therefore the investigators were a part of the communication setting. They were recruited and trained to collect bilingual speech samples. Sampling was performed in different everyday informal interactive situations, mostly during spontaneous speech situations among family members, friends, colleagues or acquaintances, such as informal gatherings, socializing or family meals. Thus, the controlling of genre and formality (conversations in informal situations) were performed. The aim of ensuring the bilingual mode was followed by recording informal spontaneous conversation and all the speakers participating to the recorded conversation were proficient bilinguals or multilinguals.

The corpus was collected in the course of the project Multilevel approach to discourse in language development (Croatian Science Foundation, UIP-2017-05-6603). All participants signed informed consent in which the data collection was described. They were informed that their data will be published as a part of a corpus but will be anonymized. They could and still can withdraw from this study and/or withdraw their transcripts from the corpus.

With the aim of mitigating the Observer’s paradox (Labov, 1972), two criteria were applied. First, all the participants were informed about the research aims and speech sampling procedure. They all provided a written informed consent in which they agree to be recorded without their explicit knowledge at a random point within the period of one month after signing the consent. Second, the investigators were trained to participate in the recorded sessions as little as possible. Almost all the recording sessions lasted approximately 15 minutes. As the transcripts are inevitably unable to capture all the communicative features and events, most of the transcripts were connected to their corresponding source audio file.

Participants were administered a background questionnaire providing information about the sociodemographic, sociolinguistic and socioeconomic status, language exposure and language usage in their social networks.


When selecting the participants, we did our best to ensure that they are representative of the bilingual community in Croatia. This was achieved by selecting participants of different ages, different education and from different geographic areas, while at the same time trying to preserve most common characteristics of bilingual community: command and usage of as language variants spoken in the area (namely: Croatian standard language, Croatian Chakavian dialect, Italian standard language, Italian dialect (Istrovenetian/Fiuman)) and early exposure to at least two language variants.

A conversational sampling method was employed for building the corpus including 91 native bilingual speakers of Istrovenetian/Fiuman and Croatian, recruited across three generations, living in different areas of the Istrian peninsula and the city of Rijeka. The bilingual speakers were proficient bilinguals or multilinguals with at least two mother tongues and first languages or speakers whose mother tongue is Istrovenetian/Fiuman and who acquired Croatian/Chakavian in the early childhood. The original group of recorded participants consisted of 102 bilingual speakers. In total 15 of them withdrew during sampling or problematic issues were encountered during the COVID19 pandemics that took place from the 2019 to the 2021. All the transcripts were annotated to include the participants’ basic information regarding gender, age and the location of the conversation.

Participants included 59 women and 28 men, aged 18-82 with an average of 42. Education ranged from elementary to PhD and income from very low to very high.

Place of recording. Participants were recruited by the investigators in the city of Rijeka and five towns of the Istrian County, namely area around Pula, Fažana, Rovinj, Poreč, Grožnjan, Umag and Rijeka. (Figure 1), in order to ensure diatopic representativeness. It is important to note that not all areas of Istria are equally populated by bilingual speakers and the numbers of participants in certain geographic areas roughly reflect that.


The C-ORAL-IC was collected in the course of the project Multilevel approach to discourse in language development (Croatian Science Foundation, UIP-2017-05-6603). This work could not have been possible without the help of all investigators, namely: Antonella Antonac, Katja Bartulić, Daniela Bonassin, Nina Crevatin, Chiara Detoffi, Elia Filinić, Lucija Gržinić, Daniela Jugovac, Romana Macuka, Olga Maier, Valentina Malusa', Petra Orbanić, Ilaria Paus, Valentina Perić Turčinović, Matteo Putigna, Anna Radin, Francesca Rajković, Ornella Sciucca, Viktorija Štokovac, Greta Šverko, Nicol Verbanac. We are deeply grateful for their work.