BilingBank Spanish-English Bangor Miami Corpus

Margaret Deuchar
ESRC Centre for Research on Bilingualism
Bangor University


Participants: 84
Type of Study: naturalistic
Location: USA
Media type: audio
DOI: doi:10.21415/T5J01D

Browsable transcripts

Download transcripts

Media folder

Citation information

Some citation here.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Miami corpus of Spanish-English bilingual speech was recorded and transcribed between 2008 and 2011 as part of a research project funded by the Economic and Social Research Council (ESRC). The main theoretical aim of the project was to test alternative models of code-switching with Spanish-English data.

The Data

The corpus consists of 56 audio recordings and their corresponding transcripts of informal conversation between two or more speakers, involving a total of 84 speakers living in Miami, Florida (USA). Participants were recruited via a variety of methods, including advertising and using the research team’s extended social network.

From the 56 audio recordings, 15 are transcripts of conversations from one individual, recorded over a longer period of time in conversation with more than one speaker. The participant (‘María’) was already known by the research team to be a balanced bilingual who frequently and consistently code-switched in daily conversation, and so she was invited to make recordings of her interactions with colleagues, family and friends. Maria decided when and with whom to make recordings, by means of a small digital recorder worn on her belt with a moderately concealed lapel microphone. She recorded 42 conversations, 15 of which have been selected for transcription on the basis of their acoustic quality. The research team had no control over when or where the recordings were made and also did not have control over the technical aspects such as checking audio levels, environmental noise and changing batteries in the recorder. Maria’s interlocutors did not sign consent forms or fill in questionnaires and so the transcripts of the15 recordings only represent Maria’s speech, while utterances from other speakers are transcribed as “www”.

In total, the corpus consists of 242,475 words of text from 35 hours of recorded conversation. The transcriptions (in CHAT format) are linked to the digitized recordings through sound links at the end of each main tier. Most recordings were in stereo, and were made using Marantz, Zoom or Microtrack digital audio recorders.

The recordings were made at a place convenient for the speakers, e.g. at their homes or workplaces. After setting up the equipment the researcher would leave the speakers to talk freely with one another. In some cases the researcher re-entered briefly during the recording. This is noted in the transcripts and speech by the researcher is usually not transcribed. The first five minutes of all recordings after the point when the researcher left the room have been deleted, in case the participants’ speech was initially affected by the presence of the recorder.

At the end of each recording all participants were asked to fill in questionnaires providing background information regarding their age, gender, location of places lived, etc, in order to provide information for sociolinguistic analysis. They were also asked to sign consent forms giving permission for their recording and its transcript to be used for research purposes and to be submitted to online linguistic archives. The consent form included the provision that the names of speakers and other people named in the recording would be replaced by pseudonyms in the transcript. In the case of children of 16 years or younger, a consent form was also signed by a parent or guardian.

Usage Restrictions

Acknowledgments: Please refer to the corpus as the Bangor Miami corpus, and provide a link to the website by which you accessed the corpus. We request that a copy of any publications that make use of this corpus be sent to us at the above address.

Canonical version of the data: The most up-to-date version of the data as well as more detailed documentation is available on