Convegno internazionale "Talking data"

Methodological and theoretical challenges raised by spoken interaction data

Organizing Committe:

Caterina Mauri, Eleonora Zucchini, Silvia Ballarè, Ludovica Pannitto.

Scientific Committee:

All the members of the PRIN 2022 PNRR DiverSIta – Diversity in Spoken Italian:

Cecilia Andorno, Silvia Ballarè, Beatrice Bernasconi, Claudia Borghetti, Massimo Cerruti, Paolo Antonio Della Putta, Eugenio Goria, Nicola Grandi, Guglielmo Inglese, Yahis Martari, Caterina Mauri, Ludovica Pannitto, Rosa Pugliese, Eleonora Zucchini.

When and where: University of Bologna, 9-10 October 2025

The conference aims to gather scholars working on data of spoken interaction from a variety of perspectives, with different approaches and goals, across different linguistics fields. We are especially interested in contributions addressing how this type of data raises both methodological and theoretical challenges all along the way, from collection, through transcription, to annotation and analysis.

The conference is the closing event of the project DiverSIta, Diversity in Spoken Italian, which is dedicated mainly to the expansion of KIParla (Mauri et al. 2019, www.kiparla.it ) a corpus aimed to document spoken Italian over time, in its internal diversity of speakers and communicative situations, with a focus on naturally occurring data (Ballarè, Mauri & Goria 2022). The conference will represent an opportunity to describe the corpus and the whole KIParla enterprise, learn about further resources, in different languages, sharing the focus on spoken interaction data; participants will have the chance to discuss the theoretical and methodological challenges that this type of data raises in various fields and approaches to the study of language, and find common or complementary objectives to pursue.

Notoriously, collecting, transcribing, and publishing data of spoken interaction pose more challenges than building resources portraying written or spoken but monological data, therefore for many years spoken corpora were limited to so-called WEIRD and LOL languages, i.e. languages with standardized written forms (Literate), official recognition (Official), and large speaker populations (Lots of users) (Dahl 2015). Only recently did we start to have access to resources containing spoken data for a variety of languages that includes less described ones, although a significant portion of such data consists of monological narratives (cf. MULTICast Haig & Schnell, 2015; SCOPIC, Barth & Evans 2021; Dingemanse & Lisenfeld 2022; DoReCo, Seifart, Paschen & Stave 2024).

Access to spoken data is crucial for various linguistic analytical perspectives that focus on language variation in a broad sense. Observing spoken interaction, despite its inherent messiness and unpredictability, is essential for developing comprehensive and accurate descriptions of language as it is truly used in real-life contexts. This approach helps mitigate biases toward overly polished or artificially structured data, allowing for a more authentic representation of linguistic diversity.

We welcome contributions discussing the issues, solutions, and challenges in building, annotating, using and comparing corpora of spoken interaction data, also in a cross-disciplinary perspective, highlighting the role of this specific type of data in shaping linguistic analyses, linguistic models, and methodological choices. A non-exhaustive list of topics includes:

Methodologies: corpus design, data collection, transcription, annotation and publication

Sampling and balancing: reconciling the representativeness and spoken data
Ecological and ethical practices for data collection
Challenges and possible solutions for manual or (semi-)automated transcription
Data formats and standards
Data annotation: units of transcription, units of analysis, disfluencies, co-constructions, multilingual interactions, …
Data FAIRness and accessibility: privacy protection and data sharing
Main problems and solutions for multilingual corpora annotation
Treebanks of spoken interactional data: how to deal with overlapping or utterance co-construction, ...
LLM training based on conversational data and LLM interactional performance evaluation
…

Analysis: Spoken interaction data in different approaches

Language variation and spoken data: how interaction shapes internal variation
Sociolinguistic perspectives on spoken data: to what extent can social categories explain variation in spoken language?
Typological approaches to spoken interaction data, e.g. universal vs. language-specific phenomena, available resources
Computational approaches to spoken interaction data, e.g. LLM training and fine-tuning, automatic detection of interactional phenomena
Diachronic approaches to spoken interaction data, e.g. emergent constructions, studies highlighting the role of dialogical interaction in language change
Studies on interactional data involving L2 speakers or speakers with multilingual repertoires: e.g. what can we learn about language acquisition and learners’ varieties from this type of data; how the presence of L2 speakers or speakers with complex repertoires shapes language in interaction.
Psycholinguistic approaches, e.g. experimental settings involving spoken interactions

Confirmed invited speakers

Robbie Love (Aston University)
Lorenza Mondada (University of Basel)
Marlou Rasenberg (Radboud University)
Stefan Schnell (University of Zurich)

Submission information

Abstract submission: please send a one-page abstract (references excluded) in PDF to caterina.mauri@unibo.it caterina.mauri@unibo.it and eleonora.zucchini2@unibo.it
Deadline for abstract submission: 20th May
Notification of acceptance: 31st May

References

Barth, Danielle & Nicholas Evans (eds). 2017-2021. Social Cognition Parallax Interview Corpus (SCOPIC). “Language Documentation & Conservation Special Publication” 12. Honolulu, University of Hawai'i Press.

Dingemanse Mark & Andreas Liesenfeld. 2022. From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Association for Computational Linguistics, pp. 5614–5633.

Dobrovoljc, Kaja. 2022. Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, European Language Resources Association, pp. 1798–1806.

Haig, Geoffrey & Stefan Schnell (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spokentexts. (multicast.aspra.uni-bamberg.de/).

Mauri Caterina, Silvia, Ballare, Eugenio Goria, Massimo Cerruti & Francesco Suriano. 2019. KIParla corpus: A new resource for spoken Italian. In CEUR Workshop Proceedings, CEUR-WS 2481, pp. 1 – 7.

Mauri, Caterina, Silvia Ballarè, Eugenio Goria & Massimo Cerruti. 2022. Il corpus KIParla. In Corpora e studi linguistici. Milano, Officinaventuno, pp. 109 – 118.

Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2024. Language Documentation Reference Corpus (DoReCo) 2.0. Lyon, Laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).

Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma & Chao Zhang. 2024. Connecting speech encoder and large language model for asr. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12637-12641.