Methodological and theoretical challenges raised by spoken interaction data
Organizing Committe:
Caterina Mauri, Eleonora Zucchini, Silvia Ballarè, Ludovica Pannitto.
Scientific Committee:
All the members of the PRIN 2022 PNRR DiverSIta – Diversity in Spoken Italian:
Cecilia Andorno, Silvia Ballarè, Beatrice Bernasconi, Claudia Borghetti, Massimo Cerruti, Paolo Antonio Della Putta, Eugenio Goria, Nicola Grandi, Guglielmo Inglese, Yahis Martari, Caterina Mauri, Ludovica Pannitto, Rosa Pugliese, Eleonora Zucchini.
When and where: University of Bologna, 9-10 October 2025
The conference aims to gather scholars working on data of spoken interaction from a variety of perspectives, with different approaches and goals, across different linguistics fields. We are especially interested in contributions addressing how this type of data raises both methodological and theoretical challenges all along the way, from collection, through transcription, to annotation and analysis.
The conference is the closing event of the project DiverSIta, Diversity in Spoken Italian, which is dedicated mainly to the expansion of KIParla (Mauri et al. 2019, www.kiparla.it ) a corpus aimed to document spoken Italian over time, in its internal diversity of speakers and communicative situations, with a focus on naturally occurring data (Ballarè, Mauri & Goria 2022). The conference will represent an opportunity to describe the corpus and the whole KIParla enterprise, learn about further resources, in different languages, sharing the focus on spoken interaction data; participants will have the chance to discuss the theoretical and methodological challenges that this type of data raises in various fields and approaches to the study of language, and find common or complementary objectives to pursue.
Notoriously, collecting, transcribing, and publishing data of spoken interaction pose more challenges than building resources portraying written or spoken but monological data, therefore for many years spoken corpora were limited to so-called WEIRD and LOL languages, i.e. languages with standardized written forms (Literate), official recognition (Official), and large speaker populations (Lots of users) (Dahl 2015). Only recently did we start to have access to resources containing spoken data for a variety of languages that includes less described ones, although a significant portion of such data consists of monological narratives (cf. MULTICast Haig & Schnell, 2015; SCOPIC, Barth & Evans 2021; Dingemanse & Lisenfeld 2022; DoReCo, Seifart, Paschen & Stave 2024).
Access to spoken data is crucial for various linguistic analytical perspectives that focus on language variation in a broad sense. Observing spoken interaction, despite its inherent messiness and unpredictability, is essential for developing comprehensive and accurate descriptions of language as it is truly used in real-life contexts. This approach helps mitigate biases toward overly polished or artificially structured data, allowing for a more authentic representation of linguistic diversity.
We welcome contributions discussing the issues, solutions, and challenges in building, annotating, using and comparing corpora of spoken interaction data, also in a cross-disciplinary perspective, highlighting the role of this specific type of data in shaping linguistic analyses, linguistic models, and methodological choices. A non-exhaustive list of topics includes:
Methodologies: corpus design, data collection, transcription, annotation and publication
Analysis: Spoken interaction data in different approaches
Confirmed invited speakers
Robbie Love (Aston University)
Lorenza Mondada (University of Basel)
Marlou Rasenberg (Radboud University)
Stefan Schnell (University of Zurich)
Submission information
References
Barth, Danielle & Nicholas Evans (eds). 2017-2021. Social Cognition Parallax Interview Corpus (SCOPIC). “Language Documentation & Conservation Special Publication” 12. Honolulu, University of Hawai'i Press.
Dingemanse Mark & Andreas Liesenfeld. 2022. From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Association for Computational Linguistics, pp. 5614–5633.
Dobrovoljc, Kaja. 2022. Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, European Language Resources Association, pp. 1798–1806.
Haig, Geoffrey & Stefan Schnell (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spokentexts. (multicast.aspra.uni-bamberg.de/).
Mauri Caterina, Silvia, Ballare, Eugenio Goria, Massimo Cerruti & Francesco Suriano. 2019. KIParla corpus: A new resource for spoken Italian. In CEUR Workshop Proceedings, CEUR-WS 2481, pp. 1 – 7.
Mauri, Caterina, Silvia Ballarè, Eugenio Goria & Massimo Cerruti. 2022. Il corpus KIParla. In Corpora e studi linguistici. Milano, Officinaventuno, pp. 109 – 118.
Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2024. Language Documentation Reference Corpus (DoReCo) 2.0. Lyon, Laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma & Chao Zhang. 2024. Connecting speech encoder and large language model for asr. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12637-12641.