Relatrici: Prof. Johanna Miecznikowski, Elena Battaglia
Data: 06 NOVEMBRE 2024 dalle 15:00 alle 17:00
Luogo: Laboratorio 76, via Cartoleria 5, Bologna - Evento in presenza e online
The TIGR corpus of spoken Italian includes 23.5 hours of video-recorded and transcribed discourse and was gathered by a team of linguists at USI Università della Svizzera italiana in the Swiss cantons Ticino and Grisons in 2021-2022. It is currently being prepared for being shared via the Swiss FAIR-compliant (cf. Wilkinson et al. 2016) LaRS repository.
We describe the corpus design and some changes it underwent during fieldwork; the interaction with the speakers, including their informed consent to participate in the study; the set-up of audio and video recordings; sociolinguistic characteristics of the participants; transcription conventions and techniques; workflows of transcript processing in view of qualitative analysis and annotation; de-identification measures; scenarios of data reuse; the organization of the data and metadata on LaRS; an open science approach to problem solving during the preparation of the data to be shared; future perspectives.
TIGR was collected within a research project conducted at USI and focused on epistemic aspects of talk (InfinIta, SNSF grant no. 192771). At the same time, it was designed to increase the diversity of available resources for spoken Italian (for an overview see Mauri et al. 2019). It includes 23.5h of video recordings documenting 23 face-to-face interactions. These vary as to genre and as to external criteria (Sinclair & Ball 1996), more specifically event-related parameters (Deppermann/Hartung 2011:423-424) such as institutionality, the number of participants, speaker roles and the presence of multi-activity (Mondada 2009): table conversations (6h5'), food preparation (1h40'), tutoring encounters (4h40'), lessons and practical instruction (7h20'), interviews (3h40'). The data collection process underwent some changes due to the Covid-19 pandemic, which had an impact on the corpus structure. The 115 speakers are 10-70 years old (most represented range: 20-29 years) and about 3/4 of them finished a higher secondary school. They declared their consent to data use and re-use for scientific purposes and expressed some de-identification demands. The technical set-up included two camcorders and 2-4 pocket audio recorders equipped with clip-on microphones, all synchronized through timecode generators. The A/V files were aligned and cut to equal length in Adobe Premiere. The team then transcribed them in ELAN (Sloetjes/Seibert 2016) using an adapted version of the GAT 2 conventions (Selting et al. 2011). A transcription technique was adopted that privileged the alignment of segment boundaries with boundaries of overlapping speech, such as to facilitate the revision of transcripts in ELAN and the manual layout of complex sequences with overlapping speech. Proper names were pseudonymized.
The data will be made available on the SWISSUbase repository using the metadata scheme provided by LaRS, which is tailored for data in linguistics. The data will be downloadable upon signing a user agreement with the corpus owners. To enhance interoperability and reusability, we plan to provide two transcript versions in addition to the EAF file generated by ELAN. By now, we have implemented a script-assisted workflow to produce TXT transcripts that are optimized for the human eye and preserve a reduced amount of timecode stamps. Later we intend to create tokenized transcripts readable by corpus linguistic software. In A/V files we are masking faces and voices, where so required, and replacing proper names by noise. For each recorded event, we are editing a single compact, easy-to-use movie file with split screen and mixed audio. Once ready, the corpus will be uploaded to the repository, completed by metadata and documentation. Users could have the following download options: event by event, either a full version (A/V files, compact movie, EAF file, transcripts) or a light version (compact movie, transcripts); transcripts only for all events at once, raw TXT or tokenized.
While preparing the data, we are step by step building a webpage to present the corpus. In parallel, we use a lab blog (sharetigr.usi.ch) to publicly report on our experience and discuss issues we are facing, thus building a case study of open research data practices in linguistics.
A desirable further step is to make the corpus accessible on a platform that allows for on-line viewing and query. Currently, no such platform is available in Switzerland. USI has started a collaboration with LiRI (Linguistic Research Infrastructure, University of Zurich) to explore possible software developments.
Blog | ShareTIGR. (n.d.). Retrieved September 23, 2024, from https://sharetigr.usi.ch/en/news-events/blog
Couper-Kuhlen, E., & Barth-Weingarten, D. (2011). A system for transcribing talk-in-interaction: GAT 2: English translation and adaptation of Selting, Margret et al: Gesprächsanalytisches Transkriptionssystem 2. Gesprächsforschung, 12, 1–51. Language Repository of Switzerland | LaRS – Language Repository of Switzerland | UZH. (n.d.). Retrieved September 23, 2024, from https://www.lars.uzh.ch/en.html
Mauri, C., Ballarè, S., Goria, E., Cerruti, M., & Suriano, F. (2019). KIParla Corpus: A New Resource for Spoken Italian. Proceedings of the Sixth Italian Conference on Computational Linguistics. CLiC-it 2019, Bari, Italy. https://ceur-ws.org/Vol-2481/paper45.pdf
Mondada, L. (2009). Multimodalità e multi-attività nelle conversazioni a tavola (p. 88). Franco Angeli. https://shs.hal.science/halshs-00376006
Sinclair, J. M., & Ball, J. (1996). EAGLES Text typology. https://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html
Sloetjes, H., & Seibert, O. (2016). Measuring by marking; the multimedia annotation tool ELAN. Measuring Behavior 2016, 10th International Conference on Methods and Techniques in Behavioral Research, 492–495. https://hdl.handle.net/11858/00-001M-0000-002B-98DF-2 SNSF Data Portal. (n.d.). Retrieved September 23, 2024, from https://data.snf.ch/
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18
Johanna Miecznikowski-Fuenfschilling, dopo la laurea in Filologia italiana, Filologia russa e Linguistica francese all'Università di Basilea è stata assistente di Linguistica francese e di Linguistica Generale presso questa stessa università. A Basilea ha lavorato sulle biografie linguistiche di parlanti plurilingui (cfr. il volume Leben mit mehreren Sprachen. Sprachbiographien / Vivre avec plusieurs langues. Biographies langagières, Bern, Peter Lang, 2004, curato insieme a Rita Franceschini) e sulla costruzione interattiva del discorso scientifico in ambiente plurilingue, e ha conseguito, nel 2002, il dottorato in Linguistica francese (cfr. Le traitement de problèmes lexicaux lors de discussions scientifiques en situation plurilingue. Procédés interactionnels et effets sur le développement du savoir, Bern, Peter Lang, 2005). Dopo un soggiorno di ricerca presso l'Università di Torino, dedicato allo studio del condizionale e dei segnali discorsivi nella lingua parlata, l'Università di Basilea le ha conferito nel 2010 la venia legendi in Linguistica Romanza. Dal 2012 al 2016 è stata presidente dell’Associazione Svizzera di Linguistica Applicata.
È professoressa titolare presso l'Istituto di studi italiani e l'Istituto di argomentazione, linguistica e semiotica dell'Università della Svizzera italiana, dove insegna linguistica e pragmatica del linguaggio all’interno del Bachelor e del Master in lingua, letteratura e civiltà italiana. Dirige attualmente i seguenti progetti di ricerca: "La categorizzazione delle fonti di informazione nell’interazione faccia a faccia: una indagine basata sul corpus di italiano parlato TIGR" (sussidio FNS no. 192771, 2020-2024), "Data-sharing skills in corpus-based research on talk-in-interaction" (programma ORD di swissuniversities, in collaborazione con partner alle università di Basilea, Losanna, Neuchâtel, aprile 2023-settembre 2024), "ShareTIGR - Condivisione del corpus di italiano parlato TIGR: un caso studio ORD" (programma ORD USI, febbraio 2024-gennaio 2025).
Elena Battaglia è assistente dottoranda in Linguistica italiana presso l'Istituto di studi italiani.
Lavora nell'ambito del progetto FNS [no. 100012_192771] "La categorizzazione delle fonti di informazione nell'interazione faccia a faccia. Una indagine basata sul corpus di italiano parlato TIGR", sotto la supervisione della Prof.ssa Johanna Miecznikowski-Fünfschilling.
Ha conseguito la laurea in Lingue e letterature straniere presso l'Università degli studi di Milano nel 2018 e la laurea magistrale in Sciences du langage presso l'Università di Lille nel 2020. È stata docente a contratto di Linguistique de l'oral presso l'Università di Lille nel 2022 e Visiting scholar presso il Dipartimento di Linguistica di KU Leuven nel 2023.