The background

An important part of cultural heritage is represented by the complex and diverse set of linguistic resources used in every-day spoken communication, to which we refer as oral heritage.
Compared to written language, the documentation of oral heritage poses a number of difficulties, part of which arise from the great variability of spoken language and can be seen as a reflection of diversity in society. This project aims to fill a gap in the documentation of Italian oral heritage, believing that better representing the diversity subsumed in spoken Italian may be a crucial step towards a society that is more inclusive and able to represent the diversity of its individuals.

Some of the existing corpora and oral archives for spoken Italian, such as ParVa (Guerini 2016), ViVo and Voci (Piccardi et al. 2019), DIA Dialogic ITalian (Mereu & Vietti 2021) or Kontatto (Ciccolone & Dal Negro 2021), are relatively small but well balanced. Other larger and general corpora, designed to be reference resources for variation in spoken Italian, are VoLIP (Voghera et al. 2014), LABLITA (Cresti & Moneglia 2005), CLIPS (Sobrero & Tempesta 2007), and Perugia Corpus (Spina 2014). However, not all of them are accessible online and most of them do not provide metadata for speakers and conversations, which hinders the analysis of sociolinguistic variation.
Major advances have been introduced by the KIParla corpus (Mauri et al. 2019), the first corpus of spoken Italian available online that, in compliance with GDPR, is able to share a wide array of metadata concerning the speakers’ profiles. More importantly, KIParla has been conceived as a modular corpus that grows over time, thanks to the addition of new sections (Ballarè et al. 2022). Two modules of KIParla have already been published: KIP (Mauri & Goria 2018), consisting of interactions in academic setting, and ParlaTO (Cerruti & Ballarè 2021), consisting of semi-structured interviews with speakers from different socio-educational backgrounds. Despite these innovations, KIParla is still biased towards speakers with an indigenous multilingualism, drawing a picture that is distant from reality. Due to the intensification of international migration flows, Italian has indeed come to be spoken by a more heterogeneous set of individuals and communities, and is often embedded in a set of multilingual practices including both indigenous and exogenous linguistic systems and as part of unstable repertoires. Therefore, we aim to develop the KIParla corpus so as to provide a better representation of the diverse constellation of speakers and of their varieties of Italian.

Speakers with an International Migration Background (SIMB, possibly L2 speakers) are generally not included in corpora. Although in the field of Second Language Acquisition (SLA) L2 varieties have for long been interpreted as part of the overall target language system (Klein 1997), they are generally only represented separately in the so-called Learner Corpora (LC). As for L2-Italian, six main collections of LC are available, of which only the Corpus of Chinese Learners of Italian (COLI) and Lessico dell’italiano parlato per stranieri (LIPS; Gallina 2013) contain spoken data.

While LC are valuable tools to study different aspects of SLA, LC have never really had a robust impact on SLA research (McEnery et al. 2019).This is due to technical and epistemological reasons. In building an LC, one should pay attention to how L2ers productions can be codified, and how to deal with non-target like productions (Andorno & Rastelli 2009, Benazzo & Watorek 2021, Lüdeling & Hirschmann 2015). This might be a complex interpretative and tagging problem. Moreover, SLA research is mostly based on the study of spontaneous, oral language, but the majority of LC available consist of elicited productions, often gathered in an educational environment or even in testing conditions (see Gallina 2010, LIPS corpus). In such contexts, learners try to control their performance, keep clear of non-standard variants and hide as much as possible their L1, for example avoiding language mixing and switching, which are otherwise constitutive phenomena of L2 discourse (Macaro 2005).

Therefore, this project aim to bring together the sociolinguistically-oriented methodology of the KIParla corpus and the body of knowledge from SLA research, in the creation of a new resource. For the first time, data from speakers with an international migration background will be recorded with the same methods used for speakers of national origin and included in the same resource. From a sociolinguistic perspective, this will enable us to add data from SIMB varieties to the sociolinguistic description of Italian; from an SLA perspective, data of unsupervised speech are of paramount importance, and could allow the characteristics of these varieties to emerge in their entirety, both infra-varietistically and multi linguistically.

References

Abel, A. (2014). A Trilingual Learner Corpus illustrating European Reference Levels. RiCOGNIZIONI, V. 1, 111-126.
Andorno, C. (2017) ‘Definire l’oggetto: che cos’è una seconda lingua e che cosa significa acquisire una lingua’, in Verso una nuova lingua. Capire l’acquisizione di L2. Torino: UTET Università, pp. 3–28.
Andorno, C. and Rastelli, S. (2009) ‘Un’annotazione orientata alla ricerca acquisizionale’, Corpora di Italiano L2: tecnologie, metodi, spunti teorici, 19(3), pp. 49–70.
Ballarè, S., Goria, E., Mauri, C. (2022). Italiano parlato e variazione lingusitica. Teoria e prassi nella costruzione del corpus KIParla. Bologna: Pàtron.
Ballarè, S., Mauri, C., Cerruti, M., & Goria, E. (2019). Il corpus KIParla. Tra linguistica dei corpora e sociolinguistica dell’italiano. RiCOGNIZIONI, 275-278.
Berruto, G. (2003) ‘Sul parlante nativo (di italiano)’, in H.I. Radatz and R. Schlosser (eds) Donum grammaticorum. Festschrift fur Harro Stammerjohann. Tubingen: Niemeyer, pp. 1–14.
Cerruti, M., Ballarè, S. (2021). ParlaTO: corpus del parlato di Torino. Bollettino dell’Atlante Linguistico Italiano 44. 171-196. Ciccolone, S., & Dal Negro, S. (2021). Comunità bilingui e lingue in contatto: Uno studio sul parlato bilingue in Alto Adige (Prima edizione 2021 nella collana Athenaeum). Caissa Italia.
Corino, E., & Marello, C. (A c. Di). (2009). Valico: Studi di linguistica e didattica (1. ed). Perugia: Guerra.
Cortinovis, E. (2011) ‘Local, Global and Ethnic Orientation in the Communicative Practices of Albaninan Speaking Adolescents in Bolzano, Italy’, Zeitschrift fuer Literaturwissenschaft und Linguistik, 41(164), pp. 121–132.
Cresti, E., & Moneglia, M. (A c. Di). (2005). C-ORAL-ROM: Integrated reference corpora for spoken Romance languages. J. Benjamins. Dewaele, J.-M. (2018). Why the dichotomy ‘L1 versus LX user’ is better than ‘native versus non-native speaker’. Appl. Linguis. 39, 236–240.
Goria, E., & Mauri, C. (2018). Il corpus KIParla: Una nuova risorsa per lo studio dell’italiano parlato. In F. Masini & F. Tamburini (A c. Di), Club working papers in linguistics. Vol II (pp. 96–116). CLUB. Circolo Linguistico dell’Università di Bologna.

Guerini, F. (A c. Di). (2016). Italiano e dialetto bresciano in racconti di partigiani (I edizione). Aracne.
Klein, W. (1997). ‘Learner varieties are the normal case’, The Clarion, 3, pp. 4–6.
Lüdeling, A., & Hirschmann, H. (2015). Error annotation systems. In S. Granger, G. Gilquin, & F. Meunier (A c. Di), The Cambridge Handbook of Learner Corpus Research (1a ed., pp. 135–158). Cambridge University Press.
Mauri, C. & Masini, F. & Borghetti C. & Bolognesi, M. (2022). Posizionamento del sé e rappresentazione dell’Altro nel discorso: una prospettiva interculturale. In Sabrina Fusari, Barbara IVancic, Caterina Mauri (eds.), Diversità e inclusione. Quando le parole sono importanti, Milano, Meltemi editore. 51 - 84.
Mauri, C., Ballarè, S., Goria, E., Cerruti, M., & Suriano, F. (2019). KIParla Corpus: A New Resource for Spoken Italian. Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy, November 13-15, 2019.
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus Linguistics, Learner Corpora, and SLA: Employing Technology to Analyze Language Use. Annual Review of Applied Linguistics, 39, 74–92.
Mereu, D., & Vietti, A. (2021). Dialogic ItAlian: The creation of a corpus of Italian spontaneous speech. Speech Communication, 130, 1–14.
Sobrero, A. Tempesta, I. (2007). Definizione delle caratteristiche generali del corpus: informatori, località. Documento di progetto. http://www.clips.unina.it/it/documenti.jsp
Spina, S. (2014). Il Perugia Corpus: Una risorsa di riferimento per l’italiano. Composizione, annotazione e valutazione. In R. Basili, A. Lenci, & B. Magnini (A c. Di), Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014. Vol. 1 (pp. 197–202). Pisa University Press.
Vietti, A. (2005) Come gli immigrati cambiano l’italiano. L’italiano di peruviane come varietà etnica. Milano: Angeli.
Voghera, M., Iacobini, C., Savy, R., Cutugno, F., Alfano, I., & Rosa, A. (2014). VoLIP: A Searchable Corpus of Spoken Italian. In L. Veselovská & M. Janebová (A c. Di), Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium: Language Use and Linguistic Structure. (pp. 628–640). Palacký University.
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., ... Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018