Corpora

Arabic 

•       Latifa Al-Sulaiti’s List of Arabic Corpora Online

•       http://arabicorpus.byu.edu

•       Leipzig Corpora Collection, corporafor230languages

•       Leeds Collection of Internet Corpora

•       Quranic Arabic Corpus

 

Catalan

•       Corpus del català contemporanihttp://www.ub.edu/cccub/

•       Leipzig Corpora Collection, corporafor 230languages

•       Corpus Textual Informatizat de la Llengua Catalana (CTILC)http://ctilc.iec.cat/


Chinese 

•       Leipzig Corpora Collection, corporafor 230 languages

•       Leeds Collection of Internet Corpora

 

Finnish

•       The Advanced Finnish Learners’ Corpus: http://www.utu.fi/fi/yksikot/hum/yksikot/suomi-sgr/tutkimus/tutkimushankkeet/las2/Sivut/home.aspx

•       Leipzig Corpora Collection, corporafor230 languages

•       HANCO, the Helsinki Annotated Corpus: http://www.ling.helsinki.fi/projects/hanco/index_e.html

•       Institute for the Languages of Finland’s corpora: http://www.kotus.fi/collections

•       Leeds Collection of Internet Corpora

•       Oulun Korpus (429’058 words from5800 texts collected fromliterary works, radio broadcast’s transcriptions, advertising,newspaper and Finnish magazines articles): https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/KielipankkiAineistotOulu

 

French

•       ARTFL Project: http://artfl-project.uchicago.edu/projects/LFA/

•       Corpus C-ORAL-ROM: it samples formal and informalspoken language : http://lablita.dit.unifi.it/coralrom/

•       Leipzig Corpora Collection, corporafor230 languages

•       Corpus de Référence du Français parlé: http://sites.univ-provence.fr/delic/corpus/index.html

•       Corpus diacronico, Frantext: http://www.frantext.fr/

•       Leeds Collection of Internet Corpora

•       http://www.llas.ac.uk/resources/mb/80, un corpus d’entretiens spontanés

 

Japan

•       Leipzig Corpora Collection, corporafor 230 languages

•       Corpus of Spontaneous Japanese: http://www.ninjal.ac.jp/english/products/csj/

•       BCCWJ: Balanced Corpus of Contemporary Written Japanese (KOTONOHA): http://www.kotonoha.gr.jp/shonagon/

•       Leeds Collection of Internet Corpora

•       Japanese Speech Corpora of Major City Dialects: http://www.age.ne.jp/x/oswcjlrc/tahara/jcmd.htm


English 

•       ACE, Australian Corpus of English: http://icame.uib.no/ace/aceman.htm

•       Diachronic: The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English

•       http://www.natcorp.ox.ac.uk/ (British National Corpus, University of Oxford)

•       http://www.corpora4learning.net/resources/corpora.html,with different links to the main corpora of english variety

•       Diachronic: Corpus of Early English Correspondence (CEEC) http://www.helsinki.fi/varieng/domains/CEEC.html

•       The Bergen Corpus of London Teenage Language (COLT): http://www.hit.uib.no/colt/

•       Leipzig Corpora Collection, corporafor 230 languages

•       Corpus of Spoken Professional American English: http://www.athel.com/cspa.html

•       Diachronic corpus: The Early Modern English Dictionaries Database (EMEDD) http://homes.chass.utoronto.ca/~ian/emedd.html

•       Corpus of Indian English: http://icame.uib.no/kolhapur/kolman.htm

•       Diacronico: Lampeter Corpus of Early Modern English http://khnt.hit.uib.no/icame/manuals/LAMPETER/LAMPHOME.HTM

•       Leeds Collection of Internet Corpora

•       London-Lund Corpus of spoken British English : http://khnt.hit.uib.no/icame/manuals/LONDLUND/INDEX.HTM

•       Diachronic: Penn Corpora of Historical English http://www.ling.upenn.edu/hist-corpora/

•       Wellington Corpus of Spoken New Zealand English: http://icame.uib.no/wsc/index.htm


Italian 

•       BAdIP (Italian spoken’s database): http://badip.uni-graz.at/

•       Digital library of Italian literature’s texts, to sort by author, period, genre and other parameters:  http://www.bibliotecaitaliana.it/

•       Corpora and Lexicons of  spoken and written Italian (CLIPS): http://www.clips.unina.it/it/corpus.jsp

•       Corpus andlexiconof written italian frequency(????)(CoLFIS): http://www.istc.cnr.it/grouppage/colfis

•       Corpus C-ORAL-ROM:it samples formal and informal speech: http://lablita.dit.unifi.it/coralrom/

•       CORIS/CODIS:written italian corpus: http://corpora.dslo.unibo.it/coris_ita.html

•       Leipzig Corpora Collection, corporafor 230 languages

•       The Corpus of Ancient Italian of the Work of the Italian Vocabulary includes about 22 million words from Vulgar texts before 1375: http://www.ovi.cnr.it/index.php?page=banchedati

•       Corpus la Repubblica,journalistic Italian: http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica

•       Texts of thirteenth century’s Corpus Taurinense(circa 260.000 words): http://www.bmanuel.org/projects/ct-HOME.html

•       Audiolectures about grammaravailable for download in Mp3format: http://www.gaudio.org/lezioni/grammatica/index.htm

•      The corpus itWaC includestexts downloaded with automatic methods.it contains more than a billion and a half words and it is the most extended available corpus  about Italian to date:http://wacky.sslmit.unibo.it

•       Leeds Collection of Internet Corpora

•       TheVALICOcorpus(variety of Italian language learning) consists of 570.000 words from learner’s texts of Italian as second language: http://www.bmanuel.org/projects/br-HOME.html

 

Dutch

•       Spoken Dutch Corpus (Corpus Gesproken Nederlands  – CGN): http://www.mpi.nl/IMDI/overview/Overview_CGN.html

•       Leipzig Corpora Collection, corporafor 230 languges

•       DutchSemCor Project Homepage : http://www2.let.vu.nl/oz/cltl/dutchsemcor/

•       Diachronic : De Geïntegreerde TaalBank http://gtb.inl.nl/

•        INL – Schatkamer van de Nederlandse taal: http://www.inl.nl/

 

Persian

•       Bijankhan corpus, based on journalistic texts: http://ece.ut.ac.ir/dbrg/bijankhan/

•       Leipzig Corpora Collection, corporafor 230 languages

•       Hamshahri Collection, corpus based on a standard Persian collection of texts: http://ece.ut.ac.ir/DBRG/Hamshahri/

•       Uppsala Persian Corpus (derivato dal Bijankhan): http://stp.lingfil.uu.se/~mojgan/UPC.html

 

Polish

•       Leipzig Corpora Collection, corporafor 230 languages

•       http://korpus.pl/ Korpus IPI PAN

•       Leeds Collection of Internet Corpora

•       Narodowy Korpus Języka Polskiego, http://nkjp.pl/ oppure http://www.nkjp.uni.lodz.pl

 

Portuguese and Brazilian

•       Journalistic Portuguese corpus: http://www.linguateca.pt/cetempublico/

•       Diachronic: medieval Portuguese corpus: http://cipm.fcsh.unl.pt/

•       English and Portuguese simoultaneous corpus : http://www.linguateca.pt/COMPARA/Welcome.html

•       Corpus C-ORAL-ROM:it samples formal and informal speech: http://lablita.dit.unifi.it/coralrom/

•       Leipzig Corpora Collection, corporafor 230 languages

•       Corpus do Português: http://www.corpusdoportugues.org/

•       Leeds Collection of Internet Corpora

•       Tycho Brahe Parsed Corpus of Historical Portuguese: http://www.tycho.iel.unicamp.br/~tycho/corpus/index.html

 

Russian 

•       Leipzig Corpora Collection, corporafor 230 languages

•       Leeds Collection of Internet Corpora

•       National Corpus of Written Russian: http://narusco.ru/

•       national Russian Corpus: www.ruscorpora.ru

 

Spanish

•       CODEA 2011,Old spanish papers

•       Corpus C-ORAL-ROM:it samples formal and informal speech: http://lablita.dit.unifi.it/coralrom/

•       Leipzig Corpora Collection, corporafor 230 languages

•       COSER, Corpus Oral y Sonoro del Español Rural

•       Corpus del Español, Mark Davies, Brigham Young University

•       Leeds Collection of Internet Corpora

•       Real Academia, section Banco de datosCORPES XXI (Corpus del Español del Siglo XXI), CDH (Corpus del Nuevo diccionario histórico del español), CREA (Corpus de Referencia del Español Actual), CORDE (Corpus Diacrónico del Español), Fichero General de la Real Academia Española


German 

•      IDS’s spoken language corpus (Institut für Deutsche Sprache);it is diachronic (divided in five-year periods) and diatopic based on the location of the publishers) frequently updated

•       Leipzig Corpora Collection, corporafor230 languages

•       Lista di corpora at IMS (Institut für Maschinelle Sprachverarbeitung), Universität Stuttgart

•       Leeds Collection of Internet Corpora

•       LIMAS-Korpus

 

Hungarian

•       Leipzig Corpora Collection, corporafor 230 languages

•       Hunglish Corpus,english-hungarian corpus (sentence-aligned)

•       Hungarian Webcorpus

•       morphdb.hu: Hungarian lexical database and morphological grammar

•       www.nytud.hu,with access to various corpora, including the Hungarian National Corpus, a large corpus with open access

•       Szeged Corpus: a natural language processed Hungarian corpus