Corpora
Textual corpora that document language use are invaluable for research in various areas of linguistics, as well as for collecting statistical information that facilitates the construction of a variety of natural language processing applications. MILA has collected or acquired a number of Hebrew corpora from various domains. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
All corpora follow the standards developed by MILA.
Corpus | Description |
---|---|
HaAretz | News and articles from the HaAretz news website, 1990-1991. |
Arutz 7 | News and articles from the Arutz 7 news website, 2001-2006. |
TheMarker | Articles from the TheMarker financial newspaper, May - October 2002. |
HaKnesset | Session protocols of the Knesset (Israeli Parliament) during January 2004 - November 2005. |
Wikipedia 2013 | Articles from the Hebrew Wikipedia online encyclopedia, 2013. |
Doctors | Articles from the Doctors medical website. |
Infomed | Question and answer discussions from the Infomed website's medical forum, January 2006 - September 2007. |
Nature of Healing | Articles and forum discussions from the Nature of Healing neuropathy medical website. |
To Be Healthy | Articles and forum discussions from the To Be Healthy (L'Hiyot Bari, 2b-bari) medical website. |
Tapuz People Forum | Forum discussions from the Tapuz People website, on a variety of subjects. |
Hebrew CHILDES | Spoken Hebrew conversations between children and between children and adults. |
Spoken Israeli Hebrew | Spoken Hebrew conversations and parts of the Corpus of Spoken Israeli Hebrew (CoSIH). |
Hebrew Dotted Text | Articles from beginner-Hebrew newspapers Shaar LaMatchil and Yanshuf. Text includes dots (niqqud/vocalization). |
Dependency parsed corpora | A dependency parsed corpus. The corpus is part of the Hebrew Wikipedia corpus and the dependencies were created by Yoav Goldberg’s automatic dependency parser. |
Walla Food Corpus | Articles from Walla Food website, 2014-2015. |
Foodpage Corpus | Articles from Foodpage.co.il website, 2014-2015. |
Walla Sport Corpus | Articles from Walla Sport website, 2014-2015. |
Sport5 Corpus | Articles from Sport5 website, 2014-2015. |
Learning Man | Articles from the "Learning man in the technological era" conference. |