Text tokenization divides text into meaningful units like words, sentences, and paragraphs. Many languages, including Hebrew, have explicit boundary markers for words (spaces and some punctuation marks) and sentences (periods), but these are sometimes ambiguous.
The MILA Hebrew Tokenization Tool divides inputted undotted Hebrew text into tokens, sentences, and paragraphs, and the XML output follows MILA's standards for corpora.
Online Demo (segments into tokens).
Enter undotted Hebrew text:
Segments input into tokens, sentences, and paragraphs.
XML output follow MILA's XML standards for corpora.
- Developed by Dalia Bojan.
- Maintained by Slava Demender, MILA Research Engineer (contact).