User: apassos Date: 8/19/13 Time: 2:00 PM
concatenates words split by hyphens in the original text based on user-provided dictionary or other words in the same document.
Segments a sequence of tokens into sentences.
Split a String into a sequence of Tokens.
A linear-chain CRF model for Chinese word segmentation with four companion objects, each pre-trained on a different corpus that corresponds to a different variety of written Mandarin.
A sequence of sections which are tokenized as phrases.
A tokenizer which will merge existing tokens if they are from one of the phrases given.
Clean up Token.