A collection of standard English "stop words"---common words often left out of processing.
Rewritten from http://tartarus.
Return Strings representing all possible character sub-sequences of length between "min" and "max", with prepended "<" and appended ">" to indicate start and end of the input string.
For segmenting fields of a comma-separated-value file.
Implements Levenshtein Distance, with specific operation costs to go from this String to String s2.
Read the entire contents of the InputStream with the given encoding, and return them as a String.
Read the entire contents of the Reader and return them as a String.
Return input string, with digits replaced, either the whole string with "<YEAR>" or "<NUM>" or just the digits replaced with "#"
Return a string that captures the generic "shape" of the original word, mapping lowercase alphabetics to 'a', uppercase to 'A', digits to '1', whitespace to ' '.
Return a string that captures the generic "shape" of the original word, mapping lowercase alphabetics to 'a', uppercase to 'A', digits to '1', whitespace to ' '. Skip more than 'maxRepetitions' of the same character class.