cc.factorie.app.nlp.segment

DeterministicTokenizer

class DeterministicTokenizer extends DocumentAnnotator

Split a String into a sequence of Tokens. Aims to adhere to tokenization rules used in Ontonotes and Penn Treebank. Note that CoNLL tokenization would use tokenizeAllDashedWords=true. Punctuation that ends a sentence should be placed alone in its own Token, hence this segmentation implicitly defines sentence segmentation also. (Although our the DeterministicSentenceSegmenter does make a few adjustments beyond this tokenizer.)

Linear Supertypes
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DeterministicTokenizer
  2. DocumentAnnotator
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DeterministicTokenizer(caseSensitive: Boolean = false, tokenizeSgml: Boolean = false, tokenizeNewline: Boolean = false, tokenizeAllDashedWords: Boolean = false, abbrevPreceedsLowercase: Boolean = false)

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. val abbrev: String

  7. val abbrevs: String

  8. val ap: String

  9. val ap2: String

  10. def apply(s: String): Seq[String]

    Convenience function to run the tokenizer on an arbitrary String.

    Convenience function to run the tokenizer on an arbitrary String. The implementation builds a Document internally, then maps to token strings.

  11. val apword: String

  12. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  13. val atuser: String

  14. val caps: String

  15. val catchAll: String

  16. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  17. val consonantNonAbbrevs: String

  18. val contractedWord: String

  19. val contraction: String

  20. val contraction2: String

  21. val currency: String

  22. val dash: String

  23. val dashedPrefixWord: String

  24. val dashedPrefixes: String

  25. val dashedSuffixWord: String

  26. val dashedSuffixes: String

  27. val date: String

  28. val day: String

  29. def documentAnnotationString(document: Document): String

    How the annotation of this DocumentAnnotator should be printed as extra information after a one-word-per-line (OWPL) format.

    How the annotation of this DocumentAnnotator should be printed as extra information after a one-word-per-line (OWPL) format. If there is no document annotation, return the empty string. Used in Document.owplString.

    Definition Classes
    DocumentAnnotator
  30. val ellipsis: String

  31. val email: String

  32. val emoticon: String

  33. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  34. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  35. val filename: String

  36. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  37. val fraction: String

  38. val frphone: String

  39. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  40. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  41. val hashtag: String

  42. val honorific: String

  43. val html: String

  44. val htmlAccentedLetter: String

  45. val htmlChar: String

  46. val htmlComment: String

  47. val htmlSymbol: String

  48. val initials: String

  49. val initials2: String

  50. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  51. val latin: String

  52. val latin2: String

  53. val letter: String

  54. val mdash: String

  55. def mentionAnnotationString(mention: Mention): String

    Definition Classes
    DocumentAnnotator
  56. val month: String

  57. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  58. val newline: String

  59. val noAbbrev: String

  60. final def notify(): Unit

    Definition Classes
    AnyRef
  61. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  62. val number: String

  63. val number2: String

  64. val ordinals: String

  65. val org: String

  66. val patterns: ArrayBuffer[String]

  67. def phraseAnnotationString(phrase: Phrase): String

    Definition Classes
    DocumentAnnotator
  68. val place: String

  69. def postAttrs: Iterable[Class[_]]

  70. def prereqAttrs: Iterable[Class[_]]

  71. def process(document: Document): Document

  72. def processParallel(documents: Iterable[Document], nThreads: Int = ...): Iterable[Document]

    Definition Classes
    DocumentAnnotator
  73. def processSequential(documents: Iterable[Document]): Iterable[Document]

    Definition Classes
    DocumentAnnotator
  74. val punc: String

  75. val quote: String

  76. val repeatedPunc: String

  77. val sgml: String

  78. val sgml2: String

  79. val space: String

  80. val state: String

  81. val state2: String

  82. val suffix: String

  83. val symbol: String

  84. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  85. def toString(): String

    Definition Classes
    AnyRef → Any
  86. def tokenAnnotationString(token: Token): String

    How the annotation of this DocumentAnnotator should be printed in one-word-per-line (OWPL) format.

    How the annotation of this DocumentAnnotator should be printed in one-word-per-line (OWPL) format. If there is no per-token annotation, return null. Used in Document.owplString.

    Definition Classes
    DeterministicTokenizerDocumentAnnotator
  87. val tokenRegex: Regex

  88. val tokenRegexString: String

  89. val units: String

  90. val url: String

  91. val url2: String

  92. val url3: String

  93. val usphone: String

  94. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  95. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  96. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  97. val word: String

Inherited from DocumentAnnotator

Inherited from AnyRef

Inherited from Any

Ungrouped