cc.factorie.app.nlp

Document

class Document extends DocumentSubstring with Attr

A Document holds a String containing the original raw string contents of a natural language document to be processed. The Document also holds a sequence of Sections, each of which is delineated by character offsets into the Document's string, and each of which contains a sequence of Tokens, Sentences and other TokenSpans which may be annotated.

Documents may be constructed with their full string contents, or they may have their string contents augmented by the appendString method.

Documents also have an optional "name" which can be set by Document.setName. This is typically used to hold a filename in the file system, or some other similar identifier.

The Document.stringLength method may be a faster alternative to Document.string.length when you are in the middle of multiple appendString calls because it will efficiently use the underlying string buffer length, rather than flushing the buffer to create a string.

The canonical sequence of Sections in the Document is available through the Document.sections method.

By default the canonical sequence of Sections holds a single Section that covers the entire string contents of the Document (even as the Document grows). This canonical sequence of Sections may be modified by the user, but this special all-encompassing Section instance will always be available as Document.asSection.

Even though Tokens, Sentences and TokenSpans are really stored in the Sections, Document has basic convenience methods for obtaining iterable collections of these by concatenating them from the canonical sequence of Sections. These iterable collections are of type Iterable[Token], not Seq[Token], however. If you need the Tokens as a Seq[Token] rather than an Iterable[Token], or you need more advanced queries for TokenSpan types, you should use methods on a Section, not on the Document. In this case typical processing looks like: "for (section <- document.sections) section.tokens.someMethodOnSeq()...".

Linear Supertypes
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. Document
  2. Attr
  3. DocumentSubstring
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Document(stringContents: String)

    Create a new Document, initializing it to have contents given by the argument.

  2. new Document()

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. def +=(s: Section): Buffer[Section]

    Add a new Section to this Document's canonical list of Sections.

  5. def -=(s: Section): Buffer[Section]

    Remove a Section from this Document's canonical list of Sections.

  6. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  7. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  8. def annotatorFor(c: Class[_]): Option[Class[_]]

    Optionally return the DocumentAnnotator that produced the annotation of class 'c' within this Document.

  9. val annotators: LinkedHashMap[Class[_], Class[_]]

    The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations.

    The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations. A Map from the annotation class to the DocumentAnnotator that produced it, for example from classOf[cc.factorie.app.nlp.pos.PennPos] to classOf[cc.factorie.app.nlp.pos.ChainPosTagger]. Note that this map records annotations placed not just on the Document itself, but also its constituents, such as NounPhraseNumberLabel on NounPhrase, PennPos on Token, ParseTree on Sentence, etc.

  10. def appendString(s: String): Int

    Append the string 's' to this Document.

    Append the string 's' to this Document.

    returns

    the length of the Document's string before string 's' was appended.

  11. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  12. val asSection: Section

    A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow.

    A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow. If the user does not explicitly add Sections to the document, this Section is the only one returned by the "sections" method.

  13. object attr

    A collection of attributes, keyed by the attribute class.

  14. def clearSections(): Unit

    Remove all Section from this Document's canonical list of Sections.

  15. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  16. def document: Document

    A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.

    A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.

    Definition Classes
    DocumentDocumentSubstring
  17. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  18. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  19. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  20. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  21. def hasAnnotation(c: Class[_]): Boolean

    Return true if an annotation of class 'c' been placed somewhere within this Document.

  22. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  23. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  24. def name: String

    Return the "name" assigned to this Document by the 'setName' method.

    Return the "name" assigned to this Document by the 'setName' method. This may be any String, but is typically a filename or other similar identifier.

  25. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  26. final def notify(): Unit

    Definition Classes
    AnyRef
  27. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  28. def owplString(annotator: DocumentAnnotator): String

    Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.

    Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.tokenAnnotationString' method.

  29. def owplString(attributes: Iterable[(Token) ⇒ Any]): String

    Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated as specified by the argument.

  30. def sections: Seq[Section]

    The canonical list of Sections containing the tokens of the document.

    The canonical list of Sections containing the tokens of the document. The user may create and add Sections covering various substrings within the Document. If the user does not explicitly add any Sections, by default there will be one Section that covers the entire Document string; this one Section is the one returned by "Document.asSection". Note that Sections may overlap with each other, representing alternative tokenizations or annotations.

  31. def sentenceCount: Int

    An efficient way to get the total number of Sentences in the canonical Sections of this Document.

  32. def sentences: Iterable[Sentence]

    Return an Iterable collection of all Sentences in all canonical Sections of this Document.

  33. def setName(s: String): Document.this.type

    Set the value that will be returned by the 'name' method.

    Set the value that will be returned by the 'name' method. It accomplishes this by setting the DocumentName attr on Document. If the String argument is null, it will remove DocumentName attr if present.

  34. def string: String

    The string contents of this Document.

    The string contents of this Document.

    Definition Classes
    DocumentDocumentSubstring
  35. def stringEnd: Int

    A method required by the DocumentSubstring trait, which in this case simply returns Document.

    A method required by the DocumentSubstring trait, which in this case simply returns Document.stringLength.

    Definition Classes
    DocumentDocumentSubstring
  36. def stringLength: Int

    The number of characters in this Document's string.

    The number of characters in this Document's string. Use this instead of Document.string.length because it is more efficient when the Document's string is growing with appendString.

  37. def stringStart: Int

    A method required by the DocumentSubstring trait, which in this case simply returns 0.

    A method required by the DocumentSubstring trait, which in this case simply returns 0.

    Definition Classes
    DocumentDocumentSubstring
  38. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  39. def toString(): String

    Definition Classes
    AnyRef → Any
  40. def tokenCount: Int

    An efficient way to get the total number of Tokens in the canonical Sections of this Document.

  41. def tokens: Iterable[Token]

    Return an Iterable collection of all Tokens in all canonical Sections of this Document.

  42. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  43. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  44. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Attr

Inherited from DocumentSubstring

Inherited from AnyRef

Inherited from Any

Ungrouped