Document

A Document holds a String containing the original raw string contents of a natural language document to be processed. The Document also holds a sequence of Sections, each of which is delineated by character offsets into the Document's string, and each of which contains a sequence of Tokens, Sentences and other TokenSpans which may be annotated.

Documents may be constructed with their full string contents, or they may have their string contents augmented by the appendString method.

Documents also have an optional "name" which can be set by Document.setName. This is typically used to hold a filename in the file system, or some other similar identifier.

The Document.stringLength method may be a faster alternative to Document.string.length when you are in the middle of multiple appendString calls because it will efficiently use the underlying string buffer length, rather than flushing the buffer to create a string.

The canonical sequence of Sections in the Document is available through the Document.sections method.

By default the canonical sequence of Sections holds a single Section that covers the entire string contents of the Document (even as the Document grows). This canonical sequence of Sections may be modified by the user, but this special all-encompassing Section instance will always be available as Document.asSection.

Even though Tokens, Sentences and TokenSpans are really stored in the Sections, Document has basic convenience methods for obtaining iterable collections of these by concatenating them from the canonical sequence of Sections. These iterable collections are of type Iterable[Token], not Seq[Token], however. If you need the Tokens as a Seq[Token] rather than an Iterable[Token], or you need more advanced queries for TokenSpan types, you should use methods on a Section, not on the Document. In this case typical processing looks like: "for (section <- document.sections) section.tokens.someMethodOnSeq()...".

Linear Supertypes

Attr, DocumentSubstring, AnyRef, Any

Instance Constructors

new Document(stringContents: String)

Create a new Document, initializing it to have contents given by the argument.
new Document()

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
def +=(s: Section): Buffer[Section]

Add a new Section to this Document's canonical list of Sections.
def -=(s: Section): Buffer[Section]

Remove a Section from this Document's canonical list of Sections.
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
def annotatorFor(c: Class[_]): Option[Class[_]]

Optionally return the DocumentAnnotator that produced the annotation of class 'c' within this Document.
val annotators: LinkedHashMap[Class[_], Class[_]]

The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations.
The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations. A Map from the annotation class to the DocumentAnnotator that produced it, for example from classOf[cc.factorie.app.nlp.pos.PennPos] to classOf[cc.factorie.app.nlp.pos.ChainPosTagger]. Note that this map records annotations placed not just on the Document itself, but also its constituents, such as NounPhraseNumberLabel on NounPhrase, PennPos on Token, ParseTree on Sentence, etc.
def appendString(s: String): Int

Append the string 's' to this Document.
Append the string 's' to this Document.
returns
the length of the Document's string before string 's' was appended.
final def asInstanceOf[T0]: T0

Definition Classes
Any
val asSection: Section

A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow.
A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow. If the user does not explicitly add Sections to the document, this Section is the only one returned by the "sections" method.
object attr

A collection of attributes, keyed by the attribute class.
def clearSections(): Unit

Remove all Section from this Document's canonical list of Sections.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def document: Document

A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.
A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.

Definition Classes
Document → DocumentSubstring
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hasAnnotation(c: Class[_]): Boolean

Return true if an annotation of class 'c' been placed somewhere within this Document.
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def name: String

Return the "name" assigned to this Document by the 'setName' method.
Return the "name" assigned to this Document by the 'setName' method. This may be any String, but is typically a filename or other similar identifier.
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def owplString(annotator: DocumentAnnotator): String

Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.
Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.tokenAnnotationString' method.
def owplString(attributes: Iterable[(Token) ⇒ Any]): String

Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated as specified by the argument.
def sections: Seq[Section]

The canonical list of Sections containing the tokens of the document.
The canonical list of Sections containing the tokens of the document. The user may create and add Sections covering various substrings within the Document. If the user does not explicitly add any Sections, by default there will be one Section that covers the entire Document string; this one Section is the one returned by "Document.asSection". Note that Sections may overlap with each other, representing alternative tokenizations or annotations.
def sentenceCount: Int

An efficient way to get the total number of Sentences in the canonical Sections of this Document.
def sentences: Iterable[Sentence]

Return an Iterable collection of all Sentences in all canonical Sections of this Document.
def setName(s: String): Document.this.type

Set the value that will be returned by the 'name' method.
Set the value that will be returned by the 'name' method. It accomplishes this by setting the DocumentName attr on Document. If the String argument is null, it will remove DocumentName attr if present.
def string: String

The string contents of this Document.
The string contents of this Document.

Definition Classes
Document → DocumentSubstring
def stringEnd: Int

A method required by the DocumentSubstring trait, which in this case simply returns Document.
A method required by the DocumentSubstring trait, which in this case simply returns Document.stringLength.

Definition Classes
Document → DocumentSubstring
def stringLength: Int

The number of characters in this Document's string.
The number of characters in this Document's string. Use this instead of Document.string.length because it is more efficient when the Document's string is growing with appendString.
def stringStart: Int

A method required by the DocumentSubstring trait, which in this case simply returns 0.
A method required by the DocumentSubstring trait, which in this case simply returns 0.

Definition Classes
Document → DocumentSubstring
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def tokenCount: Int

An efficient way to get the total number of Tokens in the canonical Sections of this Document.
def tokens: Iterable[Token]

Return an Iterable collection of all Tokens in all canonical Sections of this Document.
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

class Document extends DocumentSubstring with Attr

Instance Constructors

new Document(stringContents: String)

new Document()

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

def +=(s: Section): Buffer[Section]

def -=(s: Section): Buffer[Section]

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

def annotatorFor(c: Class[_]): Option[Class[_]]

val annotators: LinkedHashMap[Class[_], Class[_]]

def appendString(s: String): Int

final def asInstanceOf[T0]: T0

val asSection: Section

object attr

def clearSections(): Unit

def clone(): AnyRef

def document: Document

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hasAnnotation(c: Class[_]): Boolean

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def name: String

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def owplString(annotator: DocumentAnnotator): String

def owplString(attributes: Iterable[(Token) ⇒ Any]): String

def sections: Seq[Section]

def sentenceCount: Int

def sentences: Iterable[Sentence]

def setName(s: String): Document.this.type

def string: String

def stringEnd: Int

def stringLength: Int

def stringStart: Int

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def tokenCount: Int

def tokens: Iterable[Token]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Attr

Inherited from DocumentSubstring

Inherited from AnyRef

Inherited from Any

Ungrouped