Add a new Section to this Document's canonical list of Sections.
Remove a Section from this Document's canonical list of Sections.
Optionally return the DocumentAnnotator that produced the annotation of class 'c' within this Document.
The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations.
The collection of DocumentAnnotators that have been run on this Document, For keeping records of which DocumentAnnotators have been run on this document, producing which annotations. A Map from the annotation class to the DocumentAnnotator that produced it, for example from classOf[cc.factorie.app.nlp.pos.PennPos] to classOf[cc.factorie.app.nlp.pos.ChainPosTagger]. Note that this map records annotations placed not just on the Document itself, but also its constituents, such as NounPhraseNumberLabel on NounPhrase, PennPos on Token, ParseTree on Sentence, etc.
Append the string 's' to this Document.
Append the string 's' to this Document.
the length of the Document's string before string 's' was appended.
A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow.
A predefined Section that covers the entirety of the Document string, and even grows as the length of this Document may grow. If the user does not explicitly add Sections to the document, this Section is the only one returned by the "sections" method.
A collection of attributes, keyed by the attribute class.
Remove all Section from this Document's canonical list of Sections.
A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.
A method required by the DocumentSubstring trait, which in this case simply returns this Document itself.
Return true if an annotation of class 'c' been placed somewhere within this Document.
Return the "name" assigned to this Document by the 'setName' method.
Return the "name" assigned to this Document by the 'setName' method. This may be any String, but is typically a filename or other similar identifier.
Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.
Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated from the 'annotator.tokenAnnotationString' method.
Return a String containing the Token strings in the document, formatted with one-word-per-line and various tab-separated attributes appended on each line, generated as specified by the argument.
The canonical list of Sections containing the tokens of the document.
The canonical list of Sections containing the tokens of the document. The user may create and add Sections covering various substrings within the Document. If the user does not explicitly add any Sections, by default there will be one Section that covers the entire Document string; this one Section is the one returned by "Document.asSection". Note that Sections may overlap with each other, representing alternative tokenizations or annotations.
An efficient way to get the total number of Sentences in the canonical Sections of this Document.
Return an Iterable collection of all Sentences in all canonical Sections of this Document.
Set the value that will be returned by the 'name' method.
Set the value that will be returned by the 'name' method. It accomplishes this by setting the DocumentName attr on Document. If the String argument is null, it will remove DocumentName attr if present.
The string contents of this Document.
The string contents of this Document.
A method required by the DocumentSubstring trait, which in this case simply returns Document.
A method required by the DocumentSubstring trait, which in this case simply returns Document.stringLength.
The number of characters in this Document's string.
The number of characters in this Document's string. Use this instead of Document.string.length because it is more efficient when the Document's string is growing with appendString.
A method required by the DocumentSubstring trait, which in this case simply returns 0.
A method required by the DocumentSubstring trait, which in this case simply returns 0.
An efficient way to get the total number of Tokens in the canonical Sections of this Document.
Return an Iterable collection of all Tokens in all canonical Sections of this Document.
A Document holds a String containing the original raw string contents of a natural language document to be processed. The Document also holds a sequence of Sections, each of which is delineated by character offsets into the Document's string, and each of which contains a sequence of Tokens, Sentences and other TokenSpans which may be annotated.
Documents may be constructed with their full string contents, or they may have their string contents augmented by the appendString method.
Documents also have an optional "name" which can be set by Document.setName. This is typically used to hold a filename in the file system, or some other similar identifier.
The Document.stringLength method may be a faster alternative to Document.string.length when you are in the middle of multiple appendString calls because it will efficiently use the underlying string buffer length, rather than flushing the buffer to create a string.
The canonical sequence of Sections in the Document is available through the Document.sections method.
By default the canonical sequence of Sections holds a single Section that covers the entire string contents of the Document (even as the Document grows). This canonical sequence of Sections may be modified by the user, but this special all-encompassing Section instance will always be available as Document.asSection.
Even though Tokens, Sentences and TokenSpans are really stored in the Sections, Document has basic convenience methods for obtaining iterable collections of these by concatenating them from the canonical sequence of Sections. These iterable collections are of type Iterable[Token], not Seq[Token], however. If you need the Tokens as a Seq[Token] rather than an Iterable[Token], or you need more advanced queries for TokenSpan types, you should use methods on a Section, not on the Document. In this case typical processing looks like: "for (section <- document.sections) section.tokens.someMethodOnSeq()...".