Note: This page is currently obsolete. See the tutorials for the correct description. For the most up-to-date code, please see examples/ in the factorie source.
Introduction
This document is intended to be the first in a series of tutorials for learning FACTORIE by example. It assumes a passing familiarity with machine learning and Gibbs sampling.
We will apply FACTORIE to the problem of information extraction, specifically named entity recognition, which consists of searching plain text documents for words and phrases that mention some person, place, organization or other "named entity." Since the mid-2000's the most common methodology for doing this is the linear-chain conditional random field (CRF). Fortunately (well, really by design), FACTORIE easily expresses CRFs since CRFs are naturally represented as factor graphs.
Our model for this problem will have two types of variables. The first are variables with observed values representing the input words. We will call each of these a Token. They come in sequences naturally appearing as sentences in the input document. Second are variables we will callLabels, one each per Token. These variables' predicted values will indicate the category of named entity referred to by each Token; the categories are PER for people names, LOC for location names, ORG for organization names and MISC for miscellaneous named entities such as nationalities and sporting events. The many "background" words that do not correspond to any named entity should have their corresponding Label assigned the value O, which stands for "other." There are actually two different label values for each named entity category. For example,B-PER marks the beginning of a person name phrase; I-PER marks the continuation or "inside" of a person name phrase.
Defining the Variables
In this first example we will work with variable classes that are already defined by the FACTORIE library. The packagecc.factorie.app.tokenseq.labeled contains a collection of variable class definitions and other functions useful for linear-chain CRFs. It defines abstract variables classes Token and Label that we will subclass here. In the next example? we will show how to create a linear-chain CRF for this NER problem without the benefit of pre-declared variable classes.
Token inherits from BinaryFeatureVectorVariable which you can think of as a traditional machine learning "feature vector." To aid the printing of diagnostics, each Token also contains a java.lang.String called word which is the original word from which it was created. (Since Token is a Scala class it can contain many such additional "instance variables". Note that "variable" here is used in the object-oriented programming sense, not the graphical model sense; hence I will call Scala variables such as word "instance members" to avoid confusion). It also has an instance member pointing to its corresponding label, called label. In addition, Token mixes in the the trait VarInSeq, which provides methods called prevand next for accessing the preceeding and proceeding Tokens in the sentence.
Label inherits from LabelVariable, which in turn inherits from CategoricalVariable with TrueSetting. Such variables store not only a current assigned value from a finite set of categories, but also independently store their original "true" value as indicated in labeled data. Instances of the Label class also have an instance member called token for accessing its corresponding token. Label's prev and next methods are then implemented in terms of the corresponding methods of its associated Tokens.
The predefined classes tokenseq.labeled.Token and tokenseq.labeled.Label have all the functionality we need, but we must subclass them in our own model so that our token and label classes will get their own unique domain (i.e. "feature vocabulary space"). This would allow us to create more than one linear-chain CRF in the same executable each with different label and token-feature sets.
We also use the predefined class tokenseq.labeled.Sentence to hold sequences of Tokens.
We subclass our own tokens, labels and sentences as follows:
class Token(word:String, labelString:String) extends labeled.Token[Sentence,Label,Token](word) {
val label = new Label(labelString, this)
}
class Label(tag:String, token:Token) extends labeled.Label[Sentence,Token,Label](tag, token)
class Sentence extends labeled.TokenSeq[Token,Label,Sentence]We also call our sub-classed token Token. The first constructor argument after class Token is the word string. The second is the true string value to be assigned to its corresponding label. The first argument is passed to the super class constructor at the end of the first line. The second argument is used in the construction of the label "instance member" of the token. In Scala square brackets are used to indicate type parameters (somewhat like Java's <> bracketing. The square brackets after the superclass name contain type parameters that tell the superclass about the pairing of our newly defined Token, Label and Sentence classes.
We also call our sub-classed label Label. It passes both constructor arguments to its super-class constructor.
Defining the Model
We can define our model using FACTORIE's entity-relationship language (whose general underpinnings are implemented in cc.factorie.er). Here is a model definition:
// Create the model. It consists of two factor templates
val model = new Model(
Foreach[Label] { label => Score(label) },
Foreach[Label] { label => Score(label.prev, label, label.token) }
)
In FACTORIE a model is simply a list of factor templates. (A "model" is not "data"; it is not a "learning method" or even an "inference procedure". These are all flexibly specified independently from the model. Factor templates do sometimes contain parameters, however.)
We can construct a new model, passing an arbitrary-length list of factor templates as the Model constructor arguments. This model has two factor templates, each constructed by the Foreach expression shown. The first factor template examines only individual Label variables, and has parameters for every possible value Labels can take on. The second factor template uses methods available on a Label to find other related variables to examine together when scoring its preferences. It has as many parameters as the cross-product between two Labels and one Token. (The parameterization of these templates is dense; sparse alternatives are also available.)
For labels at the beginning of a sentence label.prev may not exist. This case is automatically correctly handled by the entity-relationship language for template construction.
These two factor templates are simple, reasonable choices. We might like to experiment with alternative or additional templates. For example the following template would make the model 2nd-order Markov:
Foreach[Label] { label => Score(label.prev.prev, label.prev, label) }Alternatively we could also make the model 2nd-order Markov jointly with the Token features:
Foreach[Label] { label => Score(label.prev.prev, label.prev, label, label.token) }The number of parameters in this (densely-parameterized) factor will be the size of the label domain cubed, times the size of the Token's feature domain. This could be quite big. We can give the template a sparse parameterization (in which there will be parameters only for those cross-product combinations encountered during training) as follows:
Foreach[Label] { label => SparseScore(label.prev.prev, label.prev, label, label.token) }In later examples we will explain in detail what these "Foreach"-style template definitions do and how they work. For now, just be assured that they define a template for creating factors among variables that exhibit the described relationships, and that they create all corresponding factors for the model as necessary. (Part of the magic of these seemingly simple constructions is that the resulting template not only knows how to go from its initial "label" argument to the other Score arguments, but also in the reverse direction, so that the template can find every other variable of a factor given any single variable.)
After writing the few lines above we have finished defining our model. What remains are the steps that are often done in shell scripts, Perl, other scripting languages, or "application-specific driver classes": reading the data, defining the features, choosing the inference and learning methods, setting meta-parameters, printing diagnostics interspersed with learning, transitioning among different inference or learning variants, etc. However FACTORIE encourages an alternative style that keeps model definition and the procedural aspects experimentation together---where you can see it all at once, and leverage beneficial synergies.
Loading Data, Creating Variable Instances
We again use some of the infrastructure of cc.factorie.application.LabeledTokenSeqs to read our data and create some random variable instances. Our data will come from a file containing "one word per line", plus part-of-speech tags and the true Label value for each word.
// Read training and testing data.
val trainSentences = labeled.TokenSeq.fromOWPL(Source.fromFile(new File(args(0))), ()=>new Sentence, (word,lab)=>new Token(word,lab))
val testSentences = labeled.TokenSeq.fromOWPL(Source.fromFile(new File(args(1))), ()=>new Sentence, (word,lab)=>new Token(word,lab))
Each of the above calls to fromOWPL returns a Sentence. These, in turn, are sequences of Token variables, with a few extra convenience methods that will be demonstrated later.
The values args(0) and args(1) indicate the names of the files in which to find the data; they come from this example's main method, which is shown in full at the bottom of this page. The third argument, (word,lab)=>new Token(word,lab), is a function that fromOWPL can use to construct one of our own Token instances from a word and label string obtained from the data file.
Note that we have done no special feature engineering here. The default features are simply the current word and part-of-speech, character n-grams and word shape (capitalization, letters versus digits). In a future example?, I will demonstrate how to create more sophisticated features.
In order to do learning and inference we need lists of the variables to be inferred. These are not the Tokens but the Labels. The following vanilla Scala code extracts the Label variables from the LabeledTokenSeq in the training and testing data.
// Get the variables to be inferred
val trainLabels = trainSentences.flatMap(_.labels)
val testLabels = testSentences.flatMap(_.labels)
(trainLabels ++ testLabels).foreach(_.setRandomly())
The last line scrambles the values of the Label variables. When read in by the lines above they were set to their true values. Scrambling ensures that training and testing below will not be unfairly initialized. (Recall that for purposes of supervised training and evaluation Label variables still remember their true values separately from their "current value".)
Learning and Evaluation
Next we are ready for learning. In this example we will use "Sample Rank" a discriminative online learning algorithm that is embedded inside an MCMC sampler (Culotta, 2007; Wick, Rohanimanesh, Culotta, McCallum 2009). In the next line we construct a vanilla (Gibbs) sampler that has mixed in this learning behavior.
val learner = new VariableSettingsSampler[Label](model) with SampleRank with GradientAscentUpdates
Later when we ask this VariableSettingsSampler to "process" a (discrete) variable, it will sample it by considering every possible value for that variable, efficiently calculating the difference in model score for each value, and randomly selecting a value according to the normalized distribution on those score differences. So this is essentially a GibbsSampler?. (It is not called a GibbsSampler here because for various technical reasons this name is reserved for cc.factorie.generative.GibbsSampler which only operates in generative models.)
The Sample Rank trait will also detect cases in which the ranking of the proposals according to the model's scores differ from the ranking of of those changes according to the objective function. We haven't seen the objective function in this example. It was automatically filled in with library-provided function that expresses 0/1 loss for any CategoricalValue with TrueSetting, of which Label is an example. In later examples, you will see how you can create your own objective functions, and how they are actually (somewhat surprisingly, but quite powerfully) simply another Model---a model whose score does not come from learned parameters but from hard coded functions (which in the supervised case examine true values). Thus objective functions are also defined in terms of factor templates. But we digress.
Note that in addition to specifying Sample Rank as the learning method, we also specify the style of parameter updates. Sample Rank (as well as many other learning methods such as Structured Perceptron and Contrastive Divergence) provide a method for obtaining a gradient; there remains the question how to use that gradient to change the parameters. FACTORIE currently provides GradientAscentUpdates, MIRAUpdates(Crammer and Singer 2003) and ConfidenceWeightedUpdates (Dredze, Crammer, Pereira, 2008). In this example we simply use the first one.
Now let's perform training. We tell our Gibbs sampler to visit all of our training Labels 5 times. This makes five sweeps through all our training data and should be adequate to squeeze the majority of the juice from the training data.
learner.processAll(trainLabels, 5) // Train for 5 iterations through all Labels
Gibbs sampling with embedded Sample Rank will update parameters every time the ranked preferences of the model differ from the ranked preferences of the objective function.
Note that we do not need a Gibbs sampling implementation specific to our model---the implementation is completely generalized across models. We could add many more templates to our model and still create and activate our Gibbs sampler with precisely the same two lines of source code above. When Gibbs sampling changes a variable, the model knows how to find all factors relevant to that changed variable, and also find the other variables relevant to those factors.
We are done training. Now it is time to test. We begin by asking our trained model to predict the values of the test Label variables. Although (exact) belief propagation would apply to this simple graphical model, it is not yet robustly implemented in FACTORIE. Fortunately (and perhaps surprisingly) Gibbs sampling is quite fast and accurate for inference on the test set. It would be sensible in this case to put a low "temperature" meta-parameter on the sampler to make the distributions more peaked on their maximum value. This temperature could start at a medium value and then become lower and lower over the course of several iterations. FACTORIE's Gibbs sampler has such a temperature meta-parameter. But in this first simple example we won't bother. It works fairly well with the default temperature of 1.0.
We create a new VariableSettingsSampler (one without SampleRank learning) make three passes of sampling through the testing data.
// Predict, also by sampling, visiting each variable 3 times.
val predictor = new VariableSettingsSampler[Label](model)
predictor.processAll(testLabels, 3)
Now our predictor has chosen values for the labels in the test set. We simply need to look at them, and measure the percentage of times that they are the correct, true values. The following library-provided one-liners do that.
// Evaluate
println("TRAIN "+labeled.labelEvaluation(trainLabels).accuracy)
println("TEST "+labeled.labelEvaluation(testLabels).accuracy)
We can run the above code on the CoNLL 2003 Shared Task NER data (about 200,000 words of training data in eng.train and 50,000 words of testing data in eng.testa). Data reading and pre-processing, training, prediction and testing complete in less than 10 minutes, and result in about 90% token accuracy on the training set and 75% on the testing set. Not bad for a little FACTORIE script thrown together in a few minutes. It is fairly fast to run. (I remember the original MALLET implementation I used to enter this competition would take several hours to complete training.) The accuracy is not good, however. This is due primarily to extremely poor feature set used in this example. In the next example? I will show how FACTORIE can be used to train an extraction system that approaches state-of-the-art accuracy.
The third example will show how to do everything we did in this example using only the basic, application-independent FACTORIE library facilities (avoiding cc.factorie.app.tokenseq.labeled). It demonstrates more of what you would need to know to apply FACTORIE to your own new tasks that different from linear chains.
Note that we have not yet shown examples of usage that leverage the imperative-defined factor graph approach that lies at the heart of FACTORIE. As you see from this example, that style of model definition is optional in FACTORIE. Those examples will come after we have explained the more traditional basics.
All Together
To recap, here is the entirety of the code together, this time also including the imports and object wrapper.
Note: For the most up-to-date code, please see examples/ in the factorie source.
package cc.factorie.example
import scala.io.Source
import java.io.File
import cc.factorie._
import cc.factorie.er._
import cc.factorie.app.tokenseq.labeled
object ChainNER1 {
// Define the variable classes
class Token(word:String, labelString:String) extends labeled.Token[Sentence,Label,Token](word) {
val label = new Label(labelString, this)
}
class Label(tag:String, token:Token) extends labeled.Label[Sentence,Token,Label](tag, token)
class Sentence extends labeled.TokenSeq[Token,Label,Sentence]
// Define the model:
val model = new Model(
Foreach[Label] { label => Score(label) },
Foreach[Label] { label => Score(label.prev, label, label.token) }
)
def main(args: Array[String]) : Unit = {
if (args.length != 2) throw new Error("Usage: ChainNER1 trainfile testfile")
// Read training and testing data.
val trainSentences = labeled.TokenSeq.fromOWPL(Source.fromFile(new File(args(0))), ()=>new Sentence, (word,lab)=>new Token(word,lab))
val testSentences = labeled.TokenSeq.fromOWPL(Source.fromFile(new File(args(1))), ()=>new Sentence, (word,lab)=>new Token(word,lab))
// Get the variables to be inferred
val trainLabels = trainSentences.flatMap(_.labels)
val testLabels = testSentences.flatMap(_.labels)
(trainLabels ++ testLabels).foreach(_.setRandomly())
// Train for 5 iterations
val learner = new VariableSettingsSampler[Label](model) with SampleRank with GradientAscentUpdates
learner.processAll(trainLabels, 5) // Train for 5 iterations through all Labels
// Predict, also by sampling, visiting each variable 3 times.
val predictor = new VariableSettingsSampler[Label](model)
predictor.processAll(testLabels, 3)
// Evaluate
println("TRAIN "+labeled.labelEvaluation(trainLabels).accuracy)
println("TEST "+labeled.labelEvaluation(testLabels).accuracy)
}
}