FACTORIE: Quick Start

First Examples

Here are three examples providing a brief sense of FACTORIE usage.

Topic Modeling, Document Classification, and NLP on the Command-line

FACTORIE comes with a pre-built implementation of the latent Dirichlet allocation (LDA) topic model. If “mytextdir” is a directory name containing many plain text documents each in its own file, then typing

$ bin/fac lda --read-dirs mytextdir --num-topics 20 --num-iterations 100

will run 100 iterations of a sparse collapsed Gibbs sampling on all the documents, and print out the results every 10 iterations. FACTORIE’s LDA implementation is faster than MALLET’s.

You can also train a document classifier. If “sportsdir” and “politicsdir” are each directories that contain plan text files in the categories sports and politics, then typing

$ bin/fac classify --read-text-dirs sportsdir politicsdir --write-classifier mymodel.factorie

will train a log-linear classifier by maximum likelihood (same as maximum entropy) and save it in the file “mymodel.factorie”.

If you also have the Maven-supplied factorie-nlp-resources JAR in your classpath, you can run many natural language processing tools. For example,

$ bin/fac nlp --wsj-forward-pos --transition-based-parser --conll-chain-ner

will launch an NLP server that will perform part-of-speech tagging, dependency parsing and named entity recognition on its input.
The server listens for text on a socket, and spawns a parallel document processor on each request.
To feed it input, type in a separate shell

$ echo "Mr. Jones took a job at Google in New York.  He and his Australian wife moved from New South Wales on 4/1/12." | nc localhost 3228

which then produces the output:

1       1       Mr.             NNP     2       nn      O
2       2       Jones           NNP     3       nsubj   U-PER
3       3       took            VBD     0       root    O
4       4       a               DT      5       det     O
5       5       job             NN      3       dobj    O
6       6       at              IN      3       prep    O
7       7       Google          NNP     6       pobj    U-ORG
8       8       in              IN      7       prep    O
9       9       New             NNP     10      nn      B-LOC
10      10      York            NNP     8       pobj    L-LOC
11      11      .               .       3       punct   O

12      1       He              PRP     6       nsubj   O
13      2       and             CC      1       cc      O
14      3       his             PRP$    5       poss    O
15      4       Australian      JJ      5       amod    U-MISC
16      5       wife            NN      6       nsubj   O
17      6       moved           VBD     0       root    O
18      7       from            IN      6       prep    O
19      8       New             NNP     9       nn      B-LOC
20      9       South           NNP     10      nn      I-LOC
21      10      Wales           NNP     7       pobj    L-LOC
22      11      on              IN      6       prep    O
23      12      4/1/12          NNP     11      pobj    O
24      13      .               .       6       punct   O

Univariate Gaussian

The following code creates a model for holding factors that connect random variables for holding mean and variance with 1000 samples from a Gaussian. Then it re-estimates by maximum likelihood the mean and variance from the sampled data.

package cc.factorie.tutorial
object ExampleGaussian extends App {
  import cc.factorie._                             // The base library
  import cc.factorie.directed._                    // Factors for directed graphical models
  import cc.factorie.variable_
  import cc.factorie.model._
  import cc.factorie.infer._
  implicit val model = DirectedModel()             // Define the "model" that will implicitly store the new factors we create
  implicit val random = new scala.util.Random(0)   // Define a source of randomness that will be used implicitly in data generation below 
  val mean = new DoubleVariable(10)                // A random variable for holding the mean of the Gaussian
  val variance = new DoubleVariable(1.0)           // A random variable for holding the variance of the Gaussian
  println("true mean %f variance %f".format(mean.value, variance.value))
  // Generate 1000 new random variables from this Gaussian distribution
  //  "~" would mean just "Add to the model a new factor connecting the parents and the child"
  //  ":~" does this and also assigns a new value to the child by sampling from the factor
  val data = for (i <- 1 to 1000) yield new DoubleVariable :~ Gaussian(mean, variance) 
  // Set mean and variance to values that maximize the likelihood of the children
  Maximize(mean)
  Maximize(variance)
  println("estimated mean %f variance %f".format(mean.value, variance.value))
}

Linear-chain Conditional Random Field for part-of-speech tagging

The following code declares data, model, inference and learning for a linear-chain CRF for part-of-speech tagging.

object ExampleLinearChainCRF extends App {
  import cc.factorie._            // The base library: variables, factors
  import cc.factorie.la           // Linear algebra: tensors, dot-products, etc.
  import cc.factorie.optimize._   // Gradient-based optimization and training
  // Declare random variable types
  // A domain and variable type for storing words
  object TokenDomain extends CategoricalDomain[String]
  class Token(str:String) extends CategoricalVariable(str) { def domain = TokenDomain }
  // A domain and variable type for storing part-of-speech tags
  object LabelDomain extends CategoricalDomain[String]
  class Label(str:String, val token:Token) extends LabeledCategoricalVariable(str){
    def domain = LabelDomain
  }
  class LabelSeq extends scala.collection.mutable.ArrayBuffer[Label]
  // Create random variable instances from data (a toy amount of data for this example)
  val data = List("See/V Spot/N run/V", "Spot/N is/V a/DET big/J dog/N", "He/N is/V fast/J") 
  val labelSequences = for (sentence <- data) yield new LabelSeq ++= sentence.split(" ").map(s => {
   val a = s.split("/")
   new Label(a(1), new Token(a(0)))
  })
  // Define a model structure
  val model = new Model with Parameters {
    // Two families of factors, where factor scores are dot-products of sufficient statistics and weights.
    // (The weights will set in training below.)
    val markov = new DotFamilyWithStatistics2[Label,Label] { 
      val weights = Weights(new la.DenseTensor2(LabelDomain.size, LabelDomain.size))
    }
    val observ = new DotFamilyWithStatistics2[Label,Token] {
      val weights = Weights(new la.DenseTensor2(LabelDomain.size, TokenDomain.size))
    }
    // Given some variables, return the collection of factors that neighbor them.
    def factors(labels:Iterable[Var]) = labels match {
      case labels:LabelSeq => 
        labels.map(label => new observ.Factor(label, label.token))
        ++ labels.sliding(2).map(window => new markov.Factor(window.head, window.last))
    }
  }
  // Learn parameters
  val trainer = new BatchTrainer(model.parameters, new ConjugateGradient)
  trainer.trainFromExamples(labelSequences.map(labels => new LikelihoodExample(labels, model, InferByBPChain)))
  // Inference on the same data.  We could let FACTORIE choose the inference method, 
  // but here instead we specify that is should use max-product belief propagation
  // specialized to a linear chain
  labelSequences.foreach(labels => BP.inferChainMax(labels, model))
  // Print the learned parameters on the Markov factors.
  println(model.markov.weights)
  // Print the inferred tags
  labelSequences.foreach(_.foreach(l => println(s"Token: ${l.token.value} Label: ${l.value}")))
}