Documentation‎ > ‎API‎ > ‎

Core Data Structures

Note: This page is currently obsolete. See the tutorials for the correct description.

Variables and Domains

Root of the variable class hierarchy is cc.factorie.Variable.

Naming conventions: Except for cc.factorie.VariableVariable means mutable. Observation means immutable. The names Var and Varscould be either mutable or immutable. The plural Vars indicates that objects of the class might actually hold multiple values of that type (as inBinaryVectorVariable). The singular Var indicates that the object only holds one value, such as a single integer or single real number.

Each class of Variable is associated with its own Domain, a representation of the set of valid values for variables of that type. The domain of variables of class Token extends EnumVariable[String] is accessible by Domain[Token]. The default Domain class provides little functionality. An important subclass is CategoricalDomain, which stores a bi-directional mapping between valid values (categories) of its variables and their integer indices 0 to N. A CategoricalDomain is also a scala.Seq containing its constituent categories. Thus, for example, since CategoricalVariables have a CategoricalDomain, you can print all unique Token strings by Domain[Token].foreach(println(_)).

Integer-valued variables:

  • IntegerVariable and IntegerObservation hold arbitrary integers.
  • DiscreteVariable and DiscreteObservation hold integers from 0 to N. Their method maxValue returns the value of N. Their methoddomainSize returns N+1.
  • CategoricalVariable[T] and CategoricalObservation[T] hold values from a finite enumerated set of categorical values of type T. For example, a common definition for a variable holding character strings that are mapped to densely-packed integers (which will be ultimately used as indices into arrays of model parameters) is Token extends CategoricalVariable[String].
  • ItemizedObservation are Observation instances that themselves have been mapped to densely-packed integers. For example, if you define class Person(val name:String) extends ItemizedObservation then Domain[Person] contains the collection of all constructed Person instances, and you can get a unique integer for a person val p = Person("Karl") by p.index.

Label variables:

  • LabelVariable is a CategoricalVariable that also has a trueValue, which can be used for supervised learning.

Boolean-valued variables:

  • BooleanVariable and BooleanObservation each hold a single boolean value.

Real-valued variables:

  • RealVariable and RealObservation each hold a single real number, represented as a Scala native type Double. They have implicit conversions to their underlying Double, so you can write expressions such as val alpha = new RealVariable(0.7); val beta = 1.0 - alpha.

Reference-valued variables:

  • RefVariable and RefObservation each hold a pointer to an arbitrary Scala object, including other Variables.
  • RefLabel is a RefVariable that also has a trueValue, used for supervised learning.

Vector-valued variables:

  • SparseBinaryVectorVariableholds a collection of discrete-valued indices. It is essentially a binary feature vector with true represented by 1 and false represented by 0. Its is stored with a sparse representation so that vectors with hundreds of thousands of dimensions but only a few non-zero dimensions are t (such as NLP vocabularies)
  • RealVectorVariable holds a multi-dimensional collection of real-valued numbers. In all the library's vector-valued variables, the underlying vectors are represented as objects of type Vector from cc.factorie.la.Vector (where la stands for "linear algebra").

Variables of Generative Models

GeneratedVariable is a variable that knows is "parent" source. A Parameter knows its "children". These variable classes encapsulate not only their value type, but also the distribution from which they were generated. Following the convention in statistics (unlike computer science) they are named after their parent distribution. Hence the value of a Dirichlet variable is a sequence of floating-point that sum to one, and which was generated from a Dirichlet distribution.

  • Discrete and Categorical
  • Proportions and Dirichlet
  • MultinomialDiscrete integrating out uncertainty about a DiscreteOutcome with a Multinomial prior.
  • Multinomial is a set of counts whose parent is a set of Proportions
  • DirichletMultinomial is the collapsed representation of a Dirichlet
  • Poisson
  • Gaussian
  • Gamma
  • Exponential

Mixture models. Trait MixtureComponent and MixtureChoice.

  • MixtureChoice and MixtureChoiceMixture
  • MixtureOutcome
  • DiscreteMixture and CategoricalMixture
  • GaussianMixture

Factor Templates

factor in a factor graph measures the "compatibility" of values in its neighboring variables. This compatibility is expressed as a non-negative real-valued number, which corresponds to an unnormalized log-probability. The score of an entire factor graph is the sum of the scores of all its factors.

In many cases useful factor graphs have multiple factors with the same cardinality and types of variable neighbors and also share the same parameters. A "factor template" efficiently captures these common attributes. It is a template, or "generator" of individual factors neighboring particular variables. A factor template consists of (1) a description of the arbitrary relationship among its variable neighbors, (2) a sufficient statistics function that maps those neighbors to the statistics necessary to return a real-valued score (and optionally a vector of sufficient statistics), (3) an aggregator for multiple statistics of the same template, (4) a function mapping those aggregated statistics to a real-valued score, and (5) optionally, the parameters used in the function to calculate that score; (alternatively the score may be calculated in some fixed way without learned parameters).

In FACTORIE the "description of the arbitrary relationship among its variable neighbors" can be defined in several alternative ways. There is support for an entity-attribute-relationship language that can be used to describe relations among variables. Factor templates with boolean sufficient statistics can also be defined using this entity-attribute-relationship language plus formulas in first-order logic. But most flexibly, you can use a Turing-complete language (actually the full power of Scala) to define the relationship among a template's variable neighbors.

Probability distribution between two alternative possible worlds can be calculated by...

Efficiently score only the changes to a possible world...

The DiffList and its management...

Finding the factors that touch changed variables, and the other variable neighbors of those factors...

Given one changed variable, the unroll methods find the other variables neighbors for the factor template, construct one or more Factors, and return them...

Steps of scoring a change to the model

  1. Make the change, automatically build up a DiffList
  2. Pass the DiffList to each Template, it creates Factors through unroll methods. Each unroll invocation may return multiple Factors.
  3. Factors are uniqued for each template.
  4. Each Factor is transformed into one or more Stat objects representing the sufficient statistics of the factor's scoring function. These Statobjects are not uniqued.
  5. Stat objects are aggregated into a Statistic object. For example, when the sufficient statistics are vectors, the vectors may be summed.
  6. The Statistic object has a score method that returns a Double.

Types of factor templates

  • Template. The base class for all factor templates.
  • VectorTemplate. Its Stat objects each have a Vector. They are aggregated into a Statistic by summing the vectors.
  • DotTemplate. Score each Statistic by a dot product between its Vector and a vector of weights.