The AdaGrad algorithm.
The AdaGrad regularized dual averaging algorithm from Duchi et al, Adaptive Subgradient Algorithms for Online Learning and Stochastic Optimization.
The combination of AdaGrad with MIRA
This implements the adaptive learning rates from the AdaGrad algorithm (with Composite Mirror Descent update) from "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization" by Duchi et al.
Convenience name for the averaged perceptron.
A backtracking line optimizer.
Learns the parameters of a Model by summing the gradients and values of all Examples, and passing them to a GradientOptimizer (such as ConjugateGradient or LBFGS).
A gradient from a single DiscreteVar, where the set of factors is allowed to change based on its value.
Generalization of pseudo likelihood to sets of variables, instead of a single one
A conjugate gradient optimizer.
A simple gradient descent algorithm with constant learning rate.
A simple gradient descent algorithm with constant norm-independent learning rate.
Mixin trait for a step size which is normalized by the length of the gradient and is constant
Mixin trait for a constant step size
A training example for using contrastive divergence.
Contrastive divergence with the hinge loss.
An example for a single labeled discrete variable.
Implements the domination loss function: it penalizes models that rank any of the badCandidates above any of the goodCandidates.
Implements a variant of the the domination loss function.
Main abstraction over a training example.
This implements the Exponentiated Gradient algorithm of Kivinen and Warmuth - also known as Entropic Mirror Descent (Beck and Teboulle)
Base trait for optimizers that update weights according to a gradient.
Base trait for optimizers whose operational form can be described as
A parallel online trainer which has no locks or synchronization.
Mixin trait for a step size which is normalized by the length of the gradient and looks like 1/sqrt(T)
Mixin trait for a step size which looks like 1/sqrt(T)
Mixin trait for a step size which is normalized by the length of the gradient and looks like 1/T
Mixin trait for a step size which looks like 1/T
Include L2 regularization (Gaussian with given scalar as the spherical covariance) in the gradient and value.
Simple efficient l2-regularized SGD with a constant learning rate
Maximize using Limited-memory BFGS, as described in Byrd, Nocedal, and Schnabel, "Representations of Quasi-Newton Matrices and Their Use in Limited Memory Methods"
Base example for maximizing log likelihood.
Change the weights in the direction of the gradient by using back-tracking line search to make sure we step up hill.
An implementation of the liblinear algorithm.
The MIRA algorithm
Mixin trait for implementing a MIRA step
Treats many examples as one.
Learns the parameters of a model by computing the gradient and calling the optimizer one example at a time.
Abstract trait for any (sub)differentiable objective function used to train predictors.
Mixin trait to add parameter averaging to any GradientStep
This implements an efficient version of the Pegasos SGD algorithm for l2-regularized hinge loss it won't necessarily work with other losses because of the aggressive projection steps note that adding a learning rate here is nontrivial since the update relies on baseRate / step < 1.
Convenience name for the perceptron.
A variant of the contrastive divergence algorithm which does not reset to the ground truth.
A contrastive divergence hinge example which keeps the chain going.
Base example for all OptimizablePredictors
Trains a model with pseudo likelihood.
An example which independently maximizes each label with respect to its neighbors' true assignments.
A variant of PseudomaxExample which enforces a margin.
Implements the Regularized Dual Averaging algorithm of Xiao with support for l1 and l2 regularization
Provides a gradient that encourages the model.
Maximum likelihood in one semi supervised setting.
Implements the structured perceptron.
Implements the structured SVM objective function, by doing loss-augmented inference.
Learns the parameters of a Model by processing the gradients and values from a collection of Examples.
Train using one trainer, until it has converged, and then use the second trainer instead.