A large value of delta slows the rate at which the learning rates go down initially
The initial learning rate
The strength of l1 regularization
The strength of l2 regularization
The number of examples for online training, used to scale regularizers
A large value of delta slows the rate at which the learning rates go down initially
Once learning is done, the weights should be copied back into normal tensors.
Once learning is done, the weights should be copied back into normal tensors.
The weights
Some optimizers swap out weights with special purpose tensors for e.
Some optimizers swap out weights with special purpose tensors for e.g. efficient scoring while learning.
The weights
Whether the optimizer has converged yet.
Whether the optimizer has converged yet.
The strength of l1 regularization
The strength of l2 regularization
The number of examples for online training, used to scale regularizers
The initial learning rate
Reset the optimizers internal state (such as Hessian approximation, etc.
Reset the optimizers internal state (such as Hessian approximation, etc.)
Updates the weights according to the gradient.
Updates the weights according to the gradient.
The weights
The gradient
The value
The AdaGrad regularized dual averaging algorithm from Duchi et al, Adaptive Subgradient Algorithms for Online Learning and Stochastic Optimization.
It works by keeping a (reweighted) sum of the gradients seen so far and applying regularization at prediction time instead of update time.
Tuning the rate an delta parameters is often not necessary.
The regularizers, however, are per-example, which mean that their value should be set to be a very small number, on the order of 0.01/num_training_examples, and these values should be tuned.