AdaGradSolver – this adaptive gradient solver is simpler that SGD, for AdaGrad does not use a momentum term. Instead, according to Mayanglambam, this solver “uses different learning rates for each parameter based on the iteration” in an attempt to find rarely seen features. Adaptive Gradient as described by Duchi, et. al., is adapted to “find needles in haystacks in the form of very predictive but rarely seen features.” In the blog post “An overview of gradient descent optimization algorithms,” Ruder notes, AdaGrad “adapts the learning rate to parameters, performing smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and large updates (i.e., higher learning rates) for parameters associated with sparse data.”
The SolverParameter specifies all parameters that are used to configure the solver used. This section describes the ADAGRAD specific parameters.
momentum: this parameter is not used by this solver and should be set to 0.
delta: this parameter is used for numerical stability by the solver and defaults to 1e-8.
When To Use
AdaGrad may be helpful in finding predictive but not often seen parameters. In the article “Large Scale Distributed Deep Networks“, Dean et. al., notes “One technique that we have found to greatly increase the robustness of Downpour SGD is the use of the AdaGrad adaptive learning rate procedure.”