The AdaDeltaSolver – this adaptive gradient solver is an extension to the AdaGradSolver that, as noted in the blob post “Deep Learning Optimizers” by Mayanglambam, uses a ‘restricted window size’ of weighted exponentially averaged past ‘t’ gradients. Ruder notes in “An overview of gradient descent optimization algorithms,” this window size seeks to “reduce [the AdaGrad] aggressive monotonically decreasing learning rate,” by “restricting the window of accumulated past gradients to some fixed [window] size ‘w’.”
The SolverParameter specifies all parameters that are used to configure the solver used. This section describes the ADADELTA specific parameters.
momentum: this parameter is not used by this solver and should be set to 0.
delta: this parameter is used for numerical stability by the solver and defaults to 1e-8.
When To Use
Like the AdaGrad, the AdaDelta may be helpful in finding predictive but not often seen parameters. However, as described in “Optimization techniques comparison in Julia: SGD, Momentum, Adagrad, AdaDelta, Adam” by Int8, the AdaDelta algorithm was inspired to “improve [the] AdaGrad weakness of learning rate converging with zero with increase of time.”