The AdamSolver – the Adaptive Momentum Estimation (Adam) solver updates weights with adaptive learning rates for each parameter by computing “individual adaptive learning rates for different parameters based on estimates of the first and second moments of gradients.” Mat Ruiz and Kaivalya Tota, in “Adam: The Birthchild of AdaGrad and RMSProp“. The article “Adam: A Method of Stochastic Optimization” by Kingma and Ba, describe Adam as combining the advantages of RMSProp with those of AdaGrad.
The SolverParameter specifies all parameters that are used to configure the solver used. This section describes the ADAM specific parameters.
momentum: specifies the momentum used to define the beta1 momentum update.
momentum2: specifies the momentum used to define the beta2 momentum update.
delta: this parameter is used for numerical stability by the solver and defaults to 1e-8.
When To Use
As noted above in the paper by Kingma, et. al., the Adam solver combines the benefits of RMSProp which “works well in on-line and non-static settings” with the benefits of AdaGrad which “works well with sparce gradients.” Additionally noted by Kingma, et. al., “Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparce gradients, and it naturally performs a form of step size annealing.”
However, according to “A 2021 Guide to improving CNNs-Optimizers: Adam vs SGD” by Park, you may find that Adam does not generalize as well as SGD.