NesterovSolver – this solver is similar to SGD, but the error gradient is computed on weights with added momentum where the momentum term is “pointed in the right direction”. For more information on the Netsterov momentum and how it differs from SGD + momentum, see the article “Understanding Nesterov Momentum (NAG)” by Dominik Schmidt.
The SolverParameter specifies all parameters that are used to configure the solver used. This section describes the NESTEROV specific parameters.
momentum: specifies the amount of the previous weight to use in the weight update. For example, when momentum = 0.9, 90% of the previous weight is added to 10% of the gradient update. For more discussion on the impact of momentum, see the article “Stochastic Gradient Descent with Momentum” by Vitaly Bushaev.
When to Use
According to “SGD with Nesterov acceleration – How it reduces the oscillation in SGD with Momentum” by neuralthreads, suggests that Nesterov can reduce the optimization oscillation and thus “get faster convergence.”