The SGDSolver performs Stochastic Gradient Descent optimization which, when used with momentum, updates weights with a linear combination of the negative gradient and the previous weight update.
The SolverParameter specifies all parameters that are used to configure the solver. This section describes the SGD specific parameters.
momentum: specifies the amount of the previous weight to use in the weight update. For example, when momentum = 0.9, 90% of the previous weight is added to 10% of the gradient update. For more discussion on the impact of momentum, see the article “Stochastic Gradient Descent with Momentum” by Vitaly Bushaev.
When to Use
According to “A 2021 Guide to improving CNNs-Optimizers: Adam vs SGD” by Sieun Park, some articles suggest “SGD better generalizes than Adam.” However tuning “the initial learning rate and decay scheme for Adam yields significant improvements”. “Improving Generalization Performance by Switching from Adam to SGD” by Keskar et al., also makes this suggestion.
We have found that SGD is a good baseline to start with in your training.