How Can We Help?

Optimization Solvers

When training the MyCaffe Solver classes are used to optimize the network by using different strategies to apply the gradients to the weights of each learnable layer.

Solver Configuration

The SolverParameter is used to configure the solver to use and specifies the specific configuration for each solver.

There are for main areas of settings to consider when configuring a solver: Testing parameters, Learning Rate parameters, Snapshot parameters and the Solver Specific parameters.

Testing Parameters

The testing parameters configure how testing is performed during training and includes the following settings.

test_interval: the test interval defines how often the test cycle is performed when training.  For example, if training for 1000 iterations with a test_interval = 100, the test cycle occurs every 100th interval.

test_iter: specifies the number of tests to perform on each test interval.  For example, when test_iter=10 ten test cycles are run with the batch size defined by the TEST phase data input.

test_initialization: when true, a testing cycle is performed before training start.

Learning Rate Parameters

The learning rate specifies impact how much gradient is applied to the weights by the solver.  Over time the learning rate may be changed using various strategies.

base_lr: the base learning rate defines the starting learning rate which may change on each cycle depending on the learning strategy used.

lr_policy: the learning rate policy defines how the learning rate is changed during the training process.  For example, a FIXED learning rate policy leaves the learning rate unchanged, whereas a SIGMOID policy uses a sigmoid decay rate.  For more information on the learning rate policies supported by MyCaffe, see the LearningRatePolicyType.

gamma: specifies the ‘gamma’ term used to calculate STEP, EXP, INV and SIGMOID learning rate policies.

power: specifies the ‘power’ term used to calculate the INV and POLY learning rate policies.

stepsize: specifies the step size term used to calcuate the STEP learning rate policy.

stepvalue: specifies the step values used to calculate the MULTISTEP learning rate policy.

weight_decay: when updating weights, the learning rate multiplied by the gradient is subtracted from the weights.  Weight decay subtracts an additional value equal to the weight_decay rate multiplied by the weight from weight as well which decays the weight by a small amount which can help reduce the chance of overfitting the model being trained.  For more information on weight decay, see the article “This thing called Weight Decay” by Dipam Vasani.

regularization_type: MyCaffe supports both L1 and L2 type regularization which are used to help avoid overfitting a model during training by adding a penalty to the loss function.  For a more detailed discussion comparing the difference between L1 and L2 regularization, see the article “L1 vs L2 Regularization: The intuitive difference” by Dhaval Taunk.

Snapshot Parameters

A ‘snapshot’ is the process of saving the weights learned during training which you may want to do when reaching the best accuracy at each step in the training so as to avoid losing any learning in the event the training is stopped early.  The following parameters define how the snapshots are taken.

snapshot_include_weights: when true, the weight values themselves are saved in the snapshot.  When inferencing, the weights are loaded and used in the inferencing solution.

snapshot_include_state: when true, the state which includes the weight state, learning rate and iteration information are saved in the snapshot.  Saving the state allows for restarting where training left off during the previous training session.

snapshot_format: MyCaffe supports the BINARYPROTO format for snapshot data.

snapshot: the snapshot value defines the fixed intervals over which shapshots are saved.  By default, snapshots are saved on each testing cycle where the best accuracy is found in the training session.

Solvers

MyCaffe supports the following solver types: SGD, NESTEROV, ADAGRAD, RMSPROP, ADADELTA, ADAM, LBFGS

SGD Solver

SGDSolver – performs Stochastic Gradient Descent optimization which, when used with momentum, updates weights with a linear combination of the negative gradient and the previous weight update.

NesterovSolver – this solver is similar to SGD, but the error gradient is computed on weights with added momentum where the momentum term is “pointed in the right direction”.  For more information on the Netsterov momentum and how it differs from SGD + momentum, see the article “Understanding Nesterov Momentum (NAG)” by Dominik Schmidt.

AdaGradSolver – this solver is simpler than SGD, for AdaGrad does not use a momentum term.  Instead, this solver uses “different learning rates for each parameter based on the iteration”* in an attempt to find rarely seen features.

AdaDeltaSolver – this solver is an extension to the AdaGrad in that it uses an adaptive learning rate and does not use momentum but differs by using a “restricted window size” of a weighted exponentially averaged past ‘t’ gradients*.

AdamSolver – this solver is the most preferred optimizer for it combines the SGD+momentum with the adaptive learning rate of the AdaDelta*.  According to “Adam vs. SGD: Closing the generalization gap on image classification” by Gupta et al., “Adam finds solutions that generalize worse than those found by SGD”.

RmsPropSolver – like Adam, this solver uses an adaptive learning rate and “tries to resolve the problem that gradients may vary widely in magnitudes.”.  To solve this, “Rprop combines the idea of only using the sign of the gradient with the idea of adapting the step size individually for each weight.” For more information on RmsPropSolver, see the article “Understanding RMSProp – faster neural network learning” by Vitaly Bushaev.

LBFGSSolver – this solver optimizes the parameters of a net using the L-BFGS algorithm based on the minFunc implementation by Marc Schmidt. MyCaffe uses the LBFGSSolver in neural style transfer applications.

*For more information comparing AdaGrad, AdaDelta and Adam optimizers, see the article “Deep Learning Optimizers” by Gunand Mayanglambam.

To learn more about the difference between deep learning optimization techniques, see “Various Optimization Algorithms for Training Neural Network” by Sanket Doshi.

Table of Contents