On the Origin of Implicit Regularization in Minibatch Stochastic Gradient Descent
In the limit of vanishing learning rates, stochastic gradient descent (SGD) follows the path taken by gradient flow on the full batch loss function. However, larger learning rates often achieve higher test accuracies. This generalization benefit is not explained by convergence bounds since it arises even for large compute budgets. To help interpret this phenomenon we prove that, for small but finite learning rates, the discrete SGD iterates also follow the path of gradient flow but on a modified loss. This modified loss is composed of the full batch loss function and an implicit regularizer, which penalizes the mean euclidean norm of the minibatch gradients. Under reasonable assumptions, when the batch size is small, the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.