On the Role of Optimization in Double Descent: A Least Squares Study
Empirically it has been observed that the performance of deep neural networks steadily improves as we increase model size, contradicting the classical view on overfitting and generalization. Recently, the double-descent phenomena has been proposed to reconcile this observation with theory, suggesting that the test error has a second descent when the model becomes sufficiently overparametrized, as the model size itself acts as an implicit regularizer. In this paper we add to the growing body of work in this space, providing a careful study of learning dynamics as a function of model size for the least squares scenario. We show an excess risk bound for ordinary least squares which depends on the smallest positive eigenvalue of the covariance matrix of the input features. We observe that, under mild assumptions, the smallest positive eigenvalue follows the Bai-Yin law, and therefore exhibits a "U-shaped" behaviour as the number of features increases. Since the risk is essentially controlled by the inverse of the mentioned eigenvalue, this gives rise to the double descent curve. Our analysis of the excess risk allows to decouple the effect of optimisation and generalisation error on the double descent. In particular, we find that in case of noiseless regression, it is explained solely by the optimisation error, which was missed in studies focusing on the Moore-Penrose pseudoinverse solution. We believe that our derivation provides an alternative view compared to exiting work, shedding some light on a possible cause of this phenomena, at least in the considered least squares setting. We empirically explore whether the covariance of intermediary hidden activations has a similar behaviour as the one assumed in our derivations and if our predictions hold for neural networks.