Modern neural networks, with billions of parameters, are so overparametrized that they can “overfit” even random, structureless data. Yet when trained on datasets with structure, they learn the underlying features. Understanding why overparametrization does not destroy their effectiveness is a fundamental challenge in AI. Two researchers, Andra Montanari (Stanford) and Pierfrancesco Urbani (IPhT) propose that feature learning and overfitting coexist but occur on distinct timescales during training.
Computer science started from Alan Turing’s pioneering work. He conceived a programmable machine capable of automatically evaluating complex functions from an initial input. Very importantly, the instructions for these operations, known as algorithm (or software), must be externally supplied to the machine and must be changed when changing the task/computation. This separation between the “machine” and the “instructions” is the architecture that controls how laptops/smartphones/devices operate through codes/programs/apps.
A paradigm shift occurred in the fifties when researchers proposed to have an open purpose computational framework that learns the necessary instructions directly from a vast training dataset of examples. Consider the example of a self-driving car. Rather than programmers having to code every single decision for every possible road scenario, the system is trained on a large set of driving conditions and solutions. The machine learns the underlying function required to make safe, real-time driving decisions, effectively finding its own driving software. This is the core concept behind machine learning and neural networks.
Artificial Neural Networks (ANN) are computational systems where a huge number of adjustable parameters, called weights, are automatically tuned (fitted) during training to learn the complex mathematical functions that allows to perform a desired task. The theoretical understanding of simple neural networks is grounded in statistical learning theory. A central pillar of this theory is the idea that networks are expected to work effectively when (roughly) their fitting complexity (for example, the number of weights) is kept low relative to the quantity of training data, favouring the simplest effective model. This is basically a computational form of Occam’s Razor principle.
A major theoretical crisis began around fifteen years ago when deep neural networks, a class of very complex fitting functions, were empirically shown to be highly effective even on simple tasks. This was astonishing because these models often have more weights than training example, a situation typically called overparametrization: they are so complex that they can fit even completely featureless (random) data. Despite this fact these models learn the latent features of meaningful data (feature learning) and therefore are said to generalize well. Understanding why overparametrization does not affect and if it is actually beneficial for the performances of modern neural networks has become a central problem at the core of the foundations of AI and learning paradigms in general.
In a recent work (accepted as an oral contribution to the NeurIPS conference 2025), Andrea Montanari (Stanford University) and Pierfrancesco Urbani (IPhT) have proposed a solution to this puzzle. Using a combination of innovative theoretical physics techniques and rigorous statistical analysis, they have shown that overfitting and feature learning co-exist in overparametrized neural networks but arise at different moments during training dynamics (a so-called emerging separation of timescales phenomenon). The resulting dynamical decoupling between feature learning and overfitting emerges from the interplay between the training algorithm and the architecture of the networks with bigger models having a larger separation of timescale. Given that feature learning arises before overfitting, this scenario suggests a robust mechanism that explains why and how huge overparametrized neural networks work.
[1] Andrea Montanari and Pierfrancesco Urbani (2025) Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks. « The Thirty-ninth Annual Conference on Neural Information Processing Systems ». https://openreview.net/forum?id=ImpizBSKcu. https://arxiv.org/abs/2502.21269.


