Modern neural networks, with billions of parameters, are so overparametrized that they can “overfit” even random, structureless data. Yet when trained on datasets with structure, they learn the underlying features. Understanding why overparametrization does not destroy their effectiveness is a fundamental challenge in AI. Two researchers, Andra Montanari (Stanford) and Pierfrancesco Urbani (IPhT) propose that feature learning and overfitting coexist but occur on distinct timescales during training.
Computer science started from Alan Turing’s pioneering work. He conceived a programmable machine capable of automatically evaluating complex functions from an initial input. Very importantly, the instructions for these operations, known as algorithm (or software), must be externally supplied to the machine and must be changed when changing the task/computation. This separation between the “machine” and the “instructions” is the architecture that controls how laptops/smartphones/devices operate through codes/programs/apps.
A paradigm shift occurred in the 1950s when researchers proposed an open‑goal algorithm architecture that would learn the necessary instructions directly from a massive training dataset. Consider a system that translates text from one language to another. One approach is to code an algorithm that translates each word from one language to the other and rearranges the words according to the syntax of the target language. However, modern systems work by discovering their own algorithm to translate text autonomously from a large collection of example translations. This is achieved without any external notion of syntax or vocabulary supplied to the system. It is the very essence of machine learning and neural networks.
Artificial Neural Networks (ANN) are computational systems where a huge number of adjustable parameters, called weights, are automatically tuned (fitted) during training to learn the complex mathematical functions that allows to perform a desired task. The theoretical understanding of simple neural networks is grounded in statistical learning theory. A central pillar of this theory is the idea that networks are expected to work effectively when (roughly) their fitting complexity (for example, the number of weights) is kept low relative to the quantity of training data, favouring the simplest effective model. This is basically a computational form of Occam’s Razor principle.
A major theoretical crisis began around fifteen years ago when deep neural networks, a class of very complex fitting functions, were empirically shown to be highly effective even on simple tasks. This was astonishing because these models often have more weights than training example, a situation typically called overparametrization: they are so complex that they can fit even completely featureless (random) data. Despite this fact these models learn the latent features of meaningful data (feature learning) and therefore are said to generalize well. Understanding why overparametrization does not affect and if it is actually beneficial for the performances of modern neural networks has become a central problem at the core of the foundations of AI and learning paradigms in general.
In a recent work (accepted as an oral contribution to the NeurIPS conference 2025), Andrea Montanari (Stanford University) and Pierfrancesco Urbani (IPhT) have proposed a solution to this puzzle. Using a combination of innovative theoretical physics techniques and rigorous statistical analysis, they have shown that overfitting and feature learning co-exist in overparametrized neural networks but arise at different moments during training dynamics (a so-called emerging separation of timescales phenomenon). The resulting dynamical decoupling between feature learning and overfitting emerges from the interplay between the training algorithm and the architecture of the networks with bigger models having a larger separation of timescale. Given that feature learning arises before overfitting, this scenario suggests a robust mechanism that explains why and how huge overparametrized neural networks work.
This work paves the way for a better understanding of the training dynamics of modern machine‑learning models. The corresponding learning scenario could also find applications in other systems that need to extract information and adapt accordingly, such as biological systems.
[1] Andrea Montanari and Pierfrancesco Urbani (2025) Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks. « The Thirty-ninth Annual Conference on Neural Information Processing Systems ». https://openreview.net/forum?id=ImpizBSKcu. https://arxiv.org/abs/2502.21269.


