Scientists from Carnegie Mellon, Stanford, Harvard and Princeton talk about a “Catastrophic overtraining”. Their study shows that too prolonged training can affect the performance of a model after refining.
AI deteriorates in case of overtraining
The study compared two versions of the OLMO-1B model: one trained on 2.3 Billions of Tokens, the other on 3 Billions. Surprise: the model with less training data has performance up to 3 % higher depending on the reference tests like Alpacaeval and Arc.
The degradation of performance is, according to scientists, caused by a “Progressive sensitivity”. The more the tokens increase, the more the model weakens. Minor adjustments when refining or adding noise reverse previous progress.
To prove this fragility, scientists added Gaussian noise in pre-trained models and found a deterioration in parallel performance with the training time of the model.
The moment when training degrades performance is called “Information point”. When reached, training gains are counterbalanced by internal instability. According to the study, this critical point generally occurs beyond 2.5 billions of tokens in smaller models like Olmo-1B.
“The catastrophic overtraining can be inevitable … Especially when the pre-training and refining tasks are disalged”warn scientists.
However, scientists do not suggest abandoning pre-training but invite developers to think about the optimal quantity of starting at startup: “Our discoveries call to refocus attention to the dimensioning of models by considering the entire training pipeline.”
Scientists from Carnegie Mellon, Stanford, Harvard and Princeton talk about a “Catastrophic overtraining”. Their study shows that too prolonged training can affect the performance of a model after refining.
AI deteriorates in case of overtraining
The study compared two versions of the OLMO-1B model: one trained on 2.3 Billions of Tokens, the other on 3 Billions. Surprise: the model with less training data has performance up to 3 % higher depending on the reference tests like Alpacaeval and Arc.
The degradation of performance is, according to scientists, caused by a “Progressive sensitivity”. The more the tokens increase, the more the model weakens. Minor adjustments when refining or adding noise reverse previous progress.
To prove this fragility, scientists added Gaussian noise in pre-trained models and found a deterioration in parallel performance with the training time of the model.
The moment when training degrades performance is called “Information point”. When reached, training gains are counterbalanced by internal instability. According to the study, this critical point generally occurs beyond 2.5 billions of tokens in smaller models like Olmo-1B.
“The catastrophic overtraining can be inevitable … Especially when the pre-training and refining tasks are disalged”warn scientists.
However, scientists do not suggest abandoning pre-training but invite developers to think about the optimal quantity of starting at startup: “Our discoveries call to refocus attention to the dimensioning of models by considering the entire training pipeline.”