Processing math: 100%

Scaling Laws from the Data Manifold Dimension

Utkarsh Sharma, Jared Kaplan.

Year: 2022, Volume: 23, Issue: 9, Pages: 1−34


Abstract

When data is plentiful, the test loss achieved by well-trained neural networks scales as a power-law LNα in the number of network parameters N. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension d. This simple theory predicts that the scaling exponents α4/d for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

PDF BibTeX code