2/15/2022

Computer Scientists Prove Why Bigger Neural Networks Do Better

Mordechai Rorvig, Computer Scientists Prove Why Bigger Neural Networks Do Better, Quanta Magazine, February 10, 2022.

Interpolation 

An old mathematical result says that to fit n data points with a curve, you need a function with n parameters. (In the previous example, the two points were described by a curve with two parameters.) When neural networks first emerged as a force in the 1980s, it made sense to think the same thing. They should only need n parameters to fit data points — regardless of the dimension of the data.

“This is no longer what’s happening,” said Alex Dimakis of the University of Texas, Austin. “Right now, we are routinely creating neural networks that have a number of parameters more than the number of training samples. This says that the books have to be rewritten.”

 Robustness

Bubeck and Sellke didn’t set out to rewrite anything. They were studying a different property that neural networks often lack, called robustness, which is the ability of a network to deal with small changes. For example, a network that’s not robust may have learned to recognize a giraffe, but it would mislabel a barely modified version as a gerbil. In 2019, Bubeck and colleagues were seeking to prove theorems about the problem when they realized it was connected to a network’s size.

“We were studying adversarial examples — and then scale imposed itself on us,” said Bubeck. “We recognized it was this incredible opportunity, because there was this need to understand scale itself.”

In their new proof, the pair show that overparameterization is necessary for a network to be robust. They do it by figuring out how many parameters are needed to fit data points with a curve that has a mathematical property equivalent to robustness: smoothness.
 Geometry
To see this, again imagine a curve in the plane, where the x-coordinate represents the color of a single pixel, and the y-coordinate represents an image label. Since the curve is smooth, if you were to slightly modify the pixel’s color, moving a short distance along the curve, the corresponding prediction would only change a small amount. On the other hand, for an extremely jagged curve, a small change in the x-coordinate (the color) can lead to a dramatic change in the y-coordinate (the image label). Giraffes can become gerbils.

Bubeck and Sellke showed that smoothly fitting high-dimensional data points requires not just n parameters, but × d parameters, where d is the dimension of the input (for example, 784 for a 784-pixel image). In other words, if you want a network to robustly memorize its training data, overparameterization is not just helpful — it’s mandatory. The proof relies on a curious fact about high-dimensional geometry, which is that randomly distributed points placed on the surface of a sphere are almost all a full diameter away from each other. The large separation between points means that fitting them all with a single smooth curve requires many extra parameters.

Sébastien Bubeck and Mark Sellke, A Universal Law of Robustness via Isoperimetry,  arXiv:2105.12806v3 

沒有留言:

張貼留言