Discussion about this post

User's avatar
suman suhag's avatar

I’m going to mention a problem that’s similar to what https://www.quora.com/profile/Haohan-Wang mentioned, but with a different twist:

Why does deep learning generalize so well, despite using parameters that are orders of magnitude more numerous than the training samples?

Let me give you an example. Consider the https://arxiv.org/pdf/1409.1556.pdf deep convolutional neural net that won the ImageNet challenge in 2014 (in classification+localization category). It has upwards of 130 million parameters and still performs amazingly well on a puny dataset like the https://www.cs.toronto.edu/~kriz/cifar.html which only has 60,000 thumbnail sized images!

Let me repeat. The algorithm has 130 million tunable parameters! That is enough parameters to memorize the entire CIFAR-10 dataset bit by bit a hundred times over! But somehow VGG19 doesn’t overfit and actually gives a decent test accuracy! How in the world is it doing that?

Now compare that to a typical “non-deep” machine learning algorithm like the logistic regression. For CIFAR-10’s 32x32 color images, the number of parameters would be just over 3,000. Even if you add polynomial features of order up to 10 for every pixel of every color (which will lead to terrible overfitting), the parameter count is still thousands of times lower than VGG19!

Conventional deep learning techniques that are used to prevent overfitting (like dropout, L1 regularization, L2 regularization etc.) don’t even come close to offering a convincing explanation for this mysterious phenomenon. There is a lot to be understood.

Expand full comment

No posts

Ready for more?