How many images do you need to train a neural network?

mosaic

Photo by Glenn Scott

Today I got an email with a question I’ve heard many times – “How many images do I need to train my classifier?“. In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity:

You need 1,000 representative images for each class.

Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for.

The origin of the 1,000-image magic number comes from the original ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (most I looked at had around seven or eight hundred). This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough.

Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples.

What does “in the same domain” mean? It’s a lot easier to teach a network that’s been trained on photos of real world objects (like Imagenet) to recognize other objects, but taking that same network and asking it to categorize completely different types of images like x-rays, faces, or satellite photos is likely to be less successful, and at least require a lot more training images.

Another key point is that “representative” modifier in my rule of thumb. That’s there because the quality of the images is important, not just the quantity. What’s crucial is that the training images are as close as possible to the inputs that the model will see when it’s deployed. When I first tried to run a model trained with ImageNet on a robot I didn’t see great results, and it turned out it that was because the robot’s camera had a lot of fisheye distortion, and the objects weren’t well-framed in the viewfinder. ImageNet consists of photos taken from the web, so they’re usually well-framed and without much distortion. Once I retrained my network with images that were taken by the robot itself the results got a lot better. The same applies to almost any application, a smaller amount of training images that were taken in the same environment that it will produce better end results than a larger number of less representative images.

Andreas just reminded me that augmentations are important too. You can augment the training data by randomly cropping, rotating, brightening, or warping the original images. TensorFlow for Poets controls this with command line flags like ‘flip_left_to_right‘ and ‘random_scale‘. This has the effect of effectively increasing the size of your training images, and is standard for most ImageNet-style training pipelines. It can be very useful for helping out transfer learning on smaller sets of images as well though. In my experience, distorted copies are not worth quite as much as new original images when it comes to overall accuracy, but if you only have a few images it’s a great way to boost the results and will reduce the overall number of images you need.

The real answer is to try for yourself, so if you have fewer images than the rule suggests don’t let it stop you, but I hope this rule of thumb will give you a good starting point for planning your approach at least.

4 responses

  1. Can’t you also calculate the number of free parameters in your model, and use that to get an order of magnitude estimate for the minimum you’d need?

    I.e., the top layer has as many nodes as pixels you’re using, then there’s something like an MxN matrix at the interface of each layer (plus shifts). You need enough data to saturate the number of free parameters in your network.

    Then (just like you can’t claim a great fit with linear regression if you only have 2 data points) you need a multiplicative factor on top of that to ensure that you have enough data to sample.

    I don’t have a lot of experience working with large NN, so I’m not sure this intuition helps — does this make sense at all?

  2. That kind of intuition helps for non deep learning ML techniques. But in deep learning, the guidelines for how many samples you need appear to be different, as deep learning networks (like convolutional neural networks CNNs) are routinely trained with far fewer total samples than the number of weights in the network. Take for example the original AlexNet model. It had of order 60 million parameters, while the ImageNet dataset on which it was trained had about 1.3 million training images. That’s a factor of 60 fewer training images than weights that must be learned…data augmentation does help some to artificially expand the data set size, but it doesn’t account for all the reasons that we can train such large DL models with so few samples.

  3. Pingback: Is Object Detection a Done Deal? – Hacker Noon - Coiner Blog

  4. Pingback: Is Object Detection a Done Deal? – DuCentillion

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: