Today I got an email with a question I’ve heard many times – “How many images do I need to train my classifier?“. In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity:
You need 1,000 representative images for each class.
Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for.
The origin of the 1,000-image magic number comes from the original ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (most I looked at had around seven or eight hundred). This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough.
Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples.
What does “in the same domain” mean? It’s a lot easier to teach a network that’s been trained on photos of real world objects (like Imagenet) to recognize other objects, but taking that same network and asking it to categorize completely different types of images like x-rays, faces, or satellite photos is likely to be less successful, and at least require a lot more training images.
Another key point is that “representative” modifier in my rule of thumb. That’s there because the quality of the images is important, not just the quantity. What’s crucial is that the training images are as close as possible to the inputs that the model will see when it’s deployed. When I first tried to run a model trained with ImageNet on a robot I didn’t see great results, and it turned out it that was because the robot’s camera had a lot of fisheye distortion, and the objects weren’t well-framed in the viewfinder. ImageNet consists of photos taken from the web, so they’re usually well-framed and without much distortion. Once I retrained my network with images that were taken by the robot itself the results got a lot better. The same applies to almost any application, a smaller amount of training images that were taken in the same environment that it will produce better end results than a larger number of less representative images.
Andreas just reminded me that augmentations are important too. You can augment the training data by randomly cropping, rotating, brightening, or warping the original images. TensorFlow for Poets controls this with command line flags like ‘flip_left_to_right‘ and ‘random_scale‘. This has the effect of effectively increasing the size of your training images, and is standard for most ImageNet-style training pipelines. It can be very useful for helping out transfer learning on smaller sets of images as well though. In my experience, distorted copies are not worth quite as much as new original images when it comes to overall accuracy, but if you only have a few images it’s a great way to boost the results and will reduce the overall number of images you need.
The real answer is to try for yourself, so if you have fewer images than the rule suggests don’t let it stop you, but I hope this rule of thumb will give you a good starting point for planning your approach at least.