
When I was new to Google Brain, I got involved in a long and heated discussion about evaluation numbers for some models we were using. As we walked out of the room, the most senior researcher told me “Look, the only metrics that matter are app store ratings. Everything else is just an approximation.“.
The Word Lens team, who were acquired around the same time Jetpac was, soon gave me a vivid example of this. Google Translate already had a visual translation feature for signs and menus, and the evaluation scores on test datasets were higher than Word Lens’s model achieved. What surprised the Google product managers was that consumers still preferred the Word Lens app over Google Translate for this use case, despite the lower metrics. It turned out the key difference was latency. With Google Translate you snapped a picture, it was uploaded to the server, and a result was returned in a second or two. Word Lens ran at multiple frames per second. This meant that users got instant on-screen feedback about the results, and would jiggle the camera angle until it locked on to a good translation. Google Translate had a higher chance of providing the right translation for a single still image, but because Word Lens was interactive, users ended up with better results overall. Smart product design allowed them to beat Google’s best models, despite apparently falling short on metrics.
I was thinking of this again today as I prepared a data sheet for a potential customer. They wanted to know the BLEU score for our on-device translation solutions. Calculating this caused me almost physical pain because while it remains the most common metric for evaluating machine translation, it doesn’t correlate well with human evaluations of the quality of the results. BLEU is a purely textual measure, and it compares the actual result of the translation word by word against one or more expected translations prepared as ground truth by fluent speakers of the language. There are a lot of problems with this approach. For example, think of a simple French phrase like “Le lac est très beau en automne“. One translation could be “The lake is very beautiful in the autumn“. Another could be “The lake is very pretty in the fall“. “In the fall, the lake’s very pretty” would also be a fair translation that captures the meaning, and might read better in some contexts. You can probably imagine many more variations, and as the sentences get more complex, the possibilities increase rapidly. Unless the ground truth in the dataset includes all of them, any results that are textually different from the listed sentences will be given a low accuracy score, even if they convey the meaning effectively. This means that the overall BLEU score doesn’t give you much information about how good a model is, and using it to compare different models against each other isn’t a reliable way to tell which one users will be happy with.
So why does BLEU still dominate the machine translation field? Model creators need a number that’s straightforward to calculate to optimize towards. If you’re running experiments comparing changes to datasets, optimization techniques, and architectures, you need to be able to quickly tell which seem to be improving the results, and its impractical to evaluate all of these by A/B testing them with actual users. The only way to iterate quickly and at scale is with metrics you can run in an automated way. While BLEU isn’t great for comparing different models, relative changes do at least tend to correlate with improvements or declines for a single model. If an experiment shows that the BLEU score has dropped significantly, there’s a good chance that the users will be happier with this version of the model compared to the original. That makes it a helpful directional signal.
This is why people who are actively working on training models are obsessed with benchmarks and metrics. They sound boring to outsiders, and they’re inherently poor approximations to the actual properties you need for your actual product, but without them it’s impossible to make progress. As George Box said – “All models are wrong, but some are useful“. You can see this clearly with modern LLMs. In general I’m pretty skeptical about the advantages OpenAI and Anthropic gain from their scale, but they have millions of people using their products every day and have the data to understand which metrics correlate to customer satisfaction. There are lots of external efforts to benchmark LLMs, but it’s not clear what they tell us about how well they actually work, and which are best.
This is important because a lot of big decisions get made based on benchmarks. Research papers need to show they beat the state of the art on commonly accepted metrics to be published. Companies get investment funding from their benchmark results. The output and content of the LLMs we use in our daily lives are driven by which metrics are used during their training process. What the numbers capture and what they miss has a direct and growing impact on our world, as LLMs are adopted in more and more applications.
That’s a big reason why Natalie and I started the AI Benchmark Club meetup in SF. There are a lot of AI events in the Bay Area, but if you’re actually training models from scratch, it can be hard to find other people facing similar challenges amongst all the business, marketing, and sales discussions that often dominate. The nice thing about benchmarks is that they sound unimportant to everyone except those of us who rely on them to build new models. This works as a great filter to ensure we have a lot of actual researchers and engineers, with talks and discussions on the practical challenges of our job. As Picasso said – “When art critics get together they talk about content, style, trend and meaning, but when painters get together they talk about where can you get the best turpentine“. I think benchmarks are turpentine for ML researchers, and if you agree then come join us at our next meetup!








