
Have you ever wondered why ChatGPT and similar advanced AI systems are known as Large Language Models? What are “language models”, even? To answer that, and understand how remarkable the current state of the art is, I need to jump back a few decades.
Understanding language has always been a goal for artificial intelligence researchers, right from the field’s start in the 1950’s, but what might surprise you is that language models were traditionally seen as just one processing step in a larger workflow, not as the catchall solution they are now. A good way of thinking about a language model’s job is that, given a sequence of words, it predicts which words are most likely to come next. For example, given “Move it over“, it might predict “to“, “there” or “more” as likely next words. It’s very similar to autocomplete. If you build a speech recognition system that takes in audio and tries to output the corresponding to the speech, having this kind of prediction can help decide between two words that sound the same. For example, if the speech to text model had previously heard “Move it over“, and the next word sounded like “their” or “there“, the information from the language model will tell you that “there” is more likely to be right. You can see probably see how language models could be used in similar ways to post-process the results of machine translation or optical character recognition.
For many decades, the primary focus of AI research was on symbolics, and language models were seen as a hacky statistical approach that might be useful to help clean up data, but weren’t a promising avenue towards any kind of general intelligence. They didn’t seem to embody knowledge about the world, they were just predicting strings, so how could they be more than low level tools? Even now, language models are criticized as “Stochastic Parrots“, mindlessly regurgitating plausible text with no underlying understanding of anything. There’s a whole genre of autofill games that use text prediction on phones to generate surreal sentences, highlighting the uncanny valley aspect of these comparatively primitive language models.
To understand how they have potential to be more useful, think about the words “The sky is“. As people, we’d guess “blue” or maybe “cloudy” as likely next words, and good enough language models would do the same. If you add in a preceding question, so the full prefix is “What color is the sky? The sky is“, we’d be even more likely to guess “blue“, and so would a model. This is purely because in a large enough collection of writing, a model will have come across enough instances of the question “What color is the sky?” to know that “blue” is a likely answer, but crucially, this means it has acquired some knowledge of the world! This is despite having no eyes to see, and having never been explicitly programmed with what color the sky is. The prompt you give to a modern LLM is essentially just that question at the start of the string to kick things off, so even the latest models still work in the same basic fashion.
What has happened since BERT in 2018 is that language models have been trained on larger and larger sets of data, and we’ve discovered that they’re incredibly effective at solving all sorts of problems that were considered to be challenging, and were seen as significant stepping stones towards general intelligence. For a lot of us, me included, this was very surprising, and challenged a lot of our beliefs about what makes intelligence. After all, language models are fundamentally just auto-complete. If intelligence can seem to emerge from repeatedly predicting the next word in a sentence, what does that mean about how we think ourselves? Is this truly a path towards general intelligence, or just a mirage that disappears once we run out of larger and larger sets of data to feed it?
You probably know from your own experience that modern chat bots can handily pass the Turing Test, and act as convincing conversation partners, even joking, detecting sarcasm, and exhibiting other behaviors we usually consider to require intelligence. They clearly work in practice, but as the old joke about French engineers goes, do they work in theory? This is where research is just getting off the ground. Since we have systems that exhibit intelligence, but we don’t understand how or why, it’s more experimental than theory-driven right now, and has far fewer resources available than applied and commercial applications of LLMs, but I think it will reveal some mind-bending results over the next few years. We have something approaching an intelligence that’s we’ve constructed, how could we not discover new insights by analyzing these models?
I love that language models are Cinderellas of the AI world, rising from humble servants of more popular techniques, to solving the hardest problems all on their own. I would never have predicted this myself a few years ago, but it is more evidence that larger datasets solve most problems in machine learning, and I can’t wait to see where they go next!
Update – I talked with Ann Spencer about this topic on my YouTube podcast.