Opinion · AI · LLMs
It’s just next token prediction. So why does it feel like reasoning?
TL;DR – LLMs are “just” next-token predictors at the mechanical level – but that framing obscures what the weights actually encode. This post digs into why the explanation feels incomplete, where the genuine capability comes from, and what’s still genuinely missing before we can call any of this intelligence in the deeper sense.
There’s a moment that happens to most people who push past the surface of how modern AI works. You read the papers, watch the talks, maybe poke around some PyTorch. And then it lands:
It’s just predicting the next token based on learned weights.
And your immediate reaction is something like: wait, that’s it?
No symbolic reasoning engine. No internal world simulation. No logic layer quietly verifying things. Just a neural network outputting a probability distribution over possible next tokens, sampling from it, and repeating. That’s the complete mechanical description.
So why does it write working code? Why does it explain architectural tradeoffs it’s never been explicitly asked about before? Why does it feel, in certain moments, like something that actually understands what you’re trying to do?
That tension – between the mechanism and the behaviour – is what I want to unpick here.
The reductive description is technically correct, but analytically useless
Here’s the thing: saying “it’s just next-token prediction” is a bit like saying chess is “just moving pieces on a board according to fixed rules.” True at the lowest level of description. Completely fails to explain anything interesting about the behaviour you actually observe.
Two systems can share the same mechanical description – both doing next-token prediction – and produce wildly different outcomes. A basic phone keyboard autocomplete predicts the next token. So does GPT-4o. The mechanism is nominally the same. The outputs are not remotely comparable.
What changes between those two systems isn’t the label we put on the mechanism. It’s the scale and structure of what the weights encode. And that’s where the interesting question actually lives.
What the weights actually learn
This is the part that tends to get glossed over. The weights in a large model aren’t storing word co-occurrence frequencies. They’re encoding something much richer – patterns in how problems are framed, how constraints interact, how explanations are structured, how solutions tend to be organised across a huge variety of domains.
None of this is explicit. There’s no “design patterns” module sitting inside the network. What happens during training is that the model is forced, over billions of gradient updates, to develop internal representations that are predictive of good continuations across wildly diverse contexts. That pressure – predict the next token well, at scale, across everything – ends up shaping the weights into something that looks a lot like compressed domain knowledge.
So when you ask the model to solve a problem it hasn’t seen before, what you’re actually asking it to do is find a continuation that fits the pattern of how problems like this tend to be resolved. The query, your implied intent, the structure of the problem, the prior steps in the reasoning – all of it conditions the output distribution simultaneously.
That’s not word frequency. That’s something considerably stranger.
Why this isn’t search, and why that matters
A common mental model people reach for is that the model must be doing something like a chess engine – exploring possible answer paths, evaluating them, picking the best one. It isn’t. There’s no tree search. No rollout. No branching and pruning.
The model makes a single forward pass and produces a distribution. That’s it. The “reasoning” – if we’re going to use that word – happens implicitly in the structure of how the weights route activations through the network, not through any explicit deliberation process you could point to.
This is actually closer to how skilled human intuition works than to how a chess engine works. An experienced architect doesn’t enumerate all possible building designs before settling on one. They’ve internalised enough patterns that plausible good solutions surface immediately. LLMs do something structurally similar – they jump to a path that usually works for this class of problem, based on what they’ve seen work before.
The obvious failure mode of that approach is also the same: it works until it doesn’t, and the failure mode can look deceptively confident.
Recombination, not recall
One of the things that surprises people most is that models can produce useful outputs on novel problems – problems where it’s implausible they’ve seen the exact answer in training data. This gets attributed to memorisation, which misses what’s actually happening.
The more accurate description is recombination. The model has learned structural patterns – dynamic programming applies here, this constraint satisfaction approach tends to work for that class of problem, this explanation structure fits this domain – and it applies them to new configurations it hasn’t seen explicitly.
Is that reasoning? Philosophically, I think it’s genuinely unclear. It doesn’t fit the classical definition – there’s no formal inference, no symbolic manipulation, no verifiable proof trace. But it’s also not simple recall. It’s something in between, and we don’t have great vocabulary for it yet. “Pattern-driven generalisation at scale” gets closest, but it’s a mouthful.
The description problem
Here’s what I think is the most important and least discussed limitation. The model knows how the world is described. It does not know how the world behaves.
This sounds like a subtle distinction but it has enormous practical consequences. If you ask the model what happens when you drop a glass on a tile floor, it’ll give you a correct answer. But it got there by learning the statistical patterns in text about glass-dropping, not by simulating material stress, modelling fracture propagation, or running any physics internally.
Most of the time, descriptions of reality are close enough to reality that this distinction doesn’t matter. But in the edge cases – novel physical scenarios, complex multi-step causal chains, situations where the training distribution is thin – the gap opens up. The model confidently follows the description it has, which may not track what would actually happen.
This is part of why hallucination is so hard to eliminate. It’s not a bug in the sense of a fixable implementation error. It’s a structural property of a system that learned descriptions rather than mechanisms.
What’s genuinely missing
If you strip it back to first principles, the gaps between current LLMs and something we’d comfortably call general intelligence are pretty specific:
Reliable self-verification. The model can produce a wrong answer and a convincing explanation of why it’s right, simultaneously, without any internal flag being raised. It can’t distinguish between “I know this” and “this is the kind of thing that sounds right.”
Grounded world models. Not descriptions of how the world behaves – actual internal representations that can be used to simulate novel scenarios and predict outcomes that weren’t in training data.
Persistent learning from interaction. The model doesn’t update its weights from your conversation. Every session starts from the same frozen snapshot. There’s no feedback loop between what it observes in use and what it knows.
Goal-directed planning. Not just producing outputs that look like plans, but actually modelling future states, evaluating them against a goal, and selecting actions based on that evaluation. What we currently have is a very good approximation of what planning looks like in text.
These aren’t minor gaps. They’re architecturally significant. More compute doesn’t obviously close them. Better training data doesn’t obviously close them. They probably require different approaches, not just more of the same.
So is it intelligent?
Depends entirely on what you mean, and I think it’s worth being precise rather than reaching for a comfortable yes or no.
Functionally? In the sense of solving problems, adapting to novel inputs, producing outputs that are genuinely useful across a huge range of tasks? Yes. Unambiguously. It does things that, if a human did them, we’d call intelligent without hesitation.
In the deeper sense – grounded understanding, genuine reasoning, awareness of its own epistemic state? No. Not yet, and possibly not by any amount of scaling the current paradigm.
The honest framing is probably this:
Next-token prediction, scaled to the point where it starts to approximate reasoning – without yet being reasoning.
That’s not a dismissal. The approximation is remarkably good. It’s good enough to be practically transformative. But understanding the gap between approximation and the real thing matters, especially if you’re building systems on top of it and need to know where they’ll fail.
The question that actually matters
The interesting question isn’t “is it just weights?” Of course it’s weights – that’s true of your brain too, if you squint at it from the right angle.
The interesting question is: how far can pattern-based systems go before they hit a hard wall?
Because right now, they’re going much further than the mechanism description would suggest. The gap between “next-token prediction” and the behaviour you observe is doing a lot of work that we don’t yet fully understand.
Whether the next frontier is better world models, explicit reasoning traces, hybrid architectures, or something we haven’t named yet – that’s where the real research question sits. Not in arguing about whether it’s “really” intelligent. That argument generates more heat than light.
What matters is understanding precisely where the current approach breaks down, because that’s where the next architectural leap will need to start.
