Context Windows Are No Longer the Bottleneck: The Next Battle After 128K
Context windows are finally getting interesting.
Last month, Anthropic pushed Claude 4.6’s context window to 200K tokens, and Google immediately announced Gemini 2.5 Pro supports 1M tokens. What does that mean? You can stuff a 300-page book into it and have the AI answer questions after reading.
But my personal feeling is: context length isn’t the endgame. The real battleground lies ahead.
Why do I say that?
Think about it. Context windows have grown from 4K → 32K → 128K → 200K → 1M. That’s a steep curve. But the cost curve is climbing just as fast.
Take Gemini 2.5 Pro: 1M tokens input costs $7, output costs $21. You stuff a book in, just the input cost is dozens of dollars. That’s not something average users can afford.
So now LLM vendors are all working on something: memory compression.
What does that mean? Instead of stuffing all historical conversations into the context window, you extract important information and compress it into a “memory summary.”
For example, if you’ve had 100 rounds of conversation with an AI, totaling 50K tokens, the system won’t stuff all 50K into the next round. Instead:
- Extract key information (user preferences, historical decisions, important context)
- Compress into a 2K token summary
- Each conversation only brings this summary + the most recent few rounds
This brings costs down.
But here’s the technical challenge: How do you ensure the compressed “memory summary” doesn’t lose critical information?
Current mainstream approaches include:
- Vector retrieval: Convert historical conversations to vectors, only retrieve relevant segments
- Key information extraction: Use small models to filter first, extract important content
- Layered memory: Short-term (recent rounds) + long-term (compressed summary) + knowledge base (external)
These technologies aren’t new, but combined together, they work.
For example: OpenAI recently open-sourced Agents SDK, which includes a “memory manager” component. It does exactly this—automatically compressing historical conversations to prevent context window overflow.
Honestly, I think this direction is much more valuable than “competing on context length.” Because what users really need isn’t “how many characters can fit,” but “how much key information can be remembered.”
A 200K context window stuffed with irrelevant content is useless. Conversely, a 32K window with good memory management might outperform 200K.
That’s a question of technical depth.
There’s another direction: Retrieval-Augmented Generation (RAG).
RAG’s thinking is: Don’t stuff everything into context, retrieve on demand. You ask a question, the system first retrieves relevant documents from the knowledge base, then stuffs the results into context.
This approach is hot right now, but the problem is: How do you guarantee retrieval quality? If retrieved documents aren’t relevant, they’ll interfere with the model’s judgment.
So some companies are doing “intelligent retrieval”—not simple keyword matching, but understanding user intent before retrieving.
This requires small models. First use a small model (say 7B parameters) to understand user intent and generate retrieval strategy, then use that strategy to retrieve.
This “large-small model collaboration” pattern, I think, is the future direction.
One final point: cost optimization.
The larger the context window, the higher the inference cost. This is linear—1M tokens input costs 10x what 100K costs.
But user payment capacity isn’t linear. You can’t charge users 10x for 1M tokens.
So now LLM vendors are finding ways to reduce inference costs. For example:
- KV Cache reuse: Cache intermediate calculation results, reduce redundant computation
- Speculative decoding: Small model guesses first, large model verifies
- Dynamic batching: Combine multiple requests, improve GPU utilization
These techniques seem low-level, but their impact on user experience is direct—costs come down, prices come down; prices come down, users can afford it.
So I think the context length war is basically over (1M tokens is basically enough), the next war is memory management and cost optimization.
This direction is worth watching.