The Myopia of Chatbots
Why Your Bot Can’t See the Full Picture—And How to Give It Real Sight

This is Part 1 of The Architecture of Agency, a 5-part series translating Agentic AI jargon into the software architecture paradigms you already know.
Here’s the scene. Your company’s VP of Customer Experience just got back from a conference. She’s fired up. She walks into your stand-up and says: “We’re deploying an AI support bot. I want it live by end of quarter.”
Your team spins up a chatbot. You pick a frontier model, write a friendly system prompt—”You are a helpful customer support agent for Acme Corp”—and wire it up to your website. The demo looks amazing. The bot is polite, responsive, and handles the softball questions with ease.
Then real customers show up.
A customer asks about your refund policy. The bot confidently quotes a 60-day window. Your actual policy is 30 days. A billing customer provides their account number, asks a follow-up question two messages later, and the bot has already forgotten who they are. Another customer asks about a product you discontinued last year. The bot describes it in glowing detail, complete with pricing, as if it’s still on the shelf.
Your VP is furious. Your team is confused. The bot seemed so smart in the demo.
What happened?
What happened is that you deployed a stateless prediction engine and expected it to behave like an informed employee. You gave it no access to your actual policies, no connection to your billing system, and no strategy for managing the information it was trying to juggle. You gave it a handheld magnifying glass and expected it to read the entire employee handbook.
I know something about that particular problem.
The LLM: A Brilliant, Forgetful Stranger
Let’s start at the foundation. When people say “AI” in 2026, they usually mean a Large Language Model—an LLM. GPT-4, Claude, Gemini, Llama. These are the engines underneath every chatbot, copilot, and agent you’ve heard about.
Here’s the single most important thing to understand about an LLM: it is a stateless next-token prediction engine.
That sounds reductive, but it’s the architectural truth that explains almost every failure mode you’ll encounter. An LLM takes in a sequence of tokens—words, fragments of words, punctuation—and predicts what token should come next. Then it predicts the next one. And the next. That’s it. That’s the whole trick.
It doesn’t “know” anything the way you know your home address. It has learned statistical patterns across an enormous corpus of text. When it tells you that water boils at 100°C, it’s not retrieving a fact from a database. It’s producing the most statistically likely continuation of the tokens “water boils at” given everything it absorbed during training.
This is why LLMs can be breathtakingly fluent and confidently wrong at the same time. The prediction engine optimizes for plausibility, not truth. When your support bot quoted a 60-day refund window, it wasn’t lying. It was generating the most plausible-sounding refund policy based on patterns it had seen across thousands of company websites. It had never read your policy. It was guessing—and its guesses sound so polished that nobody thought to check.
The industry calls this hallucination. I have a different word for it.
The Bioptic Lens
I navigate code with 20/150 vision. When I’m using a screen magnifier at 4x zoom, I can see about 15 lines of code at a time. The rest of the file exists—I know it’s there—but I can’t see it. If someone asks me what’s on line 247, I have to scroll there, losing my place in whatever I was reading.
Now imagine I didn’t know the rest of the file existed. Imagine I could only see those 15 lines, and when someone asked about line 247, I just... made something up based on the patterns I’d seen in other codebases. Something plausible. Something confident. Something wrong.
That’s an LLM without context. It’s not stupid. It’s myopic. It can only see what’s directly in front of it, and when it can’t see what it needs, it fills in the gaps with educated guesses. Hallucination isn’t a bug in the algorithm. It’s the inevitable result of asking a system to answer questions about things it cannot see.
This is the exact architectural failure we’re analyzing in this series—a stateless engine squinting at a tiny slice of memory, expected to see the whole picture.
The Chatbot: Giving the Stranger a Name Tag
So you have this powerful, stateless prediction engine. How do you turn it into a customer support bot?
The most common approach is the simplest: you write a system prompt. This is a block of text that gets prepended to every conversation, invisible to the user, that tells the model who it is and how to behave.
You are SupportBot, a friendly and professional customer
support agent for Acme Corp. You help customers with
questions about their accounts, billing, and products.
Always be polite and concise.This is what the industry calls a persona. And it works—sort of. The model adopts the tone. It stops talking about topics outside customer support. It says “Thank you for contacting Acme Corp!” with convincing warmth.
But a persona is a costume, not a brain. The model is still the same stateless prediction engine underneath. It still doesn’t know your refund policy, can’t look up a customer’s account, and has no idea what products you actually sell. You’ve given the stranger a name tag and a script, but you haven’t given them access to the employee handbook.
This is where most enterprise chatbot deployments stall. The gap between “sounds helpful” and “is helpful” turns out to be enormous, and it’s a gap that no amount of prompt tuning will close. You can rewrite that system prompt fifty times—make it longer, more detailed, more emphatic—and you’ll get marginal improvements at best. The fundamental problem isn’t what you’re telling the model to do. It’s what the model can see.
Which brings us to the concept that underpins everything else in this series.
The Context Window: Your Agent’s Working Memory
Every LLM has a context window—a fixed limit on the total number of tokens it can process in a single request. This includes everything: your system prompt, the conversation history, any documents you’ve stuffed in, and the model’s response. Everything the model “knows” about the current interaction has to fit inside this window.
Think of it as working memory. In human architecture, we can hold about seven items in short-term memory before things start falling out. LLMs can hold tens of thousands of tokens—sometimes hundreds of thousands. Claude can handle 200,000 tokens. Gemini advertises a million. GPT-4 Turbo offers 128,000.
Those numbers sound enormous. They are not.
Here’s why. In our support bot scenario, let’s add up what’s competing for space in that window:
System prompt with persona instructions, tone guidelines, and behavioral rules
Company policies you’ve pasted in so the bot stops hallucinating
The current conversation — every message from the customer and every response from the bot
Customer data you’ve injected (account info, order history, past tickets)
The model’s response — yes, the output counts against the window too
A single customer support conversation that touches billing, refunds, and product questions can burn through tokens fast. Paste in your 40-page policy document and you’ve consumed half your context window before the customer says hello.
But the real problem isn’t running out of space. It’s what happens as you approach the limit.
Lost in the Middle
Researchers at Stanford and UC Berkeley discovered something that should terrify anyone building production AI systems: LLMs don’t pay equal attention to everything in their context window.
Information at the beginning and end of the context gets the most attention. Information buried in the middle gets progressively ignored. The researchers call it the “lost in the middle” phenomenon, and it follows a U-shaped curve—high attention at the start, high attention at the end, and a valley of neglect in between.
This isn’t a quirk of one model. It’s a structural property of the transformer architecture that powers every major LLM. And it means that your 200,000-token context window doesn’t behave like 200,000 tokens of perfect memory. It behaves more like a spotlight that illuminates the edges and leaves the center in shadow.
The Bioptic Lens
This is exactly what happens when I try to read a massive codebase through my screen magnifier.
I can see the beginning clearly—I just opened the file, it’s fresh, I know where I am. I can see where I currently am—that’s what’s on my screen right now. But everything in between? If I scrolled past it ten minutes ago, it’s gone from my active awareness. I know it exists. I might vaguely remember seeing a function definition somewhere around line 150. But the details? Blurry at best. Invisible at worst.
The way I solve this problem is the same way production AI systems need to solve it: I don’t try to hold everything in my head at once. I use tools. I search. I bookmark important sections. I zoom in when I need detail and zoom out when I need the big picture. I manage my limited field of vision as a resource, not just a constraint.
That discipline has a name now. And it’s arguably the most important concept in this entire series.
Context Engineering: The Discipline That Changes Everything
For the past few years, the industry has been obsessed with prompt engineering—the art of crafting the right question to get the right answer. Write a better prompt, get a better response. Add “think step by step” and watch the quality improve.
Prompt engineering matters. But it’s like arguing about the wording of a question you’re shouting to someone in the next room. It helps at the margin. What helps fundamentally is opening the door and handing them the documents they need.
Context engineering is the practice of designing, managing, and curating everything that enters the model’s context window—not just the prompt, but the system instructions, retrieved documents, conversation history, tool outputs, and any other information the model needs to do its job.
Where prompt engineering asks: “How should I phrase this question?”
Context engineering asks: “What information does this model need to see, in what order, in what format, and what should I deliberately leave out?”
It’s the difference between asking a good question and building a good briefing. And it turns out, it’s the difference between a chatbot that hallucinates and one that doesn’t.
Context Rot: When Good Context Goes Bad
Here’s a failure mode that doesn’t show up in demos but cripples production systems: context rot.
As a conversation progresses, the context window fills with old messages, stale data, and outdated tool outputs. Early in a support conversation, the context is clean—just the system prompt and the customer’s first message. By message twenty, you’re carrying the full weight of every exchange, including the customer’s off-topic tangent about their dog, three redundant explanations of the same policy, and a billing lookup that’s now irrelevant because the customer changed their question.
All of that stale context is competing for the model’s attention. Worse, the transformer’s attention mechanism—remember the lost-in-the-middle problem—means the model is disproportionately focused on the beginning (your system prompt) and the end (the most recent message), while the critical details from the middle of the conversation are fading from view.
Manus, the well-known AI agent platform, reported that their agents operate at an average input-to-output token ratio of roughly 100:1. For every token the agent generates, it’s processing a hundred tokens of accumulated context. That ratio gets worse over time, not better.
Context rot is why your support bot works great for simple, two-message interactions and falls apart on complex, multi-turn conversations. The context isn’t just growing—it’s decaying.
Context Compression: Zoom Out, Then Zoom In
The solution isn’t a bigger context window. A million tokens of garbage context doesn’t produce better answers than fifty thousand tokens of garbage context. The solution is context compression—the practice of actively managing what stays in the window and what gets summarized, evicted, or externalized.
The strategies mirror exactly what I do when navigating a large codebase with limited vision:
Summarize aggressively. When I’ve been reading through a complex module, I don’t try to remember every line. I write myself a note: “This module handles authentication via OAuth, entry point is authenticate() on line 42.” That’s compression. In an agent system, you periodically ask the model to summarize the conversation so far, then replace the raw history with the summary. You lose some detail, but you gain coherence.
Prioritize what’s in front of you. I keep my screen magnifier focused on the code I’m actively working with, not the file I read an hour ago. In context engineering, this means injecting fresh, relevant data close to the end of the context (where the model pays the most attention) and pushing older material toward the beginning or out of the window entirely.
Use external memory. I can’t hold the whole codebase in my field of vision, so I use tools—search, bookmarks, the file navigator. Similarly, production AI systems store information in external databases and retrieve it on demand rather than trying to hold everything in the context window. (This is the foundation of RAG, which we’ll dive into in Part 2.)
Evict what’s irrelevant. If I’m debugging a billing issue, I don’t need the authentication module on my screen. In context engineering, this means actively removing conversation turns, tool outputs, and documents that are no longer relevant to the current task. Most teams don’t do this. They treat the context window like an append-only log. It should be treated like a carefully curated briefing that evolves with the conversation.
Anthropic’s own engineering guidance distills this into a hierarchy: prefer raw data when it fits, compact when it doesn’t, summarize only as a last resort. Every level of compression loses fidelity. The art is knowing when that trade-off is worth it.
What We Built (And What Went Wrong)
Let’s return to our support bot and look at the naive architecture we actually shipped:
Customer Ticket → Chat UI → System Prompt + Conversation History → Stateless LLM → Response
That’s it. A single linear path from question to answer, with nothing in between to ground, verify, or manage the flow of information. Every failure we experienced maps directly to this broken pipeline:
No ground truth. The model’s only knowledge base was its pre-training—what it learned in “college”—plus a few paragraphs we pasted into the system prompt. Every answer about Acme Corp’s specific policies was a statistical guess dressed in confident language.
No context strategy. Every conversation was append-only, ensuring that by message twenty, the critical details from message three were lost in the middle. We treated the context window like a bottomless log file. It isn’t one.
No verification. When the model quoted a 60-day refund window, nothing checked that claim against reality before it reached the customer. The system prompt was a costume, not a safety boundary.
Every one of these failures maps to a missing architectural component—components we’ll build over the next four parts of this series. RAG and tools in Part 2. Reasoning chains and guardrails in Part 3. Protocols and memory in Part 4. The right framework to wire it all together in Part 5.
But the foundation—the mental model that makes everything else click—is what we covered today. An LLM is a stateless prediction engine. A chatbot is that engine wearing a costume. A context window is the engine’s limited field of vision. And context engineering is the discipline of managing that field of vision as carefully as an architect manages the flow of a building.
Or, if you prefer my version: the model is myopic, and we need to build it a better pair of glasses.

