AI in Production: The Hard Part Starts After Your Pilot Works

The chatbot performs well in demos. The pilot numbers are green. The team is satisfied and the board deck is ready to go.

Then production arrives and the problems nobody put on the roadmap start showing up: latency spikes under concurrent load, hallucinations that were marginal in testing appearing daily, inference costs doubling the initial estimate, and a support team receiving tickets about "weird assistant responses." The pilot was perfect. Production is a different animal entirely.

This pattern repeats with concerning regularity. And the root cause is almost never deeply technical: it's not that the model fails, it's that the system design assumed a context that real production doesn't respect. Three tension vectors tend to break first:

The gap between model behavior in controlled evaluation and its behavior with real users, unexpected prompts, and messy data.
The retrieval and integration architecture that worked fine with a small dataset and starts limping at scale.
The absence of an observability system designed specifically for AI components — traditional logs simply aren't enough.

The Pilot Trap: Why Test Success Is Misleading

When a team evaluates an AI system, they tend to use questions they know, data they control, and users who are either the team itself or a small panel of motivated beta testers. That's not a test environment — it's a rehearsal with the script in hand.

Real production introduces three variables that pilots systematically underestimate. First: input diversity. Real users don't phrase questions the way the product team did; they write inconsistently, switch languages mid-sentence, ask things outside the system's domain, and still expect a useful answer. Second: concurrency. A model that responds in 800ms with one user might take four seconds under a hundred simultaneous requests — and from the user's perspective, that turns a functional app into one that "feels slow" even if the model technically works. Third: context drift. The data the system was built on ages. The company launches new products, changes policies, reorganizes its catalog — and the model keeps answering with six-month-old information.

A successful pilot proves the model can do the task. It doesn't prove the system can sustain it.

We've seen this across very different sectors. A customer support assistant that resolved 87% of queries without human escalation in testing. In production, after two weeks, that figure dropped to 61% — not because the model had degraded, but because real users arrived with contexts the pilot hadn't anticipated, and the system had no mechanism to detect when it was operating outside its competence zone. Without that signal, the model kept responding with the same apparent confidence. Just wrong.

Architecture: Where the Seemingly Solid Breaks

The retrieval architecture is the component that most frequently underlies enterprise AI system failures. Not the model itself, but the scaffolding around it: how data is indexed, how prompts are constructed, how context is managed across conversation turns, how queries are routed when the system draws from multiple information sources.

We've written about the illusion that RAG alone solves the retrieval problem: choosing between dense, sparse, or hybrid retrieval isn't a neutral technical decision — it depends on query type, document volume, and how much lexical precision you need. In production this becomes obvious because the use cases the pilot didn't cover — exactly the hardest ones — start arriving frequently.

The accumulating context problem

In conversational systems, managing context across turns is one of the most silent failure points. Each conversation turn adds tokens to the prompt. With long conversations, inference costs spike and response quality can degrade as the model "loses the thread" when the context window saturates. The solution isn't to increase the context window indefinitely — that's paying the luxury tax of using raw capacity as a substitute for system design — but to build a compression or summarization mechanism that preserves relevant information without dragging along the full history.

The static indexing trap

Many enterprise AI systems deploy with static indexing: documents are processed once, embeddings generated, stored. It works fine in the pilot because the data is fresh and the team knows it. In production, six months later, that knowledge base is outdated and nobody established an incremental reindexing process. The model keeps retrieving the old returns policy, last quarter's product catalog, the contract version that was already superseded. The model isn't lying — it's answering with what it has. The problem is what it has.

Observability: The Component That Always Arrives Late

In traditional software systems, observability is reasonably mature: logs, latency metrics, distributed traces, alerts on HTTP errors. That's enough to know if something breaks. In AI systems, those signals tell you if the system is down — not if it's working well.

An AI system can be technically operational — normal latency, no 500 errors, 100% response rate — while producing incorrect, partially hallucinated, or simply useless answers for the query at hand. Normal logs don't capture this. You need a different level of observability.

This means instrumenting the system to capture, at minimum: the full question-answer pair, the source or sources from which information was retrieved, a confidence signal when the model produces one, and user feedback when available (explicit or implicit via subsequent behavior). Without that, you're managing the system blind. You might feel it's working because nobody complains openly — but the hidden cost is that users simply stop using it.

AI observability doesn't ask whether the system responds. It asks whether the system is right.

There's also an organizational problem: observability tends to arrive late because it isn't prioritized during the pilot phase. The team is focused on making the model work. Instrumentation gets deferred to "when we're in production." But when you reach production without that infrastructure, adding it is more expensive, and you have weeks or months of production data you can't analyze retroactively. It's the same mistake as deferring tests: always seems reasonable in the moment, always hurts later.

Scaling: The Costs Nobody Modeled

The cost model for an AI system in production rarely matches the pilot estimate. Not because teams are careless, but because the variables governing real cost are hard to estimate without production data: average tokens per query, conversation length distribution, retry rates, the percentage of queries requiring multiple model calls due to agent logic or reranking.

An architecture that uses a frontier model for every operation — intent classification, query reformulation, retrieval, generation, and response validation — may be making five or six model calls per user interaction. With a large model, that's a per-interaction cost that can make the product's unit economics unviable at scale.

The solution isn't always "use a smaller model for everything." It's designing an architecture where each operation uses the right model for that operation. Intent classification can be handled by a 1B or 3B parameter model with more than sufficient precision. Query reformulation too. Final response generation may require a more capable model. This heterogeneous orchestration approach — also called routing architecture or operational mixture-of-experts — isn't new, but it's still ignored by projects that start by picking one model and applying it to the entire pipeline.

This connects to a broader concern about the cognitive dependency AI systems generate when they become critical infrastructure: if the system scales poorly and the human team has reduced its manual resolution capacity because "the assistant handled it," a system failure carries an operational cost far beyond the technical one.

Modeling production costs requires at minimum a load simulation exercise before launch. Not a synthetic model benchmark: a simulation with realistic traffic, representative query distribution, and the complete pipeline architecture. It's tedious work that most projects skip. And it's precisely what separates a pilot that becomes a product from a pilot that becomes technical debt.

Getting to Production Without It Blowing Up

There's no universal checklist, but certain design decisions dramatically reduce transition risk. The first is building observability from day one, not from launch day. Even if during the pilot you only log question-answer pairs to a flat file, that already gives you real data to work with.

The second is explicitly defining the system's competence perimeter: which queries it should handle, which it should reject or escalate, and how it behaves at the boundaries. A system that knows when it doesn't know is infinitely more reliable in production than one that answers everything with apparent confidence.

The third is separating the model lifecycle from the data lifecycle. The model may not change for months. The data the system draws from needs its own update, validation, and reindexing process — with a cadence that reflects how quickly the business information changes.

If you're about to take an AI system to production, or you've had a pilot that won't stabilize into a reliable product for months, at Room 714 we run architecture audits focused precisely on these failure points. Not to rebuild what works, but to identify where the system will crack before it cracks in front of your users.