The Quiet Discipline of Shipping AI Features That Don’t Break

The first time I shipped an AI feature, I treated the model like magic.

It worked beautifully in my test cases. Then production happened: longer inputs, messier formatting, users pasting entire documents, and the occasional “why did it reply with half a sentence?”

That was the moment I learned a boring truth:

Reliability isn’t a model choice. It’s a systems choice.

The problem is rarely “AI” (it’s unbounded inputs)

Most failures come from the same place: your feature has no limits.

Input grows without anyone noticing.
Prompt instructions silently expand over time.
Retrieval returns “everything,” because “more context must be better.”
Output size becomes unpredictable.

And then the feature starts failing in weird, expensive ways.

A mental model that actually holds up: budgets

Every LLM call has a budget:

Input tokens (system + user + retrieved context)
Output tokens (the answer)
Safety margin (for the unexpected)

If you don’t allocate that budget intentionally, your product will allocate it randomly.

A practical rule I now follow:

Decide what the minimum acceptable output is.
Reserve output tokens for it.
Spend the remaining budget on input—carefully.

Contracts beat vibes

Before writing a prompt, define a contract. For example, a summary feature might require:

A one-line title
3 bullet takeaways
1 next action

Everything else is optional.

This sounds obvious, but it’s the difference between “sometimes it fails” and “it always returns something useful.”

Chunking is not a trick—it's architecture

Chunking isn’t “split into 2,000 tokens.” Chunking is deciding what meaning survives when text is large.

Here are three patterns I keep reaching for:

Structure-first chunking
Split by headings/sections when possible. It preserves intent.
Sliding window
Useful when structure is unreliable. Add overlap so boundaries don’t delete meaning.
Map → Reduce
Summarize each chunk to a fixed template, then synthesize. This is the most stable pattern for long inputs.

If you only learn one pattern, learn map → reduce. It scales.

Retrieval can quietly ruin everything

RAG is great—until retrieval becomes a firehose.

Common failure mode: you retrieve too much, and the model has to “guess” what matters.

A simple retrieval discipline:

Cap retrieved tokens (hard limit)
Dedupe near-duplicates
Rerank for relevance
Prefer fewer, higher-quality chunks over many mediocre ones

If you can’t explain why a chunk is included, don’t include it.

Structured output needs enforcement

When you need JSON or a schema, treat it like an API.

Validate the output
Retry with a short correction prompt
Cap retries so cost doesn’t spiral

A model that “usually returns valid JSON” is not a reliable system. Validation makes it one.

A checklist I use before shipping

Define a minimal output contract (required fields)
Hard-cap input tokens and output tokens
Keep a safety margin (don’t run near max context)
Choose a chunking strategy (structure-first when possible)
Use map → reduce for long documents
Validate structured output and cap retries
Budget retrieval and dedupe aggressively
Log tokens, latency, retries, and failure modes

The point

When people say “LLMs are unpredictable,” they often mean their system is unbounded.

Once you add contracts, budgets, and guardrails, the feature starts behaving like software again.

Not magic.

Just engineering.