The Quiet Discipline of Shipping AI Features That Don’t Break

Today

The first time I shipped an AI feature, I treated the model like magic.

It worked beautifully in my test cases. Then production happened: longer inputs, messier formatting, users pasting entire documents, and the occasional “why did it reply with half a sentence?”

That was the moment I learned a boring truth:

Reliability isn’t a model choice. It’s a systems choice.

The problem is rarely “AI” (it’s unbounded inputs)

Most failures come from the same place: your feature has no limits.

And then the feature starts failing in weird, expensive ways.

A mental model that actually holds up: budgets

Every LLM call has a budget:

If you don’t allocate that budget intentionally, your product will allocate it randomly.

A practical rule I now follow:

  1. Decide what the minimum acceptable output is.
  2. Reserve output tokens for it.
  3. Spend the remaining budget on input—carefully.

Contracts beat vibes

Before writing a prompt, define a contract. For example, a summary feature might require:

Everything else is optional.

This sounds obvious, but it’s the difference between “sometimes it fails” and “it always returns something useful.”

Chunking is not a trick—it's architecture

Chunking isn’t “split into 2,000 tokens.” Chunking is deciding what meaning survives when text is large.

Here are three patterns I keep reaching for:

  1. Structure-first chunking
    Split by headings/sections when possible. It preserves intent.

  2. Sliding window
    Useful when structure is unreliable. Add overlap so boundaries don’t delete meaning.

  3. Map → Reduce
    Summarize each chunk to a fixed template, then synthesize. This is the most stable pattern for long inputs.

If you only learn one pattern, learn map → reduce. It scales.

Retrieval can quietly ruin everything

RAG is great—until retrieval becomes a firehose.

Common failure mode: you retrieve too much, and the model has to “guess” what matters.

A simple retrieval discipline:

If you can’t explain why a chunk is included, don’t include it.

Structured output needs enforcement

When you need JSON or a schema, treat it like an API.

A model that “usually returns valid JSON” is not a reliable system. Validation makes it one.

A checklist I use before shipping

The point

When people say “LLMs are unpredictable,” they often mean their system is unbounded.

Once you add contracts, budgets, and guardrails, the feature starts behaving like software again.

Not magic.

Just engineering.