The first time I shipped an AI feature, I treated the model like magic.
It worked beautifully in my test cases. Then production happened: longer inputs, messier formatting, users pasting entire documents, and the occasional “why did it reply with half a sentence?”
That was the moment I learned a boring truth:
Reliability isn’t a model choice. It’s a systems choice.
The problem is rarely “AI” (it’s unbounded inputs)
Most failures come from the same place: your feature has no limits.
- Input grows without anyone noticing.
- Prompt instructions silently expand over time.
- Retrieval returns “everything,” because “more context must be better.”
- Output size becomes unpredictable.
And then the feature starts failing in weird, expensive ways.
A mental model that actually holds up: budgets
Every LLM call has a budget:
- Input tokens (system + user + retrieved context)
- Output tokens (the answer)
- Safety margin (for the unexpected)
If you don’t allocate that budget intentionally, your product will allocate it randomly.
A practical rule I now follow:
- Decide what the minimum acceptable output is.
- Reserve output tokens for it.
- Spend the remaining budget on input—carefully.
Contracts beat vibes
Before writing a prompt, define a contract. For example, a summary feature might require:
- A one-line title
- 3 bullet takeaways
- 1 next action
Everything else is optional.
This sounds obvious, but it’s the difference between “sometimes it fails” and “it always returns something useful.”
Chunking is not a trick—it's architecture
Chunking isn’t “split into 2,000 tokens.” Chunking is deciding what meaning survives when text is large.
Here are three patterns I keep reaching for:
-
Structure-first chunking
Split by headings/sections when possible. It preserves intent. -
Sliding window
Useful when structure is unreliable. Add overlap so boundaries don’t delete meaning. -
Map → Reduce
Summarize each chunk to a fixed template, then synthesize. This is the most stable pattern for long inputs.
If you only learn one pattern, learn map → reduce. It scales.
Retrieval can quietly ruin everything
RAG is great—until retrieval becomes a firehose.
Common failure mode: you retrieve too much, and the model has to “guess” what matters.
A simple retrieval discipline:
- Cap retrieved tokens (hard limit)
- Dedupe near-duplicates
- Rerank for relevance
- Prefer fewer, higher-quality chunks over many mediocre ones
If you can’t explain why a chunk is included, don’t include it.
Structured output needs enforcement
When you need JSON or a schema, treat it like an API.
- Validate the output
- Retry with a short correction prompt
- Cap retries so cost doesn’t spiral
A model that “usually returns valid JSON” is not a reliable system. Validation makes it one.
A checklist I use before shipping
- Define a minimal output contract (required fields)
- Hard-cap input tokens and output tokens
- Keep a safety margin (don’t run near max context)
- Choose a chunking strategy (structure-first when possible)
- Use map → reduce for long documents
- Validate structured output and cap retries
- Budget retrieval and dedupe aggressively
- Log tokens, latency, retries, and failure modes
The point
When people say “LLMs are unpredictable,” they often mean their system is unbounded.
Once you add contracts, budgets, and guardrails, the feature starts behaving like software again.
Not magic.
Just engineering.