The gap between demos and production
Everyone has seen the impressive ChatGPT demos. But deploying generative AI into production workflows — where reliability, cost, and latency matter — is a fundamentally different challenge.
Over the past two years, I’ve led the adoption of AI and generative AI solutions across our engineering organization. Here’s what I’ve learned.
Start with the workflow, not the model
The most common mistake I see teams make is starting with “let’s use GPT-4” and then looking for a problem to solve. The right approach is the opposite:
- Map the manual workflow — What are humans doing today? Where are the bottlenecks?
- Identify the repetitive, high-volume tasks — These are your best candidates
- Prototype with the simplest model that works — You’d be surprised how often a smaller model outperforms a larger one for narrow tasks
The agentic pipeline pattern
One of the most impactful architectures I’ve built is the agentic multi-model pipeline. Instead of one monolithic LLM call, you chain specialized steps:
- Prompt generation — A small model that transforms business requirements into optimized prompts
- Primary generation — The main model does the heavy lifting
- Automated QA — A vision model or classifier validates the output
- Retry loop — Failed QA triggers regeneration with feedback
This pattern reduces cost, improves quality, and gives you observability at every step.
What actually matters
After building several AI-powered products, these are the things that matter most:
- Observability — Log every LLM call, every token count, every latency measurement. You can’t improve what you can’t measure.
- Graceful degradation — When the AI fails (and it will), what’s the fallback? Build it from day one.
- Human-in-the-loop — For high-stakes outputs, always have a review step. Automate the 80%, let humans handle the edge cases.
- Cost management — Monitor your API spend like you monitor your cloud costs. It can spiral fast.
Looking ahead
The field moves incredibly fast. What I’m most excited about right now:
- Smaller, specialized models — The trend toward efficient, task-specific models is the right direction for production use
- Better tooling — The observability and evaluation ecosystem is maturing rapidly
- Multimodal pipelines — Combining text, image, and code generation in single workflows
The key is to stay hands-on. Build POCs, test hypotheses, and form your own opinions. The gap between what’s hyped and what’s useful is still wide — and that’s where the real engineering happens.