LLM Selection for Production AI Products: A Practical Guide

Building MoveAI — an AI dispatcher for a moving company — we opened with GPT-4 and spent the first production weeks watching it work beautifully while bleeding money. The dispatcher workflow required parsing customer intake forms, generating job summaries, and routing requests to crews. GPT-4 handled the nuance well. It also cost far more per request than the task demanded, and latency was long enough that dispatchers were sitting in silence during live calls, waiting for the model to respond. We switched to GPT-3.5-turbo for the routing layer, kept GPT-4 for edge-case escalations, and in the first three weeks post-launch our production billing logs showed per-session inference cost drop from roughly $0.041 to $0.008. That layered decision — not the initial choice — is what made the system viable.

When we built TER.A Coffee, a Telegram loyalty bot, we made a different version of the same error — and caught it faster because of what MoveAI had cost us.

Define the Job Before You Pick the Model

The most common selection mistake is treating "which LLM is best" as a context-free question. It isn't. A model that excels at legal summarization may be mediocre at high-frequency classification tasks where you need sub-500ms latency.

Before opening a benchmark page, you need to settle four things — and the order matters, because each one constrains what follows.

What is the actual output format? Structured JSON, free prose, a single label, a ranked list? Models vary significantly in instruction-following reliability for strict schemas, and a model that produces fluent prose may fall apart when you need it to return a consistently valid JSON object on every single call at volume.

What is the acceptable latency? Real-time conversation has a different ceiling than an async background job. This question is worth answering precisely — not "fast enough" but an actual millisecond threshold — because it can rule out otherwise capable models entirely.

What is the expected call volume? A feature used 50 times a day and one used 50,000 times a day have completely different cost profiles at the same per-token price. The answer to this question is what turns a cost estimate from a rough sketch into an architecture decision.

What does failure look like? A hallucinated coffee reward balance in TER.A Coffee is annoying. A hallucinated crew assignment in MoveAI means a truck shows up at the wrong address. This question is the one that most directly determines how much you can trade reliability for speed or cost — and when you can't trade it at all.

These four questions interact. High failure cost combined with high volume means you can't simply use a cheap model and accept occasional errors; you need to route by request complexity. Low failure cost with very high volume is exactly the situation where a cheaper, faster model should win, even if it underperforms on benchmarks.

What to Measure Instead of Benchmarks

Benchmarks like MMLU or HumanEval measure capability in controlled conditions. They're useful for narrowing a shortlist, but production behavior depends on things controlled benchmarks don't capture — your specific prompts, your users' actual inputs, your deployment region's latency profile. In our experience, the three measurements that actually predict production quality are instruction fidelity, latency under your conditions, and cost at your call shape.

Instruction fidelity under real prompts. How consistently does the model follow the exact output format you need, across the actual inputs your users generate — not curated test cases? For TER.A Coffee, we needed the model to always return a valid JSON object with specific keys for the Telegram bot to parse downstream. We tested three models on 80 real user queries collected during the first week of beta. Claude 3 Haiku and GPT-4o both held above 97% schema-valid responses without changes to the base prompt. Gemini 1.5 Flash dropped to around 89% before we tightened the prompt, but its per-token cost at the volume we were targeting made the extra prompt engineering worth doing.

Latency under your deployment conditions. Quoted latency numbers from provider status pages are median figures under clean conditions. Measure p95 and p99 in your actual region, during peak hours, with your actual prompt sizes. For MoveAI's live dispatch flow, we ran a two-week logging period in January 2025 comparing GPT-3.5-turbo and Claude 3 Haiku on identical prompts routed in parallel. Haiku's p95 came in around 20% faster, but GPT-3.5-turbo produced schema-adherent output on about 94% of calls versus Haiku's 88% under our routing logic without additional prompt work. The consistency gap mattered more than the speed advantage. We kept GPT-3.5-turbo for that layer and accepted the latency. The point isn't that GPT-3.5-turbo is universally better than Haiku — we use Haiku elsewhere. It's that you have to measure for your specific prompts and your specific schema requirements, because aggregate benchmarks won't show you that gap.

Cost at your actual call shape. The price per million tokens listed on a provider's pricing page is only meaningful when combined with your average prompt length, completion length, and call frequency. A model with a lower input token price but higher output token price may be more expensive for a task that generates long responses. Build your cost estimate from actual production logs or realistic synthetic traffic before committing to an architecture — not from the pricing page alone.

Context Window: When Size Actually Matters

There's a tendency to reach for the largest context window available as a kind of insurance policy. That's usually unnecessary and occasionally harmful.

Larger context windows cost more per request. They also don't guarantee that models use distant context accurately — the "lost in the middle" problem (described in research from Nelson Liu et al., 2023) showed that retrieval accuracy degrades for information placed in the middle of long contexts in several models. Stuffing a 100k context window because you can is not the same as the model reliably attending to all of it.

Use a large context window when you genuinely need it: long document analysis, multi-turn conversations that accumulate substantial history, or code review across large files. For discrete, short tasks — classification, extraction from a short input, single-turn Q&A — a model with a 4k or 8k window is fine and will be faster and cheaper.

For MoveAI, the average dispatch prompt was under 800 tokens. Context window size was never a constraint. For a separate engagement involving contract review, we used Claude 3.5 Sonnet specifically because Anthropic's 200k context window handled full agreement documents without chunking, which simplified our pipeline considerably.

Where We've Made the Wrong Call and What We Learned

When we built TER.A Coffee — a Telegram loyalty bot now serving 325+ active users with zero infrastructure cost by routing everything through serverless functions and the Telegram Bot API — we initially used GPT-4o for all responses because we wanted the best user experience. The bot handles things like checking point balances, answering menu questions, and processing redemption requests.

Three weeks in, our inference costs were visibly high relative to the product's revenue contribution. More importantly, most of the interactions were simple: a user sends "/balance", the bot queries a Google Sheet via Apps Script, formats the result, and replies. The LLM's role was generating a friendly natural-language response around a data value we already had. GPT-4o was absurd overkill for that path.

We split the call routing. Simple, templated responses — balance checks, menu lookups — now use GPT-4o-mini. Complex or ambiguous queries (a user asking something outside the normal flow, complaints, multi-step redemption scenarios) escalate to GPT-4o. Comparing billing statements across the 30 days before and after the split, our monthly inference spend dropped by roughly 60%, with no measurable drop in user-reported satisfaction in the same window.

The lesson isn't "use cheap models." It's that most production AI systems have a bimodal request distribution: a high volume of simple, predictable requests and a low volume of genuinely complex ones. Routing those two populations to different models is almost always worth doing, and the split is usually less technically involved than it sounds — in TER.A Coffee's case, it came down to a classifier prompt that flagged requests as simple or complex before they hit the main handler.

Closed vs. Open-Weight Models: The Honest Tradeoff

The case for closed models from OpenAI, Anthropic, or Google is straightforward: no infrastructure to run, predictable latency SLAs, continuous model updates, and mature SDKs. You pay per token and ship faster.

The case for open-weight models — Llama 3, Mistral, Qwen — is also real: data privacy, no per-token cost at scale, full control over fine-tuning, and no dependency on a third-party's pricing decisions. But "no per-token cost" is not "free." You're paying for GPU compute, either on-demand through a provider like Together AI or Replicate, or in reserved capacity if you're running your own inference.

For most product teams at early to mid scale, closed models win on total cost of ownership when you factor in engineering time. The moment that flips is when you have either a compliance requirement that prohibits sending data to third parties, or a volume where the per-token math clearly favors running your own inference. That crossover point is higher than most teams expect — it's usually not worth doing until you're well into the millions of requests per month, and even then only if you have the ML infrastructure expertise in-house.

Forgemind has not yet shipped a product where open-weight self-hosting made sense over managed API endpoints. That may change, and we're watching the inference cost curves closely.

A Working Selection Process

Rather than a decision tree, here's the actual sequence we run when starting a new integration:

Start with a clear failure-mode analysis. Write down what a bad output looks like and how often you can tolerate it. That sets your floor for model quality.

Then run a prompt evaluation against two or three candidate models on 50–100 real or realistic inputs — not cherry-picked ones. Use your actual system prompt. Score outputs on your specific criteria, not general quality.

Estimate your monthly cost at realistic volume for each candidate. Include both input and output tokens, and use your actual average prompt and completion lengths, not the provider's example figures.

Measure p95 latency in your deployment region under load, not from a laptop running a single curl command.

Pick the cheapest model that clears your quality and latency thresholds. Not the most impressive one.

The harder question — one we don't have a clean answer to yet — is how to handle model deprecation gracefully. Every time a provider sunsets a model version, there's an evaluation and migration cycle. Building abstraction layers with something like LangChain or LiteLLM helps, but it doesn't eliminate the cost. If you have a system that's been stable on a specific model version for 18 months and you've never built in the ability to swap models, that's a real risk worth addressing before you're forced to.