About three weeks into production on MoveAI — a dispatching assistant Forgemind built for a mid-sized moving company — our Claude API bill had already doubled from the launch baseline. The product was working fine; dispatchers were running somewhere between 40 and 60 AI-assisted queries per shift, and each one was sending the same roughly 2,400-token system prompt without any caching in place.

That usage pattern is what pushed us to look seriously at prompt caching.

The Actual Problem: Tokens You're Paying For Twice

The dispatcher workflow on MoveAI is fairly narrow: someone types something like "3-bedroom move, fourth floor, no elevator, Arlington VA to Bethesda MD, crew of two" and the system returns a structured job brief, crew recommendations, and a time estimate. The system prompt encodes all the business logic — pricing tiers, crew availability rules, geographic zones, escalation conditions, edge case handling. It runs to roughly 2,400 tokens.

Each API call sent that full context cold. With Claude's pricing at the time (claude-3-5-sonnet-20241022, input tokens at $3 per million), that sounds manageable until you're running around 1,200 calls a day and the majority of those tokens are identical across every request. In our case, the repeated system prompt accounted for a substantial share of monthly input token spend — we estimated somewhere above 60%, based on average user-turn length relative to the fixed prompt size, though we didn't have clean per-field logging yet at that point.

TER.A Coffee ran into the same structural issue at a different scale. That product is a Telegram loyalty bot Forgemind built for a specialty coffee chain: it tracks visits, handles reward redemptions, and answers customer questions through a conversational interface. The LLM component handles the natural language layer — interpreting messages like "how many stamps do I have" or "what do I get at 10 visits." The system prompt there includes the reward program rules, tone guidelines, and a structured FAQ block. Smaller than MoveAI's at around 900 tokens, but it fires on every user message. TER.A was processing somewhere in the range of 2,000 to 3,000 messages a month across more than 325 active users (Replit deployment, effectively zero infrastructure cost, so the LLM spend was the dominant variable).

Both products had the same underlying architecture problem: static instructions being retransmitted on every call.

What Prompt Caching Actually Does (and What It Doesn't)

Anthropic's prompt caching feature, available on Claude 3.5 and later models, lets you mark a prefix of your prompt with a cache_control block. If the same prefix is sent again within the cache TTL — 5 minutes by default, 1 hour with the extended option — Anthropic charges at a reduced rate for those cached input tokens, roughly 90% less than standard input pricing per their published documentation, with a small cache write surcharge on first use.

The key constraint: the cached prefix has to be identical and has to appear at the same position in the messages array. Any change to the cached block — even whitespace — invalidates it. This matters more than it sounds.

The implementation itself is straightforward. In Python with the Anthropic SDK, you add "cache_control": {"type": "ephemeral"} to the content block you want cached:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=conversation_history
)

Getting the syntax right takes an afternoon. Getting consistent cache hits in production takes more thought.

Where We Broke It (Twice)

On MoveAI, our first caching attempt failed silently for close to a week. Cache hit rate was near zero. The reason: we were dynamically injecting the current date into the system prompt for scheduling context. "Today is Monday, June 23, 2025." — that line sat at the top of the system prompt, which meant the cached prefix changed every day, and within a day it changed whenever the string was constructed with a slightly different format.

The fix was structural. We moved all dynamic context — date, dispatcher name, shift information — out of the system prompt entirely and into the first human turn as a structured preamble. The system prompt became static. Cache hit rate climbed to roughly 85% within the first day, which tracks with Anthropic's guidance that consistent request patterns within the TTL window are what drive meaningful cache utilization.

TER.A Coffee had a different version of the same problem. Because the system prompt was shorter, we'd deprioritized the caching work — and we'd also been parameterizing the prompt with the location name and its specific reward tier structure for each of the four active coffee shop locations. Each location had its own prompt variant, and none of those variants could benefit from cache hits against each other.

The fix was to consolidate the reward logic into a single universal prompt and pass location-specific data as a structured JSON block in the human turn. It required refactoring around 60 lines of prompt construction code and retesting all four location variants, but it meant we went from four cold prompts to one consistently cached one.

What the Numbers Looked Like

After both fixes were stable for a full billing cycle, here's what we saw.

For MoveAI, monthly input token spend dropped by roughly 61% compared to the pre-caching baseline. The cache write surcharge added back a small amount — under 4% of the original bill in our case — so net savings landed in the 57–60% range depending on the month's traffic pattern. At production dispatch volume, that's a meaningful reduction.

For TER.A Coffee, the proportional savings were similar but smaller in absolute dollars given the lower message volume. The more operationally significant outcome was cost predictability: the bot runs on a single low-cost Replit deployment with no autoscaling, and knowing that LLM costs scale sub-linearly with message volume means a promotional week at one location won't produce a surprise bill.

One thing we didn't anticipate going in: cache misses during off-peak hours. MoveAI runs in defined shift windows, so when it goes quiet for a few hours between shifts, the default 5-minute TTL expires. The first request after that gap pays full price. For a shift-based product, this is predictable and acceptable. For a product with irregular usage patterns, you'd want to factor in the extended cache option, or reconsider whether prompt caching is the right primary lever.

The Tradeoffs Worth Knowing Before You Start

Prompt caching is not free optimization. There are real constraints that should inform whether and how you implement it.

Your system prompt has to be stable. If your prompts are highly dynamic — personalized per user, updated frequently, or dependent on real-time state — caching will either not apply or require significant prompt architecture changes to isolate the static portions. On MoveAI, the refactor took about two days of engineering time. At lower volume, that investment might not pencil out.

You're trading flexibility for cost. Injecting context into the system prompt is often the easiest way to influence model behavior. When you move that context into the human turn instead, you're relying on the model to weight it appropriately, and you're adding complexity to your message construction logic. In our experience across both products, Claude handles this well — but it's worth verifying for your specific use case before shipping.

Cache hit rate depends on traffic shape. Low-volume products with sporadic usage will see inconsistent cache performance. TER.A Coffee's cache hit rate is lower than MoveAI's because users don't send messages in a dense enough pattern to reliably stay within the TTL window. The savings are still real, but they're harder to project in advance.

This is Anthropic-specific right now. OpenAI has its own prompt caching mechanism with different semantics and pricing. If you're running multi-provider setups or considering a provider switch, the implementation isn't portable without rework.

What We'd Do Differently

We'd instrument cache hit rate from day one. Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens in the usage object. We weren't logging those fields initially, which is why it took nearly a week to notice the cache wasn't hitting on MoveAI. Those fields belong in your observability stack before you ship caching to production — without them, you have no way to confirm the optimization is functioning.

We'd also do a prompt architecture review before implementing caching, not after. The refactor on both products was manageable, but asking "what in this prompt is actually static?" early would have saved time on both. In both cases, the answer surfaced context that was being injected into the system prompt out of convenience rather than necessity.

If you're building with Claude and your product has fixed business logic, regulatory text, persona definition, or rule sets baked into the system prompt, prompt caching is probably the highest-leverage cost optimization available right now. The implementation is not the hard part. The hard part is auditing your prompt construction code honestly enough to know whether your architecture is set up to benefit from it — and doing that audit before you flip the switch, not after a week of silent cache misses.