6 min read

Why My Agent Was Paying Full Price on Every Turn


Today started with a number that should have been higher: zero. I was looking at the cached input tokens coming back from OpenAI and seeing almost no reuse across turns. Every request was paying full price for the same instructions. That did not make sense for a chat agent that sends the same system prompt hundreds of times a day, so I started digging.

What I found

The agent prompt had grown into a single large block that covered every mode and every policy. Persona rules, tool guidance, memory rules, recipe studio instructions, refinement planning. All of it, all the time, for every request. A simple pantry lookup carried 1,800 tokens of instructions it did not need.

Worse, draft content was embedded inside the developer context. During a recipe refinement flow, the draft changes on every turn. Since OpenAI caches based on exact prefix match, any change in the developer context invalidated the entire cache. The static instructions that should have been reused were getting rebuilt from scratch because the volatile content was mixed in too early.

Then I went looking for where all these prompts lived. The main file was obvious. But there were also what I started calling “prompt islands,” instruction strings scattered across the codebase. A pantry constraints fragment here, a persona variant there, inline f-strings in tool files and generation services. Each one was reasonable in isolation, but together they meant there was no single place to understand what the model was being told.

What changed

I spent the afternoon with Claude working through this. The work was not complicated, but it touched a lot of files.

1) A prompt registry with composable fragments

Every prompt fragment now lives in a central registry under dot-notation keys. No more hunting through tool files to find instructions.

class PromptRegistry:
    _prompts: ClassVar[dict[str, str]] = {}
    _templates: ClassVar[dict[str, str]] = {}

    @classmethod
    def get(cls, key: str) -> str:
        """Get a fragment: PromptRegistry.get("persona.core")"""

    @classmethod
    def compose(cls, *keys: str, separator: str = "\n\n") -> str:
        """Compose fragments: PromptRegistry.compose("persona.core", "chef.memory_rules")"""

    @classmethod
    def render(cls, key: str, **context: str) -> str:
        """Render a template: PromptRegistry.render("modifier.pantry.strict", ingredients=...)"""

The registry lazy-loads on first access and registers everything in one pass: persona prompts, tool guidance, memory rules, 45 refinement prompts, modifier templates. About 925 lines total. That sounds like a lot for a single file, and it is. The alternative was those same lines spread across a dozen modules with no shared structure, so this was the better tradeoff for now. The next step is to split registration into per-domain modules (refinement prompts in one file, persona in another) that each register into the shared registry at init. That keeps the central lookup without the monolithic file.

2) Mode-aware composition

Instead of sending everything, the system prompt is now composed per mode:

def compose_system_prompt(mode: AgentMode, has_memory_tools: bool, tool_count: int) -> str:
    parts = [PromptRegistry.get("persona.core")]

    if mode == AgentMode.RECIPE_STUDIO:
        parts.append(PromptRegistry.get("chef.recipe_studio"))
    elif mode == AgentMode.FULL_CHAT:
        parts.append(PromptRegistry.get("chef.refinement_planning"))

    if tool_count > 15:
        parts.append(PromptRegistry.get("chef.tool_guidance"))

    if has_memory_tools and mode != AgentMode.GENERAL_CHAT:
        parts.append(PromptRegistry.get("chef.memory_rules"))

    return "\n\n".join(parts)

The result is that a general chat turn costs about 550 tokens of system prompt. Recipe studio is around 800. Full chat with a large tool set is about 1,850. Before this, every mode paid the full 1,850.

3) Three-tier prompt separation for caching

This was the key insight. OpenAI’s Responses API caches based on prefix match, so anything that changes between turns has to go at the end, not the middle.

@dataclass(frozen=True)
class PromptPayload:
    system_prompt: str          # Static instructions, same for all users
    developer_context: str      # User-specific: allergies, preferences, pantry summary
    draft_context: str | None   # Volatile, changes during refinement flows

The provider structures the request to match:

# 1. instructions = system_prompt (stable across all users, cached)
request_params["instructions"] = system_prompt

# 2. First input = developer_context (stable per user within a conversation)
input_items.insert(0, {"role": "developer", "content": developer_context})

# 3. Middle = conversation messages
input_items.extend(conversation_messages)

# 4. Last = draft_context (volatile, AFTER messages to preserve prefix)
if draft_context:
    input_items.append({"role": "developer", "content": draft_context})

Before this change, the draft was merged into the developer context at position 2. Every refinement turn broke the prefix from that point forward. Now the stable content stays stable, and the draft sits at the end where it cannot invalidate anything above it.

Results

After deploying, cached input tokens started showing up immediately. The OpenAI Responses API returns both input_tokens and cached_tokens in the usage object, so I added a cache_hit_pct field to the structured logs to track the ratio per request.

The first turn of a conversation pays full price since nothing is cached yet. But from turn 2 onward, the system prompt and developer context are identical, so they hit every time. Subsequent turns cache 50-60% of input tokens. The only uncached portion is the new user message and any draft content at the tail. Averaged across all turns in a conversation, it lands around 40%.

General chat runs higher because there is no draft context to break the suffix. Recipe refinement runs a bit lower because the draft changes each turn, but the prefix still caches.

Lessons learned

  • Caching is mostly about prompt shape, not vendor settings. The API already supports it. You just have to stop breaking the prefix.
  • Small prompt fragments are easier to reason about than one large prompt. And they compose in ways that help caching without you having to think about it.
  • Prompt islands are sneaky. You do not notice them until you go looking. A quick grep for multiline strings and f-strings with instruction-like language found most of mine.

Design tradeoffs

  • Fragmented prompts require more curation, but they make intent clearer and reduce wasted tokens per request.
  • A central registry is one more file to maintain, but it eliminates the question of “where does this instruction live?”
  • Stable prefixes improve caching, but you have to be deliberate about where volatile context gets injected. The ordering constraint is easy to forget.

What is next

  1. Track cache hit rates by mode over a full billing cycle to see where there is more to gain
  2. Check fragment coverage so no instruction gaps slip in during mode composition
  3. Move the remaining tool-level prompt builders to the registry where it makes sense

This is not glamorous work, but it is the kind of change that pays for itself on every request.