From LLM Wrapper to Agentic UX: Practical Notes From Building One

Most teams start by inserting an LLM call into an existing feature. That works, but it often becomes a wrapper that produces text while the user does the work. I ran into the same pattern building Simmr, a recipe platform with an AI chef assistant, and had to unlearn it. The shift to agentic UX is less about model choice and more about system design. This write-up captures the lessons that stuck.

Output vs outcomes

A wrapper returns answers. An agentic UX completes steps.

The first version of Simmr’s recipe generation was a wrapper. The user typed “make me a chicken dinner,” the backend sent a prompt to the model, and the model returned a recipe as structured JSON. The user got a result, but the system did not know anything about the user. It did not check what was in their pantry. It did not know they hated cilantro. It did not save the recipe anywhere. The user had to do all of that manually.

The agentic version works differently. The same request now triggers a sequence:

The agent calls get_pantry_items to see what the user actually has
It calls get_user_memory to check for dietary restrictions and preferences
It calls create_recipe_draft with a recipe that accounts for both
The user reviews, asks for tweaks (“make it spicier”), and the agent calls refine_recipe_draft
When the user is satisfied, the agent calls finalize_recipe_draft to save it

The difference is not cosmetic. In the wrapper version, the model produced text. In the agentic version, the system gathered context, made decisions, executed actions, and changed state. The user’s pantry, preferences, and recipe library were all involved without the user having to manage any of it.

If the system cannot change state, it is still a wrapper. That sounds obvious, but it is easy to miss when early prototypes look good in demos.

Lessons learned

1) Consistency beats cleverness

Agent systems break when every feature has its own prompts, tools, and logs. Early on, Simmr had separate code paths for streaming chat responses and non-streaming recipe generation. They shared some logic but diverged in subtle ways — different guardrail checks, different error handling, different ways of tracking token usage. When I added a new tool, I had to wire it up in two places. When a guardrail changed, I had to remember to update both paths.

I replaced all of it with a single event-driven engine. The core loop is straightforward:

while True:
    # 1. Check guardrails
    can_continue, reason = ctx.check_can_continue()
    if not can_continue:
        yield ContentCompleteEvent(content=timeout_message(reason))
        break

    # 2. Call LLM
    response = await self._runtime.complete(ctx, request)

    # 3. Emit content
    if response.has_content:
        yield ContentCompleteEvent(content=response.content)

    # 4. If no tool calls, done
    if not response.has_tool_calls:
        break

    # 5. Execute tools, feed results back, repeat
    for tool_call in response.tool_calls:
        yield ToolStartedEvent(tool_name=tool_call.name)
        record = await self._tool_executor.execute_tool(ctx, tool_call)
        yield ToolCompletedEvent(record=record)

Streaming and non-streaming are now adapters over the same engine. They consume the same events, enforce the same guardrails, and log the same way. Adding a new tool means registering it once. Changing a guardrail means changing it in one place.

That choice was more boring than clever, which is exactly why it worked. It reduced drift and made new features safer by default.

2) Guardrails are part of the UX

Budget limits, timeouts, and tool permissions are user-facing controls, not backend plumbing. I learned this the hard way. Early versions had a flat token limit and no tool round cap. The agent would occasionally enter a loop — calling a search tool, not finding what it wanted, calling it again with different parameters, burning through tokens. The user saw a long spinner and then an abrupt error.

The fix was a multi-tier budget system that classifies each user message by intent:

INTENT_TURN_BUDGETS = {
    IntentLevel.LIGHT:    TurnBudget(max_tokens=2000, max_tool_rounds=1),
    IntentLevel.STANDARD: TurnBudget(max_tokens=5000, max_tool_rounds=2),
    IntentLevel.HEAVY:    TurnBudget(max_tokens=8000, max_tool_rounds=3),
}

A simple question like “what’s in my pantry?” gets a light budget — one tool round, 2k tokens. A complex request like “plan me a week of dinners using what I have” gets a heavy budget — three tool rounds, 8k tokens. The classification happens before the agent loop starts, so the system never over-allocates.

On top of that, every agent run checks three hard limits before each iteration:

Token budget: total input + output tokens consumed so far
Tool rounds: number of tool-calling iterations completed
Wall-clock time: elapsed seconds since the run started

When any limit is hit, the agent exits gracefully with a message that acknowledges the work done rather than apologizing for a failure:

if "token" in reason.lower():
    return "I've gathered a lot of information to help you. "
           "Let me know if you'd like me to dig deeper."
if "time" in reason.lower():
    return "I want to make sure I respond quickly. "
           "Based on what I've learned so far, how can I help?"

The user sees a helpful pause, not a crash. They can continue the conversation and the agent picks up where it left off. Guardrails turned out to be a product feature, not a safety net.

3) Tools matter more than prompts

A good prompt without tools is still a wrapper. The pattern that kept working: read tools for context, write tools for actions, and a clear boundary between them.

Simmr has around 19 tools, and they split cleanly into reads and writes:

class AgentTool(ABC):
    name: str
    description: str
    is_write: bool = False          # Read by default
    require_confirmation: bool = False
    args_model: type | None = None  # Pydantic model for typed arguments

Read tools (get_pantry_items, search_recipes, get_user_memory) are safe — they gather context without side effects. Write tools (create_recipe_draft, add_pantry_item, save_user_memory) change state and carry additional restrictions.

The tool executor enforces this at runtime, not through prompts:

# In the executor, before running any tool
if tool.is_write and not ctx.guardrails.allow_write_tools:
    return record.marked_error("Write tool not allowed")

if not ctx.guardrails.is_tool_allowed(tool.name):
    return record.marked_error("Tool not available in this mode")

This matters because different modes need different tool sets. General chat gets read-only tools — the user is exploring, not creating. Recipe studio mode gets the full set including draft creation and refinement. The tool registry handles this through named sets:

GENERAL_CHAT_TOOLS = frozenset({
    "get_pantry_items", "search_pantry",
    "get_saved_recipes", "search_recipes",
    "get_user_memory", "substitution", "nutrition",
})

RECIPE_STUDIO_TOOLS = frozenset({
    *GENERAL_CHAT_TOOLS,
    "create_recipe_draft", "update_recipe_draft",
    "refine_recipe_draft", "finalize_recipe_draft",
    "save_user_memory", "forget_user_memory",
})

This also saves tokens. Each tool definition costs prompt space, so sending 8 tools instead of 19 saves roughly 4,400 tokens per request. The model also makes better decisions when it has fewer, more relevant options.

One more detail that paid off: typed arguments via Pydantic models. Each tool declares an args_model, and the executor validates arguments before execution. Bad arguments from the model get caught with a clear error message instead of a runtime exception deep in business logic.

4) Memory should be structured

Free-form memory leads to stale or conflicting context. The first version of Simmr’s memory was a single text blob — a summary string that got appended to after each conversation. It worked for a few turns, but it degraded fast. Preferences contradicted each other. Old information never got pruned. The model had no way to distinguish between something the user said once in passing and something they stated explicitly.

I replaced it with three tiers:

Profile memory is the durable summary — strong likes, strong dislikes, dietary notes, and cooking patterns. It gets injected into every conversation as stable context.

Normalized memory items are individual preferences with provenance. Each one has a category, a trust level, and an evidence trail:

class MemoryItem:
    category: MemoryCategory    # "cuisine", "abstract_like", "dietary_goal"
    value: str                  # "Thai food"
    value_normalized: str       # "thai" (for deduplication)
    asserted_by: AssertionType  # "inferred" | "user_explicit" | "user_confirmed"
    confidence: float           # 0.0 - 1.0
    evidence_snippet: str       # "I love Thai food"

The trust hierarchy matters. An inferred preference (“the user ordered Thai twice”) has lower trust than an explicit statement (“I love Thai food”), which has lower trust than a confirmed preference. When memory is full and something needs to be evicted, lower-trust items go first.

Capacity is bounded: 50 items per user, 15 per category, and a maximum of 3 saves per agent turn to prevent the model from dumping a list of inferences all at once. When a category hits its cap, the oldest item with equal or lower trust gets replaced. If all items in the category have higher trust, the save fails gracefully.

Session memory is the current conversation — messages, metadata, and context like “the last recipe we were discussing.” This enables pronoun resolution (“make that spicier”) without loading the user’s full history.

The prompt rules for when to save are explicit:

When to Save (ONLY explicit first-person statements):
- "I love garlic" → save
- "I hate cilantro" → save

When NOT to Save:
- Third-person: "My friend is vegetarian"
- Uncertain: "Maybe I should try..."
- Already known: check USER CONTEXT first

Structured memory is easier to validate and easier to prune. The model can still reason, but it has a stable baseline.

5) Agentic UX is not just chat

Chat can be useful, but Simmr’s most effective interactions are not conversational. They are structured flows that happen to be powered by an agent.

Recipe refinement is a good example. Instead of asking the user to describe what they want changed, the UI offers one-click refinements: “make it healthier,” “make it quicker,” “scale to 6 servings.” Each option maps to a refinement type with three intensity levels:

RefinementType.MAKE_HEALTHIER: {
    LIGHT:  "Reduce saturated fats by 20-30%...",
    MEDIUM: "Reduce calories by 30-40%. Use leaner proteins...",
    HEAVY:  "Transform into very healthy version. Reduce 50%+...",
}

There are 15 refinement types and 3 intensities each — 45 total refinement prompts. The user picks an option, the agent applies it to their draft, and the result shows up as a diff. No open-ended conversation needed. The agent is doing real work, but the UX is a button, not a text box.

Meal planning works similarly. The user sets parameters (days, servings, variety level, pantry constraints), and the agent generates a structured plan. The variety modifier controls how adventurous the results are:

Balanced: at least 4 cuisines, no protein repeated more than twice
World Tour: different cuisine for every meal, rotate cooking methods

The pantry constraint modifier controls how strict the ingredient matching is — from “prefer what’s available” to “use only these exact ingredients, do not assume staples.”

The interface is less important than the outcome. A chat message, a button click, and a form submission can all trigger the same agent engine. What matters is that the system gathers context, executes actions, and returns a completed step.

Hurdles and fixes

Hurdle: adding more prompts to fix behavior

When the agent made a mistake — saving a preference it should not have, or ignoring pantry contents — the instinct was to add more instructions to the system prompt. “Never save preferences from third-party statements.” “Always check the pantry first.” The prompt grew into a wall of text covering every mode and every edge case.

The problem was not instruction quality. It was that one prompt was doing too much. Every request carried instructions for every mode, which wasted tokens and gave the model too many directives to prioritize.

Fix: a prompt registry with composable fragments. Each mode composes only what it needs:

system_prompt = PromptRegistry.compose(
    "persona.core",
    "chef.tool_guidance",
    "chef.memory_rules",
)

The core persona stays stable and cacheable. Mode-specific rules get added on top. When a behavior needs fixing, the change goes into one fragment and affects only the relevant mode. This also improved caching — the static prefix stays identical across requests, and OpenAI’s Responses API caches based on exact prefix match.

Hurdle: too many entry points

Early on, streaming chat, non-streaming generation, and single-shot utility calls (substitution lookups, nutrition info) each had their own path to the LLM. Guardrails were implemented slightly differently in each. A token limit change required updating three places. A new hook — say, content moderation before the LLM call — had to be wired up three times.

Fix: a runtime facade that every entry point goes through.

class RuntimeFacade:
    async def chat(self, user_id, message, ...):
        auth = await self._ai_gateway.authorize_chat(user_id, message)
        if not auth.can_proceed:
            return AgentResponse(content=auth.denial_reason)

        guardrails = AgentGuardrails(
            max_tokens_budget=auth.turn_budget.max_tokens,
            max_tool_rounds=auth.turn_budget.max_tool_rounds,
            allowed_tools=auth.turn_budget.allowed_tools,
        )

        return await self._agent_service.run(guardrails=guardrails, ...)

Chat, generation, and utility calls all go through this facade. Authorization, budget allocation, guardrail construction, and post-run accounting happen in one place. The facade pattern also made quota refunds easy to implement — if the LLM call fails after quota was incremented, the facade handles the refund.

Hurdle: unclear write boundaries

Without clear rules, the agent writes too early. The most common problem was memory saves. The agent would infer preferences from vague statements and save them — “you seem to like spicy food” after one request for hot sauce. Users did not know their profile was being modified, and the saved preferences sometimes contradicted later statements.

Fix: a layered defense on the write path.

First, the tool itself validates: only explicit first-person statements trigger a save, and the evidence snippet is stored for auditability. Second, rate limiting caps saves at 3 per turn, so the agent cannot dump a batch of inferences. Third, trust-based eviction means low-confidence saves get replaced first when capacity is reached. Fourth, the user can disable memory learning entirely via a settings toggle, and the tool checks that flag before every save:

if user and not user.memory_learning_enabled:
    return ToolResult.ok({"skipped": True, "reason": "user_disabled"})

The same principle applies to other write tools. Recipe drafts require explicit finalization. Pantry modifications use typed arguments that the executor validates before execution. The agent cannot bypass these checks because they are in the tool executor, not in the prompt.

Hurdle: big rewrites

The temptation after identifying all these problems was to rewrite everything at once. Rip out the old streaming code, the old generation code, the old memory system, and replace it all with the new architecture.

Fix: keep the runtime thin, migrate features one by one. I started with the engine — getting the event-driven loop working with the existing tools and prompts. Then I migrated tools to the registry one at a time, adding typed arguments and permission checks as I went. Memory was last because it touched the most state.

Each migration was a single PR that could be tested independently. The old code paths stayed alive until the new ones were proven. No feature was broken during the transition because there was always a working path.

A practical migration path

If you are building something similar, here is the order that worked for me:

Unify the runtime. Get one engine that handles the tool loop, token tracking, and event emission. Streaming and non-streaming are adapters, not separate implementations.
Centralize the tool registry. Every tool declares its name, schema, read/write classification, and permission requirements. The executor validates and enforces. No tool bypasses the registry.
Make one workflow end-to-end agentic. Pick the most impactful one. For Simmr, it was recipe creation — from pantry check to draft to refinement to save. Get that working well before expanding.
Add guardrails as a product feature. Budget by intent, cap tool rounds, set wall-clock timeouts, and write graceful exit messages. Test what the user sees when limits are hit.
Structure the memory system. Separate durable preferences from session context. Add provenance tracking and capacity limits. Make sure the user can see and control what the system remembers.
Expand carefully. Each new agentic flow should reuse the engine, the registry, and the guardrails. If it requires a special case, that is a signal to generalize the infrastructure instead.

Each step should improve reliability, not just capability.

What “agentic” means in practice

An agentic UX:

gathers context (pantry, preferences, history)
selects tools (read for context, write for actions)
executes actions (drafts, saves, refinements)
validates results (typed arguments, permission checks)
updates structured preferences (with provenance and trust)
respects limits (budgets, timeouts, capacity)

The model is not the product. The workflow is.