TECHNICAL GUIDE

Caching LLM Responses
for RAG Agents

How CacheCore cuts synthesis latency from 5 seconds to 19 milliseconds, without changing your prompting logic.

10 min read RAG · Caching · Agents

The Problem

Every time a RAG agent answers a question, it sends a request to an LLM. The model generates an answer, your user sees it, and you pay for the tokens. So far, so normal. But if a second user asks the same question five minutes later, pointing at the same document corpus, the entire round-trip happens again. Same prompt tokens in, same answer out, same bill.

This adds up fast. Consider a compliance bot used by 200 analysts at a bank, or a customer support agent handling 50k queries a day for a SaaS product. In environments like these, question overlap is not the exception. It is the norm. A large share of your LLM spend ends up paying for answers you have already generated, sometimes hundreds of times.

The latency cost is just as real. Users wait 4 to 7 seconds for each synthesis call while the LLM regenerates something it already said yesterday. For agents embedded in workflows or serving real-time UIs, that wait is noticeable and it compounds across multi-step pipelines.

What CacheCore Does

CacheCore is a caching proxy that sits between your agent and your LLM provider. It intercepts outgoing completion requests, checks whether it has already seen this question (or one close enough in meaning), and returns the cached response when it can. When it can't, it forwards the request upstream, caches the result, and gets out of the way.

Your agent code barely changes. You point your HTTP client at CacheCore instead of api.openai.com, add one header for dependency tracking, and the rest works as before. CacheCore speaks the standard OpenAI /v1/chat/completions schema, so there is no new SDK to learn.

Under the hood, CacheCore uses two layers of matching. The first layer, L1, catches the exact same question asked again: identical wording, identical context. When it finds a match the answer comes back almost instantly, straight from the cache store. The second layer, L2, is more interesting. It catches questions that mean the same thing even when the words are different. If a user asks "What are the SAR filing requirements?" and another later asks "What is the FinCEN deadline for filing a suspicious activity report?", L2 recognises these as semantically equivalent and serves the cached response. Both layers skip the LLM entirely, consuming zero tokens.

L1 · Exact match
19 ms
Same question replayed verbatim
L2 · Semantic match
~350 ms
Rephrased question, same intent

For comparison, a cold request with no cache and a full round-trip to OpenAI took between 4.5 and 6.6 seconds in our smoke test. That puts the improvement at roughly 10 to 300× depending on the match type, with no loss of answer quality.

Status Synthesis Time What Happens
MISS 4,534 – 6,571 ms Full round-trip to LLM provider
HIT_L1 19 – 20 ms Exact match, returned instantly
HIT_L2 257 – 495 ms Semantic match, meaning close enough

Integrating into Your Agent

A CacheCore integration follows a two-phase pattern. In Phase 1, your agent retrieves documents from its vector store exactly as it does today. Nothing changes here. In Phase 2, instead of calling the LLM directly, you route the request through CacheCore and attach two extra pieces of information: a tenant token (for namespace isolation) and a list of dependency keys (so the cache knows which source documents this answer depends on, and can invalidate it if any of them change).

Key integration points
# 1. Build the user message with the question
user_message = (
    f"Policy context:\n{context}\n\n"
    f"Question: {question}"
)

# 2. Configure CacheCore
config = AgentConfig(
    system_prompt = SYSTEM_PROMPT,
    tenant_jwt    = self._tenant_jwt,     # identifies your tenant
    dep_keys      = dep_keys,             # e.g. ['doc:AML-002', 'doc:KYC-001']
    model         = SYNTHESIS_MODEL,
)

# 3. Call CacheCore (same endpoint as OpenAI)
result = await runner.run(
    messages = [{"role": "user", "content": user_message}],
    config   = config,
)

Under the hood, the runner sends a standard POST to CacheCore's /v1/chat/completions endpoint. Two headers do the heavy lifting. x-cachecore-token carries the tenant JWT, from which CacheCore derives the cache namespace. x-cache-deps carries the document IDs that ground this answer. If any of those documents are updated or deleted later, every cached answer that depends on them gets invalidated automatically.

The response comes back in the same OpenAI format your code already expects, with one addition: an X-Cache header telling you whether the response was served from L1, L2, or went to the upstream provider. You can log this for observability without changing any parsing logic.

Keep the user message lean. This is the single most important design rule for getting good L2 hit rates. CacheCore uses the user message content to compute semantic similarity between requests. If you stuff large document bodies into the user turn alongside the question, the document text dominates the embedding and two identical questions will look different to the semantic layer. Put your retrieved context in the system prompt or in a clearly separated prefix, and keep the actual question as clean as possible.

Proof: a Live Smoke Test

We ran a 10-case test suite against a live CacheCore instance. The agent behind it answers compliance questions by retrieving policy documents from a vector store and synthesising answers via OpenAI. The suite covers four scenarios, and they run in strict sequence because each one builds on the cache state created by the previous one.

Cold cache. Three compliance questions are asked for the first time. CacheCore has no stored data, so each request gets forwarded to OpenAI. The responses are cached and the dependency keys are registered. Synthesis latencies ranged from 4.5 to 6.6 seconds, reflecting the full upstream round-trip.

Cold miss, first query
  Q:        What are the requirements for filing a Suspicious Activity Report?
  status:   MISS   retrieve=327ms   synthesis=4,534ms

Exact replay. The same three questions are replayed verbatim, from the same tenant. CacheCore matches each request on the full message fingerprint and returns the stored answer without contacting OpenAI at all. Synthesis collapses to 19–20 ms across the board. That is a 238× improvement over the cold call for the SAR question.

L1 hit, identical question
  Q:        What are the requirements for filing a Suspicious Activity Report?
  status:   HIT_L1   retrieve=63ms   synthesis=19ms

Semantic replay. Now the interesting part. Each question is rephrased while keeping the same intent. "What are the requirements for filing a Suspicious Activity Report?" becomes "What is the FinCEN deadline for filing a suspicious activity report?". Different words, overlapping meaning. CacheCore's L2 layer embeds the new question, finds a match above the similarity threshold, and serves the cached answer. Synthesis times land between 257 and 495 ms, and zero tokens are consumed from the LLM provider.

L2 hit, rephrased question
  Q:        What is the FinCEN deadline for filing a suspicious activity report?
  status:   HIT_L2   retrieve=40ms   synthesis=495ms

Cross-tenant isolation. Finally, the exact same SAR question from the first scenario is sent again, but this time from a different tenant (alpha_fund instead of acme_bank). Even though the question is identical and the answer already exists in the cache, CacheCore treats this as a fresh request because it operates in a completely separate namespace. The result is a clean miss and a full upstream call.

Tenant isolation, same question, different tenant
  Q:        What are the requirements for filing a Suspicious Activity Report?
  tenant:   alpha_fund
  status:   MISS   retrieve=35ms   synthesis=4,604ms
Final results
   Cold cache             3/3 passed
   Exact replay           3/3 passed
   Semantic replay        3/3 passed
   Tenant isolation       1/1 passed

  Overall: 10/10 passed

Tenant Isolation

Every tenant's cached data lives in its own namespace, derived from the identity token attached to each request. Tenant A's queries, answers, and embeddings are invisible to Tenant B. They cannot be searched, matched, or returned, even for byte-identical questions. This is not a configuration option you enable; it is a property of every cache lookup by default.

For teams building multi-tenant SaaS products or running enterprise deployments with business-unit boundaries, this is table stakes. One customer's cached data never leaks to another, and you don't need to write any application-level logic to enforce it. The isolation test in the smoke suite exists specifically to verify this property on every run: same question, same documents, same model, different tenant, guaranteed miss.

When to Use CacheCore

High-traffic agents where users ask overlapping questions. Compliance bots, customer support assistants, internal knowledge bases. The more question overlap your workload has, the higher your cache hit rate, and the bigger the savings on both latency and token spend.

Multi-tenant platforms that need hard isolation between customers. CacheCore enforces it at the cache layer with no extra work on your side.

Cost-sensitive teams paying per-token to OpenAI, Anthropic, or Bedrock. Every cache hit is a request that consumes zero tokens. At a 60%+ combined hit rate, you are cutting both latency and LLM costs by more than half.

In our smoke test, cold requests took 4.5 to 6.6 seconds. Exact matches returned in 19 ms. Semantic matches returned in under 500 ms.