Infographic: ten violet document queries on the left feed into a glowing hexagonal HyperRAG capsule. Eight emerald lines exit right to checkmark nodes (cache hits). Two amber lines curve down to a small GPU chip (the rare compute path).

HyperRAG · Public beta

Your KV cache exists.
We make it better.

Every major inference server ships with KV cache. But default caching leaves most of that capacity on the table. HyperRAG is a drop-in optimization layer that makes your existing cache capture far more hits, automatically, from the first request.

Cache hit rate

94.2%

up to

6x faster

Slots into

Four amber server towers each independently reprocessing the same document, showing redundant GPU compute across identical requests.

The problem

Your cache is working. Just not hard enough.

Every major inference server ships with built-in KV cache. But the default implementation doesn't recognize when queries share the same documents, doesn't know which cached entries are worth keeping, and doesn't prevent the model from sitting idle while retrieval runs. Your GPU ends up reprocessing context it has already seen.

Duplicate work on every request: when different users reference the same documents, each request rebuilds that context from scratch. The overlap goes unrecognized and you pay for the same compute repeatedly.

The cache drops what matters most: entries are evicted by recency, not by how often they are reused. At peak load, your most-requested context is the first to go, forcing expensive recomputes exactly when you can least afford them.

Generation stalls waiting on retrieval: the model sits idle until documents are fetched, even when shared context is already in the cache and ready to use. Latency compounds at scale.

A flow diagram showing three incoming requests merging into a central HyperRAG decision node, then splitting into a fast emerald cache-hit path and a heavier teal compute path that both reconverge at a final output.

How it works

Smarter decisions on every request.

HyperRAG sits in front of your inference API. It identifies shared document context across incoming requests and routes them to cached computation your framework already has, rather than reprocessing the same context from scratch each time.

A single luminous violet cache tower at the center, with twelve curved fiber lines radiating outward to small green and violet endpoint nodes, representing one source serving many concurrent requests.

What you get

More from the cache you already have.

Cache hits compound: repeated document context is recognized and reused across concurrent requests, not just sequential ones.

Popular entries stay resident: smarter eviction keeps the context your users hit most in memory when traffic spikes.

Generation starts sooner: prefill overlap cuts time-to-first-token. Your GPU stops idling while retrieval finishes.

Use cases

Works across every RAG workload.

High-volume APIs

Shared document context across thousands of requests per minute. HyperRAG turns overlap into speedup.

Enterprise search

Internal queries return to the same pages repeatedly. Popular content stays fast as traffic scales.

Agentic workflows

Multi-step agents revisit documents across reasoning steps. Each revisit hits the cache, cutting cumulative latency.

Long-context inference

Similar long prompts share substantial context. HyperRAG reuses cached computation and cuts prefill cost.

Model support

Built for production models.

Works with popular open-weight model families out of the box, from 1B to 405B. Custom models and private deployments are supported. HyperRAG's cache optimization is model-agnostic: it works at the serving layer, not inside the model.

LLaMA 3.x1B–405B

Mistral7B, Nemo 12B

Gemma 22B, 9B, 27B

Qwen 2.57B, 14B, 72B

DeepSeek R17B, 70B

Phi-3Mini · Medium

Integration

On top of your stack. Not instead of it.

HyperRAG mounts in front of your existing inference API. No configuration, no cluster changes. Start getting cache hits on the first query.

View on PyPI

No model changes. No migration. No cluster reconfiguration.

Smarter caching starts on the first query. Gains visible immediately.

quickstart.py Python

from hyperrag import HyperRAG

hr = HyperRAG()
ctrl = hr.deploy()

# Query as usual - HyperRAG optimizes automatically
r = ctrl.query(
    text="What is transformer attention?",
    doc_ids=["d1", "d2"],
)

# Metrics available from first request
print(r.latency_ms, "ms")
print(ctrl.metrics())

Benchmarks

Up to 6x faster, measured.

1,000 queries per model across four RAG workload patterns. Tested on 14 production models from 2B to 120B parameters.

Models testedTested

2B to 120B params

Avg TTFT reductionAvg TTFT

54%

across all 14 models

Avg throughput gainThroughput

2.4×

tokens per second

6.27× faster than baseline (best workload)

HyperRAG vs. baseline · latency in ms (lower is better)

0ms10ms20ms30ms40ms50ms60ms

High-volume

Long docs

Multi-step

Complex pipes

High-volume

5.76×

9.8ms vs 56.4ms

Long docs

5.46×

10.4ms vs 56.8ms

Multi-step

5.49×

10.4ms vs 57.1ms

Complex pipes

6.27×

9.1ms vs 57.1ms

4 representative models shown. Averages computed across all 14 tested models. Results will vary by workload and hardware.

Your stack stays. Your latency drops.

HyperRAG is available now on PyPI. Drop it in front of your inference API and start seeing the difference immediately.

Talk to us