HyperRAG:
KV cache optimization for RAG serving.
Reduce time-to-first-token by up to 2x. HyperRAG combines a prefix-trie KV cache, PGDSF eviction, speculative pipelining, and Pareto schedule search into one serving optimization layer.
Up to 2×
Faster TTFT
~180
Schedules on Pareto frontier
±5%
Latency prediction accuracy
Three steps from workload to production
HyperRAG sits between your retriever and your LLM. Plug in your existing pipeline, get optimized serving out.
Describe your workload
Pick a paradigm (hyperscale, long-context, iterative, or rewriter-reranker), specify your model and GPU budget. RAGOptimizeConfig validates and structures your constraints.
Run Pareto schedule search
rago.optimize() sweeps ~180 candidate configurations across GPU count, batch size, and cache hit rate targets. Returns the non-dominated schedule that minimizes TTFT for your QPS target.
Serve with KV caching
rago.build_controller() returns a RAGServeController ready to process queries with prefix reuse, PGDSF eviction, and speculative pipelining active.
Research-backed algorithms
Every core decision in HyperRAG is grounded in peer-reviewed systems research. The scheduling layer jointly optimizes latency and throughput across hardware configurations1, while the caching layer predicts which knowledge fragments will be reused and holds them in memory before the next query arrives2.
KnowledgeTree
prefix trieMulti-level prefix trie that stores transformer KV attention tensors per document. When two queries share retrieved documents, the cached KV state for that prefix is reused exactly once. Nodes span GPU HBM (L1) and host DRAM (L2) tiers.
PGDSF Eviction
Priority = Clock + (Freq x Cost) / SizePriority-based Greedy Dual Size Frequency policy. Each cached node is scored by recency (clock), access frequency, prefill recomputation cost, and KV tensor size. Lowest-priority leaf nodes are demoted to host DRAM or evicted first.
Speculative Pipelining
retrieve + prefill overlap Starts prefill on cached documents immediately, while retrieval of remaining documents runs in parallel. New documents are merged into the prefill pass on arrival. Estimated savings: min(t_retrieve, t_prefill x 0.8).
Pareto Schedule Search
~180 candidatesEnumerates GPU counts (1, 2, 4), batch sizes (1, 2, 4, 8, 16), cache hit rates (0 to 0.9), and placements (collocated, disaggregated, hybrid). Returns a Pareto frontier of 14 to 18 non-dominated configurations. No schedule in the frontier dominates another on both TTFT and QPS.
Four serving paradigms
Each paradigm captures a different system bottleneck. HyperRAG selects the right cost model and default tuning for each.
Hyperscale
Standard single-hop RAG at scale. Bottleneck is FAISS index scan, not the LLM. Default model: LLaMA 3.1 8B. Measured 1.09x speedup on 4x A100 at 1000 queries (Zipfian workload, alpha 1.1).
Long Context
1M+ token context where retrieval is skipped. Bottleneck is LLM prefill. Default model: LLaMA 3.1 70B. High cache hit rate (94%) with 4-GPU tensor parallelism delivers 9.02x speedup, reducing TTFT from 30.9 ms to 3.4 ms.
Iterative
Multi-hop and agentic retrieval. FAISS is called up to 4 times per query. Bottleneck is cumulative retrieval latency. Default model: LLaMA 3.1 70B. Speculative pipelining provides the largest relative improvement here.
Rewriter-Reranker
Query rewriting plus cross-encoder reranking. Bottleneck is both encoder and rewriter LLM. Default model: LLaMA 3.1 70B. Combined scheduler and cache optimization reduces TTFT from 649.2 ms to 339.7 ms, a 1.91x speedup.
Fits into any RAG pipeline
HyperRAG wraps your existing retriever. Pass doc_ids and doc_tokens from your retriever output. Integrates with LangChain and LlamaIndex in three lines.
Configure and optimize
Set your paradigm, model preset, and GPU/host budget. Call optimize() to get the best schedule and cache allocation for your workload.
Build the serving controller
build_controller() wires up the KnowledgeTree, multi-tier cache, request reordering, and speculative pipelining into one ready-to-call object.
Process queries and read metrics
Each QueryResult reports TTFT, total latency, cached token count, and whether speculative pipelining fired. ctrl.metrics() gives aggregate hit rate and GPU cache usage.
from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query
# Step 1: Configure
rago = RAGOptimize(RAGOptimizeConfig(
paradigm="hyperscale",
model=LLMModel.LLAMA_3_1_8B,
gpu_budget_gb=4.0,
host_budget_gb=16.0,
))
# Step 2: Find optimal schedule
result = rago.optimize()
print(result.summary())
# TTFT=243.6ms QPS/chip=47.0 hit_rate=82.0%
# gpus=4 batch=1 pareto_size=18
# Step 3: Build controller
ctrl = rago.build_controller()
# Step 4: Process a query
r = ctrl.process(Query(
query_id="q1",
text="What is transformer attention?",
doc_ids=["d1", "d2"],
doc_tokens=[512, 256],
))
print(f"TTFT={r.ttft_s*1000:.1f}ms hit={r.cache_hit}")
# Step 5: Aggregate metrics
m = ctrl.metrics()
print(f"hit_rate={m.hit_rate:.1%} gpu_mib={m.gpu_used_mib:.0f}")17 built-in model presets
Roofline cost model parameters are pre-calibrated for all major open-weight LLMs. Custom model specs are accepted via LLMModel.custom().
All presets include layer count, GQA head configuration, and head dimension for accurate KV cache size estimation.
Benchmark Results
1,000 queries across four RAG serving paradigms.
| Paradigm | Baseline TTFT | With HyperRAG | Speedup |
|---|---|---|---|
Hyperscale | 68.3 ms | 53.9 ms | 1.27× |
Long Context | 68.1 ms | 54.9 ms | 1.24× |
Iterative | 68.8 ms | 53.6 ms | 1.28× |
Rewriter-Reranker | 68.3 ms | 53.6 ms | 1.27× |
Results are for text-only models. Multimodal and vision-language models are not included in this benchmark set.
Start optimizing your RAG stack
HyperRAG is available now on PyPI. The GPU extras require vLLM 0.4.0 and Python 3.10 or higher. Run the optimizer in simulation mode on any machine.
Talk to us1 Alnaasan et al., "RAGO: Retrieval-Augmented Generation Optimization," ISCA 2025. arXiv:2503.14649.
2 Jin et al., "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation," ACM TOCS 2025. arXiv:2404.12457.