HyperRAG - Now in beta.

HyperRAG:
KV cache optimization for RAG serving.

Reduce time-to-first-token by up to 2x. HyperRAG combines a prefix-trie KV cache, PGDSF eviction, speculative pipelining, and Pareto schedule search into one serving optimization layer.

pip install dv-hyperrag

Up to 2×

Faster TTFT

~180

Schedules on Pareto frontier

±5%

Latency prediction accuracy

Three steps from workload to production

HyperRAG sits between your retriever and your LLM. Plug in your existing pipeline, get optimized serving out.

Step 1

Describe your workload

Pick a paradigm (hyperscale, long-context, iterative, or rewriter-reranker), specify your model and GPU budget. RAGOptimizeConfig validates and structures your constraints.

Step 2

Run Pareto schedule search

rago.optimize() sweeps ~180 candidate configurations across GPU count, batch size, and cache hit rate targets. Returns the non-dominated schedule that minimizes TTFT for your QPS target.

Step 3

Serve with KV caching

rago.build_controller() returns a RAGServeController ready to process queries with prefix reuse, PGDSF eviction, and speculative pipelining active.

Research-backed algorithms

Every core decision in HyperRAG is grounded in peer-reviewed systems research. The scheduling layer jointly optimizes latency and throughput across hardware configurations1, while the caching layer predicts which knowledge fragments will be reused and holds them in memory before the next query arrives2.

KnowledgeTree

prefix trie

Multi-level prefix trie that stores transformer KV attention tensors per document. When two queries share retrieved documents, the cached KV state for that prefix is reused exactly once. Nodes span GPU HBM (L1) and host DRAM (L2) tiers.

PGDSF Eviction

Priority = Clock + (Freq x Cost) / Size

Priority-based Greedy Dual Size Frequency policy. Each cached node is scored by recency (clock), access frequency, prefill recomputation cost, and KV tensor size. Lowest-priority leaf nodes are demoted to host DRAM or evicted first.

Speculative Pipelining

retrieve + prefill overlap

Starts prefill on cached documents immediately, while retrieval of remaining documents runs in parallel. New documents are merged into the prefill pass on arrival. Estimated savings: min(t_retrieve, t_prefill x 0.8).

Pareto Schedule Search

~180 candidates

Enumerates GPU counts (1, 2, 4), batch sizes (1, 2, 4, 8, 16), cache hit rates (0 to 0.9), and placements (collocated, disaggregated, hybrid). Returns a Pareto frontier of 14 to 18 non-dominated configurations. No schedule in the frontier dominates another on both TTFT and QPS.

Four serving paradigms

Each paradigm captures a different system bottleneck. HyperRAG selects the right cost model and default tuning for each.

Hyperscale

paradigm="hyperscale"

Standard single-hop RAG at scale. Bottleneck is FAISS index scan, not the LLM. Default model: LLaMA 3.1 8B. Measured 1.09x speedup on 4x A100 at 1000 queries (Zipfian workload, alpha 1.1).

FAISS-bottlenecked8B default243.6 ms TTFT

Long Context

paradigm="long_context"

1M+ token context where retrieval is skipped. Bottleneck is LLM prefill. Default model: LLaMA 3.1 70B. High cache hit rate (94%) with 4-GPU tensor parallelism delivers 9.02x speedup, reducing TTFT from 30.9 ms to 3.4 ms.

9.02x speedup70B default3.4 ms TTFT

Iterative

paradigm="iterative"

Multi-hop and agentic retrieval. FAISS is called up to 4 times per query. Bottleneck is cumulative retrieval latency. Default model: LLaMA 3.1 70B. Speculative pipelining provides the largest relative improvement here.

Multi-hop70B defaultFAISS x4

Rewriter-Reranker

paradigm="rewriter_reranker"

Query rewriting plus cross-encoder reranking. Bottleneck is both encoder and rewriter LLM. Default model: LLaMA 3.1 70B. Combined scheduler and cache optimization reduces TTFT from 649.2 ms to 339.7 ms, a 1.91x speedup.

1.91x speedup70B default339.7 ms TTFT

Fits into any RAG pipeline

HyperRAG wraps your existing retriever. Pass doc_ids and doc_tokens from your retriever output. Integrates with LangChain and LlamaIndex in three lines.

1

Configure and optimize

Set your paradigm, model preset, and GPU/host budget. Call optimize() to get the best schedule and cache allocation for your workload.

2

Build the serving controller

build_controller() wires up the KnowledgeTree, multi-tier cache, request reordering, and speculative pipelining into one ready-to-call object.

3

Process queries and read metrics

Each QueryResult reports TTFT, total latency, cached token count, and whether speculative pipelining fired. ctrl.metrics() gives aggregate hit rate and GPU cache usage.

quickstart.py
from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query

# Step 1: Configure
rago = RAGOptimize(RAGOptimizeConfig(
    paradigm="hyperscale",
    model=LLMModel.LLAMA_3_1_8B,
    gpu_budget_gb=4.0,
    host_budget_gb=16.0,
))

# Step 2: Find optimal schedule
result = rago.optimize()
print(result.summary())
# TTFT=243.6ms  QPS/chip=47.0  hit_rate=82.0%
#   gpus=4  batch=1  pareto_size=18

# Step 3: Build controller
ctrl = rago.build_controller()

# Step 4: Process a query
r = ctrl.process(Query(
    query_id="q1",
    text="What is transformer attention?",
    doc_ids=["d1", "d2"],
    doc_tokens=[512, 256],
))
print(f"TTFT={r.ttft_s*1000:.1f}ms  hit={r.cache_hit}")

# Step 5: Aggregate metrics
m = ctrl.metrics()
print(f"hit_rate={m.hit_rate:.1%}  gpu_mib={m.gpu_used_mib:.0f}")

17 built-in model presets

Roofline cost model parameters are pre-calibrated for all major open-weight LLMs. Custom model specs are accepted via LLMModel.custom().

LLaMA 3.x
1B, 3B, 8B, 70B, 405B
Mistral
7B, Nemo 12B
Gemma 2
2B, 9B, 27B
Qwen 2.5
7B, 14B, 72B
DeepSeek R1
7B, 70B
Phi-3
Mini 3.8B, Medium 14B

All presets include layer count, GQA head configuration, and head dimension for accurate KV cache size estimation.

Benchmark Results

1,000 queries across four RAG serving paradigms.

Paradigm Baseline TTFT With HyperRAG Speedup
Hyperscale
68.3 ms 53.9 ms 1.27×
Long Context
68.1 ms 54.9 ms 1.24×
Iterative
68.8 ms 53.6 ms 1.28×
Rewriter-Reranker
68.3 ms 53.6 ms 1.27×

Results are for text-only models. Multimodal and vision-language models are not included in this benchmark set.

Start optimizing your RAG stack

HyperRAG is available now on PyPI. The GPU extras require vLLM 0.4.0 and Python 3.10 or higher. Run the optimizer in simulation mode on any machine.

Talk to us

1  Alnaasan et al., "RAGO: Retrieval-Augmented Generation Optimization," ISCA 2025. arXiv:2503.14649.

2  Jin et al., "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation," ACM TOCS 2025. arXiv:2404.12457.