Blog
From the lab.
Research notes, engineering deep-dives, and infrastructure insights from the Deep Variance team.

Engineering
How VMM Stitching Recovers 65% of Wasted GPU Memory
A technical walkthrough of how Optimemory uses CUDA Virtual Memory Management to stitch fragmented VRAM into contiguous address spaces, eliminating allocation overhead.
2 min read

Research
FP8 Training: Achieving Near-Zero Perplexity Loss at Half the Memory
Our research into dual-format FP8 precision reveals that E4M3 forward passes combined with E5M2 backward passes maintain 99.9% accuracy while cutting memory in half.
1 min read

Product
Introducing HyperRAG: KV Cache Optimization for RAG Serving
HyperRAG combines prefix-trie KV caching, PGDSF eviction, speculative pipelining, and Pareto schedule search to deliver up to 2x faster time-to-first-token for RAG workloads.
1 min read