Blog

From the lab.

Research notes, engineering deep-dives, and infrastructure insights from the Deep Variance team.

How VMM Stitching Recovers 65% of Wasted GPU Memory

A technical walkthrough of how Optimemory uses CUDA Virtual Memory Management to stitch fragmented VRAM into contiguous address spaces, eliminating allocation overhead.

Apr 10, 2026 2 min read

Research

FP8 Training: Achieving Near-Zero Perplexity Loss at Half the Memory

Our research into dual-format FP8 precision reveals that E4M3 forward passes combined with E5M2 backward passes maintain 99.9% accuracy while cutting memory in half.

Apr 7, 2026 1 min read

Product

Introducing HyperRAG: KV Cache Optimization for RAG Serving

HyperRAG combines prefix-trie KV caching, PGDSF eviction, speculative pipelining, and Pareto schedule search to deliver up to 2x faster time-to-first-token for RAG workloads.

Apr 1, 2026 1 min read

Autopilot

Optimemory

LLM Tuner

HyperRAG

From the lab.

How VMM Stitching Recovers 65% of Wasted GPU Memory

FP8 Training: Achieving Near-Zero Perplexity Loss at Half the Memory

Introducing HyperRAG: KV Cache Optimization for RAG Serving