Optimemory - Now available for usage. Try out python package.

Optimemory:
Hardware-aware memory virtualization.

Unlock 2.5x more efficient model weights with research-backed virtual memory stitching. A specialized VMM layer that optimizes memory allocation across fragmented hardware.

pip install deep-variance

2.5x

Model scale increase

-65%

Memory allocation overhead reduced

+42%

Training stability uplift

Built for Every AI Workload

Optimemory is not tied to any model architecture. If it runs on PyTorch and CUDA, it benefits from VMM-backed memory pooling.

Large Language Models

GPT, LLaMA, Mistral, Megatron. Pre-allocate batch buffers once and reuse them across tens of thousands of training steps.

Vision Transformers

ViT, CLIP, DINO, SigLIP. Image patch buffers stay resident in a fixed VMM pool through the full training run.

Diffusion Models

Stable Diffusion, FLUX, DiT. Noisy latent buffers across denoising timesteps share the same physical VRAM pool.

Inference Servers

ResNet, BERT, EfficientNet. Pre-allocate I/O buffers at startup and serve every request from reusable VMM slots with zero allocation on the hot path.

Hardware-Aware Memory for Any Architecture

Every deep learning workload is memory-bound. Optimemory decouples physical hardware limitations from model capacity by operating below the framework level, invisible to the model and the optimizer.

VMM Stitching Layer

Research-backed virtual memory stitching that presents fragmented physical VRAM as a contiguous address space.

Physical Memory Pooling

Pool and reuse freed physical VRAM chunks across training steps, eliminating repeated allocation overhead without stalling compute kernels.

Hardware-Aware Fragmenting

Queries the CUDA driver for hardware-specific allocation granularity and aligns chunk sizes accordingly. Works on any NVIDIA GPU with Compute Capability 6.0 or higher (Pascal through Hopper).

Upcoming in v2.4

Native FP8 quantization and weight optimization kernels.

Learn More →
Research
Multi-GPU & NVlink Support

We are currently researching cross-GPU virtual address space stitching via high-speed NVlink interconnects.

Memory Efficiency

Malloc Master: Active

98.4%
Reclaimed

42.2 GB

Utilization

99.1%

Drop-in integration

Replace standard tensor allocation with a single call. No changes to your training loop required.

1

Install via pip: deep-variance

2

Pre-allocate a reusable GPU buffer once with vmm_empty_nd, backed by physical CUDA memory pooling

3

Copy into the buffer each step with zero allocation overhead. Inspect pool health anytime via cache_stats()

train.py
from deep_variance import vmm_empty_nd, cache_stats
import torch

# Pre-allocate a reusable GPU buffer once
img_buf = vmm_empty_nd(
    (batch_size, 3, 224, 224),
    dtype=torch.float32
)

# Reuse across every training step, zero overhead
for imgs, labels in dataloader:
    img_buf.copy_(imgs.cuda(non_blocking=True))

print(cache_stats())
Console Outputdev0: free_chunks: 12 | bytes_in_pool: 24.0 MB

Works Everywhere PyTorch Runs

Designed to fit your infrastructure, not the other way around.

HPC and SLURM

Built-in module-load support. Validated on Perlmutter and Summit with deep-variance-check environment diagnostics.

Multi-GPU Training

Process-local by design. Each DDP rank manages its own VMM pool independently on its assigned device, mapping cleanly to PyTorch's multi-process data-parallel patterns.

Mixed Precision

VMM tensors participate fully in FP16 and BF16 autocast regions. Autograd and nn.Module work without modification.

Pre-Compiled Wheel

Ships as a pre-compiled Python wheel for CUDA 12.x and Linux x86_64. No compiler, no build tools. One pip install and you are running.

Run massive models on the hardware you already have.

Drop Optimemory into your training loop and reclaim VRAM you're already paying for.