Early Access: HPC Inference & Training

Stop guessing
run configurations.

DeepTuner uses intermediate code analysis to predict energy-efficient run configurations before any code runs. Up to 50% less energy and 2x throughput gains on multi-head attention kernels, no runtime profiling required.

Up to 50%

Less energy

Up to 2x

Throughput on MHA

Up to 70%

Search space saved

Request early access

How DeepTuner works

Static analysis. No runtime overhead.

Existing kernel tuners require exhaustive runtime profiling: O(2ⁿ) benchmark runs for n configuration parameters. For a production cluster running continuous training, that profiling tax is paid on every hardware migration and workload change.

DeepTuner analyzes intermediate GPU code before any execution, extracting memory access patterns, control flow, and instruction mix to predict the energy-efficient run configuration and GPU power cap. A one-time microbenchmark per GPU generation is all it needs.

Validated on NVIDIA RTX 5000 Ada (Hopper) and RTX 3070 (Ampere) across multi-head attention, convolution, and matrix multiplication kernels.

Intermediate code analysis

Analyzes intermediate GPU code to extract memory locality scores, register pressure, warp divergence, and instruction mix ratios, without launching a single kernel.

Architecture-agnostic calibration

Calibrated once per GPU generation. Optimal kernel configs are predicted without re-profiling when you migrate from Ampere to Hopper or Blackwell.

Joint shape and power-cap tuning

Jointly tunes run configuration and GPU power cap for minimum energy per token, without sacrificing throughput above 95% of peak.

Validated on HPC NVIDIA systems and consumer-grade GPUs.

Energy savings across sequence lengths

DeepTuner results on multi-head attention kernels. MHA dominates transformer compute, so gains here propagate through every training step and inference call.

Seq. Length	Power (W)	Block Shape	Baseline J/tok	Optimized J/tok	Energy Saved
256	100W	2x32	5.03x10⁻⁶	1.63x10⁻⁶	-67.5%
512	100W	2x32	3.87x10⁻⁶	0.88x10⁻⁶	-78.9%
1024	100W	2x32	1.25x10⁻⁶	0.41x10⁻⁶	-67.4%
2048	165W	4x16	0.25x10⁻⁶	0.12x10⁻⁶	-22.3%
4096	100W	2x32	0.40x10⁻⁶	0.12x10⁻⁶	-74.7%
8192	100W	2x32	0.08x10⁻⁶	0.06x10⁻⁶	-38.5%

Measured on NVIDIA RTX 5000 Ada (24 GB) with CUDA 12.4. Rounded mean of 10 runs (σ < 5%). Baseline is NVIDIA Occupancy Calculator heuristic at 250 W default power.

On the horizon

Expanding beyond NVIDIA

DeepTuner currently runs on NVIDIA GPUs. Work is underway to bring the same intermediate code analysis approach to other hardware targets.

AMD ROCm

In research

Porting the intermediate code analysis pipeline to AMD's ROCm stack and CDNA architecture. Coming soon.

Google TPUs

In research

Adapting energy-aware run configuration search to XLA's compilation model for TPU v4 and v5 workloads. Coming soon.

* DeepTuner is architecture-agnostic in principle. Production support is currently NVIDIA-only.

Join the DeepTuner beta

We're onboarding HPC teams with active training or inference infrastructure. Tell us your hardware setup and we'll scope a pilot.

Get early access

Optimemory

HyperRAG

DeepTuner

Stop guessing
run configurations.

Static analysis. No runtime overhead.

Intermediate code analysis

Architecture-agnostic calibration

Joint shape and power-cap tuning

Energy savings across sequence lengths

Expanding beyond NVIDIA

AMD ROCm

Google TPUs

Join the DeepTuner beta

Optimemory

HyperRAG

DeepTuner

Stop guessingrun configurations.

Static analysis. No runtime overhead.

Intermediate code analysis

Architecture-agnostic calibration

Joint shape and power-cap tuning

Energy savings across sequence lengths

Expanding beyond NVIDIA

AMD ROCm

Google TPUs

Join the DeepTuner beta

Stop guessing
run configurations.