Early Access: HPC Inference & Training

Stop guessing
run configurations.

DeepTuner uses intermediate code analysis to predict energy-efficient run configurations before any code runs. Up to 50% less energy and 2x throughput gains on multi-head attention kernels, no runtime profiling required.

Up to 50%

Less energy

Up to 2x

Throughput on MHA

Up to 70%

Search space saved

Request early access
How DeepTuner works

Static analysis. No runtime overhead.

Existing kernel tuners require exhaustive runtime profiling: O(2ⁿ) benchmark runs for n configuration parameters. For a production cluster running continuous training, that profiling tax is paid on every hardware migration and workload change.

DeepTuner analyzes intermediate GPU code before any execution, extracting memory access patterns, control flow, and instruction mix to predict the energy-efficient run configuration and GPU power cap. A one-time microbenchmark per GPU generation is all it needs.

Validated on NVIDIA RTX 5000 Ada (Hopper) and RTX 3070 (Ampere) across multi-head attention, convolution, and matrix multiplication kernels.

Intermediate code analysis

Analyzes intermediate GPU code to extract memory locality scores, register pressure, warp divergence, and instruction mix ratios, without launching a single kernel.

Architecture-agnostic calibration

Calibrated once per GPU generation. Optimal kernel configs are predicted without re-profiling when you migrate from Ampere to Hopper or Blackwell.

Joint shape and power-cap tuning

Jointly tunes run configuration and GPU power cap for minimum energy per token, without sacrificing throughput above 95% of peak.

Validated on HPC NVIDIA systems and consumer-grade GPUs.

On the horizon

Expanding beyond NVIDIA

DeepTuner currently runs on NVIDIA GPUs. Work is underway to bring the same intermediate code analysis approach to other hardware targets.

AMD ROCm

In research

Porting the intermediate code analysis pipeline to AMD's ROCm stack and CDNA architecture. Coming soon.

Google TPUs

In research

Adapting energy-aware run configuration search to XLA's compilation model for TPU v4 and v5 workloads. Coming soon.

* DeepTuner is architecture-agnostic in principle. Production support is currently NVIDIA-only.

Join the DeepTuner beta

We're onboarding HPC teams with active training or inference infrastructure. Tell us your hardware setup and we'll scope a pilot.

Get early access