LLM Tuner - in beta testing.

DeepTuner:
The FP8 engine.

State-of-the-art quantization techniques for large language models. Currently supporting FP8 with near-zero perplexity loss.

Active Release

Native FP8 Precision

Our research-backed FP8 kernels enable fine-tuning with 50% less memory than BF16 while maintaining 99.9% accuracy signatures.

  • Reduced VRAM requirements
  • Higher throughput kernels
  • Minimal perplexity shift
Upcoming Phase

Sparse-Aware Fine-Tuning

Exploiting activation sparsity during backward passes to further accelerate training on H100 hardware.

  • 2:4 Sparsity integration
  • Structured mask optimization

How DeepTuner Achieves Near-Zero Perplexity Loss

Three precision-aware techniques work in concert to maintain convergence quality while halving memory requirements.

Dual-Format FP8

Forward pass uses E4M3 (4-bit exponent, 3-bit mantissa) for high accuracy. Backward pass uses E5M2 (5-bit exponent, 2-bit mantissa) for wider dynamic range during gradient flow.

Adaptive Loss Scaling

Auto-scaling mu policy monitors per-tensor saturation ratios every step. When saturation exceeds 0.001%, mu is halved immediately. When stable, mu grows back toward its maximum over a 1,000-step window.

Compressed Optimizer States

First-order momentum stored in FP8, second-order variance in FP16, master weights in FP16. Reduces optimizer memory from 16 bytes per parameter to under 7 bytes.

Three Levels of FP8 Optimization

O1

FP8 Gradient Communication

FP8 all-reduce for DDP gradient synchronization. Compresses inter-GPU bandwidth by up to 2x with no change to the model or optimizer code.

O2

FP8 Optimizer States

Includes O1 plus first-order momentum compression to FP8. Cuts peak optimizer memory by over 2x, enabling larger models or larger batch sizes on the same GPU.

O3

Full FP8 Pipeline with ZeRO

Includes O2 plus ZeRO-aware FP8 weight partitioning for multi-GPU setups. Enables the full distributed FP8 training pipeline with minimal precision loss across all ranks.