NVIDIA just delivered a masterclass in AI training dominance, sweeping every single benchmark in MLPerf Training v5.1 while being the only platform to compete across all seven tests. The chipmaker's new Blackwell Ultra architecture didn't just win - it obliterated previous records, training Llama 3.1 405B in just 10 minutes using breakthrough NVFP4 precision that no competitor can match.
NVIDIA just redefined what's possible in AI training performance. The company didn't just participate in MLPerf Training v5.1 - it dominated every category while competitors couldn't even show up to compete across the board. This wasn't close. NVIDIA swept all seven benchmarks covering large language models, image generation, recommender systems, computer vision, and graph neural networks, according to results published today.
What makes this sweep particularly striking is that NVIDIA was the only platform to submit results on every single test. While competitors cherry-picked their battles, NVIDIA showed up everywhere and won everything - a testament to both the versatility of their CUDA software stack and the raw power of their Blackwell architecture.
The star of the show was NVIDIA's Blackwell Ultra GPU architecture, making its MLPerf Training debut in the GB300 NVL72 rack-scale system. The performance gains were staggering. Compared to the previous-generation Hopper architecture, Blackwell Ultra delivered more than 4x the performance on Llama 3.1 405B pretraining and nearly 5x faster Llama 2 70B LoRA fine-tuning using the same number of GPUs.
But the real breakthrough came from NVIDIA's introduction of NVFP4 precision calculations - a first in MLPerf Training history. While lower precision typically means sacrificing accuracy, NVIDIA's teams innovated across their entire stack to make FP4 calculations work without compromising results. The Blackwell Ultra architecture can perform these NVFP4 calculations at 3x the rate of FP8, delivering substantially greater AI compute performance.
The scale achievements were equally impressive. NVIDIA set a new Llama 3.1 405B training record of just 10 minutes using more than 5,000 Blackwell GPUs working together. This result was 2.7x faster than their best Blackwell-based submission from the previous round, achieved through both efficient scaling to twice the number of GPUs and the dramatic performance boost from NVFP4 precision.











