NVIDIA just delivered a masterclass in AI training dominance, sweeping every single benchmark in MLPerf Training v5.1 while being the only platform to compete across all seven tests. The chipmaker's new Blackwell Ultra architecture didn't just win - it obliterated previous records, training Llama 3.1 405B in just 10 minutes using breakthrough NVFP4 precision that no competitor can match.
NVIDIA just redefined what's possible in AI training performance. The company didn't just participate in MLPerf Training v5.1 - it dominated every category while competitors couldn't even show up to compete across the board. This wasn't close. NVIDIA swept all seven benchmarks covering large language models, image generation, recommender systems, computer vision, and graph neural networks, according to results published today.
What makes this sweep particularly striking is that NVIDIA was the only platform to submit results on every single test. While competitors cherry-picked their battles, NVIDIA showed up everywhere and won everything - a testament to both the versatility of their CUDA software stack and the raw power of their Blackwell architecture.
The star of the show was NVIDIA's Blackwell Ultra GPU architecture, making its MLPerf Training debut in the GB300 NVL72 rack-scale system. The performance gains were staggering. Compared to the previous-generation Hopper architecture, Blackwell Ultra delivered more than 4x the performance on Llama 3.1 405B pretraining and nearly 5x faster Llama 2 70B LoRA fine-tuning using the same number of GPUs.
But the real breakthrough came from NVIDIA's introduction of NVFP4 precision calculations - a first in MLPerf Training history. While lower precision typically means sacrificing accuracy, NVIDIA's teams innovated across their entire stack to make FP4 calculations work without compromising results. The Blackwell Ultra architecture can perform these NVFP4 calculations at 3x the rate of FP8, delivering substantially greater AI compute performance.
The scale achievements were equally impressive. NVIDIA set a new Llama 3.1 405B training record of just 10 minutes using more than 5,000 Blackwell GPUs working together. This result was 2.7x faster than their best Blackwell-based submission from the previous round, achieved through both efficient scaling to twice the number of GPUs and the dramatic performance boost from NVFP4 precision.
To put the per-GPU improvements in perspective, NVIDIA submitted results using 2,560 Blackwell GPUs that achieved an 18.79-minute training time - 45% faster than their previous submission using 2,496 GPUs. Each individual GPU is simply performing at a fundamentally higher level.
The benchmarks themselves evolved this round, with two new tests replacing older standards. Llama 3.1 8B replaced the long-running BERT-large model, bringing a more modern, compact LLM into the benchmark suite. NVIDIA set the bar at 5.2 minutes to train using up to 512 Blackwell Ultra GPUs. Meanwhile, FLUX.1 - a state-of-the-art image generation model - replaced Stable Diffusion v2, and again only NVIDIA submitted results, setting a 12.5-minute record using 1,152 Blackwell GPUs.
The infrastructure supporting these achievements was equally advanced. NVIDIA's Quantum-X800 InfiniBand platform made its MLPerf debut, delivering the industry's first end-to-end 800 Gb/s scale-up networking and doubling scale-out networking bandwidth compared to the previous generation. When you're coordinating thousands of GPUs, that networking performance becomes critical.
The broader ecosystem participated extensively, with 15 organizations including Dell Technologies, HPE, Lenovo, and others submitting results using NVIDIA platforms. But the message was clear - if you want to compete in AI training at scale, you're using NVIDIA hardware.
This dominance comes as AI companies are racing to train increasingly sophisticated models for reasoning capabilities. The ability to train a 405B parameter model in 10 minutes instead of hours or days changes the entire economics of AI development. It enables rapid iteration, more experimentation, and ultimately faster progress toward more capable AI systems.
NVIDIA's MLPerf Training v5.1 sweep isn't just about winning benchmarks - it's about establishing an insurmountable lead in the infrastructure powering the next generation of AI. With breakthrough NVFP4 precision, massive scalability improvements, and an ecosystem that spans every major hardware partner, NVIDIA has created a training performance moat that competitors will struggle to cross. As AI companies race to build more capable reasoning models, this kind of training speed advantage could determine who leads the next phase of AI development.