NVIDIA Blackwell Ultra Crushes MLPerf Benchmarks

NVIDIA's latest Blackwell Ultra architecture just rewrote the AI inference playbook. The GB300 NVL72 systems powered by these new chips crushed every benchmark in MLPerf Inference v5.1, delivering up to 1.4x better performance than the previous generation. This isn't just about bragging rights - it's about making AI factories dramatically more profitable.

NVIDIA just dropped a bombshell that's going to reshape the AI infrastructure landscape. The company's brand-new Blackwell Ultra architecture is absolutely demolishing performance benchmarks, and the numbers are staggering enough to make every data center operator take notice.

Less than six months after debuting at NVIDIA GTC, the GB300 NVL72 rack-scale systems powered by Blackwell Ultra have set new records across every single benchmark in the latest MLPerf Inference v5.1 suite. We're talking about 1.4x better DeepSeek-R1 inference throughput compared to the already impressive Blackwell-based GB200 systems.

But here's what makes this really interesting - this isn't just about raw speed. "Inference performance is critical, as it directly influences the economics of an AI factory," NVIDIA explains in their announcement. The higher the throughput, the more tokens these systems can pump out, which translates directly to increased revenue and lower total cost of ownership.

The technical specs behind these results are genuinely impressive. Blackwell Ultra packs 1.5x more NVFP4 AI compute than its predecessor, along with 2x better attention-layer acceleration and up to 288GB of HBM3e memory per GPU. That's not incremental improvement - that's a generational leap.

NVIDIA swept the board on all the new data center benchmarks added to MLPerf v5.1, including DeepSeek-R1, Llama 3.1 405B Interactive, Llama 3.1 8B, and Whisper. They're also maintaining their stranglehold on per-GPU records across every MLPerf data center benchmark.

The secret sauce here isn't just better silicon. NVIDIA's full-stack approach is paying dividends through hardware acceleration for their proprietary NVFP4 data format - a 4-bit floating point format that delivers better accuracy than other FP4 formats while matching higher-precision alternatives. Their TensorRT Model Optimizer software quantized major models like DeepSeek-R1 and Llama 3.1 405B to this format, squeezing out every bit of performance while meeting strict accuracy requirements.

One particularly clever optimization caught our attention: disaggregated serving. This technique splits large language model inference into separate context and generation tasks, allowing each to be optimized independently. The result? A nearly 50% performance boost per GPU on the Llama 3.1 405B Interactive benchmark when using GB200 NVL72 systems versus traditional serving approaches.