NVIDIA is pushing deeper into enterprise AI infrastructure with new cloud integrations for its Dynamo platform. The chip giant announced that Dynamo now works with managed Kubernetes services from Amazon Web Services, Google Cloud, and Oracle to handle multi-node AI inference at data center scale. The move positions NVIDIA to capitalize on the growing enterprise demand for distributed AI workloads that require coordination across dozens or hundreds of GPU nodes.
NVIDIA is making a calculated bet that the future of AI inference looks nothing like today's single-GPU setups. The company's latest push centers on Dynamo, its platform for orchestrating AI workloads across entire GPU clusters, now integrated with the Kubernetes services that run most enterprise infrastructure.
The timing isn't coincidental. As AI models grow more complex through multi-agent workflows, they're hitting the limits of what single servers can handle efficiently. NVIDIA sees an opening to own the infrastructure layer that'll manage these distributed workloads, much like it dominated AI training.
"AI inference must now scale across entire clusters to serve millions of concurrent users," according to NVIDIA's blog post. The company's pushing a technique called disaggregated serving, where different parts of AI model processing get assigned to specialized GPU clusters optimized for specific tasks.
The approach splits AI inference into two phases: processing input prompts (prefill) and generating outputs (decode). Instead of running both on the same GPUs, disaggregated serving assigns each phase to independently optimized hardware. For complex reasoning models like DeepSeek-R1, this setup becomes essential rather than optional.
Baseten, an AI infrastructure company, already documented impressive results using NVIDIA's approach. The platform achieved 2x faster inference for long-context code generation and boosted throughput by 1.6x without buying additional hardware. Those software-driven performance gains translate directly to cost reductions for AI providers.
Recent SemiAnalysis benchmarks showed that disaggregated serving with Dynamo on NVIDIA's GB200 NVL72 systems delivers the lowest cost per million tokens for mixture-of-experts reasoning models among tested platforms.












