NVIDIA introduces Dynamo to optimise AI reasoning models

Wed, 19th Mar 2025

NVIDIA has introduced NVIDIA Dynamo, an open-source software designed to optimise AI reasoning models across extensive GPU fleets, aiming to enhance inference performance and reduce costs in AI factories.

NVIDIA Dynamo seeks to maximise the efficiency and cost-effectiveness of AI model inferencing, crucial for AI factories to ensure competitive performance while managing expenses.

As AI models increasingly adopt reasoning capabilities, Dynamo addresses the need for models to process myriad tokens per prompt efficiently.

Jensen Huang, founder and CEO of NVIDIA, stated, "Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time. To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories."

The software enhances inference by coordinating inference requests across GPUs, separating the processing and generation of large language model (LLM) phases onto different GPUs, benefiting from the independent optimisation of each phase.

Using NVIDIA Dynamo, AI factories can notably improve their performance metrics. On NVIDIA Hopper platforms, for instance, Dynamo doubles the performance and revenue when deploying Llama models.

Additionally, it significantly boosts the token generation capability—over 30 times on individual GPUs—when optimising the DeepSeek-R1 model running on GB200 NVL72 racks.

Denis Yarats, Chief Technology Officer of Perplexity AI, expressed enthusiasm about Dynamo's capabilities: "To handle hundreds of millions of requests monthly, we rely on NVIDIA GPUs and inference software to deliver the performance, reliability and scale our business and users demand. We look forward to leveraging Dynamo, with its enhanced distributed serving capabilities, to drive even more inference-serving efficiencies and meet the compute demands of new AI reasoning models."

For Cohere, NVIDIA Dynamo presents an opportunity to enhance their Command series with sophisticated AI capabilities.

"Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage," said Saurabh Baji, Senior Vice President of Engineering at Cohere. "We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers."

NVIDIA Dynamo supports disaggregated serving, allowing different computational phases of LLMs to be managed by separate GPUs, ultimately enabling more effective resource allocation and faster response times for reasoning models.

Together AI intends to use NVIDIA Dynamo to enhance the scalability of its inference workloads across GPU nodes.

Ce Zhang, Chief Technology Officer of Together AI, explained, "Scaling reasoning models cost effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing. Together AI provides industry-leading performance using our proprietary inference engine. The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization — maximizing our accelerated computing investment. We're excited to leverage the platform's breakthrough capabilities to cost-effectively bring open-source reasoning models to our users."

NVIDIA Dynamo integrates several novel features such as the "GPU Planner" for dynamic GPU management, a "Smart Router" for optimising GPU requests, a low-latency communication library for data transfer, and a memory manager that efficiently manages data storage, contributing to reduced inference serving costs.

The company plans to include NVIDIA Dynamo in NVIDIA NIM microservices and support it in subsequent updates to the NVIDIA AI Enterprise platform, providing enhanced security and stability.

The advancements of NVIDIA Dynamo indicate a step forward in AI reasoning model efficiency and scalability.

Share on: