NVIDIA Blackwell Delivers 20x More Agents per Megawatt in AgentPerf Benchmark

NVIDIA’s Blackwell Ultra NVL72 platform delivers up to 20 times more agentic AI performance per megawatt than the NVIDIA Hopper HGX H200 system, according to data from Artificial Analysis’s AgentPerf benchmark. This efficiency gain supports a shift from simple conversational AI to “agentic” workflows that execute complex, multi-step tasks autonomously.

Why is agentic AI different from conversational AI?

Conversational AI operates as a “sprint.” A user sends a prompt, and the large language model (LLM) returns a single response. Agentic AI, however, functions as a “relay.” According to Artificial Analysis, agents break a primary goal into numerous smaller steps, continuing the process until the task is complete.

These agents chain together multiple LLM calls and tool calls to gather context, reason, and act. A single agentic task might involve dozens or hundreds of chained calls. Each handoff includes tool calls such as database searches, web browsing, or code execution. This creates multiplicative complexity rather than additive growth, placing significantly higher stress on computing systems than a single chat request.

Did you know? AgentPerf is the industry’s first benchmark specifically designed for agentic AI. It uses real coding trajectories from public repositories across 12+ programming languages to simulate how agents actually behave in production.

How does the NVIDIA GB300 NVL72 achieve 20x efficiency?

The GB300 NVL72 outperforms the HGX H200 by optimizing the entire hardware and software stack for mixture-of-experts (MoE) models, such as DeepSeek V4 Pro. NVIDIA reports that the system connects 72 GPUs into a single rack-scale system, allowing the model to distribute execution more efficiently.

Three specific technical drivers enable this performance:

CUDA Kernels: These overlap communication and compute, which absorbs the coordination cost across experts and reduces latency.
TensorRT LLM: This software separates input processing from output generation, allowing each to be optimized independently as concurrent sessions scale.
Rack-Scale Architecture: The GB300 NVL72 supports higher concurrent agent counts at service-level objectives of both 20 and 60 tokens per second per agent.

Pro Tip: When evaluating AI infrastructure, look beyond “tokens per second.” For agents, the critical metric is concurrent agents per megawatt, as this determines the actual operational cost of deploying a digital workforce.

What real-world applications are using Blackwell today?

Several inference providers have already deployed agentic workloads on NVIDIA Blackwell. Together AI uses the platform to power Cursor, an agentic coding tool. According to NVIDIA, Cursor’s agents can debug issues and execute refactors in real-time while developers continue to work.

Anthropic Models Suspended + NVIDIA Blackwell’s 20x Agent Leap

DeepInfra uses Blackwell to power Pam.ai, a workforce platform for car dealerships. These agents handle outbound sales campaigns, book service appointments, and manage phone calls. Baseten is also among the leading providers serving frontier models like DeepSeek V4 Pro on this architecture.

What happens next for AI infrastructure?

The transition to agentic AI suggests a future where infrastructure is measured by “productive work per dollar” rather than raw model size. Because agents require sustained, iterative processing, the bottleneck is shifting from memory capacity to power efficiency and interconnect speed.

NVIDIA has already moved the Vera Rubin architecture into full production. This next generation aims to meet the scaling demands of agentic AI, where the goal is to run thousands of simultaneous, autonomous agents without an exponential increase in energy consumption.

Metric	NVIDIA Hopper (H200)	NVIDIA Blackwell (GB300 NVL72)
Workload Type	Conversational/Single-Call	Agentic/Chained-Call
Efficiency	Baseline	20x more agents per megawatt
Optimization	Standard Inference	Rack-scale MoE distribution

Frequently Asked Questions

What is the AgentPerf benchmark?
It’s the first industry benchmark from Artificial Analysis that measures how many simultaneous agentic tasks a platform can support based on real-world coding trajectories.

Why does power efficiency matter for AI agents?
Agents make hundreds of LLM calls per task. If efficiency isn’t improved, the energy cost of running a fleet of autonomous agents would be prohibitively expensive for enterprises.

Which model was used to test the GB300 NVL72?
The benchmark used DeepSeek V4 Pro, a large mixture-of-experts (MoE) model representative of current frontier agentic AI.

Is your organization moving toward agentic workflows or sticking with conversational bots? Let us know in the comments or subscribe to our newsletter for more deep dives into AI infrastructure.