Nvidia’s new Blackwell chips are changing how quickly artificial intelligence systems can be trained.

In the latest round of benchmarking results released on Wednesday by MLCommons, a nonprofit group that tracks and compares the capabilities of AI chips, the Blackwell architecture programmed by Nvidia set records.

When tested with Meta’s open-source Llama 3.1 405B model, one of its biggest and most complex AI models, training was finished in only 27 minutes using Blackwell chips. This was done with only 2,496 Blackwell GPUs, an order of magnitude less than what it would have taken with Nvidia’s previous Hopper chips.

In contrast, previous designs used over three times as many Hopper GPUs to deliver equivalent performance. By the chip, Blackwell was more than twice as speedy, which was a huge jump in convergence efficiency. That kind of performance boost could translate into major time and cost savings for organizations training trillion-parameter models.

These results are believed to be the first MLCommons benchmarks for training models at these extreme scales and provide a real-world measurement of how well chips handle the most demanding AI workloads.

CoreWeave, Nvidia drive smarter AI scaling

Not only were the results a victory for Nvidia, but they also highlighted the work of CoreWeave, a cloud infrastructure company that partnered on the tests. In a press conference, CoreWeave Chief Product Officer Chetan Kapoor pointed out a general direction that increasingly made sense in the industry: away from large, homogeneous blocks of tens of thousands of GPUs.

Rather than building a single, massive, monolithic computing system, companies are now looking at smaller, interconnected subsets that can manage massive model training more efficiently and with better scaling.

Kapoor said that with such a technique, developers can continue scaling up or cutting down the time required to train extremely large models with trillions of parameters.

The move to the modular deployment of hardware is also necessary as the size and complexity of AI models only inflate.

Blackwell puts Nvidia in the lead for AI model training

Though the focus lately has shifted to AI inference, in which models like ChatGPT1 answer user questions in real-time, training is still the workhorse of AI development.

The training part gives these models their smarts, allowing them to understand language, tackle some of our most challenging problems, and even produce human-like prose. The computation is highly demanding and requires thousands of high-performance chips to operate for long periods, typically days, if not weeks or months.

That has changed with Nvidia’s Blackwell architecture. By radically cutting the chips and time it takes to train gargantuan AI models, Blackwell chips give Nvidia a better hand in a market where speed and efficiency rule the roost. 

Training models such as Meta’s Llama 3.1 405B, which has trillions of parameters, have previously had to be run on huge clusters of GPUs and have been an expensive energy-guzzling process. 

Such performance gains are a significant leg up at a time when there is blistering demand for ever larger and more powerful AI models across many industries — from health care and finance to education and autonomous vehicles.

It also sends a clear message to Nvidia’s rivals. Now, chip companies like AMD and Intel, which are working on their AI-specific chips, are under greater pressure to maintain a similar pace.

AMD submitted to the MLCommons benchmark test but didn’t show results for a model as large as Llamas 3.1 405B. Nvidia was the only one that tested at the high end of the benchmark, proving that it was the superior hardware and willing to take on the hardest challenges.

Cryptopolitan Academy: Want to grow your money in 2025? Learn how to do it with DeFi in our upcoming webclass. Save Your Spot