Nvidia May Have Already Fixed Blackwell’s Cooling Issues

Nvidia May Have Already Fixed Blackwell’s Cooling Issues

Earlier this week, a report from The Information said Nvidia’s Blackwell AI chips were delayed due to an overheating issue that developed when they were placed on server racks, but a third-party research firm claims the problem is overblown and was fixed months ago.

Blackwell, which is designed for businesses looking to build out their AI data centers, has server racks that can fit up to 72 GPUs. But Semianalysis, a research firm that focuses on the semiconductor and AI industries, tells Business Insider that Nvidia suppliers reworked the server racks with “minor” changes to address the problem. According to the firm’s chief analyst, cooling may be a concern in the future, but the specific server issue in question has been fixed.

Nvidia said earlier this week that it is “working with leading cloud service providers as an integral part of our engineering team and process” regarding any potential issues and that “engineering iterations are normal and expected.”

Overheating GPUs can throttle performance and cause operational issues. Their immediate surroundings (like the number of nearby fans, type of case, or rack design) can directly impact GPU temperature, as well. The GPU’s design can also result in higher average temperatures depending on the specific model.

But Georgia Tech Professor Bara Cola—who is also the founder of Carbice, which develops thermal computing solutions—argues that heat itself isn’t Blackwell’s biggest challenge.

Recommended by Our Editors

“The real challenge is mechanical stress and not heat. I am confident that Nvidia will find a way to operate these chips for their customers. High-performance chips like this will always run hot, and it is just a matter of balancing how hot—smart engineers will solve this,” Cola tells PCMag via email. “But early failure happens when the interfaces cannot handle the thermal expansion stress that the heat brings. This is a hard materials science problem.”

Blackwell previously had a “design flaw” unrelated to the server overheating issue. Nvidia CEO Jensen Huang has also said this has since been resolved.

Get Our Best Stories!

Sign up for What’s New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.

About Kate Irwin

Reporter

I’m a reporter for PCMag covering tech news early in the morning. Prior to joining PCMag, I was a producer and reporter at Decrypt and launched its gaming vertical, GG. I have previously written for Input, Game Rant, Dot Esports, and other places, covering a range of gaming, tech, crypto, and entertainment news.


Read Kate’s full bio

Read the latest from Kate Irwin

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *