On-Prem

HPC

Azure-based 'AI supercomputer' to use Nvidia GPUs and software

Is it a super or a cloud service? Users would string together components as required


Microsoft and Nvidia say they are teaming up to build an "AI supercomputer" using Azure infrastructure combined with Nvidia's GPU accelerators, network kit, and its software stack.

The target market will be enterprises looking to train and deploy large state-of-the-art AI models at scale.

According to Nvidia, the project will be a multi-year effort that will aim to deliver "one of the most powerful AI supercomputers in the world," and the move will also make Azure the first public cloud to incorporate Nvidia's AI software stack.

To build this system, the pair will use the Azure cloud platform's ND-series and NC-series virtual machines, which are GPU-based instances, but the project will involve bringing "tens of thousands" of Nvidia's A100 and the latest H100 GPUs to the platform.

Nvidia claims that Microsoft's Azure is the first public cloud to incorporate its Quantum-2 InfiniBand networking switches for its AI-optimized virtual machines. The current Azure instances feature 200Gb/s Quantum InfiniBand and A100 GPUs, but future ones will have 400Gbps Quantum-2 InfiniBand and the newer H100 GPUs.

The two companies offered no time frame for this project to become operational, but it appears that this so-called AI supercomputer may actually not exist as such.

Reading between the lines, it appears that Microsoft is simply adding to Azure a bunch of AI-optimized instances with Nvidia's latest hardware and software, which customers will string together as required for their specific project.

We asked Nvidia for clarification, and a spokesperson told us that: "All new capabilities are in Azure instances, but the setup is such that enterprises will be able to scale those instances all the way up to supercomputing status." So that's an AI cloud service, in that case.

Nvidia's spokesperson added: "Customers can acquire resources as they normally would with a real supercomputer. For both you have a software layer that reserves resources. Just now it is in the cloud and not on a dedicated supercomputer. The most important thing is scalability. The resource on Azure can be scaled up to supercomputer standards with the same AI software, network capability and compute nodes."

The partnership will also see Nvidia make use of Azure resources to conduct research into generative AI models, which the company describes as a "rapidly emerging area" of AI in which foundational models such as Megatron Turing NLG 530B are the basis for new algorithms capable of synthesizing text, computer code, images, and even video.

The model in question was developed by a team from Microsoft and Nvidia, using Nvidia's Megatron-LM transformer model and Microsoft's DeepSpeed deep learning optimization library, the latter of which the pair will also work on optimizing.

Scott Guthrie, Microsoft EVP for its Cloud + AI Group, said that AI will fuel the next wave of automation in enterprises, enabling organizations to do more with less.

"Our collaboration with Nvidia unlocks the world's most scalable supercomputer platform, which delivers state-of-the-art AI capabilities for every enterprise on Microsoft Azure."

Nvidia's VP of enterprise computing, Manuvir Das, said the partnership aimed to provide researchers and companies with infrastructure and software to exploit AI.

"The breakthrough of foundation models has triggered a tidal wave of research, fostered new startups and enabled new enterprise applications," he said.

To this end, the AI supercomputer/cloud service will support a broad range of applications and services, including Microsoft DeepSpeed and Nvidia's AI Enterprise software suite. The latter is already certified and supported on Azure instances with A100 GPUs, while support for Azure instances with the newer H100 GPUs will be added in a future software release, Nvidia said. ®

Send us news
Post a comment

Creating a single AI-generated image needs as much power as charging your smartphone

PLUS: Microsoft to invest £2.5B in UK datacenters to power AI, and more

AMD slaps together a silicon sandwich with MI300-series APUs, GPUs to challenge Nvidia’s AI empire

Chips boast 1.3x lead in AI, 1.8x in HPC over Nv's H100

Tech world forms AI Alliance to promote open and responsible AI

Everyone from Linux Foundation to NASA and Intel ... but some big names in AI are MIA

HPE targets enterprises with Nvidia-powered platform for tuning AI

'We feel like enterprises are either going to become AI powered, or they're going to become obsolete'

Don't be fooled: Google faked its Gemini AI voice demo

PLUS: The AI companies that will use AMD's latest GPUs, and more

AWS accuses Microsoft of clipping customers' cloud freedoms

World's biggest off-prem service slinger submits comments to UK cloud inquiry, mostly has Redmond HQ's rival in its sights

Trust us, says EU, our AI Act will make AI trustworthy by banning the nasty ones

Big Tech plays the 'this might hurt innovation' card for rules that bar predictive policing, workplace emotion assessments

Microsoft adds FPGA-powered network accelerator to Azure

'Azure Boost' vastly speeds cloudy server IOPS and is coming to all new instance types

Experienced Copilot help is hard to find, warns Microsoft MVP

Almost nobody has used it, or knows it well, so beware of consultants bearing cred

Microsoft, Databricks double act tries to sew up the data platform market

But the one-stop shop vision fails to take it far beyond the competition

Dell APJ chief: Industry won't wait for Nvidia H100

Canalys mostly agrees, but thinks GPU giant still has a way to go

Nvidia’s China-market H20 chips hit another speed bump

Integration woes delay Nvidia's hopes of maintaining grip on Middle Kingdom