Microsoft unveils enormous supercomputer to train large AI models

By IANS Last updated May 21, 2020

Microsoft has built a new powerful supercomputer in collaboration with Artificial Intelligence (AI) startup OpenAI, making new infrastructure available in Azure to train extremely large AI models. Microsoft announced a multi-year supercomputer partnership with OpenAI in 2019, including a $1 billion investment by the tech giant.

The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five.

- Advertisement -

“Built in collaboration with and exclusively for OpenAI, the supercomputer hosted in Azure was designed specifically to train that company’s AI models,” the company announced at its virtual ‘Build 2020’ conference.

Hosted in Azure, the supercomputer also benefits from all the capabilities of a robust modern cloud infrastructure, including rapid deployment, sustainable datacenters and access to Azure services.

“This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now,” explained Microsoft Chief Technical Officer Kevin Scott.

As part of its ‘AI at Scale’ initiative, Microsoft has developed its own family of large AI models, the Microsoft Turing models, which it has used to improve many different language understanding tasks across Bing, Office, Dynamics and other productivity products.

- Advertisement -

Earlier this year, it also released to researchers the largest publicly available AI language model in the world, the Microsoft Turing model for natural language generation. The goal, Microsoft said, is to make its large AI models, training optimization tools and supercomputing resources available through Azure AI services and GitHub so developers, data scientists and business customers can easily leverage the power of AI at Scale.

“As we’ve learned more and more about what we need and the different limits of all the components that make up a supercomputer, we were really able to say, ‘If we could design our dream system, what would it look like?’” said OpenAI CEO Sam Altman.

And then Microsoft was able to build it.

“We are seeing that larger-scale systems are an important component in training more powerful models,” Altman added.

Microsoft has also unveiled a new version of DeepSpeed, an open source deep learning library for PyTorch that reduces the amount of computing power needed for large distributed model training.

The update is significantly more efficient than the version released just three months ago and now allows people to train models more than 15 times larger and 10 times faster than they could without DeepSpeed on the same infrastructure.

If you have an interesting article / experience / case study to share, please get in touch with us at [email protected]