Training Cluster as a Service: Bridging the AI Compute Gap
Making GPU Clusters Accessible
At the recent GTC Paris conference, we witnessed a breakthrough in the accessibility of GPU clusters for research organizations globally. NVIDIA and Hugging Face have joined forces to introduce Training Cluster as a Service, aimed at democratizing access to powerful GPU clusters. As the demand for advanced AI research grows, this initiative seeks to level the playing field, ensuring that even "GPU-poor" researchers can tap into the abundant GPU resources available from hyperscalers and regional cloud providers.
The rapid expansion in compute capacities is crucial to address the increasing disparities in AI research capabilities. With entities like Hugging Face facilitating connections between GPU resources and researchers, the path to building innovative AI models is clearer than ever.
How It Works
For organizations looking to get started, the process is remarkably straightforward. Researchers can request the GPU cluster size they require based on their unique needs at hf.co/training-cluster. The service integrates vital components from NVIDIA and Hugging Face, ensuring a comprehensive solution that includes:
-
Capacity Provisioning: NVIDIA Cloud Partners support the latest NVIDIA accelerated computing capabilities like NVIDIA Hopper and the NVIDIA GB200, all centralized through NVIDIA DGX Cloud.
-
Seamless Infrastructure Access: With the newly unveiled NVIDIA DGX Cloud Lepton, accessing essential infrastructure becomes simpler. This platform facilitates scheduling and monitoring of training runs, making it easier for developers to manage their workloads.
- Open Source Developer Resources: Hugging Face provides a wealth of developer resources and libraries, ensuring that even those new to AI training can hit the ground running.
Once a request for a GPU cluster is accepted, Hugging Face collaborates with NVIDIA to customize the cluster according to size, geographic location, and duration, ensuring that researchers receive tailored support.
Clusters at Work
Advancing Rare Genetic Disease Research with TIGEM
The Telethon Institute of Genomics and Medicine (TIGEM) is committed to unraveling the complexities of rare genetic diseases. With Training Cluster as a Service, they can efficiently harness the power of AI to predict the effects of pathogenic variants and explore novel drug repositioning strategies.
“AI offers new ways to research the causes of rare genetic diseases and to develop treatments, but our domain requires training new models. Training Cluster as a Service made it easy to procure the GPU capacity we needed, at the right time.”
— Diego di Bernardo, Coordinator of the Genomic Medicine Program at TIGEM
Advancing AI for Mathematics with Numina
Numina, a non-profit organization, is striving to create open-source AI for mathematical reasoning, successfully winning the 2024 AIMO progress prize. The project is currently pushing boundaries, but the limitation of computing resources has been a significant hurdle.
“With Training Cluster as a Service, we will be able to reach our goal of building open alternatives to closed-source models like DeepMind’s AlphaProof!”
— Yann Fleureau, Co-founder of Project Numina
Advancing Material Science with Mirror Physics
Startup Mirror Physics is at the forefront of developing groundbreaking AI systems for chemistry and materials science. The collaboration with the MACE team aims to push AI limits and produce high-fidelity chemical models at an unprecedented scale.
“This is going to be a significant step forward for the field!”
— Sam Walton Norwood, CEO and Founder at Mirror
Powering the Diversity of AI Research
The introduction of Training Cluster as a Service heralds a new era for AI researchers worldwide. As Clément Delangue, co-founder and CEO of Hugging Face, articulates:
“Access to large-scale, high-performance compute is essential for building the next generation of AI models across every domain and language. This service will remove barriers for researchers and companies, unlocking the ability to train the most advanced models.”
Similarly, Alexis Bjorlin, vice president of DGX Cloud at NVIDIA, emphasizes the significance of integrating DGX Cloud Lepton with Hugging Face’s services:
“This collaboration makes it easier for AI researchers and organizations to scale their AI training workloads while using familiar tools on Hugging Face.”
Enabling AI Builders with NVIDIA
The collaboration between Hugging Face and NVIDIA is a pivotal step towards providing high-performance compute resources to bolster the AI community’s collective efforts. Organizations can dive in and explore the possibilities of this powerful resource at hf.co/training-cluster.
As AI technology continues to evolve, the services introduced today are set to empower researchers and developers, paving the way for future innovations that push the boundaries of artificial intelligence.
Inspired by: Source

