| NVIDIA Isaac GR00T N1 used in object manipulation. |
At its annual GTC conference, NVIDIA made headlines by unveiling three groundbreaking open-source releases that are set to revolutionize the field of physical AI. These innovations include a new suite of World Foundation Models (WFMs) called Cosmos Transfer, a comprehensive Physical AI Dataset, and the first open model for general humanoid reasoning, the NVIDIA Isaac GR00T N1. Together, these resources empower developers to push the boundaries of robotics and autonomous vehicle technology.
New World Foundation Model – Cosmos Transfer
The Cosmos Transfer model represents a significant advancement in NVIDIA’s Cosmos™ world foundation models (WFMs). Boasting 7 billion parameters, this model offers unparalleled control and accuracy in generating virtual world scenes. It utilizes multicontrols to ensure high-fidelity outputs from various structural inputs, allowing for precise spatial alignment and scene composition.
How it Works
The effectiveness of Cosmos Transfer is rooted in its architecture, which involves training individual ControlNets for each sensor modality used to capture the simulated world.
Input types include:
- 3D bounding box maps
- Trajectory maps
- Depth maps
- Segmentation maps
During inference, developers can employ a variety of structured visual or geometric data—such as edge maps, human motion keypoints, LiDAR scans, and HD maps—to guide the model’s output. The control signals from each branch are combined with adaptive spatiotemporal control maps and integrated into the transformer blocks of the base model, resulting in photorealistic video sequences that maintain controlled layouts and object placements.
The Cosmos Transfer model is particularly effective for generating synthetic data tailored for robotics and autonomous vehicle development, especially when paired with the NVIDIA Omniverse platform. Developers can explore numerous examples on GitHub, including samples specifically designed for autonomous vehicles.
Open Physical AI Dataset
In addition to Cosmos Transfer, NVIDIA has launched the Physical AI Dataset, an open-source resource available on Hugging Face. This extensive dataset comprises 15 terabytes of data, encapsulating more than 320,000 trajectories for robotics training, along with up to 1,000 Universal Scene Description (OpenUSD) assets, including a collection ready for simulation.
This dataset is particularly beneficial for developers utilizing post-training foundation models like Cosmos Predict, providing high-quality, diverse data essential for enhancing AI model performance. The dataset’s commercial-grade, pre-validated nature ensures that developers can rely on it for rigorous training and testing.
Purpose Built Model for Humanoids – NVIDIA Isaac GR00T N1
Among the standout announcements is NVIDIA Isaac GR00T N1, the first open foundation model specifically designed for generalized reasoning and skills in humanoid robots. Capable of processing multimodal inputs—including language and images—this model excels in performing manipulation tasks across various environments. The Isaac GR00T-N1-2B model is readily accessible on Hugging Face.
Trained on a vast humanoid dataset that combines real-world captured data, synthetic data generated from the NVIDIA Isaac GR00T Blueprint, and internet-scale video data, Isaac GR00T N1 is adaptable for specific embodiments, tasks, and environments. This versatility is achieved using a single model and set of weights, enabling it to perform complex manipulation behaviors on different humanoid robots like the Fourier GR-1 and 1X Neo.
The model showcases impressive generalization capabilities across a myriad of tasks, from grasping and manipulating objects to executing intricate multi-step tasks requiring sustained contextual understanding.
Dual-System Architecture
NVIDIA Isaac GR00T N1 features a dual-system architecture inspired by human cognitive processes, comprising:
- Vision-Language Model (System 2): Based on NVIDIA-Eagle with SmolLM-1.7B, this model interprets environmental cues through vision and language, allowing robots to reason and plan actions accordingly.
- Diffusion Transformer (System 1): This action model translates the planned actions from System 2 into precise movements, ensuring smooth and continuous robot operation.
Path Forward
The emphasis on post-training represents a crucial step forward in refining autonomous systems and developing specialized models tailored for downstream physical AI tasks. Developers are encouraged to explore the Cosmos Predict and Cosmos Transfer inference scripts available on GitHub, as well as the research papers detailing their functionalities.
The NVIDIA Isaac GR00T-N1-2B model can also be found on Hugging Face, alongside sample datasets and PyTorch scripts designed for post-training with custom user datasets, compatible with the Hugging Face LeRobot format. For further insights into the Isaac GR00T N1 model, the accompanying research paper offers comprehensive information.
Stay updated with NVIDIA’s latest advancements by following their developments on Hugging Face, paving the way for innovation in the realm of physical AI.
Inspired by: Source

