Authors: Nigel Nelson, Lukas Zbinden, Mostafa Toloui, Sean Huver
Healthcare AI has primarily revolved around perception-based models, concentrating on interpreting signals to classify or segment pathologies and anatomy. Yet, healthcare fundamentally encompasses “doing,” rendering past perception-only datasets inadequate due to their static nature, which doesn’t account for embodiment, contact dynamics, or closed-loop control. The healthcare sector calls for standardized robotic bodies, synchronized vision–force–kinematics data, sim-to-real pairing, and cross-embodiment benchmarks to establish a solid foundation for Physical AI.
1. Open-H-Embodiment
Open-H-Embodiment is a collaborative, community-driven dataset initiative aimed at creating a shared foundation essential for training and evaluating AI autonomy and world foundation models in surgical robotics and ultrasound applications. Spearheaded by a steering committee featuring notable figures such as Prof. Axel Krieger from Johns Hopkins, Prof. Nassir Navab from the Technical University of Munich, and Dr. Mahdi Azizian from NVIDIA, this initiative has now expanded to include participation from over 35 organizations worldwide.
Collectively, these participants have joined forces to construct the first large-scale dataset aimed at propelling the advancement of Physical AI within healthcare robotics.
Participants
Notable participants include:
- Balgrist
- CMR Surgical
- The Chinese University of Hong Kong
- Great Bay University
- Hong Kong Baptist University
- Hamlyn
- ImFusion
- Johns Hopkins University
- Leeds University
- Mohamed bin Zayed University of Artificial Intelligence
- Moon Surgical
- NVIDIA
- Northwell Health
- Obuda University
- The Hong Kong Polytechnic University
- Qilu Hospital of Shandong University
- Rob Surgical
- Sanoscience
- Surgical Data Science Collective
- Semaphor Surgical
- Stanford
- Dresden University of Technology
- Technical University of Munich
- Tuodao
- Turin
- University of British Columbia
- UC Berkeley
- UC San Diego
- University of Illinois Chicago
- University of Tennessee
- University of Texas
- Vanderbilt
- Virtual Incision
The Dataset
- Comprises 778 hours of CC-BY-4.0 healthcare robotics training data, primarily focused on surgical robotics, along with ultrasound and colonoscopy autonomy data.
- Includes simulations, benchtop exercises (such as suturing), and actual clinical procedures.
- Utilizes both commercial robots (like CMR Surgical, Rob Surgical, and Tuodao) and research robots (including dVRK, Franka, and Kuka).
- Accompanied by the release of two new, permissively open-source models trained on this dataset.
2. GR00T-H: Vision Language Action Model for Surgical Robotics
One of the significant innovations birthed from this initiative is GR00T-H, a derivative of NVIDIA’s Isaac GR00T N series of Vision-Language-Action (VLA) models. With training based on approximately 600 hours of Open-H-Embodiment data, GR00T-H is pioneering as the first policy model tailored for surgical robotics tasks.
Leveraging NVIDIA’s open-source ecosystem, Gr00T-H utilizes Cosmos Reason 2 2B as its Vision-Language Model (VLM) backbone.
Architectural Design Choices
Developing surgical robotics calls for acute precision, and specialized hardware like cable-driven systems complicates imitation learning (IL). To tackle this, GR00T-H incorporates four pivotal design choices:
- Unique Embodiment Projectors: A distinct, learnable MLP maps each robot’s specific kinematics to a uniform, normalized action space.
- State Dropout (100%): Proprioceptive input is dropped during inference, generating a learned bias term for every system, which enhances real-world results.
- Relative EEF Actions: Training employs a common relative End-Effector (EEF) action space to mitigate kinematic inconsistencies.
- Metadata in Task Prompts: Directly injects instrument names and control index mapping into the VLM task prompt.
A prototype of GR00T-H has successfully executed a complete, end-to-end suture as demonstrated in the SutureBot benchmark, showcasing robust long-horizon dexterity.

3. Cosmos-H-Surgical-Simulator
Another groundbreaking creation is the Cosmos-H-Surgical-Simulator, designed as a World Foundation Model (WFM) for action-conditioned surgical robotics. Traditional simulators have struggled due to the complexities of real-world conditions, such as soft tissue, reflections, blood, and smoke.
Key Capabilities
- Overcoming the Sim-to-Real Gap: Fine-tuned from NVIDIA Cosmos Predict 2.5 2B, it generates physically plausible surgical video directly from kinematic actions.
- Efficiency Gains: Completing 600 rollouts took only 40 minutes in simulation compared to 2 days required for real-world benchtop methods.
- WFM as a Physics Simulator: This model learns tissue deformation and tool interaction implicitly from data.
- Synthetic Data Generation: Capable of generating realistic synthetic video-action pairs to enhance underrepresented datasets.
Fine-Tuning Details
The model underwent fine-tuning using the Open-H-Embodiment dataset (utilizing 9 robot embodiments across 32 datasets), employing 64x A100 GPUs over approximately 10,000 GPU-hours and utilizing a unified 44-dimensional action space.
4. What is Next: Towards Reasoning For Surgical Robotics
Looking ahead, the goal for version 2 of the Open-H-Embodiment initiative is to transition from mere perceptual control to the development of reasoning-capable autonomy—a significant leap reminiscent of a surgical robotics ChatGPT moment—where systems can explain, plan, and adapt throughout long procedures. Achieving this goal necessitates extending Open-H-Embodiment into reasoning-ready data, enriched with annotated task traces that capture intents, outcomes, and failure modes. This transformative effort urges community engagement, and we invite you to participate. For more details, visit our Open-H GitHub Repo to help shape the future of healthcare robotics.
5. Get started today
Ready to dive in? Access the following resources to start working with the Open-H-Embodiment dataset and models:
Inspired by: Source



