Introducing The First Comprehensive Healthcare Robotics Dataset And Essential Physical AI Models For Advancing Healthcare Robotics

Authors: Nigel Nelson, Lukas Zbinden, Mostafa Toloui, Sean Huver

Healthcare AI has primarily revolved around perception-based models, concentrating on interpreting signals to classify or segment pathologies and anatomy. Yet, healthcare fundamentally encompasses “doing,” rendering past perception-only datasets inadequate due to their static nature, which doesn’t account for embodiment, contact dynamics, or closed-loop control. The healthcare sector calls for standardized robotic bodies, synchronized vision–force–kinematics data, sim-to-real pairing, and cross-embodiment benchmarks to establish a solid foundation for Physical AI.

1. Open-H-Embodiment

Open-H-Embodiment is a collaborative, community-driven dataset initiative aimed at creating a shared foundation essential for training and evaluating AI autonomy and world foundation models in surgical robotics and ultrasound applications. Spearheaded by a steering committee featuring notable figures such as Prof. Axel Krieger from Johns Hopkins, Prof. Nassir Navab from the Technical University of Munich, and Dr. Mahdi Azizian from NVIDIA, this initiative has now expanded to include participation from over 35 organizations worldwide.

Collectively, these participants have joined forces to construct the first large-scale dataset aimed at propelling the advancement of Physical AI within healthcare robotics.

Participants

Notable participants include:

Balgrist
CMR Surgical
The Chinese University of Hong Kong
Great Bay University
Hong Kong Baptist University
Hamlyn
ImFusion
Johns Hopkins University
Leeds University
Mohamed bin Zayed University of Artificial Intelligence
Moon Surgical
NVIDIA
Northwell Health
Obuda University
The Hong Kong Polytechnic University
Qilu Hospital of Shandong University
Rob Surgical
Sanoscience
Surgical Data Science Collective
Semaphor Surgical
Stanford
Dresden University of Technology
Technical University of Munich
Tuodao
Turin
University of British Columbia
UC Berkeley
UC San Diego
University of Illinois Chicago
University of Tennessee
University of Texas
Vanderbilt
Virtual Incision

The Dataset

Comprises 778 hours of CC-BY-4.0 healthcare robotics training data, primarily focused on surgical robotics, along with ultrasound and colonoscopy autonomy data.
Includes simulations, benchtop exercises (such as suturing), and actual clinical procedures.
Utilizes both commercial robots (like CMR Surgical, Rob Surgical, and Tuodao) and research robots (including dVRK, Franka, and Kuka).
Accompanied by the release of two new, permissively open-source models trained on this dataset.

2. GR00T-H: Vision Language Action Model for Surgical Robotics

One of the significant innovations birthed from this initiative is GR00T-H, a derivative of NVIDIA’s Isaac GR00T N series of Vision-Language-Action (VLA) models. With training based on approximately 600 hours of Open-H-Embodiment data, GR00T-H is pioneering as the first policy model tailored for surgical robotics tasks.

Leveraging NVIDIA’s open-source ecosystem, Gr00T-H utilizes Cosmos Reason 2 2B as its Vision-Language Model (VLM) backbone.

pyramid

Architectural Design Choices

Developing surgical robotics calls for acute precision, and specialized hardware like cable-driven systems complicates imitation learning (IL). To tackle this, GR00T-H incorporates four pivotal design choices:

Unique Embodiment Projectors: A distinct, learnable MLP maps each robot’s specific kinematics to a uniform, normalized action space.
State Dropout (100%): Proprioceptive input is dropped during inference, generating a learned bias term for every system, which enhances real-world results.
Relative EEF Actions: Training employs a common relative End-Effector (EEF) action space to mitigate kinematic inconsistencies.
Metadata in Task Prompts: Directly injects instrument names and control index mapping into the VLM task prompt.

A prototype of GR00T-H has successfully executed a complete, end-to-end suture as demonstrated in the SutureBot benchmark, showcasing robust long-horizon dexterity.

GR00T-H performing end-to-end suturing.

3. Cosmos-H-Surgical-Simulator

Another groundbreaking creation is the Cosmos-H-Surgical-Simulator, designed as a World Foundation Model (WFM) for action-conditioned surgical robotics. Traditional simulators have struggled due to the complexities of real-world conditions, such as soft tissue, reflections, blood, and smoke.

Key Capabilities

Overcoming the Sim-to-Real Gap: Fine-tuned from NVIDIA Cosmos Predict 2.5 2B, it generates physically plausible surgical video directly from kinematic actions.
Efficiency Gains: Completing 600 rollouts took only 40 minutes in simulation compared to 2 days required for real-world benchtop methods.
WFM as a Physics Simulator: This model learns tissue deformation and tool interaction implicitly from data.
Synthetic Data Generation: Capable of generating realistic synthetic video-action pairs to enhance underrepresented datasets.

Fine-Tuning Details

The model underwent fine-tuning using the Open-H-Embodiment dataset (utilizing 9 robot embodiments across 32 datasets), employing 64x A100 GPUs over approximately 10,000 GPU-hours and utilizing a unified 44-dimensional action space.

4. What is Next: Towards Reasoning For Surgical Robotics

Looking ahead, the goal for version 2 of the Open-H-Embodiment initiative is to transition from mere perceptual control to the development of reasoning-capable autonomy—a significant leap reminiscent of a surgical robotics ChatGPT moment—where systems can explain, plan, and adapt throughout long procedures. Achieving this goal necessitates extending Open-H-Embodiment into reasoning-ready data, enriched with annotated task traces that capture intents, outcomes, and failure modes. This transformative effort urges community engagement, and we invite you to participate. For more details, visit our Open-H GitHub Repo to help shape the future of healthcare robotics.

5. Get started today

Ready to dive in? Access the following resources to start working with the Open-H-Embodiment dataset and models:

Inspired by: Source

Contents

1. Open-H-Embodiment

Participants
The Dataset

2. GR00T-H: Vision Language Action Model for Surgical Robotics

Architectural Design Choices

3. Cosmos-H-Surgical-Simulator

Key Capabilities
Fine-Tuning Details

4. What is Next: Towards Reasoning For Surgical Robotics
5. Get started today

Introducing the First Comprehensive Healthcare Robotics Dataset and Essential Physical AI Models for Advancing Healthcare Robotics