Introducing Holotron-12B: The Future of Multimodal Computing

We’re excited to announce the release of Holotron-12B, a cutting-edge multimodal computer-use model developed by H Company. Following extensive training from the open NVIDIA Nemotron-Nano-2 VL model, Holotron-12B leverages our proprietary data mixture, marking a significant milestone in our collaborative research efforts. Our team focused on creating a model optimized for high scalability and performance suitable for production environments.

As a proud member of the NVIDIA Inception Program, H Company is committed to pushing the boundaries of technology, and Holotron-12B exemplifies this dedication.

Why Holotron-12B? A Unique Approach to Multimodal Modeling

Unlike most multimodal models that primarily enhance static vision or follow simple instructions, Holotron-12B is fundamentally different. Like our Holo2 model, its primary objective is to function as a policy model for computer-use agents that can accurately perceive, make decisions, and act in interactive surroundings.

Our intention with Holotron-12B was to develop a model that manages long contexts with multiple images while still achieving high performance on agent benchmarks. Building on the robust foundation of the NVIDIA Nemotron model, Holotron-12B showcases the remarkable capabilities achieved through additional training and fine-tuning.

High Throughput Inference with Hybrid SSM Architecture

The significant advancement in Holotron-12B’s inference efficiency can be attributed to its underlying Nemotron architecture, which employs a hybrid State-Space Model (SSM) in combination with an attention mechanism. Unlike traditional transformer-based models, the SSM design excels in high-throughput performance, particularly for agentic tasks that involve extended context and multiple high-resolution images.

This architecture boasts a dramatically reduced memory footprint, enabling efficient processing. While conventional attention mechanisms require storing Key (K) and Value (V) activations per token and layer, SSM operates as a linear recurrent model. This allows it to maintain only a constant state per layer per generated sequence, irrespective of the sequence length.

In evaluations using the WebVoyager Benchmark, Holotron-12B showed outstanding performance under real-world multimodal workloads, managing long contexts and high request concurrency with remarkable efficiency. When executed on a single H100 GPU—utilizing vLLM with the latest SSM optimizations (v0.14.1)—Holotron-12B outperformed its predecessor, Holo2-8B, by more than double in throughput.

Controlled experiments revealed that Holotron-12B scales effectively as the concurrency increases, achieving a total token throughput of 8.9k tokens/s at a maximum concurrency of 100. In stark contrast, Holo2-8B’s throughput plateaus at 5.1k tokens/s, underscoring Holotron-12B’s superior VRAM utilization and smaller memory footprint, which support larger effective batch sizes without compromising throughput.

Training and Evaluating Holotron-12B

Holotron-12B’s training involved a dual-phase process. We began with the Nemotron-Nano-12B-v2-VL-BF16, a multimodal base model published by NVIDIA. Our next step was supervised fine-tuning with H Company’s proprietary data set focused on localization and navigation, honing in on screen understanding, grounding, and UI-level interactions. Overall, the final model utilized around 14 billion tokens during the training process.

Exceptional Performance on Agent Benchmarks

On computer-use and navigation benchmarks, Holotron-12B outperformed the Nemotron base model and showcased strong results compared to leading agent models. Its WebVoyager performance soared from 35.1% to an impressive 80.5%, surpassing the results achieved by Holo2-8B and validating the model’s capabilities in agentic environments.

Enhancements in Localization Benchmarks

Holotron-12B has also shown remarkable improvements in localization and grounding benchmarks, such as OS-World-G, GroundUI, and WebClick, indicating its robust understanding of spatial contexts and UI interactions.

Holotron-12B illustrates how the NVIDIA Nemotron VL model can serve as a solid basis for real-world multimodal agents when paired with the appropriate training infrastructure. It demonstrates a high agent performance, greatly improved inference throughput, and a clear pathway for future advancements, especially concerning high-resolution vision training.

We are eager to see how developers and organizations leverage Holotron-12B to create innovative applications. Available now on Hugging Face under the NVIDIA Open Model License, we encourage everyone to explore its capabilities.

In a noteworthy announcement, NVIDIA has unveiled the Nemotron 3 Omni, which promises to build upon the features of Holotron-12B. By harnessing the advanced hybrid SSM-Attention and MoE architectural frameworks of the Nemotron 3 family, we anticipate even greater advancements in reasoning capabilities and multimodal accuracy, setting the stage for expansive commercial applications.

Inspired by: Source

Contents

Why Holotron-12B? A Unique Approach to Multimodal Modeling
High Throughput Inference with Hybrid SSM Architecture
Training and Evaluating Holotron-12B
Exceptional Performance on Agent Benchmarks
Enhancements in Localization Benchmarks

High Throughput Computer Use Agent: Understanding 12B for Optimal Performance

Introducing Holotron-12B: The Future of Multimodal Computing

Why Holotron-12B? A Unique Approach to Multimodal Modeling

High Throughput Inference with Hybrid SSM Architecture

Training and Evaluating Holotron-12B

Exceptional Performance on Agent Benchmarks

Enhancements in Localization Benchmarks

Stay Connected

Explore Top AI Tools Instantly

Latest News

How Structured Prompts Enhance Language Model Evaluation: An Analysis of [2511.20836]

China’s Five-Year Plan: Key Targets for AI Implementation and Development

Revolutionary Instruction-Free Framework for Low-Latency Next Edit Suggestions Using Historical Editing Trajectories

Explore an Interactive Tool for Understanding Dialectal Bias in Automated Toxicity Models

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Introducing Holotron-12B: The Future of Multimodal Computing

Why Holotron-12B? A Unique Approach to Multimodal Modeling

High Throughput Inference with Hybrid SSM Architecture

Training and Evaluating Holotron-12B

Exceptional Performance on Agent Benchmarks

Enhancements in Localization Benchmarks

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

How Structured Prompts Enhance Language Model Evaluation: An Analysis of [2511.20836]

China’s Five-Year Plan: Key Targets for AI Implementation and Development

Revolutionary Instruction-Free Framework for Low-Latency Next Edit Suggestions Using Historical Editing Trajectories

Explore an Interactive Tool for Understanding Dialectal Bias in Automated Toxicity Models