Introducing Holotron-12B: The Future of Multimodal Computing
We’re excited to announce the release of Holotron-12B, a cutting-edge multimodal computer-use model developed by H Company. Following extensive training from the open NVIDIA Nemotron-Nano-2 VL model, Holotron-12B leverages our proprietary data mixture, marking a significant milestone in our collaborative research efforts. Our team focused on creating a model optimized for high scalability and performance suitable for production environments.
As a proud member of the NVIDIA Inception Program, H Company is committed to pushing the boundaries of technology, and Holotron-12B exemplifies this dedication.
Why Holotron-12B? A Unique Approach to Multimodal Modeling
Unlike most multimodal models that primarily enhance static vision or follow simple instructions, Holotron-12B is fundamentally different. Like our Holo2 model, its primary objective is to function as a policy model for computer-use agents that can accurately perceive, make decisions, and act in interactive surroundings.
Our intention with Holotron-12B was to develop a model that manages long contexts with multiple images while still achieving high performance on agent benchmarks. Building on the robust foundation of the NVIDIA Nemotron model, Holotron-12B showcases the remarkable capabilities achieved through additional training and fine-tuning.
High Throughput Inference with Hybrid SSM Architecture
The significant advancement in Holotron-12B’s inference efficiency can be attributed to its underlying Nemotron architecture, which employs a hybrid State-Space Model (SSM) in combination with an attention mechanism. Unlike traditional transformer-based models, the SSM design excels in high-throughput performance, particularly for agentic tasks that involve extended context and multiple high-resolution images.
This architecture boasts a dramatically reduced memory footprint, enabling efficient processing. While conventional attention mechanisms require storing Key (K) and Value (V) activations per token and layer, SSM operates as a linear recurrent model. This allows it to maintain only a constant state per layer per generated sequence, irrespective of the sequence length.
In evaluations using the WebVoyager Benchmark, Holotron-12B showed outstanding performance under real-world multimodal workloads, managing long contexts and high request concurrency with remarkable efficiency. When executed on a single H100 GPU—utilizing vLLM with the latest SSM optimizations (v0.14.1)—Holotron-12B outperformed its predecessor, Holo2-8B, by more than double in throughput.
Controlled experiments revealed that Holotron-12B scales effectively as the concurrency increases, achieving a total token throughput of 8.9k tokens/s at a maximum concurrency of 100. In stark contrast, Holo2-8B’s throughput plateaus at 5.1k tokens/s, underscoring Holotron-12B’s superior VRAM utilization and smaller memory footprint, which support larger effective batch sizes without compromising throughput.
Training and Evaluating Holotron-12B
Holotron-12B’s training involved a dual-phase process. We began with the Nemotron-Nano-12B-v2-VL-BF16, a multimodal base model published by NVIDIA. Our next step was supervised fine-tuning with H Company’s proprietary data set focused on localization and navigation, honing in on screen understanding, grounding, and UI-level interactions. Overall, the final model utilized around 14 billion tokens during the training process.
Exceptional Performance on Agent Benchmarks
On computer-use and navigation benchmarks, Holotron-12B outperformed the Nemotron base model and showcased strong results compared to leading agent models. Its WebVoyager performance soared from 35.1% to an impressive 80.5%, surpassing the results achieved by Holo2-8B and validating the model’s capabilities in agentic environments.
Enhancements in Localization Benchmarks
Holotron-12B has also shown remarkable improvements in localization and grounding benchmarks, such as OS-World-G, GroundUI, and WebClick, indicating its robust understanding of spatial contexts and UI interactions.
Holotron-12B illustrates how the NVIDIA Nemotron VL model can serve as a solid basis for real-world multimodal agents when paired with the appropriate training infrastructure. It demonstrates a high agent performance, greatly improved inference throughput, and a clear pathway for future advancements, especially concerning high-resolution vision training.
We are eager to see how developers and organizations leverage Holotron-12B to create innovative applications. Available now on Hugging Face under the NVIDIA Open Model License, we encourage everyone to explore its capabilities.
In a noteworthy announcement, NVIDIA has unveiled the Nemotron 3 Omni, which promises to build upon the features of Holotron-12B. By harnessing the advanced hybrid SSM-Attention and MoE architectural frameworks of the Nemotron 3 family, we anticipate even greater advancements in reasoning capabilities and multimodal accuracy, setting the stage for expansive commercial applications.
Inspired by: Source





