Introducing Nemotron 3 Super: A Day 0 Launch with SGLang
We are thrilled to announce that SGLang is supporting the groundbreaking NVIDIA Nemotron 3 Super on Day 0. This latest addition in the Nemotron 3 family is designed for sophisticated multi-agent interactions, enabling seamless collaboration between agents that plan, reason, and execute tasks together.
What Makes Nemotron 3 Super Stand Out?
Advanced Architecture
The Nemotron 3 Super employs a Mixture of Experts (MoE) structure combined with a Hybrid Transformer-Mamba Architecture. This architecture is engineered for efficiency, allowing it to achieve a throughput that is up to 5x higher compared to previous models, such as Llama Nemotron Super 1.5. Additionally, its Multi-Token Prediction (MTP) capability allows simultaneous token prediction, dramatically speeding up long-form text generation.
Unmatched Accuracy
On the Artificial Analysis Intelligence Index, the Nemotron 3 Super boasts leading accuracy metrics within its size category. It achieves up to 2x higher accuracy than its predecessor through its innovative latent MoE feature, which enables the model to utilize four experts for the inference cost of just one.
Optimized Model Specifications
- Parameter Count: 120B total parameters, with only 12B active parameters during each inference run.
- Context Length: Capable of handling contexts up to 1M tokens, providing a broader scope for conversation and workflow management.
- Input/Output: Simple text input with text output, making it user-friendly for various applications.
- Supported Hardware: The model efficiently runs on top-tier GPUs including B200, H100, H200, DGX Spark, and RTX 6000.
Fully Open Model
As demonstrated in our accompanying chart on the Artificial Analysis Openness Index, Nemotron 3 Super sets itself apart with its fully open framework. It offers open weights, datasets, and configuration recipes, allowing developers the freedom to customize, optimize, and deploy as per their needs, ensuring maximum privacy and security.
Installation: Getting Started with SGLang and Nemotron 3 Super
If you’re looking to integrate Nemotron 3 Super into your pipeline, the first step is installing SGLang. For detailed guidance, you can consult our comprehensive getting started cookbook.
Run the following command to install the necessary dependencies:
bash
pip install ‘git+https://github.com/sgl-project/sglang.git#subdirectory=python‘
After installation, serving the model is straightforward. The example below is optimized for a 4x H200 setup. Detailed instructions are further elaborated in our cookbooks.
bash
python3 -m sglang.launch_server
–model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
–host 0.0.0.0
–port 5000
–trust-remote-code
–tp 4
–tool-call-parser qwen3_coder
–reasoning-parser nemotron_3
Once your server is operational, you can begin prompting the model with simple code snippets as shown below:
python
from openai import OpenAI
SERVED_MODEL_NAME = “nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16″
BASE_URL = f”http://localhost:5000/v1”
API_KEY = “EMPTY”
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{“role”: “system”, “content”: “You are a helpful AI assistant.”},
{“role”: “user”, “content”: “Give me 3 bullet points about SGLang.”}
],
temperature=0.6,
max_tokens=512,
)
print(“Reasoning:”, resp.choices[0].message.reasoning_content, “nContent:”, resp.choices[0].message.content)
The Ideal Solution for Multi-Agent Workloads
Nemotron 3 Super shines particularly in scenarios requiring multi-agent capabilities and complex reasoning workloads.
Efficiency and Performance
As illustrated in the accompanying chart, the model excels not just in accuracy but also in efficiency—making it an attractive choice for multi-agent systems. The expansive 1M-token context empowers agents to maintain full conversation histories, enhancing their ability to plan and execute tasks effectively. This architecture is particularly advantageous for RAG (Retrieval-Augmented Generation) processes, as large document sets can be ingested in one operation. This feature helps in minimizing fragmentation and reducing the risk of goal drift during multi-step workflows.
Diverse Applications
The capabilities of Nemotron 3 Super extend across a range of applications—from code generation and debugging to research summarization, alert triage, and document analysis. Its design enables users to orchestrate multiple agents efficiently within a single node, making it a versatile tool in any developer’s arsenal.
Get Started Today!
With Nemotron 3 Super, developers are equipped to build scalable, cost-effective multi-agent AI systems without sacrificing accuracy. Its open-source framework gives you the flexibility to tailor your deployment, whether on local infrastructure or cloud environments.
Eager to revolutionize your multi-agent AI projects? Dive into the potential of Nemotron 3 Super today!
Acknowledgments
We extend our heartfelt thanks to everyone who contributed to implementing Nemotron 3 Super into SGLang. Special thanks go to the NVIDIA team—Nirmal Kumar Juluru, Anusha Pant, Max Xu, Daniel Afrimi, Shahar Mor, Roi Koren, and Ann Guan—along with the SGLang team and community members Baizhou Zhang, Jiajun Li, Ke Bao, Lingyan Hao, and Mingyi Lu for their invaluable efforts.
Inspired by: Source

