Exciting News: SGLang Now Supports DeepSeek-V3.2 on Day 0

We are thrilled to announce that SGLang supports DeepSeek-V3.2 on Day 0! In the latest DeepSeek tech report, we learn how DeepSeek-V3.2 enhances the previous version, DeepSeek-V3.1-Terminus, by incorporating DeepSeek Sparse Attention (DSA) through continued training. This innovative approach achieves remarkable efficiency improvements for training and inference, especially in long-context scenarios.

Contents

Installation and Quick Start

For NVIDIA GPUs
For AMD (MI350X/MI355X)
For NPU

Description of Key Features

DeepSeek Sparse Attention: Long-Context Efficiency Unlocked

Innovations Supporting DSA

Future Work

Acknowledgments

Installation and Quick Start

Getting started with DeepSeek-V3.2 using SGLang is easy. Follow these steps to pull the container and launch the server:

For NVIDIA GPUs

bash
docker pull lmsysorg/sglang:v0.5.3-cu129

python -m sglang.launch_server –model deepseek-ai/DeepSeek-V3.2-Exp –tp 8 –dp 8 –enable-dp-attention

For AMD (MI350X/MI355X)

bash
docker pull lmsysorg/sglang:dsv32-rocm

SGLANG_NSA_FUSE_TOPK=false SGLANG_NSA_KV_CACHE_STORE_FP8=false SGLANG_NSA_USE_REAL_INDEXER=true SGLANG_NSA_USE_TILELANG_PREFILL=True python -m sglang.launch_server –model-path deepseek-ai/DeepSeek-V3.2-Exp –disable-cuda-graph –tp 8 –mem-fraction-static 0.85 –page-size 64 –nsa-prefill "tilelang" –nsa-decode "aiter"

For NPU

bash
docker pull lmsysorg/sglang:dsv32-a2

docker pull lmsysorg/sglang:dsv32-a3

python3 -m sglang.launch_server –model-path deepseek-ai/DeepSeek-V3.2-Exp –trust-remote-code –attention-backend ascend –mem-fraction-static 0.85 –chunked-prefill-size 32768 –disable-radix-cache –tp-size 16 –quantization w8a8_int8

Description of Key Features

DeepSeek Sparse Attention: Long-Context Efficiency Unlocked

At the core of DeepSeek-V3.2 lies the DeepSeek Sparse Attention (DSA) mechanism, designed for improved efficiency in processing long-context data.

DSA diverges from traditional methods by implementing:

Lightning Indexer: An ultra-light FP8 scorer that efficiently identifies the most relevant tokens for each query.
Top-k Token Selection: This focuses computation solely on the most impactful key-value entries.

By reducing the complexity of core attention from O(L²) to O(Lk), DeepSeek-V3.2 offers dramatic improvements in training and inference efficiency, extending to a context length of up to 128K with minimal compromise on model quality.

Innovations Supporting DSA

To bolster the effectiveness of DSA, SGLang integrates a series of innovations including:

Lightning Indexer Support: It features a dedicated key&key_scale cache in the memory pool, enabling ultra-fast token scoring.
Native Sparse Attention (NSA) Backend: This new backend is designed specifically for sparse workloads and includes:
- FlashMLA: A multi-query attention kernel optimized for DeepSeek.
- FlashAttention-3 Sparse: Adapted for compatibility and maximum kernel reuse.

These enhancements ensure that DeepSeek-V3.2-Exp maintains state-of-the-art reasoning quality while significantly reducing memory overhead during deployment.

Future Work

Looking ahead, SGLang has a robust roadmap to further enhance DeepSeek capabilities, including:

Multi-token prediction (MTP) support, which will speed up decoding, especially during smaller batch sizes.
FP8 KV Cache: Compared to traditional BF16 KV caches, this upgrade can nearly double the number of tokens while cutting memory access pressure by half.
TileLang support, enabling more flexible development pathways.

Acknowledgments

We extend our heartfelt thanks to the DeepSeek team for their groundbreaking contributions to open-model research and the open-source community. Their advanced kernels are now seamlessly integrated into the SGLang inference engine.

Special recognition goes to Tom Chen, Ziyi Xu, Liangsheng Yin, Biao He, Baizhou Zhang, Henry Xiao, Hubert Lu, Wun-guo Huang, Zhengda Qin, and Fan Yin for their invaluable contributions to the DeepSeek-V3.2-Exp integration.

Furthermore, we acknowledge NVIDIA, AMD, and Nebius Cloud for sponsoring the GPU machines used during this development.

With these advancements in DeepSeek-V3.2, SGLang remains at the forefront of natural language processing technology, driving the future of model deployment and efficiency. Check our Roadmap for more details on upcoming features!

Inspired by: Source

Enhanced Day 0 Support for DeepSeek V3.2 with Sparse Attention in SGLang

Exciting News: SGLang Now Supports DeepSeek-V3.2 on Day 0

Installation and Quick Start

For NVIDIA GPUs

For AMD (MI350X/MI355X)

For NPU

Description of Key Features

DeepSeek Sparse Attention: Long-Context Efficiency Unlocked

Innovations Supporting DSA

Future Work

Acknowledgments

Stay Connected

Explore Top AI Tools Instantly

Latest News

AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report

Navigating the Modern Cybercrime Landscape: Key Insights and Trends

Agoda Launches Innovative Multimodal Content System to Enhance Travel Discovery Through Images and Reviews

Ultimate Guide to Absolute vs Relative Imports in Python: Test Your Knowledge with Our Quiz – Real Python

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Exciting News: SGLang Now Supports DeepSeek-V3.2 on Day 0

Installation and Quick Start

For NVIDIA GPUs

For AMD (MI350X/MI355X)

More Read

For NPU

Description of Key Features

DeepSeek Sparse Attention: Long-Context Efficiency Unlocked

Innovations Supporting DSA

Future Work

Acknowledgments

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report

Navigating the Modern Cybercrime Landscape: Key Insights and Trends

Agoda Launches Innovative Multimodal Content System to Enhance Travel Discovery Through Images and Reviews

Ultimate Guide to Absolute vs Relative Imports in Python: Test Your Knowledge with Our Quiz – Real Python