Exciting News: SGLang Now Supports DeepSeek-V3.2 on Day 0
We are thrilled to announce that SGLang supports DeepSeek-V3.2 on Day 0! In the latest DeepSeek tech report, we learn how DeepSeek-V3.2 enhances the previous version, DeepSeek-V3.1-Terminus, by incorporating DeepSeek Sparse Attention (DSA) through continued training. This innovative approach achieves remarkable efficiency improvements for training and inference, especially in long-context scenarios.
Installation and Quick Start
Getting started with DeepSeek-V3.2 using SGLang is easy. Follow these steps to pull the container and launch the server:
For NVIDIA GPUs
bash
docker pull lmsysorg/sglang:v0.5.3-cu129
python -m sglang.launch_server –model deepseek-ai/DeepSeek-V3.2-Exp –tp 8 –dp 8 –enable-dp-attention
For AMD (MI350X/MI355X)
bash
docker pull lmsysorg/sglang:dsv32-rocm
SGLANG_NSA_FUSE_TOPK=false SGLANG_NSA_KV_CACHE_STORE_FP8=false SGLANG_NSA_USE_REAL_INDEXER=true SGLANG_NSA_USE_TILELANG_PREFILL=True python -m sglang.launch_server –model-path deepseek-ai/DeepSeek-V3.2-Exp –disable-cuda-graph –tp 8 –mem-fraction-static 0.85 –page-size 64 –nsa-prefill "tilelang" –nsa-decode "aiter"
For NPU
bash
docker pull lmsysorg/sglang:dsv32-a2
docker pull lmsysorg/sglang:dsv32-a3
python3 -m sglang.launch_server –model-path deepseek-ai/DeepSeek-V3.2-Exp –trust-remote-code –attention-backend ascend –mem-fraction-static 0.85 –chunked-prefill-size 32768 –disable-radix-cache –tp-size 16 –quantization w8a8_int8
Description of Key Features
DeepSeek Sparse Attention: Long-Context Efficiency Unlocked
At the core of DeepSeek-V3.2 lies the DeepSeek Sparse Attention (DSA) mechanism, designed for improved efficiency in processing long-context data.
DSA diverges from traditional methods by implementing:
- Lightning Indexer: An ultra-light FP8 scorer that efficiently identifies the most relevant tokens for each query.
- Top-k Token Selection: This focuses computation solely on the most impactful key-value entries.
By reducing the complexity of core attention from O(L²) to O(Lk), DeepSeek-V3.2 offers dramatic improvements in training and inference efficiency, extending to a context length of up to 128K with minimal compromise on model quality.
Innovations Supporting DSA
To bolster the effectiveness of DSA, SGLang integrates a series of innovations including:
-
Lightning Indexer Support: It features a dedicated
key&key_scalecache in the memory pool, enabling ultra-fast token scoring. - Native Sparse Attention (NSA) Backend: This new backend is designed specifically for sparse workloads and includes:
- FlashMLA: A multi-query attention kernel optimized for DeepSeek.
- FlashAttention-3 Sparse: Adapted for compatibility and maximum kernel reuse.
These enhancements ensure that DeepSeek-V3.2-Exp maintains state-of-the-art reasoning quality while significantly reducing memory overhead during deployment.
Future Work
Looking ahead, SGLang has a robust roadmap to further enhance DeepSeek capabilities, including:
- Multi-token prediction (MTP) support, which will speed up decoding, especially during smaller batch sizes.
- FP8 KV Cache: Compared to traditional BF16 KV caches, this upgrade can nearly double the number of tokens while cutting memory access pressure by half.
- TileLang support, enabling more flexible development pathways.
Acknowledgments
We extend our heartfelt thanks to the DeepSeek team for their groundbreaking contributions to open-model research and the open-source community. Their advanced kernels are now seamlessly integrated into the SGLang inference engine.
Special recognition goes to Tom Chen, Ziyi Xu, Liangsheng Yin, Biao He, Baizhou Zhang, Henry Xiao, Hubert Lu, Wun-guo Huang, Zhengda Qin, and Fan Yin for their invaluable contributions to the DeepSeek-V3.2-Exp integration.
Furthermore, we acknowledge NVIDIA, AMD, and Nebius Cloud for sponsoring the GPU machines used during this development.
With these advancements in DeepSeek-V3.2, SGLang remains at the forefront of natural language processing technology, driving the future of model deployment and efficiency. Check our Roadmap for more details on upcoming features!
Inspired by: Source

