Exciting developments in the world of machine learning! PyTorch has announced the native quantized variants of popular models like Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it through an inspired collaboration between the TorchAO team and Unsloth. These innovative models use int4 and float8 quantization techniques to provide efficient inference on high-performance GPUs like A100 and H100, as well as on mobile devices. What’s remarkable is their ability to achieve these advancements while maintaining a minimal to no degradation in model quality compared to their bfloat16 counterparts.
Key Highlights of the New Quantized Models
- We’ve launched pre-quantized models that are optimized for both server and mobile platforms, ideal for users looking to deploy faster models in their production environments.
- Complete and reproducible quantization recipes and guides are now available, covering model quality evaluation and performance benchmarking. This resource is invaluable for users applying PyTorch’s native quantization to their own models and datasets.
- Users can also finetune with unsloth and then quantize the finetuned model with TorchAO.
Post Training Quantization: Models and Results
We proudly present several quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it. Below is a detailed breakdown of our quantization methods, results, and corresponding models:
| Quantization methods | Results | Models |
|---|---|---|
| Int4 weight-only quantization with hqq algorithm and AWQ (for server H100 and A100 GPU) |
|
Phi-4-mini-instruct-INT4, Phi-4-mini-instruct-AWQ-INT4, Qwen3-8B-INT4, Qwen3-8B-AWQ-INT4 |
| Float8 dynamic activation and float8 weight quantization (for server H100 GPU) |
|
gemma-3-270m-it-torchao-FP8, Phi-4-mini-instruct-FP8, Qwen3-32B-FP8 |
| Int8 dynamic activation and int4 weight quantization (for mobile CPU) |
|
Phi-4-mini-instruct-INT8-INT4, Qwen3-4B-INT8-INT4, SmolLM3-3B-INT8-INT4 |
Each of the mentioned models comes complete with reproducible quantization recipes utilizing the TorchAO library. This functionality empowers users to quantize their own models as well.
Seamless Integrations Within the PyTorch Ecosystem
The new PyTorch native quantized models are designed to work harmoniously within the broader PyTorch ecosystem, ensuring that users benefit from robust, high-performance quantization solutions that cater to a variety of deployment requirements.
We leverage an array of tools across the PyTorch stack for model quantization, finetuning, quality evaluation, latency testing, and deployment, guaranteeing that the newly released quantized models and their associated recipes function smoothly throughout the entire lifecycle of model preparation and deployment.
Looking Ahead: Future Innovations
- New Features
- Innovations like MoE quantization for both inference and training.
- Support for new dtype: NVFP4.
- Enhanced techniques for preserving accuracy during post-training quantization, such as SmoothQuant, GPTQ, and SpinQuant.
- Collaborations
- We’re thrilled to continue our partnership with Unsloth, ensuring that TorchAO is accessible for finetuning, QAT, and releasing TorchAO quantized models.
- We’re also working alongside vLLM to enhance end-to-end server inference performance, utilizing optimized kernels from FBGEMM.
We Want to Hear from You!
We invite you to try our new models and quantization recipes. Your feedback is incredibly valuable to us, so please share your thoughts by opening issues in TorchAO or discussing your experiences on the released models page. You can also connect with us on our Discord channel. Additionally, we are eager to learn how you are currently quantizing models and explore opportunities to collaborate on releasing quantized models on HuggingFace in the future.
Inspired by: Source

