Discover TorchAO Quantized Models And Recipes On Hugging Face Hub For PyTorch

Exciting developments in the world of machine learning! PyTorch has announced the native quantized variants of popular models like Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it through an inspired collaboration between the TorchAO team and Unsloth. These innovative models use int4 and float8 quantization techniques to provide efficient inference on high-performance GPUs like A100 and H100, as well as on mobile devices. What’s remarkable is their ability to achieve these advancements while maintaining a minimal to no degradation in model quality compared to their bfloat16 counterparts.

Key Highlights of the New Quantized Models

We’ve launched pre-quantized models that are optimized for both server and mobile platforms, ideal for users looking to deploy faster models in their production environments.
Complete and reproducible quantization recipes and guides are now available, covering model quality evaluation and performance benchmarking. This resource is invaluable for users applying PyTorch’s native quantization to their own models and datasets.
Users can also finetune with unsloth and then quantize the finetuned model with TorchAO.

Post Training Quantization: Models and Results

We proudly present several quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it. Below is a detailed breakdown of our quantization methods, results, and corresponding models:

Quantization methods	Results	Models
Int4 weight-only quantization with hqq algorithm and AWQ (for server H100 and A100 GPU)	1.1-1.2x speedup on A100 over bfloat16 model and 1.75x on H100 at batch size 1. Small accuracy degradation from bfloat16 model, e.g. Phi4-mini-instruct-INT4 scored 53.28 vs. 55.35 for the baseline bfloat16. For accuracy-critical tasks, Phi4-mini-instruct-INT4 scored 36.98 for mmlu_pro, while careful calibration improved accuracy to 43.13. 60% peak memory reduction.	Phi-4-mini-instruct-INT4, Phi-4-mini-instruct-AWQ-INT4, Qwen3-8B-INT4, Qwen3-8B-AWQ-INT4
Float8 dynamic activation and float8 weight quantization (for server H100 GPU)	1.7-2x speedup on H100 over bfloat16 at batch sizes 1 and 256. Little to no accuracy degradation with scores like Phi-4-mini-instruct-FP8 averaging 55.11 vs. bfloat16’s 55.35. 30-40% peak memory reduction.	gemma-3-270m-it-torchao-FP8, Phi-4-mini-instruct-FP8, Qwen3-32B-FP8
Int8 dynamic activation and int4 weight quantization (for mobile CPU)	Small accuracy degradation compared to bfloat16. Facilitates model execution on iOS and Android devices like iPhone 15 Pro and Samsung Galaxy S22.	Phi-4-mini-instruct-INT8-INT4, Qwen3-4B-INT8-INT4, SmolLM3-3B-INT8-INT4

Each of the mentioned models comes complete with reproducible quantization recipes utilizing the TorchAO library. This functionality empowers users to quantize their own models as well.

Seamless Integrations Within the PyTorch Ecosystem

The new PyTorch native quantized models are designed to work harmoniously within the broader PyTorch ecosystem, ensuring that users benefit from robust, high-performance quantization solutions that cater to a variety of deployment requirements.

We leverage an array of tools across the PyTorch stack for model quantization, finetuning, quality evaluation, latency testing, and deployment, guaranteeing that the newly released quantized models and their associated recipes function smoothly throughout the entire lifecycle of model preparation and deployment.

Looking Ahead: Future Innovations

New Features
- Innovations like MoE quantization for both inference and training.
- Support for new dtype: NVFP4.
- Enhanced techniques for preserving accuracy during post-training quantization, such as SmoothQuant, GPTQ, and SpinQuant.
Collaborations
- We’re thrilled to continue our partnership with Unsloth, ensuring that TorchAO is accessible for finetuning, QAT, and releasing TorchAO quantized models.
- We’re also working alongside vLLM to enhance end-to-end server inference performance, utilizing optimized kernels from FBGEMM.

We Want to Hear from You!

We invite you to try our new models and quantization recipes. Your feedback is incredibly valuable to us, so please share your thoughts by opening issues in TorchAO or discussing your experiences on the released models page. You can also connect with us on our Discord channel. Additionally, we are eager to learn how you are currently quantizing models and explore opportunities to collaborate on releasing quantized models on HuggingFace in the future.

Inspired by: Source

Contents

Key Highlights of the New Quantized Models
Post Training Quantization: Models and Results
Seamless Integrations Within the PyTorch Ecosystem
Looking Ahead: Future Innovations
We Want to Hear from You!

Discover TorchAO Quantized Models and Recipes on Hugging Face Hub for PyTorch

Key Highlights of the New Quantized Models

Post Training Quantization: Models and Results

Seamless Integrations Within the PyTorch Ecosystem

Looking Ahead: Future Innovations

We Want to Hear from You!

Stay Connected

Explore Top AI Tools Instantly

Latest News

EgoMemReason: Benchmarking Memory-Driven Reasoning for Long-Horizon Egocentric Video Analysis

Ilya Sutskever Defends His Role in Sam Altman’s OpenAI Ouster: ‘I Aimed to Protect the Company’

Thinking Machines Aims to Create Conversational AI That Listens Effectively While Communicating

Unlocking the Potential of Order: Misleading LLMs with Adversarial Table Permutations in Research 2605.00445

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Key Highlights of the New Quantized Models

Post Training Quantization: Models and Results

Seamless Integrations Within the PyTorch Ecosystem

Looking Ahead: Future Innovations

We Want to Hear from You!

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

EgoMemReason: Benchmarking Memory-Driven Reasoning for Long-Horizon Egocentric Video Analysis

Ilya Sutskever Defends His Role in Sam Altman’s OpenAI Ouster: ‘I Aimed to Protect the Company’

Thinking Machines Aims to Create Conversational AI That Listens Effectively While Communicating

Unlocking the Potential of Order: Misleading LLMs with Adversarial Table Permutations in Research 2605.00445