The PyTorch Foundation is thrilled to announce that vLLM has joined as a hosted project. Developed by the University of California – Berkeley, vLLM stands out as a high-throughput, memory-efficient inference and serving engine designed specifically for large language models (LLMs). With its robust integration with PyTorch, vLLM leverages a unified interface that supports a diverse range of hardware backends, encompassing NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This close relationship with PyTorch not only guarantees compatibility but also optimizes performance across various hardware platforms.
In a recent announcement, the PyTorch Foundation revealed its evolution into an umbrella foundation aimed at accelerating AI innovation. The inclusion of vLLM as one of its inaugural projects highlights the Foundation’s commitment to fostering cutting-edge developments in AI technology. Foundation-Hosted Projects are governed and managed under the PyTorch Foundation’s neutral and transparent governance model, ensuring accountability and community engagement.
What is vLLM?
As large language models grow increasingly complex, the challenge of running them efficiently becomes more pronounced. vLLM addresses this challenge head-on. Initially built around the pioneering PagedAttention algorithm, vLLM has evolved into a state-of-the-art inference engine that is continually enhanced by a vibrant community. This community is actively contributing new features and optimizations such as pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.
Since its inception, vLLM has gained substantial traction in the developer community, amassing over 46,500 stars on GitHub and welcoming over 1,000 contributors. These numbers reflect the project’s popularity and the robust ecosystem surrounding it, marking a significant milestone for vLLM as it empowers developers and researchers with innovative tools for efficient AI deployment.
Key Features of vLLM
- Extensive Model Support: vLLM supports over 100 LLM architectures and offers multi-modal capabilities for images and videos, alongside specialized architectures like sparse attention, Mamba, BERT, Whisper, embedding, and classification models.
- Comprehensive Hardware Compatibility: It operates seamlessly on NVIDIA GPUs through Blackwell, with official support for AMD, Google TPU, AWS Neuron, Intel CPU/XPU/HPU, and ARM. Third-party accelerators like IBM Spyre and Huawei Ascend can also be integrated easily via our plugin system.
- Highly Extensible: vLLM allows for custom model implementations, hardware plugins, torch.compile optimizations, and configurable scheduling policies tailored to specific needs.
- Optimized for Response Speed: It minimizes latency through techniques such as speculative decoding, quantization, prefix caching, and CUDA graph acceleration.
- Engineered for Maximum Throughput: vLLM achieves peak performance with tensor/pipeline parallelism and specialized kernels designed for efficiency.
- Seamless RLHF Integration: It offers first-class support for reinforcement learning from human feedback and integrates well with common post-training frameworks.
- Enterprise-Scale Distributed Inference: vLLM enables cluster-wide scaling through KV cache offloading, intelligent routing, and prefill-decode disaggregation.
- Production-Hardened: With enterprise-grade security, comprehensive observability, and proven operational reliability, vLLM is built to withstand production demands.
Accelerating Open Source AI Together
As a part of the PyTorch Foundation, vLLM will collaborate closely with the PyTorch team on feature development. Plans include:
- Ensuring vLLM code runs on Torch nightly, with the PyTorch team overseeing tests to guarantee reliability.
- Enhancing support for torch.compile and FlexAttention in vLLM.
- Facilitating close collaboration with native libraries such as TorchTune, TorchAO, and FBGEMM.
This partnership presents significant advantages for both vLLM and the PyTorch core. vLLM gains a dedicated steward in the Foundation, ensuring long-term maintenance of the codebase, production stability, and a transparent governance structure. Simultaneously, PyTorch stands to benefit from vLLM’s capacity to broaden its adoption across various accelerator platforms while innovating features that enhance the entire ecosystem.
Inspired by: Source

