### Google’s Qualcomm AI Engine Direct (QNN): Revolutionizing On-Device AI
Google recently unveiled its powerful accelerator for LiteRT, termed Qualcomm AI Engine Direct (QNN), specifically designed to amplify on-device AI performance on Qualcomm-powered Android devices featuring Snapdragon 8 series System on Chips (SoCs). This innovative technology promises breathtaking improvements in execution speed, leading to up to **100 times faster performance compared to CPU execution** and an impressive **10 times faster than GPU** performance.
### Addressing Performance Bottlenecks in Mobile AI
Despite the widespread availability of GPU horsepower in modern Android devices, relying solely on these processors for AI-related tasks often creates performance bottlenecks. Google software engineers Lu Wang, Wiyi Wanf, and Andrew Wang pointed out a common scenario: executing a compute-heavy, text-to-image generation model while simultaneously processing real-time camera feeds can easily overwhelm even the most advanced mobile GPUs. This overload may lead to a jittery experience and dropped frames, significantly impairing user interaction.
In contrast, many modern devices come equipped with **Neural Processing Units (NPUs)**—custom AI accelerators designed to enhance the speed and efficiency of AI workloads while drawing less power than traditional GPUs. This adaptation to leverage NPUs allows for a more seamless, high-performance experience in AI applications.
### Introducing the QNN Accelerator
Developed in collaboration with Qualcomm, QNN serves as a refined replacement for the previous TensorFlow Lite QNN delegate, streamlining the development process for creators. By integrating multiple SoC compilers and runtimes into a cohesive workflow, QNN presents a simplified API for developers. This tool supports **90 LiteRT operations**, aiming to facilitate **full model delegation**, a critical factor in achieving peak performance across various applications.
Included within the QNN framework are specialized kernels and optimizations tailored to enhance the performance of **Large Language Models (LLMs)** such as Gemma and FastLVM, significantly increasing the capabilities of AI applications.
### Benchmarked Performance Gains
Google conducted rigorous benchmarking of QNN across **72 machine learning models**, with an impressive **64 of those models achieving full NPU delegation**. These assessments revealed overwhelming performance benefits, showcasing gains of up to **100 times over CPU execution** and **10 times over GPU execution**.
A notable example of these enhancements can be seen on Qualcomm’s latest flagship SoC, the **Snapdragon 8 Elite Gen 5**. Here, the performance advantage is eye-opening: over **56 models are executed in under 5 milliseconds** with the NPU, while only **13 models reach that execution time on the CPU**. This advancement unlocks a realm of real-time AI experiences previously deemed unattainable.
### Cutting-Edge Use Cases and Applications
Google engineers have even developed a pioneering concept application that harnesses optimized versions of Apple’s FastVLM-0.5B vision-encoding model. This application can interpret scenes captured by the camera nearly instantaneously. On the Snapdragon 8 Elite Gen 5 NPU, it achieves a remarkable **time-to-first-token (TTFT) of just 0.12 seconds** on 1024×1024 images, along with processing speeds over **11,000 tokens per second** for prefill and more than **100 tokens per second** for decoding.
The optimization process for Apple’s model involved **int8 weight quantization** and **int16 activation quantization**, a vital aspect that unlocks the full potential of the NPU’s high-speed **int16 kernels**.
### Compatibility and Getting Started
Currently, QNN supports a selective range of Android hardware, focusing primarily on devices powered by the **Snapdragon 8** and **Snapdragon 8+** SoCs. Developers eager to explore these advancements can visit the NPU acceleration guide and download **LiteRT** from GitHub to get started with these innovative capabilities.
### Summary of Insights
With QNN, Google is setting a compelling precedent in the world of mobile AI, not only maximizing processing efficiency but also paving the way for interactive, real-time applications that were previously constrained by hardware limitations. By leveraging NPUs and revolutionizing the way on-device AI is approached, QNN stands to redefine user experiences across Qualcomm-powered Android devices.
Inspired by: Source

