Maximizing Data Processing Efficiency with Polars’ GPU-Accelerated Parquet Reader
When handling large datasets, the performance of your data processing tools is paramount. Enter Polars, an open-source library celebrated for its speed and efficiency in data manipulation. With a GPU-accelerated backend powered by cuDF, Polars offers a remarkable opportunity to enhance performance, especially when dealing with extensive data. However, to truly harness the capabilities of Polars’ GPU backend, optimizing the data loading process and effectively managing memory usage is crucial.
As the development of the GPU backend advances, numerous techniques have emerged to maintain high performance, particularly when using the GPU Parquet reader. Previous versions of Polars (up to version 24.10) struggled to scale effectively with larger dataset sizes, which necessitated a new approach. This article delves into how a chunked Parquet Reader, combined with Unified Virtual Memory (UVM), can significantly outperform both non-chunked readers and traditional CPU-based methods.
Challenges with Scale Factors and Non-Chunked Readers
As dataset size increases, the challenges associated with a non-chunked GPU Polars Reader become evident. With scale factors beyond SF200, performance often degrades markedly, leading to Out of Memory (OOM) errors. For instance, in specific queries like Query 9, the non-chunked GPU reader encounters failures even before hitting SF50. This performance drop-off is primarily due to memory constraints when loading substantial Parquet files into the GPU’s memory. The data gaps in the non-chunked Parquet Reader’s performance graph illustrate the OOM issues faced at elevated scale factors.
Improving I/O and Peak Memory with Chunked Parquet Reading
To address these memory limitations, implementing a chunked Parquet Reader is essential. By processing the Parquet file in smaller, manageable chunks, the memory footprint is significantly reduced. This adjustment allows Polars GPU to handle larger datasets effectively. For example, using a chunked Parquet Reader with a 16 GB pass-read-limit enables a broader range of scale factors to be executed compared to a non-chunked reader. In the case of Query 9, adopting chunked reading with either 16 GB or 32 GB is critical for achieving better throughput.

pass_read_limit) across scale factors for Query 9Reading Even Larger Datasets with UVM
While chunked reading enhances memory management, the integration of Unified Virtual Memory (UVM) takes performance capabilities to unprecedented levels. UVM allows the GPU to access system memory directly, further alleviating memory constraints and optimizing data transfer efficiency. In comparative scenarios, non-UVM chunked readers experience OOM errors before reaching SF100, while chunked readers with UVM successfully execute queries at higher scale factors, albeit with some impact on throughput.
Figure 3 illustrates this advantage clearly. A chunked Parquet Reader with UVM enabled shows successful execution across many more scale factors compared to a non-chunked Parquet Reader.

Stability and Throughput
When determining the optimal pass_read_limit, balancing stability and throughput is crucial. Analysis of Figures 1-3 suggests that a 16 GB or 32 GB pass_read_limit strikes the best compromise between these two factors.
- 32 GB
pass_read_limit: All queries succeeded except for Query 9 and Query 19, which failed with OOM exceptions. - 16 GB
pass_read_limit: All queries succeeded without issues.
Chunked-GPU versus CPU
Throughput observations consistently indicate that chunked GPU performance surpasses that of traditional CPU Polars. This advantage permits many queries to complete successfully that would otherwise fail without chunking. A 16 GB or possibly 32 GB pass_read_limit appears to be optimal, enabling successful execution at higher scale factors compared to non-chunked Parquet readers.
Inspired by: Source

