Optimizing Parquet Storage: Enhancing Efficiency at Hugging Face
The Xet team at Hugging Face is spearheading an initiative to improve the efficiency of the Hub’s storage architecture. With Hugging Face hosting nearly 11PB of datasets—of which Parquet files alone account for over 2.2PB—optimizing the storage of these files is paramount. This article delves into the intricacies of Parquet storage, the challenges faced, and the innovative solutions being explored.
Understanding Parquet Files
Parquet is a columnar storage file format that offers efficient data compression and encoding schemes. It works by splitting a table into row groups, each containing a fixed number of rows (for instance, 1,000). Each column within these row groups is compressed and stored separately. This structure enhances read performance for analytical queries, making Parquet a popular choice for data scientists and engineers.
Challenges in Parquet Storage
One of the primary challenges in managing Parquet files is deduplication, especially when users frequently update their datasets. When datasets are regularly modified, the need for efficient storage becomes critical. Without effective deduplication, updating datasets can lead to substantial storage overhead, as users might have to re-upload entire datasets each time.
The default storage algorithm employed by Hugging Face utilizes byte-level Content-Defined Chunking (CDC). While this method generally works well for insertions and deletions, the inherent layout of Parquet files presents unique challenges. Let’s explore some experiments conducted to assess the performance of this deduplication strategy.
Experimenting with Parquet Modifications
Appending Data
In an initial test, 10,000 new rows were appended to a 2GB Parquet file containing 1,092,000 rows from the FineWeb dataset. The results were promising: the new file achieved a deduplication rate of 99.1%, requiring only 20MB of additional storage. This outcome aligns with expectations, as appending data should ideally not disrupt existing row groups.
Modifying Data
When a small modification was made to a specific row, the deduplication results were less favorable. Although most of the file was still deduplicated, many small, regularly spaced sections of new data emerged. This phenomenon occurs because modifications affect the Parquet column headers, which contain absolute file offsets. Consequently, even minor changes can necessitate rewriting all column headers, leading to a deduplication rate of only 89% and requiring an additional 230MB of storage.
Deleting Data
Deleting a row from the middle of the file triggered significant changes in the row group layout, as each group contains 1,000 rows. While the first half of the file retained its deduplicated status, the latter half contained entirely new blocks of data. This behavior is attributed to the aggressive compression applied to each column in Parquet files.
When compression was turned off, the deduplication improved significantly. However, this came at the cost of file size, which nearly doubled without compression. This raises a crucial question: can we achieve the benefits of both deduplication and compression?
Innovative Solutions: Content-Defined Row Groups
One potential solution lies in applying CDC not only at the byte level but also at the row level. By splitting row groups based on a hash of a designated “Key” column, we can dynamically determine the size of each row group. This approach allows for efficient deduplication even when rows are deleted, as highlighted in the results of an experimental demonstration.
Future Directions for Parquet Storage
The experiments conducted by the Xet team have highlighted several avenues for improving the deduplication capabilities of Parquet files:
-
Using Relative Offsets: Transitioning from absolute to relative offsets for file structure data could enhance position independence, streamlining deduplication processes. However, implementing this change would require significant modifications to the file format.
- Supporting Content-Defined Chunking on Row Groups: As the Parquet format allows for row groups of varying sizes, enhancing support for content-defined chunking could improve deduplication while maintaining compatibility with existing systems.
The Xet team is keen to collaborate with the Apache Arrow project to explore the feasibility of these enhancements within the Parquet and Arrow codebase.
Meanwhile, they continue to investigate the performance of the deduplication process across various file types. Users are encouraged to try out the deduplication estimator and share their findings, contributing to the ongoing improvement of data storage efficiency at Hugging Face.
Inspired by: Source





