Skip to content
Parquet Content-Defined Chunking Enables Efficient Deduplication with Hugging Face Xet
Source: huggingface.co

Parquet Content-Defined Chunking Enables Efficient Deduplication with Hugging Face Xet

Sources: https://huggingface.co/blog/parquet-cdc, Hugging Face Blog

TL;DR

  • Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas and works with Hugging Face’s Xet storage layer to enable efficient deduplication.
  • CDC deduplicates data at the data-page level before serialization and compression, reducing uploads/downloads across repositories and files.
  • Enable Parquet CDC by passing use_content_defined_chunking=True to the Parquet writer; PyArrow URIs can be used with the hf:// scheme for on-Hub reads/writes.
  • The combination of Parquet CDC and Xet significantly lowers data transfer and storage costs for large-scale Hugging Face datasets.

Context and background

Parquet is a widely used columnar storage format in data engineering, and Hugging Face hosts an enormous amount of Parquet data on the Hub. The Hub’s new Xet storage layer uses content-defined chunking to deduplicate chunks of data, reducing storage costs and speeding up uploads and downloads. Because Parquet’s layout—particularly column chunks and data pages—can produce different byte-level representations for small data changes, deduplication can be suboptimal unless the data is written with chunking that matches the data’s content. This is where CDC helps: it aligns chunk boundaries with the logical data values, enabling more effective deduplication before any serialization or compression. OpenOrca demonstrates a manageable subset to illustrate the workflow. Since PyArrow 21.0.0, Hugging Face URIs can be used directly in PyArrow functions to read and write Parquet (and other formats) to the Hub via the hf:// URI scheme. This integration is central to performing end-to-end Parquet operations on the Hub without leaving the environment, and to showcasing how CDC interacts with the Xet storage layer to achieve dedup across repositories and files. The broader goal is to minimize data transfer while preserving accessibility and compatibility with existing readers and writers. Hugging Face Blog. Traditional file systems do not deduplicate files across repositories, leading to full re-uploads and re-downloads. By contrast, a system that uses content-defined chunking can recognize identical content and avoid transferring unchanged data. The article demonstrates that identical tables uploaded to the Hub can be transferred instantly when CDC is enabled, and the deduplication works across repositories as well. This cross-repository capability is a key feature of Xet, enabling efficient data sharing and collaboration. Hugging Face Blog. The demonstration scores and visuals in the blog show how Parquet CDC impacts the size and behavior of Parquet data as columns are added, removed, or have their types changed. Footer metadata, per-data-page changes, and per-column changes illustrate where data is transferred versus where it is already known to the storage layer. The deduplication estimation tool and heatmaps are used to visualize where data is new versus unchanged. Hugging Face Blog.

What’s new

The core updates are twofold: a Parquet CDC feature now available in the main Parquet writers and a system that leverages Hugging Face’s Xet storage layer to deduplicate at the data-page level. The key points:

  • Parquet CDC is available in PyArrow and Pandas and works in concert with Xet for deduplication.
  • CDC is enabled by passing use_content_defined_chunking=True to the Parquet writer (both PyArrow and Pandas now support this approach).
  • The integration supports on-Hub reads and writes via hf:// URIs, enabling direct upload and download to the Hub with CDC in effect. Hugging Face Blog.
  • Deduplication benefits extend across repository boundaries, not just within a single file, highlighting Xet’s cross-repository data sharing capabilities. Hugging Face Blog. The blog demonstrates the effect of CDC with several scenarios: uploading an original and a modified Parquet file, and then re-uploading to a different repository. In each case, the updated data (like added columns or changed types) triggers only the changed data pages and the new footer metadata to transfer, while the existing data stays in place. This results in substantially reduced data transfer and faster operations when changes occur. Hugging Face Blog. A number of practical considerations are discussed, including the default row-group size in PyArrow (1 Mi rows) and how dataset writers may adjust this to balance random access performance and memory footprint. Changing row-group boundaries can influence how changes map to data pages, which in turn affects deduplication. The post also explores how the technique scales when datasets are split into multiple files for parallelism and access patterns; CDC remains effective across file boundaries, especially when combined with Xet. Hugging Face Blog. To illustrate the mechanism, the post shows that enabling Parquet CDC yields considerably smaller transfers when uploading the original and the changed datasets, compared with the non-CDC case. The same performance benefits apply to downloads via huggingface_hub.hf_hub_download() and datasets.load_dataset() functions. Hugging Face Blog.

Why it matters (impact for developers/enterprises)

For developers and enterprises dealing with massive datasets, Parquet CDC on the Xet storage layer translates into tangible cost and time savings. The combination enables:

  • Dramatically reduced data transfer during uploads and downloads due to cross-file and cross-repository deduplication.
  • Lower storage costs by avoiding re-storing unchanged data pages and metadata when updates occur.
  • Better collaboration across teams since data can be updated and shared without re-uploading entire datasets.
  • Compatibility with existing Parquet-based workflows and readers, while gaining the efficiency benefits of CDC and Xet. In practice, organizations that operate large Parquet datasets on the Hub can expect faster iteration cycles for data updates and fewer bandwidth requirements for distribution, which is especially meaningful for distributed teams and pipelines. The approach aligns with broader storage and data management goals of reducing wasteful data movement while preserving accessibility. Hugging Face Blog.

Technical details or Implementation

This section summarizes how to leverage Parquet CDC in typical workflows, emphasizing actionable steps and considerations:

  • Parquet CDC is supported in PyArrow 21.0.0+ and Pandas, enabling content-defined chunking at write time. The effect is to align chunking with data values prior to serialization and compression, improving deduplication with Xet.
  • To enable CDC, use write_parquet with the flag use_content_defined_chunking=True. In PyArrow, you can also operate with Hugging Face URIs directly by using hf:// paths for reads and writes. Hugging Face Blog.
  • The deduplication benefits are contextual and depend on how much of the data changes. If most pages are affected by a filter or update, the dedup ratio can drop; when changes are localized, the benefits are more pronounced. CDC works at the parquet data page (column chunk) level, before any serialization or compression. Hugging Face Blog.
  • Parquet writers use fixed-sized row-groups by default (PyArrow defaults to 1 Mi rows). Dataset writers may adjust row-group sizes to balance random access performance and memory usage. Changes that shift rows between row-groups can still influence data pages, so the CDC approach helps mitigate that impact by chunking based on content. Hugging Face Blog.
  • Across multi-file datasets, Parquet CDC combined with Xet can deduplicate data even when files are split at different boundaries, supporting efficient parallelism and access patterns. Hugging Face Blog.
  • For readers and tooling, the same CDC benefits apply to download paths like hf_hub_download() and datasets.load_dataset(), ensuring end-to-end efficiency. Hugging Face Blog. Implementation considerations include choosing the right row-group size for a given workload and understanding that the degree of deduplication depends on the distribution of changes across the dataset. The article notes that the Parquet CDC mechanism is applied at the parquet data page level, which is key to achieving consistent chunking and dedup across updates. Hugging Face Blog.

Key takeaways

  • Parquet CDC brings content-defined chunking to Parquet writes, enabling better deduplication on the Xet storage layer.
  • Enabling CDC is straightforward with use_content_defined_chunking=True and works with hf:// URIs for Hub operations.
  • The combination reduces both data transfer and storage costs, improves cross-repository deduplication, and scales across multi-file datasets.
  • Row-group sizing and how data changes are distributed across the dataset influence dedup performance; CDC mitigates many of these challenges by chunking based on content.
  • Readers, writers, and download tools benefit from CDC-enabled workflows, maintaining compatibility with existing Parquet-based pipelines. Hugging Face Blog.

FAQ

  • What is Parquet Content-Defined Chunking (CDC)?

    CDC is a Parquet feature that chunks data pages based on content to improve deduplication effectiveness when writing Parquet files, especially on content-addressable storage like Xet. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).

  • How does CDC interact with Hugging Face’s Xet storage layer?

    CDC aligns chunk boundaries with actual data values, enabling more efficient deduplication across files and repositories within Xet. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).

  • How do I enable Parquet CDC in my workflow?**

    Use `write_parquet` with `use_content_defined_chunking=True` in PyArrow (and Pandas supports the feature); access via `hf://` URIs is supported for on-Hub I/O. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).

  • Are there trade-offs or limitations to CDC?

    The deduplication benefit depends on how changes are distributed across data pages; most changes that affect many pages may reduce dedup gains. The row-group size and how data is split across files also influence outcomes. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).

References

More news