Parquet Content-Defined Chunking Enables Efficient Deduplication with Hugging Face Xet
Sources: https://huggingface.co/blog/parquet-cdc, Hugging Face Blog
TL;DR
- Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas and works with Hugging Face’s Xet storage layer to enable efficient deduplication.
- CDC deduplicates data at the data-page level before serialization and compression, reducing uploads/downloads across repositories and files.
- Enable Parquet CDC by passing
use_content_defined_chunking=Trueto the Parquet writer; PyArrow URIs can be used with thehf://scheme for on-Hub reads/writes. - The combination of Parquet CDC and Xet significantly lowers data transfer and storage costs for large-scale Hugging Face datasets.
Context and background
Parquet is a widely used columnar storage format in data engineering, and Hugging Face hosts an enormous amount of Parquet data on the Hub. The Hub’s new Xet storage layer uses content-defined chunking to deduplicate chunks of data, reducing storage costs and speeding up uploads and downloads. Because Parquet’s layout—particularly column chunks and data pages—can produce different byte-level representations for small data changes, deduplication can be suboptimal unless the data is written with chunking that matches the data’s content. This is where CDC helps: it aligns chunk boundaries with the logical data values, enabling more effective deduplication before any serialization or compression.
OpenOrca demonstrates a manageable subset to illustrate the workflow. Since PyArrow 21.0.0, Hugging Face URIs can be used directly in PyArrow functions to read and write Parquet (and other formats) to the Hub via the hf:// URI scheme. This integration is central to performing end-to-end Parquet operations on the Hub without leaving the environment, and to showcasing how CDC interacts with the Xet storage layer to achieve dedup across repositories and files. The broader goal is to minimize data transfer while preserving accessibility and compatibility with existing readers and writers. Hugging Face Blog.
Traditional file systems do not deduplicate files across repositories, leading to full re-uploads and re-downloads. By contrast, a system that uses content-defined chunking can recognize identical content and avoid transferring unchanged data. The article demonstrates that identical tables uploaded to the Hub can be transferred instantly when CDC is enabled, and the deduplication works across repositories as well. This cross-repository capability is a key feature of Xet, enabling efficient data sharing and collaboration. Hugging Face Blog.
The demonstration scores and visuals in the blog show how Parquet CDC impacts the size and behavior of Parquet data as columns are added, removed, or have their types changed. Footer metadata, per-data-page changes, and per-column changes illustrate where data is transferred versus where it is already known to the storage layer. The deduplication estimation tool and heatmaps are used to visualize where data is new versus unchanged. Hugging Face Blog.
What’s new
The core updates are twofold: a Parquet CDC feature now available in the main Parquet writers and a system that leverages Hugging Face’s Xet storage layer to deduplicate at the data-page level. The key points:
- Parquet CDC is available in PyArrow and Pandas and works in concert with Xet for deduplication.
- CDC is enabled by passing
use_content_defined_chunking=Trueto the Parquet writer (both PyArrow and Pandas now support this approach). - The integration supports on-Hub reads and writes via
hf://URIs, enabling direct upload and download to the Hub with CDC in effect. Hugging Face Blog. - Deduplication benefits extend across repository boundaries, not just within a single file, highlighting Xet’s cross-repository data sharing capabilities. Hugging Face Blog.
The blog demonstrates the effect of CDC with several scenarios: uploading an original and a modified Parquet file, and then re-uploading to a different repository. In each case, the updated data (like added columns or changed types) triggers only the changed data pages and the new footer metadata to transfer, while the existing data stays in place. This results in substantially reduced data transfer and faster operations when changes occur. Hugging Face Blog.
A number of practical considerations are discussed, including the default row-group size in PyArrow (1 Mi rows) and how dataset writers may adjust this to balance random access performance and memory footprint. Changing row-group boundaries can influence how changes map to data pages, which in turn affects deduplication. The post also explores how the technique scales when datasets are split into multiple files for parallelism and access patterns; CDC remains effective across file boundaries, especially when combined with Xet. Hugging Face Blog.
To illustrate the mechanism, the post shows that enabling Parquet CDC yields considerably smaller transfers when uploading the original and the changed datasets, compared with the non-CDC case. The same performance benefits apply to downloads via
huggingface_hub.hf_hub_download()anddatasets.load_dataset()functions. Hugging Face Blog.
Why it matters (impact for developers/enterprises)
For developers and enterprises dealing with massive datasets, Parquet CDC on the Xet storage layer translates into tangible cost and time savings. The combination enables:
- Dramatically reduced data transfer during uploads and downloads due to cross-file and cross-repository deduplication.
- Lower storage costs by avoiding re-storing unchanged data pages and metadata when updates occur.
- Better collaboration across teams since data can be updated and shared without re-uploading entire datasets.
- Compatibility with existing Parquet-based workflows and readers, while gaining the efficiency benefits of CDC and Xet. In practice, organizations that operate large Parquet datasets on the Hub can expect faster iteration cycles for data updates and fewer bandwidth requirements for distribution, which is especially meaningful for distributed teams and pipelines. The approach aligns with broader storage and data management goals of reducing wasteful data movement while preserving accessibility. Hugging Face Blog.
Technical details or Implementation
This section summarizes how to leverage Parquet CDC in typical workflows, emphasizing actionable steps and considerations:
- Parquet CDC is supported in PyArrow 21.0.0+ and Pandas, enabling content-defined chunking at write time. The effect is to align chunking with data values prior to serialization and compression, improving deduplication with Xet.
- To enable CDC, use
write_parquetwith the flaguse_content_defined_chunking=True. In PyArrow, you can also operate with Hugging Face URIs directly by usinghf://paths for reads and writes. Hugging Face Blog. - The deduplication benefits are contextual and depend on how much of the data changes. If most pages are affected by a filter or update, the dedup ratio can drop; when changes are localized, the benefits are more pronounced. CDC works at the parquet data page (column chunk) level, before any serialization or compression. Hugging Face Blog.
- Parquet writers use fixed-sized row-groups by default (PyArrow defaults to 1 Mi rows). Dataset writers may adjust row-group sizes to balance random access performance and memory usage. Changes that shift rows between row-groups can still influence data pages, so the CDC approach helps mitigate that impact by chunking based on content. Hugging Face Blog.
- Across multi-file datasets, Parquet CDC combined with Xet can deduplicate data even when files are split at different boundaries, supporting efficient parallelism and access patterns. Hugging Face Blog.
- For readers and tooling, the same CDC benefits apply to download paths like
hf_hub_download()anddatasets.load_dataset(), ensuring end-to-end efficiency. Hugging Face Blog. Implementation considerations include choosing the right row-group size for a given workload and understanding that the degree of deduplication depends on the distribution of changes across the dataset. The article notes that the Parquet CDC mechanism is applied at the parquet data page level, which is key to achieving consistent chunking and dedup across updates. Hugging Face Blog.
Key takeaways
- Parquet CDC brings content-defined chunking to Parquet writes, enabling better deduplication on the Xet storage layer.
- Enabling CDC is straightforward with
use_content_defined_chunking=Trueand works withhf://URIs for Hub operations. - The combination reduces both data transfer and storage costs, improves cross-repository deduplication, and scales across multi-file datasets.
- Row-group sizing and how data changes are distributed across the dataset influence dedup performance; CDC mitigates many of these challenges by chunking based on content.
- Readers, writers, and download tools benefit from CDC-enabled workflows, maintaining compatibility with existing Parquet-based pipelines. Hugging Face Blog.
FAQ
-
What is Parquet Content-Defined Chunking (CDC)?
CDC is a Parquet feature that chunks data pages based on content to improve deduplication effectiveness when writing Parquet files, especially on content-addressable storage like Xet. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).
-
How does CDC interact with Hugging Face’s Xet storage layer?
CDC aligns chunk boundaries with actual data values, enabling more efficient deduplication across files and repositories within Xet. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).
-
How do I enable Parquet CDC in my workflow?**
Use `write_parquet` with `use_content_defined_chunking=True` in PyArrow (and Pandas supports the feature); access via `hf://` URIs is supported for on-Hub I/O. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).
-
Are there trade-offs or limitations to CDC?
The deduplication benefit depends on how changes are distributed across data pages; most changes that affect many pages may reduce dedup gains. The row-group size and how data is split across files also influence outcomes. [Hugging Face Blog](https://huggingface.co/blog/parquet-cdc).
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.