Skip to content
Parquet Content-Defined Chunking with Xet Storage on Hugging Face Hub
Source: huggingface.co

Parquet Content-Defined Chunking with Xet Storage on Hugging Face Hub

Sources: https://huggingface.co/blog/parquet-cdc, Hugging Face Blog

Overview

Parquet Content-Defined Chunking (CDC) is now available for PyArrow and Pandas, enabling efficient deduplication of Parquet files on top of Hugging Face’s Xet storage layer. CDC deduplicates data at the level of content-defined pages or chunks, so when you upload or download Parquet data, only the changed data is transferred. This can dramatically reduce data transfer and storage costs, especially for large datasets stored on the Hub. Xet is a storage layer designed to deduplicate chunks across repositories and files. Parquet’s layout—column chunks and data pages combined with compression—can produce different byte-level representations for small data changes. CDC addresses this by chunking data based on content rather than fixed byte boundaries, aligning the dedup process with the logical data. For developers, you can enable CDC by passing use_content_defined_chunking=True to the write_parquet function (PyArrow). Pandas also supports the feature. With PyArrow, you can read and write Parquet data to Hugging Face Hub via hf:// URIs when pyarrow>=21.0.0 is installed. The Hub collects storage statistics and demonstrates that the uploaded data can be significantly smaller when CDC is enabled. The blog demonstrates scenarios like adding or removing columns, changing column types, and appending rows. It shows deduplication across repositories and across files, as well as the impact of different row-group sizes. The main takeaway is that Parquet CDC, in combination with Xet, enables more efficient and scalable data workflows on the Hub.

Key features

  • Content-defined chunking operates at parquet data page / column chunk level to improve deduplication granularity.
  • Works with PyArrow and Pandas, enabling CDC in Parquet writes.
  • Deduplication across repositories and across files, not just within a single file.
  • Integration with Hugging Face Xet storage layer to minimize transferred bytes and storage footprint.
  • Support for common Parquet edits (adding/removing columns, changing types) with reduced re-upload.
  • Compatibility with hf:// URIs, enabling direct read/write to the Hub when pyarrow is >= 21.0.0.
  • Dedup performance varies with data changes and reader/writer constraints; row-group size can influence outcomes.
  • Highlights that downloads also benefit from CDC when using the Hub APIs (hf_hub_download, datasets.load_dataset).

Common use cases

  • Large Parquet datasets stored on Xet where data evolves over time and you want to avoid re-uploading unchanged data.
  • Collaborative data workflows across multiple repositories, sharing updated columns or rows with minimal transfer.
  • Schema evolution scenarios (adding/removing columns, changing data types) where only affected parts are transferred.
  • Data pipelines that rely on Parquet’s columnar layout and want to optimize storage and bandwidth across Cloud-like storage backends.

Setup & installation

To use Parquet CDC, you’ll need PyArrow 21.0.0 or newer and the Pandas ecosystem. Install the required packages and then run CDC-enabled writes.

# Install core dependencies with CDC support
pip install "pyarrow>=21.0.0" pandas
# Optional: install additional Hugging Face tooling for dataset access
pip install huggingface_hub datasets
# PyArrow example: write a table with CDC enabled
import pyarrow as pa
import pyarrow.parquet as pq
# simple example table
tbl = pa.Table.from_pydict({"id": [1, 2, 3], "val": [10, 20, 30]})
# enable content-defined chunking
pq.write_table(tbl, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# Pandas example: write a DataFrame with CDC enabled
import pandas as pd
import pandas as pd
df = pd.DataFrame({"id": [1, 2, 3], "val": [10, 20, 30]})
df.to_parquet("hf://my-repo/data.parquet", use_content_defined_chunking=True)

Note: CDC requires pyarrow>=21.0.0 to enable hf:// URIs for direct Hub I/O.

Quick start

The following minimal workflow demonstrates writing Parquet with CDC, then reading it back, illustrating how the Hub can deduplicate data and deliver fast I/O.

import pyarrow as pa
import pyarrow.parquet as pq
# 1) Create and write initial data
tbl1 = pa.Table.from_pydict({"id": [1, 2, 3], "value": [100, 200, 300]})
pq.write_table(tbl1, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# 2) Read back
tbl_read = pq.read_table("hf://my-repo/data.parquet")
print(tbl_read)
# 3) Modify data (e.g., add a column)
tbl2 = tbl1.append_column("extra", pa.array([1, 1, 1]))
pq.write_table(tbl2, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# 4) Read updated data
print(pq.read_table("hf://my-repo/data.parquet"))

This demonstrates writing with CDC and re-reading from the Hub. The CDC-enabled write transfers only new or changed data pages, not the entire file, when appropriate.

Pros and cons

  • Pros
  • Substantial reductions in upload/download size for Parquet datasets where changes are incremental.
  • Dedup across multiple files and across repositories, enabling efficient data sharing.
  • Works with standard Parquet workflows via PyArrow and Pandas, and direct Hub I/O via hf:// URIs.
  • Supports typical Parquet edits (adding/removing columns, type changes) with less data movement.
  • Cons
  • Dedup performance depends on how data changes map to Parquet data pages; some edits can reduce gains.
  • Optimal results may require tuning row-group sizes per workload.
  • Requires Parquet writers to support content-defined chunking (CDC-enabled writes).

Alternatives (brief comparisons)

| Approach | Dedup across repos | CDC support | Notes |---|---|---|---| | Parquet without CDC on Xet | Limited to file-level dedup only | No | Re-uploads can transfer unchanged content. |Parquet CDC with Xet (Hub) | Yes | Yes | Reduces data transfer; relies on CDC-enabled writers. |Traditional cloud storage without Xet | No cross-repo dedup | No | Transfers entire data more often. |

Pricing or License

Pricing and licensing details are not specified in the referenced material.

References

https://huggingface.co/blog/parquet-cdc

More resources