Parquet Content-Defined Chunking with Xet Storage on Hugging Face Hub
Sources: https://huggingface.co/blog/parquet-cdc, Hugging Face Blog
Overview
Parquet Content-Defined Chunking (CDC) is now available for PyArrow and Pandas, enabling efficient deduplication of Parquet files on top of Hugging Face’s Xet storage layer. CDC deduplicates data at the level of content-defined pages or chunks, so when you upload or download Parquet data, only the changed data is transferred. This can dramatically reduce data transfer and storage costs, especially for large datasets stored on the Hub. Xet is a storage layer designed to deduplicate chunks across repositories and files. Parquet’s layout—column chunks and data pages combined with compression—can produce different byte-level representations for small data changes. CDC addresses this by chunking data based on content rather than fixed byte boundaries, aligning the dedup process with the logical data. For developers, you can enable CDC by passing use_content_defined_chunking=True to the write_parquet function (PyArrow). Pandas also supports the feature. With PyArrow, you can read and write Parquet data to Hugging Face Hub via hf:// URIs when pyarrow>=21.0.0 is installed. The Hub collects storage statistics and demonstrates that the uploaded data can be significantly smaller when CDC is enabled. The blog demonstrates scenarios like adding or removing columns, changing column types, and appending rows. It shows deduplication across repositories and across files, as well as the impact of different row-group sizes. The main takeaway is that Parquet CDC, in combination with Xet, enables more efficient and scalable data workflows on the Hub.
Key features
- Content-defined chunking operates at parquet data page / column chunk level to improve deduplication granularity.
- Works with PyArrow and Pandas, enabling CDC in Parquet writes.
- Deduplication across repositories and across files, not just within a single file.
- Integration with Hugging Face Xet storage layer to minimize transferred bytes and storage footprint.
- Support for common Parquet edits (adding/removing columns, changing types) with reduced re-upload.
- Compatibility with hf:// URIs, enabling direct read/write to the Hub when pyarrow is >= 21.0.0.
- Dedup performance varies with data changes and reader/writer constraints; row-group size can influence outcomes.
- Highlights that downloads also benefit from CDC when using the Hub APIs (hf_hub_download, datasets.load_dataset).
Common use cases
- Large Parquet datasets stored on Xet where data evolves over time and you want to avoid re-uploading unchanged data.
- Collaborative data workflows across multiple repositories, sharing updated columns or rows with minimal transfer.
- Schema evolution scenarios (adding/removing columns, changing data types) where only affected parts are transferred.
- Data pipelines that rely on Parquet’s columnar layout and want to optimize storage and bandwidth across Cloud-like storage backends.
Setup & installation
To use Parquet CDC, you’ll need PyArrow 21.0.0 or newer and the Pandas ecosystem. Install the required packages and then run CDC-enabled writes.
# Install core dependencies with CDC support
pip install "pyarrow>=21.0.0" pandas
# Optional: install additional Hugging Face tooling for dataset access
pip install huggingface_hub datasets
# PyArrow example: write a table with CDC enabled
import pyarrow as pa
import pyarrow.parquet as pq
# simple example table
tbl = pa.Table.from_pydict({"id": [1, 2, 3], "val": [10, 20, 30]})
# enable content-defined chunking
pq.write_table(tbl, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# Pandas example: write a DataFrame with CDC enabled
import pandas as pd
import pandas as pd
df = pd.DataFrame({"id": [1, 2, 3], "val": [10, 20, 30]})
df.to_parquet("hf://my-repo/data.parquet", use_content_defined_chunking=True)
Note: CDC requires pyarrow>=21.0.0 to enable hf:// URIs for direct Hub I/O.
Quick start
The following minimal workflow demonstrates writing Parquet with CDC, then reading it back, illustrating how the Hub can deduplicate data and deliver fast I/O.
import pyarrow as pa
import pyarrow.parquet as pq
# 1) Create and write initial data
tbl1 = pa.Table.from_pydict({"id": [1, 2, 3], "value": [100, 200, 300]})
pq.write_table(tbl1, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# 2) Read back
tbl_read = pq.read_table("hf://my-repo/data.parquet")
print(tbl_read)
# 3) Modify data (e.g., add a column)
tbl2 = tbl1.append_column("extra", pa.array([1, 1, 1]))
pq.write_table(tbl2, "hf://my-repo/data.parquet", use_content_defined_chunking=True)
# 4) Read updated data
print(pq.read_table("hf://my-repo/data.parquet"))
This demonstrates writing with CDC and re-reading from the Hub. The CDC-enabled write transfers only new or changed data pages, not the entire file, when appropriate.
Pros and cons
- Pros
- Substantial reductions in upload/download size for Parquet datasets where changes are incremental.
- Dedup across multiple files and across repositories, enabling efficient data sharing.
- Works with standard Parquet workflows via PyArrow and Pandas, and direct Hub I/O via hf:// URIs.
- Supports typical Parquet edits (adding/removing columns, type changes) with less data movement.
- Cons
- Dedup performance depends on how data changes map to Parquet data pages; some edits can reduce gains.
- Optimal results may require tuning row-group sizes per workload.
- Requires Parquet writers to support content-defined chunking (CDC-enabled writes).
Alternatives (brief comparisons)
| Approach | Dedup across repos | CDC support | Notes |---|---|---|---| | Parquet without CDC on Xet | Limited to file-level dedup only | No | Re-uploads can transfer unchanged content. |Parquet CDC with Xet (Hub) | Yes | Yes | Reduces data transfer; relies on CDC-enabled writers. |Traditional cloud storage without Xet | No cross-repo dedup | No | Transfers entire data more often. |
Pricing or License
Pricing and licensing details are not specified in the referenced material.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow
A production-ready, modular telesurgery workflow from NVIDIA Isaac for Healthcare unifies simulation and clinical deployment across a low-latency, three-computer architecture. It covers video/sensor streaming, robot control, haptics, and simulation to support training and remote procedures.