Building csv-stream-diff: A Fast, Streaming CSV Comparison Tool for Very Large Files

March 22, 2026

Building csv-stream-diff: A Fast, Streaming CSV Comparison Tool for Very Large Files

Comparing two CSV files sounds simple until the files are no longer small.

Once the datasets move into the millions of rows, the usual approaches start to fall apart. Loading both files into memory is expensive. Spreadsheet tools stop being useful. Even many ad hoc scripts become slow, fragile, or impossible to run reliably in production-like environments.

That is the problem I wanted to solve with csv-stream-diff.

The Problem

In real systems, CSV comparison is rarely just "diff these two files."

Usually the job looks more like this:

The files are large enough that a full in-memory load is risky or impossible
The key columns on the left and right files do not use the same names
The comparison should only consider a selected subset of columns
Duplicate keys may exist and need to be reported clearly
Sometimes a full comparison is required, but sometimes a statistically useful sample is enough
The output needs to be machine-readable so it can feed downstream validation or remediation workflows

I wanted a tool that could handle that cleanly, with minimal dependencies, and still be easy to package and run anywhere.

What `csv-stream-diff` Does

csv-stream-diff is a Python CLI tool for comparing very large CSV files using:

streaming reads
hash-based partitioning
multiprocessing
YAML-driven configuration

It produces structured output files for:

rows only on the left
rows only on the right
rows with cell-level differences
duplicate keys
summary metadata

It is designed to be practical rather than clever.

The Core Design

The main design constraint was memory.

If a tool tries to build a single giant in-memory index for both files, it will eventually hit a limit. So instead of comparing the full files at once, csv-stream-diff uses a two-phase approach.

1. Partition both files into hashed buckets

Each row key is normalized and hashed into a bucket. The left and right files are streamed row by row and written into matching bucket files on disk.

That matters because rows with the same normalized key always land in the same bucket. Once that is true, each bucket can be compared independently.

2. Compare buckets in parallel

After partitioning, the tool compares bucket pairs using multiple worker processes. Each worker only needs to index one bucket of the left file at a time, not the entire dataset.

This keeps memory bounded while still taking advantage of all available CPU cores.

The result is a design that scales much better for heavy loads than a naïve single-process implementation.

Why YAML Configuration

I did not want the CLI to become a wall of flags.

The comparison usually needs several pieces of information:

the left and right file paths
the left and right key columns
the left and right comparison columns
CSV dialect options
output paths
sampling settings
performance settings

That is much easier to manage in a YAML file than on the command line.

The CLI still supports overrides, but the YAML file is the primary contract. That makes runs reproducible and easier to version alongside data validation jobs.

Exact Sampling for Large Validation Runs

Sometimes you do not want to compare every row.

For example, if the source files contain tens of millions of records, you may want to run a fast validation pass against an exact random sample of keys before committing to a full comparison.

csv-stream-diff supports that:

sampling.size: 0 means compare everything
sampling.size > 0 means compare an exact random sample of left-side unique keys
sampling.seed makes the sample reproducible

This gives you a useful middle ground between tiny spot checks and full heavy-load comparisons.

Handling Duplicate Keys

Duplicate keys are one of the most annoying edge cases in file comparison work.

If a key appears multiple times, the comparison becomes ambiguous. Instead of failing silently or hiding the problem, the tool reports duplicate keys explicitly and continues using the first occurrence for the main comparison.

That behaviour is deliberate:

you get a warning
you get a separate duplicate-key artifact
you still get a usable comparison result

This makes the tool better suited for messy real-world data.

Keeping Dependencies Small

I wanted the runtime dependency footprint to stay minimal.

The tool is built mostly with the Python standard library. The only runtime dependency is PyYAML, which is used for configuration loading.

That keeps installation simple and reduces operational friction when the tool needs to run in different environments.

Outputs That Are Actually Useful

One important goal was to avoid producing a human-only report.

The tool writes separate output files for each class of result, which makes it easier to automate downstream processing:

only_in_left.csv
only_in_right.csv
differences.csv
duplicate_keys.csv
summary.json

The differences.csv file is especially useful because it reports cell-level differences with both the left and right column names and values.

That means you can do more than say "this row changed." You can say exactly how it changed.

Testing the Tool Properly

I also wanted the project to be easy to validate.

So the repository includes:

unit tests with pytest
BDD-style acceptance tests with behave
a fixture generator that creates two baseline-identical CSV files and then introduces controlled differences

The generator makes it easy to create realistic comparison scenarios involving:

changed values
left-only rows
right-only rows
duplicate keys

That is useful both for development and for demonstrating the tool to others.

A Few Practical Lessons

Building this reinforced a few engineering lessons:

For large-file tooling, streaming and partitioning beat clever in-memory shortcuts
Exact sampling is worth implementing properly because it gives a fast validation mode without becoming a toy feature
Duplicate handling should be explicit, not implicit
Machine-readable outputs matter as much as console output
Minimal dependencies make utility tools easier to adopt

Example Usage

With a config file in place, the tool is intentionally simple to run:

python -m csvstreamdiff.cli --config config.yaml

You can also override selected settings from the CLI:

python -m csvstreamdiff.cli --config config.yaml --sample-size 100000 --workers 8

Why I Built It

This project came from a practical need: compare large CSV datasets reliably, with clear outputs, and without depending on heavy frameworks or fragile one-off scripts.

The result is a tool that is meant to be packaged, published, and reused anywhere.

That was the bar from the start.

Closing

If you work with large exports, migration validation, reconciliation jobs, or data quality checks, CSV comparison becomes infrastructure very quickly.

csv-stream-diff is my attempt to make that infrastructure solid:

reproducible
scalable
explicit
easy to automate

If you want to explore the project, the repository includes the CLI, example configuration, test generator, and packaging setup needed to build and publish it.

Search This Blog

Random thoughts on coding and technology

Building csv-stream-diff: A Fast, Streaming CSV Comparison Tool for Very Large Files

The Problem

What `csv-stream-diff` Does

The Core Design

1. Partition both files into hashed buckets

2. Compare buckets in parallel

Why YAML Configuration

Exact Sampling for Large Validation Runs

Handling Duplicate Keys

Keeping Dependencies Small

Outputs That Are Actually Useful

Testing the Tool Properly

A Few Practical Lessons

Example Usage

Why I Built It

Closing

Comments

Post a Comment

Popular Posts

Train Your Own LoRA with ComfyUI: A Step-by-Step Guide

Firebase Cloud Messaging with Delphi 10.1 Berlin update 2.

Building csv-stream-diff: A Fast, Streaming CSV Comparison Tool for Very Large Files

The Problem

What csv-stream-diff Does

The Core Design

1. Partition both files into hashed buckets

2. Compare buckets in parallel

Why YAML Configuration

Exact Sampling for Large Validation Runs

Handling Duplicate Keys

Keeping Dependencies Small

Outputs That Are Actually Useful

Testing the Tool Properly

A Few Practical Lessons

Example Usage

Why I Built It

Closing

Comments

Post a Comment

Popular Posts

Train Your Own LoRA with ComfyUI: A Step-by-Step Guide

Firebase Cloud Messaging with Delphi 10.1 Berlin update 2.

What `csv-stream-diff` Does