Building csv-stream-diff: A Fast, Streaming CSV Comparison Tool for Very Large Files
Once the datasets move into the millions of rows, the usual approaches start to fall apart. Loading both files into memory is expensive. Spreadsheet tools stop being useful. Even many ad hoc scripts become slow, fragile, or impossible to run reliably in production-like environments.
That is the problem I wanted to solve with csv-stream-diff.
The Problem
In real systems, CSV comparison is rarely just "diff these two files."
Usually the job looks more like this:
- The files are large enough that a full in-memory load is risky or impossible
- The key columns on the left and right files do not use the same names
- The comparison should only consider a selected subset of columns
- Duplicate keys may exist and need to be reported clearly
- Sometimes a full comparison is required, but sometimes a statistically useful sample is enough
- The output needs to be machine-readable so it can feed downstream validation or remediation workflows
I wanted a tool that could handle that cleanly, with minimal dependencies, and still be easy to package and run anywhere.
What csv-stream-diff Does
csv-stream-diff is a Python CLI tool for comparing very large CSV files using:
- streaming reads
- hash-based partitioning
- multiprocessing
- YAML-driven configuration
It produces structured output files for:
- rows only on the left
- rows only on the right
- rows with cell-level differences
- duplicate keys
- summary metadata
It is designed to be practical rather than clever.
The Core Design
The main design constraint was memory.
If a tool tries to build a single giant in-memory index for both files, it will eventually hit a limit. So instead of comparing the full files at once, csv-stream-diff uses a two-phase approach.
1. Partition both files into hashed buckets
Each row key is normalized and hashed into a bucket. The left and right files are streamed row by row and written into matching bucket files on disk.
That matters because rows with the same normalized key always land in the same bucket. Once that is true, each bucket can be compared independently.
2. Compare buckets in parallel
After partitioning, the tool compares bucket pairs using multiple worker processes. Each worker only needs to index one bucket of the left file at a time, not the entire dataset.
This keeps memory bounded while still taking advantage of all available CPU cores.
The result is a design that scales much better for heavy loads than a naïve single-process implementation.
Why YAML Configuration
I did not want the CLI to become a wall of flags.
The comparison usually needs several pieces of information:
- the left and right file paths
- the left and right key columns
- the left and right comparison columns
- CSV dialect options
- output paths
- sampling settings
- performance settings
That is much easier to manage in a YAML file than on the command line.
The CLI still supports overrides, but the YAML file is the primary contract. That makes runs reproducible and easier to version alongside data validation jobs.
Exact Sampling for Large Validation Runs
Sometimes you do not want to compare every row.
For example, if the source files contain tens of millions of records, you may want to run a fast validation pass against an exact random sample of keys before committing to a full comparison.
csv-stream-diff supports that:
sampling.size: 0means compare everythingsampling.size > 0means compare an exact random sample of left-side unique keyssampling.seedmakes the sample reproducible
This gives you a useful middle ground between tiny spot checks and full heavy-load comparisons.
Handling Duplicate Keys
Duplicate keys are one of the most annoying edge cases in file comparison work.
If a key appears multiple times, the comparison becomes ambiguous. Instead of failing silently or hiding the problem, the tool reports duplicate keys explicitly and continues using the first occurrence for the main comparison.
That behaviour is deliberate:
- you get a warning
- you get a separate duplicate-key artifact
- you still get a usable comparison result
This makes the tool better suited for messy real-world data.
Keeping Dependencies Small
I wanted the runtime dependency footprint to stay minimal.
The tool is built mostly with the Python standard library. The only runtime dependency is PyYAML, which is used for configuration loading.
That keeps installation simple and reduces operational friction when the tool needs to run in different environments.
Outputs That Are Actually Useful
One important goal was to avoid producing a human-only report.
The tool writes separate output files for each class of result, which makes it easier to automate downstream processing:
only_in_left.csvonly_in_right.csvdifferences.csvduplicate_keys.csvsummary.json
The differences.csv file is especially useful because it reports cell-level differences with both the left and right column names and values.
That means you can do more than say "this row changed." You can say exactly how it changed.
Testing the Tool Properly
I also wanted the project to be easy to validate.
So the repository includes:
- unit tests with
pytest - BDD-style acceptance tests with
behave - a fixture generator that creates two baseline-identical CSV files and then introduces controlled differences
The generator makes it easy to create realistic comparison scenarios involving:
- changed values
- left-only rows
- right-only rows
- duplicate keys
That is useful both for development and for demonstrating the tool to others.
A Few Practical Lessons
Building this reinforced a few engineering lessons:
- For large-file tooling, streaming and partitioning beat clever in-memory shortcuts
- Exact sampling is worth implementing properly because it gives a fast validation mode without becoming a toy feature
- Duplicate handling should be explicit, not implicit
- Machine-readable outputs matter as much as console output
- Minimal dependencies make utility tools easier to adopt
Example Usage
With a config file in place, the tool is intentionally simple to run:
csv-stream-diff --config config.yaml
You can also override selected settings from the CLI:
csv-stream-diff --config config.yaml --sample-size 100000 --workers 8
Why I Built It
This project came from a practical need: compare large CSV datasets reliably, with clear outputs, and without depending on heavy frameworks or fragile one-off scripts.
The result is a tool that is meant to be packaged, published, and reused anywhere.
That was the bar from the start.
Closing
If you work with large exports, migration validation, reconciliation jobs, or data quality checks, CSV comparison becomes infrastructure very quickly.
csv-stream-diff is my attempt to make that infrastructure solid:
- reproducible
- scalable
- explicit
- easy to automate
If you want to explore the project, the repository includes the CLI, example configuration, test generator, and packaging setup needed to build and publish it.
Comments
Post a Comment