simonw / csv-diff Goto Github PK

Python CLI tool and library for diffing CSV and JSON files

License: Apache License 2.0

Python 99.54% Dockerfile 0.46%

csv click diff git-scraping datasette-io datasette-tool csv-diff tsv-diff

csv-diff's Introduction

csv-diff

Tool for viewing the difference between two CSV, TSV or JSON files. See Generating a commit log for San Francisco’s official list of trees (and the sf-tree-history repo commit log) for background information on this project.

Installation

pip install csv-diff

Usage

Consider two CSV files:

one.csv

id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

id,name,age
1,Cleo,5
3,Bailey,1

csv-diff can show a human-readable summary of differences between the files:

$ csv-diff one.csv two.csv --key=id
1 row changed, 1 row added, 1 row removed

1 row changed

  Row 1
    age: "4" => "5"

1 row added

  id: 3
  name: Bailey
  age: 1

1 row removed

  id: 2
  name: Pancakes
  age: 2

The --key=id option means that the id column should be treated as the unique key, to identify which records have changed.

The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using --format=tsv or --format=csv.

You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use --format=json if your input files are JSON.

Use --show-unchanged to include full details of the unchanged values for rows with at least one change in the diff output:

% csv-diff one.csv two.csv --key=id --show-unchanged
1 row changed

  id: 1
    age: "4" => "5"

    Unchanged:
      name: "Cleo"

You can use the --json option to get a machine-readable difference:

$ csv-diff one.csv two.csv --key=id --json
{
    "added": [
        {
            "id": "3",
            "name": "Bailey",
            "age": "1"
        }
    ],
    "removed": [
        {
            "id": "2",
            "name": "Pancakes",
            "age": "2"
        }
    ],
    "changed": [
        {
            "key": "1",
            "changes": {
                "age": [
                    "4",
                    "5"
                ]
            }
        }
    ],
    "columns_added": [],
    "columns_removed": []
}

As a Python library

You can also import the Python library into your own code like so:

from csv_diff import load_csv, compare
diff = compare(
    load_csv(open("one.csv"), key="id"),
    load_csv(open("two.csv"), key="id")
)

diff will now contain the same data structure as the output in the --json example above.

If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.

As a Docker container

Build the image

$ docker build -t csvdiff .

Run the container

$ docker run --rm -v $(pwd):/files csvdiff

Suppose current directory contains two csv files : one.csv two.csv

$ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv

Alternatives

csvdiff is a "fast diff tool for comparing CSV files" - you may get better results from this than from csv-diff against larger files.

csv-diff's People

Contributors

Stargazers

Watchers

csv-diff's Issues

BUG: new line at end of file causes crash

Having a \n at the end of a .csv or .tsv file shouldn't cause a crash.

Github for example asks for all files to end in an empty line, otherwise it will show a red stop sign.

In any case, if you want to enforce no empty lines at end, then the error should be caught and handled, before telling the user what the problem is.

How to reproduce

printf "a,b,c\n1,2,3\n\n" >a.csv
csv-diff a.csv a.csv --key a

Traceback (most recent call last):
  File "/usr/local/Caskroom/mambaforge/base/bin/csv-diff", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/cli.py", line 60, in cli
    diff = compare(load(previous), load(current), show_unchanged)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/cli.py", line 56, in load
    return load_csv(
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/__init__.py", line 26, in load_csv
    return {keyfn(r): r for r in rows}
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/__init__.py", line 26, in <dictcomp>
    return {keyfn(r): r for r in rows}
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/__init__.py", line 21, in <lambda>
    keyfn = lambda r: r[key]
KeyError: 'a'

The error should be much better, saying which line the error is on, etc.

BUG: Duplicate key causes undefined behaviour, user not warned

When trying to figure out what happens in #30 I discovered that when there are rows with duplicated keys, the behaviour is odd.

It would be good to test for duplicate keys and raise an error or warning (maybe allowing user to pick how to deal with this).

What shouldn't happen is that incomplete rows cause some sort of undefined behaviour as here:

printf "a,b,c,d\n1,2,3,4\n1,2,3\n3,2,3,4" >a.csv
printf "a,b,c,d\n1,2,3,4\n1,2,3,4\n3,2,3,4" >b.csv
csv-diff b.csv a.csv --key a

The output makes no sense:

1 column removed

1 column removed

  d

Release notes for 1.0

I'm going to bump to 1.0. Here are the changes: 0.6...main

Q: What is default of `key`? What happens when no `key` passed?

I read through the README and --help but couldn't figure out what happens when I don't pass a --key.

Does it diff everything?

KeyError in load_csv

Hello,

I am getting a KeyError when comparing two CSV's with the following header rows, using key='''"Event ID"''':

current.csv
"","Event ID","Primary","EVENT DATE","Invoice #","5 Days Notice","Registration Links","Did the vendor provide the required staff level?","Tent, including anchoring device","Tables and chairs","Yoga mats/cots/stretchers/floor mats","Administrative supplies","Traffic control devices, if applicable","Food and fluids","Emergency supply bag (confirm with SO)","Fan, air conditioner","Mobile handwashing station","At least one sign more than 3 ft tall","Did all vendor staff arrive on time?","Was ready prior to start time?","3 EpiPens (at least 1 adult and 1 pediatric dose)","Did start on time?","Onsite entire time","Were there any avoidable patient care errors?","LNO Name","Have vendor representative enter their name","Vendor Comments","Admin Comments/Response"

update.csv
"","Event ID","Primary","EVENT DATE","Invoice #","5 Days Notice","Registration Links","Did the vendor provide the required staff level?","Tent, including anchoring device","Tables and chairs","Yoga mats/cots/stretchers/floor mats","Administrative supplies","Traffic control devices, if applicable","Food and fluids","Emergency supply bag (confirm with SO)","Fan, air conditioner","Mobile handwashing station","At least one sign more than 3 ft tall","Did all vendor staff arrive on time?","Was ready prior to start time?","3 EpiPens (at least 1 adult and 1 pediatric dose)","Did start on time?","Onsite entire time","Were there any avoidable patient care errors?","LNO Name","Have vendor representative enter their name","Vendor Comments","Admin Comments/Response"

As you can see, the 2 header rows contain the same exact columns, and '"Event ID"' is present in both. However, the load_csv function is still returning: KeyError: '"Event ID"'

Any help would be greatly appreciated!

BUG: incomplete last line causes unhandled error that's hard to debug

When I diffed two files I got csv-diff to give me this strange error:

ValueError: too many values to unpack (expected 2)

Turns out that the last line in on of the files being diffed is incomplete, it lacks a number of fields.

However, csv-diff doesn't handle the error as gracefully as it should. The user should be notified instead of getting a traceback.

Reproduce

printf "a,b,c,d\n1,2,3,4\n2,2,3,4\n3,2,3" >a.csv
printf "a,b,c,d\n1,2,3,4\n2,2,3,4\n3,2,3,4" >b.csv
csv-diff a.csv b.csv --key a

Traceback (most recent call last):
  File "/usr/local/Caskroom/mambaforge/base/bin/csv-diff", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/cli.py", line 60, in cli
    diff = compare(load(previous), load(current), show_unchanged)
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/__init__.py", line 86, in compare
    "changes": {
  File "/usr/local/Caskroom/mambaforge/base/lib/python3.9/site-packages/csv_diff/__init__.py", line 91, in <dictcomp>
    for _, field, (prev_value, current_value) in diffs
ValueError: not enough values to unpack (expected 2, got 1)

Support JSON inputs too

I find myself wanting to generate commit messages for JSON files.

This can reuse almost all of the logic from csv-diff - so for the moment I'm going to add that ability here rather than spin up a brand new tool.

A rename if this tool can come later!

Option to also print all fields in changed rows

When visually scanning the output, it would be nice to also have the un-changed fields. Often, critical context is provided by those fields.

Support for empty CSV files

Thanks for this lib!

I've noticed that the compare method doesn't work if one of the csv files is empty.

Steps to reproduce

a.csv

id,foo,bar

b.csv

id,foo,bar
1,hey,ho
2,lets,go

from csv_diff import compare
from csv_diff import load_csv

diff: dict = compare(
    load_csv(open("a.csv"), key="id"),
    load_csv(open("b.csv"), key="id"),
)

for added_row in added:
     print(added_row)

Expected result

{"id": 1, "foo": "hey", "bar": "ho"}
{"id": 2, "foo": "lets", "bar": "go"}

Actual result

Traceback (most recent call last):
  File "REDACTED.py", line 165, in <module>
    diff: dict = compare(
  File "REDACTED/python3.9/site-packages/
csv_diff/__init__.py", line 58, in compare
    previous_columns = set(next(iter(previous.values())).keys())
StopIteration

I'm using Python 3.9.2.

ENH: Add csv-diff and tsv-diff to topics so that it's shown via google

I searched for tsv-diff via Google and was led to this page where your tool doesn't actually show up. Maybe add csv-diff and tsv-diff as a topic to the repo metadata so it's more findable?

https://github.com/topics/tsv-diff

Allow all kind of separator for csv files for example ";"

right there is only option if you want to use , separated csv or tab separated csv. what if someone has separator as ; or : something else, please accept one cli arg for this as sep where user can choose separator!

Composite key

This looks like it will be really useful, thanks.

When we were working on CSV Schema Language we found it necessary to allow uniqueness to be defined over a composite set of columns (the unique column rule in the schema). I can see from the code structure that this wouldn't necessarily be entirely straightforward here, but I think it would be useful.

I store the csv of two tables with multiple primary keys. How do I pass in multiple values in the key of load_csv?

support for semicolon separated csv

csv-diff/csv_diff/__init__.py

Line 13 in 0256e49

dialect = csv.Sniffer().sniff(peek, delimiters=",\t")

Can you please add ";" in the list of supported delimiters to support CSV generated in countries that uses the "," as decimal point, like Italy?

the line can be modified in this way:

dialect = csv.Sniffer().sniff(peek, delimiters=",\t;")

error when no key precised

the tool works well when we have a id precised by user as the key. However I think when the user doesn't use the '--key' option, it's not working as expected.

for example:
----1.csv----
"""id,name,age
1,Cleo,5
2,Pancakes,2"""

----2.csv----
"""id,name,age
1,Cleo,5
2,Pancakes,3"""

and we launch with no key option :
csv-diff 1.csv 2.csv
we will get:
"2 rows added, 2 rows removed

2 rows added

id: 1
name: Cleo
age: 5

id: 2
name: Pancakes
age: 2

2 rows removed

id: 1
name: Cleo
age: 5

id: 2
name: Pancakes
age: 4

while i expect this:
"
1 row added, 1 row removed
1 row added:
id:2
name: Pancakes
age: 2
1 row removed:
id: 2
name: Pancakes
age: 4

the problem is caused because that the hash of the line is stored as a hash object, and even two lines are the same content, the memo locations are different. So they are different objects.

I have corrected this feature, if you like, I can upload it through a pull request.

Handle case where new keys are added to JSON objects (--format json)

Right now if the first JSON object is missing keys that are present in the second JSON object the script crashes.

KeyError executing in Google Colab

!csv-diff /content/antecedents.tsv /content/base/antecedents.tsv --key=id

Traceback (most recent call last):
File "/usr/local/bin/csv-diff", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/csv_diff/cli.py", line 60, in cli
diff = compare(load(previous), load(current), show_unchanged)
File "/usr/local/lib/python3.7/dist-packages/csv_diff/cli.py", line 57, in load
open(filename, newline=""), key=key, dialect=dialect.get(format)
File "/usr/local/lib/python3.7/dist-packages/csv_diff/init.py", line 26, in load_csv
return {keyfn(r): r for r in rows}
File "/usr/local/lib/python3.7/dist-packages/csv_diff/init.py", line 26, in
return {keyfn(r): r for r in rows}
File "/usr/local/lib/python3.7/dist-packages/csv_diff/init.py", line 21, in
keyfn = lambda r: r[key]
KeyError: 'id'

ERROR: CSV parse error

I'm trying to diff two CSV files and csv-diff just responds with:

ERROR: CSV parse error on line 2

So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:

KeyError: 'my_key'

Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.

So I look at the the file encoding and Python's magic library tells me:

'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'

so if I open the file with an encoding of "utf-8-sig" all works fine.

Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:

def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python. Alas.
    
    A quick summary:
    
    1. I come across CSV files with a UTF-8 or UTF-16 encoding regularly enough.
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Some CSV sources though write with a BOM
    8. The encoding must therefore be specified as:
        utf-16     for UTF-16 files
        utf-8       for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.
    
    :param filepath: The path to a file
    '''
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0
    
    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and then if I run:

with open(File1, "r", encoding=file_encoding(File1), newline='') as f1:
    csv1 = load_csv(f1, key=key)
    
with open(File2, "r", encoding=file_encoding(File2), newline='') as f2:
    csv2 = load_csv(f2, key=key)

diff = compare(csv1, csv2)

all is good and I get a reliable diff.

I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But setup.py build just creates a build folder with a lib folder with __init__.py and cli.py in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.

If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's open() could better guess the encoding (the way magic can).

Support using row number as a key

load_csv functions nicely without a key, using a hash of the row data.

While nice, a useful alternative that also resolves uniquely but conveys useful information is the row number. Would suggest supporting a keyword for row number as the simplest approach to implementing this. The keyword may of course be a header and if so, dropping back to using the value in the column is sound.

A candidate key might be "row" or "row_number"

Error when column name contains `.` (dot)

$csv-diff --version
csv-diff, version 0.6

Consider the following example, which works correctly:

import csv_diff
from io import StringIO
csv_diff.compare(csv_diff.load_csv(StringIO("id,a,b,c c,d\n0,2,3,4,5"), key="id"),
                 csv_diff.load_csv(StringIO("id,a,b,c c,d\n0,2,4,5,5"), key="id"))

Output:

{'added': [],
 'removed': [],
 'changed': [{'key': '0', 'changes': {'b': ['3', '4'], 'c c': ['4', '5']}}],
 'columns_added': [],
 'columns_removed': []}

If I add a . inside the column name of "c c" -> "c. c" it breaks

import csv_diff
from io import StringIO
csv_diff.compare(csv_diff.load_csv(StringIO("id,a,b,c. c,d\n0,2,3,4,5"), key="id"),
                 csv_diff.load_csv(StringIO("id,a,b,c. c,d\n0,2,4,5,5"), key="id"))