Git Product home page Git Product logo

g2audit's Introduction

g2audit

Overview

The G2Audit.py utility compares two entity resolution result sets and computes the precision, recall and F1 scores between them. It can be used to compare different runs to a truth set to determine which one is best or to determine the full effect a configuration change had gainst prior run of the same data. There are many articles that describe this including:

This project is designed to be used along with:

Usage:

python3 G2Audit.py --help
usage: G2Audit.py [-h] [-n NEWERFILE] [-p PRIORFILE] [-o OUTPUTROOT] [-D]

optional arguments:
  -h, --help            show this help message and exit
  -n NEWERFILE, --newer_csv_file NEWERFILE
                        the latest entity map file
  -p PRIORFILE, --prior_csv_file PRIORFILE
                        the prior entity map file
  -o OUTPUTROOT, --output_file_root OUTPUTROOT
                        the ouputfile root name (both a .csv and a .json file
                        will be created)
  -D, --debug           print debug statements

Contents

  1. Prerequisites
  2. Installation
  3. Typical use
  4. Output files

Prerequisites

  • Python 3.6 or higher

Plenty of RAM! This process runs very fast as it loads each data set into memory. This is not a problem if your control or truth set is under a million records. But if you get into the 10s or 100s of million records, you will need to run this on a computer with enough RAM to load both sets into memory at the same time.

Installation

  1. Place the the following file in a directory of your choice:

Typical use

For comparing to a truthset to find the best result

python3 G2Audit.py -n /path/to/candidate1-result.csv -p /path/to/truthset.csv -o /path/to/audit1-result

python3 G2Audit.py -n /path/to/candidate2-result.csv -p /path/to/truthset.csv -o /path/to/audit2-result

You will find the precision, recall and F1 scores in the audit1-result1.json and audit-result2.json files along with examples of entities that have been split or merged.

For analyzing the full effect of a configuration change

python3 G2Audit.py -n /path/to/v1-config-test1-result.csv -p /path/to/v2-config-test1-result.csv -o /path/to/audit-result-cfg2-cfg1

This would determine the effect the version 2 config changes had on the test1 data set compared to the version 1 config result on the same data.

Configuration updates are usually made to reduce false positives or negatives on specific examples reported by users. Performing this kind of an audit can help ensure their examples were corrected without drastically affecting the overall precision and recall scores.

Output files

json statistics file

Alt text

csv statistics file

Alt text

  • AUDIT_ID groups the records involved in a split or merged entity.
  • AUDIT_CATEGORY is either "SPLIT" or "MERGED" or "SPLIT+MERGED" and applies to the group so is the same for every record in the AUDIT_ID.
  • AUDIT_RESULT is either "new_positive", "new_negative", or "same" and shows that particular record's role in the audit result.
  • DATA_SOURCE and RECORD_ID indicate the specific record.
  • PRIOR_ID indicates the unique entity or cluster ID in the prior or truth set.
  • PRIOR_SCORE indicates the reported score for this record in relation to the prior entity reported by the prior or truth set.
  • NEWER_ID indicates the unique entity or cluster ID in the newer or candidate result set.
  • NEWER_SCORE indicates the reported score for this record in relation to the newer entity reported by the newer or candidate result set.

The scores are usually only provided on data run through the Senzing software which even provides scores of records that were not matched. For instance in the screen shot above, line 8 shows that even though the entity was split, there was still a relationship created on name and date of birth.

g2audit's People

Contributors

docktermj avatar jbutcher21 avatar kernelsam avatar dependabot[bot] avatar sammacy avatar jaeadams avatar antaenc avatar

Watchers

James Cloos avatar  avatar  avatar Barry M. Caceres avatar  avatar

Forkers

terrorizer1980

g2audit's Issues

Improve performance on disparate data sets

Audit was very slow when auditing a small data set against a larger data set.
There should be an option for the tool to ignore records in the larger data set that are not in the smaller data set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.