Git Product home page Git Product logo

splink's Introduction

image

Coverage Status issues-status python-version-dependency

splink: Probabilistic record linkage and deduplication at scale

splink implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters of the model.

The aims of splink are to:

  • Work at much greater scale than current open source implementations (100 million records +).

  • Get results faster than current open source implementations - with runtimes of less than an hour.

  • Have a highly transparent methodology, so the match scores can be easily explained both graphically and in words

  • Have accuracy similar to some of the best alternatives

Installation

splink is a Python package. It uses the Spark Python API to execute data linking jobs in a Spark cluster. It has been tested in Apache Spark 2.3 and 2.4.

Install splink using

pip install splink

Interactive demo

You can run demos of splink in an interactive Jupyter notebook by clicking the button below:

Binder

Documentation

The best documentation is currently a series of demonstrations notebooks in the splink_demos repo.

We also provide an interactive splink settings editor and example settings here. A tool to generate custom m and u probabilities can be found here.

The statistical model behind splink is the same as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. This is the best place to start for users wanting to understand the theory about how splink works.

You can read a short blog post about splink here.

Videos

You can find a short video introducing splink and running though an introductory demo here.

A 'best practices and performance tuning' tutorial can be found here.

Acknowledgements

We are very grateful to ADR UK (Administrative Data Research UK) for providing funding for this work as part of the Data First project.

We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work.

splink's People

Contributors

mamonu avatar robinl avatar samnlindsay avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.