Git Product home page Git Product logo

blocking's Introduction

R-CMD-check test-coverage Codecov test coverage

Overview

Warning!

The package is under heavily development so the API as well as functionalities may change.

Description

This R package is designed to block records for data deduplication and record linkage (also known as entity resolution) using approximate nearest neighbours algorithms (ANN) and graphs (via the igraph package).

It supports the following R packages that bind to specific ANN algorithms:

  • rnndescent (default, very powerful, supports sparse matrices),
  • RcppHNSW (powerful but does not support sparse matrices),
  • RcppAnnoy,
  • mlpack (see mlpack::lsh and mlpack::knn).

The package can be used with the reclin2 package via the blocking::pair_ann function.

Funding

Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.

Installation

Install the GitHub blocking package with:

# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")

Basic usage

Load packages for the examples

library(blocking)
library(reclin2)
#> Loading required package: data.table

Generate simple data with three groups (df_example) and reference data (df_base).

df_example <- data.frame(txt = c(
  "jankowalski",
  "kowalskijan",
  "kowalskimjan",
  "kowaljan",
  "montypython",
  "pythonmonty",
  "cyrkmontypython",
  "monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))

df_example
#>               txt
#> 1     jankowalski
#> 2     kowalskijan
#> 3    kowalskimjan
#> 4        kowaljan
#> 5     montypython
#> 6     pythonmonty
#> 7 cyrkmontypython
#> 8           monty

df_base
#>           txt
#> 1 montypython
#> 2 kowalskijan
#> 3       other

Deduplication using the blocking function. Output contains information:

  • the method used (where nnd which refers to the NN descent algorithm),
  • number of blocks created (here 2 blocks),
  • number of columns used for blocking, i.e. how many shingles were created by text2vec package (here 28),
  • reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.5714 which means blocking reduces comparison by over 57%).
blocking_result <- blocking(x = df_example$txt)
## data frame with indices and block 
blocking_result
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4 
#> 2

Table with blocking results contains:

  • row numbers from the original data,
  • block number (integers),
  • distance (from the ANN algorithm).
blocking_result$result
#>        x     y block       dist
#>    <int> <int> <num>      <num>
#> 1:     1     2     1 0.10000002
#> 2:     2     3     1 0.14188367
#> 3:     2     4     1 0.28286284
#> 4:     5     6     2 0.08333331
#> 5:     5     7     2 0.13397455
#> 6:     5     8     2 0.27831215

Deduplication using the pair_ann function for integration with the reclin2 package. Use the pipeline with the reclin2 package.

pair_ann(x = df_example, on = "txt") |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.55) |>
  link(selection = "threshold")
#>   Total number of pairs: 8 pairs
#> 
#> Key: <.y>
#>       .y    .x       txt.x           txt.y
#>    <int> <int>      <char>          <char>
#> 1:     2     1 jankowalski     kowalskijan
#> 2:     3     1 jankowalski    kowalskimjan
#> 3:     3     2 kowalskijan    kowalskimjan
#> 4:     4     1 jankowalski        kowaljan
#> 5:     4     2 kowalskijan        kowaljan
#> 6:     6     5 montypython     pythonmonty
#> 7:     7     5 montypython cyrkmontypython
#> 8:     8     5 montypython           monty

Linking records using the same function where df_base is the “register” and df_example is the reference (data).

pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.55) |>
  link(selection = "threshold")
#>   Total number of pairs: 8 pairs
#> 
#> Key: <.y>
#>       .y    .x       txt.x           txt.y
#>    <int> <int>      <char>          <char>
#> 1:     1     2 kowalskijan     jankowalski
#> 2:     2     2 kowalskijan     kowalskijan
#> 3:     3     2 kowalskijan    kowalskimjan
#> 4:     4     2 kowalskijan        kowaljan
#> 5:     5     1 montypython     montypython
#> 6:     6     1 montypython     pythonmonty
#> 7:     7     1 montypython cyrkmontypython
#> 8:     8     1 montypython           monty

See also

See section Data Integration (Statistical Matching and Record Linkage) in the Official Statistics Task View.

Packages that allow blocking:

  • klsh – k-means locality sensitive hashing,
  • reclin2pair_blocking, pari_minsim functions,
  • fastLinkblockData function.

Other:

  • clevr – evaluation of clustering, helper functions.
  • exchanger – bayesian Entity Resolution with Exchangeable Random Partition Priors

blocking's People

Contributors

berenz avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

blocking's Issues

Release 0.2.0

Plans:

  • Support for rnndescent as it will be shipped to CRAN.

Full `rnndescent` support

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `RcppAnnoy` support

Support for RcppAnnoy in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

bug when `true_blocks` are provided

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

testing <- blocking(x = df_example$txt,
                    deduplication = T)

testing2 <- blocking(x = df_example$txt,
                    deduplication = T,
                    true_blocks = testing$result[c(1,4,6), .(x,y,block)])

Error message

Error in modularity.igraph(graph, membership) : 
  At vendor/cigraph/src/community/modularity.c:132 : Membership vector size differs from number of vertices. Invalid value
> traceback()
4: modularity.igraph(graph, membership)
3: modularity(graph, membership)
2: igraph::make_clusters(eval_g1, membership = eval_blocks$block.x)
1: blocking(x = df_example$txt, deduplication = T, true_blocks = testing$result[c(1, 
       4, 6), .(x, y, block)])

error in the `confusion` construction

Just an issue to remind that there is some error construction of confusion matrix

> confusion
          same_truth
same_block   FALSE    TRUE
     FALSE 1926684      11
     TRUE       11     960

cell (1, 2) is exactly the same as (2,1).

Full `RcppHNSW` support

Support for RcppHNSW in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

blocking by blocking variables

Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.

remove duplicated pairs from the result

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

result <- blocking(x = df_example$txt,
                   ann = "hnsw",
                   control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))

result$result

The result contains pairs that refer to the same records (i.e. 1 - 2, 2-1)

> result$result
       x     y block       dist
   <int> <int> <num>      <num>
1:     1     2     1 0.09999990
2:     2     1     1 0.09999990
3:     2     3     1 0.14188361
4:     2     4     1 0.28286278
5:     5     6     2 0.08333343
6:     5     7     2 0.13397467
7:     6     5     2 0.08333343
8:     6     8     2 0.27831221

Improvement of performance

Ideas for improving performance:

  • if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
  • if a large query data is present the same procedure should be applied.

Consider using:

  • sparse matrix (Matrix) as an input for x, y
  • bigmemory::big.matrix as an input for x, y

Full `mlpack` support

Support for mlpack in:

  1. lsh functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index
  1. knn functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

`pair_ann` does not work with `data.table`

> pair_ann(x = df_example, on = "txt")
  First data set:  8 records
  Second data set: 8 records
  Total number of pairs: 10 pairs
  Blocking on: 'txt'

       .x    .y block
    <int> <int> <num>
 1:     1     2     1
 2:     1     3     1
 3:     1     4     1
 4:     2     3     1
 5:     2     4     1
 6:     5     6     2
 7:     5     7     2
 8:     5     8     2
 9:     6     7     2
10:     6     8     2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.

add quality metrics

Quality metrics about blocking. This would require specifying new argument: true_block

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.