data61 / anonlink Goto Github PK

Python implementation of anonymous linkage using cryptographic linkage keys

License: Apache License 2.0

Python 74.10% C++ 13.28% C 7.76% Cython 4.86%

privacy-enhancing-technologies privacy-preserving record-linkage entity-resolution cryptography

anonlink's Introduction

A Python (and optimised C++) implementation of anonymous linkage using cryptographic linkage keys as described by Rainer Schnell, Tobias Bachteler, and Jörg Reiher in A Novel Error-Tolerant Anonymous Linking Code.

anonlink computes similarity scores, and/or best guess matches between sets of cryptographic linkage keys (hashed entity records).

Use clkhash to create cryptographic linkage keys from personally identifiable data.

Installation

Install a precompiled wheel from PyPi:

pip install anonlink

Or (if your system has a C++ compiler) you can locally install from source:

pip install -r requirements.txt
pip install -e .

Benchmark

You can run the benchmark with:

$ python -m anonlink.benchmark
Anonlink benchmark -- see README for explanation
------------------------------------------------

Threshold: 0.5, All results returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  (50.73%) |  0.762  (49.2% / 50.8%) |     2.669
  2000 |   2000 |    4e6  (51.04%) |  3.696  (42.6% / 57.4%) |     2.540
  3000 |   3000 |    9e6  (50.25%) |  8.121  (43.5% / 56.5%) |     2.548
  4000 |   4000 |   16e6  (50.71%) | 15.560  (41.1% / 58.9%) |     2.504

Threshold: 0.5, Top 100 matches per record returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 6.86%) |  0.170  (85.9% / 14.1%) |     6.846
  2000 |   2000 |    4e6  ( 3.22%) |  0.384  (82.9% / 17.1%) |    12.561
  3000 |   3000 |    9e6  ( 2.09%) |  0.612  (81.6% / 18.4%) |    18.016
  4000 |   4000 |   16e6  ( 1.52%) |  0.919  (78.7% / 21.3%) |    22.135
  5000 |   5000 |   25e6  ( 1.18%) |  1.163  (80.8% / 19.2%) |    26.592
  6000 |   6000 |   36e6  ( 0.97%) |  1.535  (75.4% / 24.6%) |    31.113
  7000 |   7000 |   49e6  ( 0.82%) |  1.791  (80.6% / 19.4%) |    33.951
  8000 |   8000 |   64e6  ( 0.71%) |  2.095  (81.5% / 18.5%) |    37.466
  9000 |   9000 |   81e6  ( 0.63%) |  2.766  (72.5% / 27.5%) |    40.389
 10000 |  10000 |  100e6  ( 0.56%) |  2.765  (81.7% / 18.3%) |    44.277
 20000 |  20000 |  400e6  ( 0.27%) |  7.062  (86.2% / 13.8%) |    65.711

Threshold: 0.7, All results returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   113.109
  2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.7% /  1.3%) |   124.076
  3000 |   3000 |    9e6  ( 0.01%) |  0.071  (99.1% /  0.9%) |   128.515
  4000 |   4000 |   16e6  ( 0.01%) |  0.123  (99.0% /  1.0%) |   131.654
  5000 |   5000 |   25e6  ( 0.01%) |  0.202  (99.1% /  0.9%) |   124.999
  6000 |   6000 |   36e6  ( 0.01%) |  0.277  (99.0% /  1.0%) |   131.403
  7000 |   7000 |   49e6  ( 0.01%) |  0.368  (98.9% /  1.1%) |   134.428
  8000 |   8000 |   64e6  ( 0.01%) |  0.490  (99.0% /  1.0%) |   131.891
  9000 |   9000 |   81e6  ( 0.01%) |  0.608  (99.0% /  1.0%) |   134.564
 10000 |  10000 |  100e6  ( 0.01%) |  0.753  (99.0% /  1.0%) |   134.105
 20000 |  20000 |  400e6  ( 0.01%) |  2.905  (98.8% /  1.2%) |   139.294

Threshold: 0.7, Top 100 matches per record returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   111.640
  2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.6% /  1.4%) |   122.060
  3000 |   3000 |    9e6  ( 0.01%) |  0.074  (99.1% /  0.9%) |   123.237
  4000 |   4000 |   16e6  ( 0.01%) |  0.124  (99.0% /  1.0%) |   130.204
  5000 |   5000 |   25e6  ( 0.01%) |  0.208  (99.1% /  0.9%) |   121.351
  6000 |   6000 |   36e6  ( 0.01%) |  0.275  (99.0% /  1.0%) |   132.186
  7000 |   7000 |   49e6  ( 0.01%) |  0.373  (99.0% /  1.0%) |   132.650
  8000 |   8000 |   64e6  ( 0.01%) |  0.496  (99.1% /  0.9%) |   130.125
  9000 |   9000 |   81e6  ( 0.01%) |  0.614  (99.0% /  1.0%) |   133.216
 10000 |  10000 |  100e6  ( 0.01%) |  0.775  (99.1% /  0.9%) |   130.230
 20000 |  20000 |  400e6  ( 0.01%) |  2.939  (98.9% /  1.1%) |   137.574

The tables are interpreted as follows. Each table measures the throughput of the Dice coefficient comparison function. The four tables correspond to two different choices of "matching threshold" and "result limiting".

These parameters have been chosen to characterise two different performance scenarios. Since the data used for comparisons is randomly generated, the first threshold value (0.5) will cause about 50% of the candidates to "match", while the second threshold value (0.7) will cause ~0.01% of the candidates to match (these values are reported in the "match %" column). Where the table heading includes "All results returned", all matches above the threshold are returned and passed to the solver. With the threshold of 0.5, the large number of matches means that much of the time is spent keeping the candidates in order. Next we limit the number of matches per record to the top 100 - which also must be above the threshold.

In the final two tables we use the threshold value of 0.7, this very effectively filters the number of candidate matches down. Here the throughput is determined primarily by the comparison code itself, adding the top 100 filter has no major impact.

Finally, the Total Time column includes indications as to the proportion of time spent calculating the (sparse) similarity matrix comparisons and the proportion of time spent matching in the greedy solver. This latter is determined by the size of the similarity matrix, which will be approximately #comparisons * match% / 100.

Tests

Run unit tests with `pytest`:

$ pytest
====================================== test session starts ======================================
platform linux -- Python 3.6.4, pytest-3.2.5, py-1.4.34, pluggy-0.4.0
rootdir: /home/hlaw/src/n1-anonlink, inifile:
collected 71 items

tests/test_benchmark.py ...
tests/test_bloommatcher.py ..............
tests/test_e2e.py .............ss....
tests/test_matcher.py ..x.....x......x....x..
tests/test_similarity.py .........
tests/test_util.py ...

======================== 65 passed, 2 skipped, 4 xfailed in 4.01 seconds ========================

To enable slightly larger tests add the following environment variables:

INCLUDE_10K
INCLUDE_100K

Limitations

The linkage process has order n^2 time complexity - although algorithms exist to significantly speed this up. Several possible speedups are described in Privacy Preserving Record Linkage with PPJoin.

Discussion

If you run into bugs, you can file them in our issue tracker on GitHub.

There is also an anonlink mailing list for development discussion and release announcements.

Wherever we interact, we strive to follow the Python Community Code of Conduct.

Citing

Anonlink is designed, developed and supported by CSIRO's Data61. If you use any part of this library in your research, please cite it using the following BibTex entry:

@misc{Anonlink,
  author = {CSIRO's Data61},
  title = {Anonlink Private Record Linkage System},
  year = {2017},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/data61/anonlink}},
}

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

anonlink's People

Contributors

Stargazers

Watchers

Forkers

djrobstep minghao2016 wilko77 private-record-linkage olbrich jimba86

anonlink's Issues

[CLOSED] anonlink tests are broken for Python 2

Issue by tho802
Wednesday Jul 13, 2016 at 01:15 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/13

Opps, one of my cffi updates has broken support for python 2

    return cffi_filter_similarity_k(filters1, filters2, k=k, threshold=threshold)
  File "/home/brian/dev/N1/AnonymousLinking/anonlink/entitymatch.py", line 44, in cffi_filter_similarity_k
    bytes([b for f in filters2 for b in f[0].tobytes()]))
IndexError: initializer str is too long for 'char[12800]' (got 87821 characters)

[CLOSED] [WIP] Pure Python Implementation of multibit trees

Issue by tho802
Friday Aug 05, 2016 at 01:42 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/14

Follows the paper "Similarity Filtering with Multibit Trees for Record Linkage" with the following differences:

Uses one multibit tree per CLK popcount. E.g., every CLK with 512 bits set will be in one tree. The paper says they get quicker results using one multibit tree for all CLKs however it wasn't clear to me how to modify the calculation of the maximum bounding on jaccard similarity - which is how you determine whether to terminate the search (Equation 5 in the paper). Perhaps @har991 you could point me in the right direction?
For the final similarity computation I reused our reasonably efficient cffi_filter_similarity_k function. This means in the third search phase instead of using jaccard similarity we return the top k matches using the Sorensen Dice index.
The paper uses ordered q-grams. I think this is entirely independent of the multibit tree datastructure so will add an issue to implement that.

Not ready for merge yet. Just want to solicit feedback/help

Once it is doing the right thing I'll look at performance.

tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/14/commits

Tests shouldn't be pinned to specific machines

The tests are pinned to specific machines in the jenkinsfile:

def configs = [		    
     [label: 'GPU 1', pythons: ['python3.4', 'python3.5', 'python3.6'], compilers: ['clang', 'gcc']],
     [label: 'McNode', pythons: ['python3.5'], compilers: ['clang', 'gcc']]
 ]

This should be replaced with more generic labels, e.g. osx for McNode.

The problem is, that if one of those nodes goes down, the tests will fail. It renders the redundancy in the jenkins nodes useless.

Avoid copying data back and forth between the Python runtime and the C++ library

~~According to the benchmark, copying to and from the C++ library currently takes 99% of the time.~~
Edit: Not true. Copying does take a lot of time in some places (see #79), but the copying mentioned below is not "99%" bad, more like "5%" bad.
Really the C++ library should just be passed an address to the memory it needs to look at directly, making the copy redundant. This may take advantage of a cleaner interface to the arrays provided by a resolution to issue #64.

This issue partly supersedes issue #29 where Brian says:

We also are dealing with "nice" python bitarrays which require some manipulation (1) before passing into native code. We might want to consider adding an accelerated interface that takes our custom bit packed data as plain python bytes.

1: [ffi.new("char[128]", bytes(f[0].tobytes())) for f in filters1]

I've started experimenting in

Branch feature-chunked-speedup for a C implementation of many x many comparisons.

Branch feature-direct-cffi builds ontop of that with a look at accessing bitarray data from C without a memcopy. Only does a bitarray popcount for now.
# Assume ba is a bitarray
addr = ba.buffer_info()[0]
pntr = ffi.cast("char *", addr)
lib.popcount(pntr)

The comments in issue #18 might still be relevant.

Aha! Link: https://csiro.aha.io/features/ANONLINK-68

[CLOSED] Seperate out test/client side code

Issue by tho802
Tuesday Jul 12, 2016 at 02:14 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/12

Several functions like cryptoBloomFilter and calculate_bloom_filters are used primarily for testing or to demonstrate client side operations - they shouldn't be in entitymatch.py.

Accelerated api for processing n x m entities

At the moment in entitymatch.cffi_filter_similarity_k we pass in two lists of bloom filters, but we end up calling the native code n times. This issue is to move that iteration into C.

We also are dealing with "nice" python bitarrays which require some manipulation (1) before passing into native code. We might want to consider adding an accelerated interface that takes our custom bit packed data as plain python bytes.

1: [ffi.new("char[128]", bytes(f[0].tobytes())) for f in filters1]

I've started experimenting in

Branch feature-chunked-speedup for a C implementation of many x many comparisons.
Branch feature-direct-cffi builds ontop of that with a look at accessing bitarray data from C without a memcopy. Only does a bitarray popcount for now.

# Assume ba is a bitarray
addr = ba.buffer_info()[0]
pntr = ffi.cast("char *", addr)
lib.popcount(pntr)

e2e test sometimes fails

Very occasionally the e2e test can fail with items that are very similar, but were not the exact match. This is somewhat expected behaviour with probablistic linking so the tests should still pass.

A failing example:

[GPU 1-python3.6-clang] ======================================================================
[GPU 1-python3.6-clang] FAIL: test_cffi_k (tests.test_e2e.TestEntityMatchTopK)
[GPU 1-python3.6-clang] ----------------------------------------------------------------------
[GPU 1-python3.6-clang] Traceback (most recent call last):
[GPU 1-python3.6-clang]   File "/var/jenkins_home/workspace/nk_bug-remove-extra-clkhash-4U5Z47G3YO7PFYAEM2LZQ3G7Y5YBDBNU6TKWSRJCVTBFKS42WJQA@2/tests/test_e2e.py", line 157, in test_cffi_k
[GPU 1-python3.6-clang]     self.assertEqual(s1[indexA], s2[mapping[indexA]])
[GPU 1-python3.6-clang] AssertionError: Tuples differ: (285, 'Cheri Wahlquist', '1975/05/22', 'F') != (208, 'Mozell Wahlquist', '1975/01/05', 'F')
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] First differing element 0:
[GPU 1-python3.6-clang] 285
[GPU 1-python3.6-clang] 208
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] - (285, 'Cheri Wahlquist', '1975/05/22', 'F')
[GPU 1-python3.6-clang] ?    -   ^^ ^^                     ---
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] + (208, 'Mozell Wahlquist', '1975/01/05', 'F')
[GPU 1-python3.6-clang] ?   +    ^^^ ^^                    +++
[GPU 1-python3.6-clang]

This issue is to create a better assertion for the test.

Calculate nodes' degree to improve greedy method's accuracy

Issue by tho802
Wednesday Jun 22, 2016 at 05:17 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/11

If we apply the threshold of similarity scores in C, and instead of returning the top K edges, we return the edges that meet the threshold (along with the count) it should be slightly quicker.

If we can also determine the degree of the nodes from organisation B we can make more accurate matches - e.g. defeating #10

Aha! Link: https://csiro.aha.io/features/ANONLINK-80

CI should be using valgrind

Run Memcheck on the binary and fail if it detects a memory leak or bad access.

Aha! Link: https://csiro.aha.io/features/ANONLINK-75

GCC and Clang don't produce equally performant code

As @hardbyte noted in #35, Clang produces much better code than GCC:

GCC

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.065s    |        15.430
  2000 |   2000 |      4000000 |    0.166s    |        24.043
  3000 |   3000 |      9000000 |    0.352s    |        25.532
  4000 |   4000 |     16000000 |    0.557s    |        28.727
  5000 |   5000 |     25000000 |    0.710s    |        35.235
  6000 |   6000 |     36000000 |    0.710s    |        50.699
  7000 |   7000 |     49000000 |    0.747s    |        65.631
  8000 |   8000 |     64000000 |    0.789s    |        81.089
  9000 |   9000 |     81000000 |    0.861s    |        94.081
 10000 |  10000 |    100000000 |    1.003s    |        99.673
Single Core:
  5000 |   5000 |     25000000 |    0.425s    |        58.800

Clang

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.055s    |        18.248
  2000 |   2000 |      4000000 |    0.135s    |        29.720
  3000 |   3000 |      9000000 |    0.275s    |        32.769
  4000 |   4000 |     16000000 |    0.424s    |        37.692
  5000 |   5000 |     25000000 |    0.514s    |        48.607
  6000 |   6000 |     36000000 |    0.537s    |        67.025
  7000 |   7000 |     49000000 |    0.557s    |        87.939
  8000 |   8000 |     64000000 |    0.596s    |       107.434
  9000 |   9000 |     81000000 |    0.636s    |       127.285
 10000 |  10000 |    100000000 |    0.774s    |       129.232
Single Core:
  5000 |   5000 |     25000000 |    0.260s    |        96.132

A quick visual of the generated assembly shows that GCC misses obvious opportunities for function inlining and loop unrolling that probably explain most of the difference. There is ample opportunity to clean up the C++ source to make it easier for GCC to produce better code. At worst we can just do the inlining and unrolling manually; probably worth it to get a 50% speedup.

[CLOSED] Matching progress

Issue by tho802
Wednesday Jun 22, 2016 at 03:49 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/9

It can take many hours to carry out a large matching, it would be a great idea to expose the progress in some way.

For example the calculate_mapping_greedy function could take as an argument a Queue, progress updates would be produced by the function.

[CLOSED] Add Greedy Solver

Issue by tho802
Thursday Jun 16, 2016 at 07:28 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/5

One method we can solve a performance issue with large datasets is to do less work! This method of solving the graph matching doesn't do anything smart!

Be warned though - it may make more mistakes!

tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/5/commits

CLI hash command doesn't use all cores

Issue by tho802
Wednesday Jan 18, 2017 at 06:21 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/25

Accelerated API for processing n x m entities

This issue partly supersedes #29 where Brian says:

At the moment in entitymatch.cffi_filter_similarity_k we pass in two lists of bloom filters, but we end up calling the native code n times. This issue is to move that iteration into the C library.

Aha! Link: https://csiro.aha.io/features/ANONLINK-67

[CLOSED] Command line tool integration with entity service

Issue by tho802
Friday Feb 10, 2017 at 02:38 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/28

Adds support in the command line tool to:

Locally benchmark hashing capability
Connect to and create a mapping on an entity matching server
upload locally hashed data
retrieve results

tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/28/commits

Support multiple bloom filters

It may make sense to calculate multiple CLKs using different field sets for improved matching, blocking, matching with orgs who only have a subset of the fields, and most importantly for automated methods to determine the threshold.

Example

If the possible fields are Name, Phone, Address, Email we might calculate CLKs for:

Name, Phone, Address, Email
Name, Phone, Address
Name, Address, Email
Phone, Address, Email
Name, Phone
Name, Address
Name, Email
Phone, Address
Phone, Email
etc

Need to investigate further and determine what changes would be required in anonlink to support matching datasets which contain sets of different clks.

Aha! Link: https://csiro.aha.io/features/ANONLINK-78

Benchmark/test with perturbed data

Yangfeng suggested looking at febrl to generate data with pertubations.

Manual - http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/

Additional test sets:

Aha! Link: https://csiro.aha.io/features/ANONLINK-76

[CLOSED] Feature popcount optimisation

Issue by tho802
Friday Jun 10, 2016 at 06:21 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/3

Adds assertions around calling C, slight modification to how we create the array datatypes in Python.
New popcount method that does a whole buffer.
Fixes a bug that surfaced on my machine (possible from new compiler/python version)

tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/3/commits

[CLOSED] Similarity matrix

Issue by tho802
Monday Apr 11, 2016 at 16:48 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/1

When determining Bloom Filter similarity the score is computed in C for all NxM combinations, but only the single best match is returned. This somewhat limits the next step that solves pairwise matches for the entire network at once.

Currently the structure is:

the index in filters1
the similarity score between 0 and 1 of the best match
The original index in entity A
The original index in entity B
The index in filters2 of the best match

Instead of computing a tuple for each entity in entity A, we could explore the memory/accuracy trade off of instead computing a similarity matrix - recording the n-gram similarity score between every pair.

Release Process

Not sure if it would be better to get our private jenkins to publish to pypi, have travis or some other public CI do it, or continue cutting the releases myself and then uploading gpg signed releases to PyPi.

Opinions appreciated.

Note to self to have a look at cryptography's release process. The have a release script

Aha! Link: https://csiro.aha.io/features/ANONLINK-74

[CLOSED] Consider making the slower but more accurate weighted solver default

Issue by tho802
Monday Apr 11, 2016 at 17:02 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/2

Only makes a difference if there are close matches.

Investigate Blocking in privacy preserving way

Issue by tho802
Tuesday Oct 11, 2016 at 03:31 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/19

Yanfeng is looking at various blocking techniques, currently implementing and comparing blocking algorithms in Python.

We need to consider if the server api has to change at all. Clustering occurs on the CLK on the server before matching.

Links

Different results when compiled using clang

Building on OSX using the default clang compiler appears to compile without warning but produces different results than g++. The tests showing the error are L71- L79 of test_bloomfilter.py:

    def test_dice_1_c(self):
        ba = self.generate_bitarray(1024)
        self.assertEqual(bm.dicecoeff(ba, ba), 1.0)

    def test_dice_2_c(self):
        ba = self.generate_bitarray(1024)
        bb = ba.copy()
        bb.invert()
        self.assertEqual(bm.dicecoeff(ba, bb), 0.0)

Test Results show that testing a bitarray against itself, and against its inverse don't give similarity scores of 1, and 0.

See the compile output for osx vs linux.

The command used to build with clang is:

clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/private/var/jenkins_home/workspace/ink_feature-use-jenkinsfile-VGRPYD53GGGDDSBIJDLSUDYPJ34QR63ITGMC5VJNB56W6ID244AA/env/include -I/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c build/temp.macosx-10.12-x86_64-3.5/_entitymatcher.cpp -o build/temp.macosx-10.12-x86_64-3.5/build/temp.macosx-10.12-x86_64-3.5/_entitymatcher.o -std=c++11 -mssse3 -mpopcnt

The code is trying to use strictly sized types, but nonetheless there is an issue.

The Python part:

    e1array = ffi.new("char[]", e1.tobytes())
    e2array = ffi.new("char[]", e2.tobytes())

    return lib.dice_coeff_1024(e1array, e2array)

Which calls the C part:

double dice_coeff_1024(const char *e1, const char *e2) {
    int nbyte = 128; // 1024 bit key, 128 bytes
    int nuint64 = int(nbyte / sizeof(uint64_t)); // assume l is divisible by 64

    const uint64_t *comp1 = (const uint64_t *) e1;
    const uint64_t *comp2 = (const uint64_t *) e2;

    uint32_t count_both = 0;

    count_both += builtin_popcnt_unrolled_errata_manual(comp1, nuint64);
    count_both += builtin_popcnt_unrolled_errata_manual(comp2, nuint64);

    uint64_t* combined = new uint64_t[nuint64];
    for (int i=0 ; i < nuint64; i++ ) {
        combined[i] = comp1[i] & comp2[i];
    }

    uint32_t count_and = builtin_popcnt_unrolled_errata_manual(combined, nuint64);

    delete[] combined;

    return 2 * count_and / (double) (count_both);
}

Which calls the following assembler:

// Source: http://danluu.com/assembly-intrinsics/
// https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
uint32_t builtin_popcnt_unrolled_errata_manual(const uint64_t* buf, int len) {
  assert(len % 4 == 0);
  uint64_t cnt[4];
  for (int i = 0; i < 4; ++i) {
    cnt[i] = 0;
  }

  for (int i = 0; i < len; i+=4) {
    __asm__(
        "popcnt %4, %4  \n\t"
        "add %4, %0     \n\t"
        "popcnt %5, %5  \n\t"
        "add %5, %1     \n\t"
        "popcnt %6, %6  \n\t"
        "add %6, %2     \n\t"
        "popcnt %7, %7  \n\t"
        "add %7, %3     \n\t" // +r means input/output, r means intput
        : "+r" (cnt[0]), "+r" (cnt[1]), "+r" (cnt[2]), "+r" (cnt[3])
        : "r"  (buf[i]), "r"  (buf[i+1]), "r"  (buf[i+2]), "r"  (buf[i+3]));
  }
  return cnt[0] + cnt[1] + cnt[2] + cnt[3];
}

My hunch is the bug is in the C code. Is the casting coming back to haunt me? Anyone know much about differences between clang and g++.
cc: @mhsiah @maxott @sjhardy

GPU integration

Optional interface with cuda.

Note we have a proof of concept for computing the DICE-Sorensen index, sorting and applying a threshold all on the GPU.

Need to consider whether to match the top-k filtering similar to our C implementation, or drop that.

[CLOSED] Use a heap for storing top k scores

Issue by tho802
Friday Jun 17, 2016 at 01:04 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/7

Normally for top k you would make a heap of length k by insertion, and then if the smallest entry is smaller than the new entry, delete the smallest and add the new one. Here (I think) you keep all the scores, and sort them all, taking the top k. Not sure if this is an issue, but will scale as n log n rather than n log k.

Uniform Python interface to C++ library

The C++ library is currently accessed through the interface generated by Python's CFFI, which is clumsy and not idiomatic. It would neater if we wrapped interaction with the C++ library completely in a more Pythony way. This ought to complement issue #66 which is about avoiding copying back and forth from the Python runtime.

Aha! Link: https://csiro.aha.io/features/ANONLINK-69

Tests that currently trick greedy algorithm

Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.

The network methods should succeed, and for now the greedy algorithm should fail. Correct mapping is:
{a: 2, b: 3, c: 1}

A similar variant where the result will probably be different between the methods (same desired output):

This issue tracks these issues - they exist as tests marked as expectedFailure.

Aha! Link: https://csiro.aha.io/features/ANONLINK-81

Improve priority queue use in C++ matching code

The use of the std::priority_queue in the C++ matching code (match_one_against_many_dice_k_top), is the main bottleneck when k is large. There a few deficiencies that may be negatively impacting performance. For example:

Extensive use of priority_queue::pop() in loops as though it were O(1) when it is in fact O(log n).
AoS memory layout of the queue elements (struct Node) when SoA would probably be better.
Possibly unnecessary maintenance of the queue invariant when k is small (a qsort at the end might do).

Some potentially useful resources:

Aha! Link: https://csiro.aha.io/features/ANONLINK-70

[CLOSED] Detect/handle larger datasets

Issue by tho802
Friday Jun 17, 2016 at 00:48 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/6

A likely future failure mode is the request will be too big - e.g. 8M rows will need 1GB ram allocated on line 73.

Need to detect this, then think about how to scale - maybe streaming?

[CLOSED] Tests don't cover cli

Issue by tho802
Thursday Nov 03, 2016 at 06:40 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/22

anonlink/cli.py doesn't appear to get tested see coverage

Supervised method to automatically learn the threshold

Consider adding a train function that would be provided with training data - CLKs that are known to match. The output would be an optimal threshold t.

This would be a vast improvement over the current method of essentially guessing a similarity threshold.

Ideally considering issue #23 - this should be able to work on a vector of clks.

Aha! Link: https://csiro.aha.io/features/ANONLINK-77

[CLOSED] Parallel computation for greedy method

Issue by tho802
Monday Jun 20, 2016 at 02:49 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/8

At the moment the greedy solver is entirely CPU bound - and uses just one core.
It wold be possible to break the task into parallel chunks.

First calculate plain bitcounts for all bloomfilters in filters1 and filters2.

Then solve chunks of filters1 in parallel:

Each thread/process has a chunk of bloomfilters from filters1 (e.g. 500 filters)
read only access to all of filters2 (in shared memory)
Each process computes the top k indices and scores for every filter from filters1

Then when all jobs complete the mapping is easily assembled:

        mappings = {}
        matched_entries_b = set()
        for possible_index_b, score in zip(indices, scores):
            if possible_index_b not in matched_entries_b and score > threshold:
                mappings[index_a] = possible_index_b
                matched_entries_b.add(possible_index_b)
                break

C implementation of arbitrary length Dice coefficient

A fast implementation for non 1024 bit arrays

Source distribution should include C source

Ideally we can create a source release of anonlink with:

python setup.py sdist

And another user (possibly on another version of Python) should be able to install using the created source distribution.

At the moment a .tar.gz gets created with the Python code but not the C/C++ sources.

bit shuffling

Issue by tho802
Tuesday Aug 16, 2016 at 03:43 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/18

The de-serialisation of bitarray instances into char * actually takes some time. We currently do it with:

    carr2 = ffi.new("char[{}]".format(128 * length_f2),
                        bytes([b for f in filters2 for b in f[0].tobytes()]))

    c_popcounts = ffi.new("uint32_t[{}]".format(length_f2), [f[2] for f in filters2])

It might be worth looking at the internal C representation of bitarray objects, and passing that. One of the features of the bitarray library is:

Packing and unpacking to other binary data formats, e.g. numpy.ndarray, is possible

So we might want to benchmark pack/unpack versus our current iteration approach. Or we could probably just pass the bitarray.buffer_info() directly to our C library.

Returns a tuple (address, size, endianness, unused, allocated) giving the current memory address, the size (in bytes) used to hold the bitarray’s contents, the bit endianness as a string, the number of unused bits (e.g. a bitarray of length 11 will have a buffer size of 2 bytes and 5 unused bits), and the size (in bytes) of the allocated memory.

Remove clkhash code from anonlink

Note that the entityservice has some testing code that currently relies on anonlink: es#57

Test suite does not declare expected failures correctly

Running the test suite currently raises a RuntimeWarning:

/usr/lib/python3.6/unittest/case.py:550: RuntimeWarning: TestResult has no addExpectedFailure method, reporting as passes RuntimeWarning)

This is probably because the suite uses the Python @unittest.expectedFailure decorator instead of the Nose @raises decorator in various places (reasoning from here).

Misleading or confusing aspects of anonlink.benchmark

Following the README, we run the benchmark like so:

$ python3 -m anonlink.benchmark
100000 x 1024 bit popcounts in 0.019525 seconds
Popcount speed: 625.18 MiB/s
Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.118s    |         8.478
  2000 |   2000 |      4000000 |    0.239s    |        16.707
  3000 |   3000 |      9000000 |    0.495s    |        18.171
  4000 |   4000 |     16000000 |    0.848s    |        18.874
  5000 |   5000 |     25000000 |    0.942s    |        26.528
  6000 |   6000 |     36000000 |    1.032s    |        34.898
  7000 |   7000 |     49000000 |    1.041s    |        47.049
  8000 |   8000 |     64000000 |    1.107s    |        57.789
  9000 |   9000 |     81000000 |    1.137s    |        71.229
 10000 |  10000 |    100000000 |    1.173s    |        85.245
 20000 |  20000 |    400000000 |    3.641s    |       109.867
Single Core:
  5000 |   5000 |     25000000 |    0.545s    |        45.878

There are three problems here:

The reported popcount throughput is for the count() method in Python's bitarrays, not the C implementation (which, FWIW, I've independently measured to be in the order of 10GiB/s). We use the C implementation exclusively except for testing AFAICT.
The table of comparisons per second is measuring calculations of similarity matrices via the function distributed_processing.calculate_filter_similarity with defaults k=10 and threshold=0.5 (these defaults and the actual work are delegated to entitymatch.calculate_filter_similarity). In contrast, the "Single Core" line uses the function entitymatch.calculate_mapping_greedy which calculates a single similarity matrix AND does a greedy solve with defaults k=5 and threshold=0.95. These two situations are obviously completely incomparable, but are presented in the same table without explanation.
Following on from the previous point, the line for 5000x5000 in the main table is always way slower than the line for single core, which makes no sense if the main table is "distributed"; it must be a consequence of the fact that the k and threshold parameters have been changed.

[CLOSED] OSX build

Issue by tho802
Tuesday Oct 25, 2016 at 00:20 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/20

Currently jenkins is failing to build anonlink on OSX. This could just be the McNode on jenkins isn't set up with XCode etc for compiling. Or it could be that the build doesn't currently work for OSX. Either way we should get it working and document what is required to build on mac.

[CLOSED] Broken build

Issue by tho802
Wednesday Feb 08, 2017 at 06:52 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/27

https://magic-ci-atp.it.csiro.au/job/anonlink/

Jenkins seems to be having a problem with building the C extension. Need to find out why.

Popcount function is under performing

In branch hlaw-fix-issue56 the raw popcount function performance is reported to be about 12 GiB/s (on my machine), however (essentially) the same code performs at 18 GiB/s when called from a basic wrapper function thus:

// -*- compile-command: "g++ -ggdb -Wall -Wextra -std=c++11 -O3 -mssse3 -mpopcnt -fvisibility=hidden -o bench bench.cc dice_one_against_many.cpp" -*-

#include <iostream>
#include <stdint.h>
using namespace std;

double popcount_1024_array(const char *many, int n, uint32_t *counts_many);

int main(int argc, char *argv[]) {
    static constexpr int KEYWORDS = 16;
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);
    uint64_t *buf = new uint64_t[KEYWORDS * n];
    uint32_t *counts = new uint32_t[n];
    long nbytes = n*KEYWORDS*sizeof(uint64_t);
    double t = popcount_1024_array(reinterpret_cast<const char *>(buf), n, counts);
    cout << "Time: " << t << "ms => " << (n/1e6) * (1e3/t) << " 1e6 popc/s => "
         <<  (nbytes / (double)(1 << 20)) * (1e3/t) << " MiB/s" << endl;
    delete[] counts;
    delete[] buf;
    return 0;
}

Aha! Link: https://csiro.aha.io/features/ANONLINK-71

Attack on Bloom filters built from multiple identifiers

In Who Is 1011011111…
1110110010? Automated Cryptanalysis of Bloom Filter Encryptions of Databases with Several Personal Identifiers Kroll and Steinmetzer present cryptanalysis and an attack on Bloom filters built from multiple identifiers.

Is that applicable to our use-case?
Discuss!

Update test suite to use py.test.

The test suite has some vestiges of when it depended on the nose testing framework (e.g. in the README file). These should be removed and the tests should be updated to reflect py.test test idioms.

This issue should resolve #62.

[CLOSED] Positional q-grams

Issue by tho802
Friday Aug 05, 2016 at 02:02 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/15

Positional q-grams should give more accuracy for numerical identifiers such as zip code, house number, date of birth.

With positional q-grams the locations of the q-grams within the string are recorded as additional information. For example, when using the first character of each bi-gram as locational label, the positional bigram set for SMITH is { S(0), SM(1), MI(2), IT(3), TH(4), H(5) }. Positional g-grams are considered equal iff the bigram and the locations are equal.

Only part that isn't clear to me from reading is how to hash positional uni-grams (or bi-grams) into CLKs.
At the moment we transform 1987 into the unigram list [1, 9, 8, 7], but to encode the position I plan to
prepend the index: ["1 1", "2 9", "3 8", "4 7"]. This seems reasonable to me and would allow us to carry on using the same hashing.

Consider One-to-Many relationships

Issue by tho802
Sunday Aug 07, 2016 at 23:47 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/16

Some organisations want to use our entity matching service to learn matches between one to many.

Remove dependency on clkhash

Currently a few e2e tests rely on clkhash for generating the bloomfilters.

While it is nice to do an integration test between the related libraries we shouldn't introduce a clkhash as a dependency just for the integration test.

They are also used in the benchmarking script, comment from @wilko77:

none of the benchmarks actually needs real CLKs, as all you measure is timing. You could easily fake CLKs by choosing them randomly. This would not require a clkhash dependency.

[CLOSED] Tests don't cover anonlink.concurrent

Issue by tho802
Thursday Nov 03, 2016 at 06:37 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/21

Coverage testing has revealed a mostly untested module:

https://magic-ci-atp.it.csiro.au/job/anonlink/coveragepy/anonlink_concurrent_py.html

Similarity reported as nan for all zero arrays

>>> from bitarray import bitarray
>>> from anonlink import bloommatcher as bm
>>>print(bm.dicecoeff(bitarray('0'*1024), bitarray('0'*1024)))
nan

This is consistent for clang and gcc.

Investigate effect of not including start/end bigrams

When creating the bi-grams, the first and last bi-gram are padded with a whitespace.

This is a weakness, because it allows an attacker to more easily to find the beginning and the end of a word. Intuitively it helps with matching so we should investigate if dropping the padding decreases matching accuracy.

Aha! Link: https://csiro.aha.io/features/ANONLINK-72

[CLOSED] Feature similarity matrix

Issue by tho802
Wednesday Jun 15, 2016 at 06:40 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/4

Addresses #1.

@har991 do you have time to review?

tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/4/commits