Git Product home page Git Product logo

anonlink's Introduction

image

image

image

A Python (and optimised C++) implementation of anonymous linkage using cryptographic linkage keys as described by Rainer Schnell, Tobias Bachteler, and Jörg Reiher in A Novel Error-Tolerant Anonymous Linking Code.

anonlink computes similarity scores, and/or best guess matches between sets of cryptographic linkage keys (hashed entity records).

Use clkhash to create cryptographic linkage keys from personally identifiable data.

Installation

Install a precompiled wheel from PyPi:

pip install anonlink

Or (if your system has a C++ compiler) you can locally install from source:

pip install -r requirements.txt
pip install -e .

Benchmark

You can run the benchmark with:

$ python -m anonlink.benchmark
Anonlink benchmark -- see README for explanation
------------------------------------------------

Threshold: 0.5, All results returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  (50.73%) |  0.762  (49.2% / 50.8%) |     2.669
  2000 |   2000 |    4e6  (51.04%) |  3.696  (42.6% / 57.4%) |     2.540
  3000 |   3000 |    9e6  (50.25%) |  8.121  (43.5% / 56.5%) |     2.548
  4000 |   4000 |   16e6  (50.71%) | 15.560  (41.1% / 58.9%) |     2.504

Threshold: 0.5, Top 100 matches per record returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 6.86%) |  0.170  (85.9% / 14.1%) |     6.846
  2000 |   2000 |    4e6  ( 3.22%) |  0.384  (82.9% / 17.1%) |    12.561
  3000 |   3000 |    9e6  ( 2.09%) |  0.612  (81.6% / 18.4%) |    18.016
  4000 |   4000 |   16e6  ( 1.52%) |  0.919  (78.7% / 21.3%) |    22.135
  5000 |   5000 |   25e6  ( 1.18%) |  1.163  (80.8% / 19.2%) |    26.592
  6000 |   6000 |   36e6  ( 0.97%) |  1.535  (75.4% / 24.6%) |    31.113
  7000 |   7000 |   49e6  ( 0.82%) |  1.791  (80.6% / 19.4%) |    33.951
  8000 |   8000 |   64e6  ( 0.71%) |  2.095  (81.5% / 18.5%) |    37.466
  9000 |   9000 |   81e6  ( 0.63%) |  2.766  (72.5% / 27.5%) |    40.389
 10000 |  10000 |  100e6  ( 0.56%) |  2.765  (81.7% / 18.3%) |    44.277
 20000 |  20000 |  400e6  ( 0.27%) |  7.062  (86.2% / 13.8%) |    65.711

Threshold: 0.7, All results returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   113.109
  2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.7% /  1.3%) |   124.076
  3000 |   3000 |    9e6  ( 0.01%) |  0.071  (99.1% /  0.9%) |   128.515
  4000 |   4000 |   16e6  ( 0.01%) |  0.123  (99.0% /  1.0%) |   131.654
  5000 |   5000 |   25e6  ( 0.01%) |  0.202  (99.1% /  0.9%) |   124.999
  6000 |   6000 |   36e6  ( 0.01%) |  0.277  (99.0% /  1.0%) |   131.403
  7000 |   7000 |   49e6  ( 0.01%) |  0.368  (98.9% /  1.1%) |   134.428
  8000 |   8000 |   64e6  ( 0.01%) |  0.490  (99.0% /  1.0%) |   131.891
  9000 |   9000 |   81e6  ( 0.01%) |  0.608  (99.0% /  1.0%) |   134.564
 10000 |  10000 |  100e6  ( 0.01%) |  0.753  (99.0% /  1.0%) |   134.105
 20000 |  20000 |  400e6  ( 0.01%) |  2.905  (98.8% /  1.2%) |   139.294

Threshold: 0.7, Top 100 matches per record returned
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   111.640
  2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.6% /  1.4%) |   122.060
  3000 |   3000 |    9e6  ( 0.01%) |  0.074  (99.1% /  0.9%) |   123.237
  4000 |   4000 |   16e6  ( 0.01%) |  0.124  (99.0% /  1.0%) |   130.204
  5000 |   5000 |   25e6  ( 0.01%) |  0.208  (99.1% /  0.9%) |   121.351
  6000 |   6000 |   36e6  ( 0.01%) |  0.275  (99.0% /  1.0%) |   132.186
  7000 |   7000 |   49e6  ( 0.01%) |  0.373  (99.0% /  1.0%) |   132.650
  8000 |   8000 |   64e6  ( 0.01%) |  0.496  (99.1% /  0.9%) |   130.125
  9000 |   9000 |   81e6  ( 0.01%) |  0.614  (99.0% /  1.0%) |   133.216
 10000 |  10000 |  100e6  ( 0.01%) |  0.775  (99.1% /  0.9%) |   130.230
 20000 |  20000 |  400e6  ( 0.01%) |  2.939  (98.9% /  1.1%) |   137.574

The tables are interpreted as follows. Each table measures the throughput of the Dice coefficient comparison function. The four tables correspond to two different choices of "matching threshold" and "result limiting".

These parameters have been chosen to characterise two different performance scenarios. Since the data used for comparisons is randomly generated, the first threshold value (0.5) will cause about 50% of the candidates to "match", while the second threshold value (0.7) will cause ~0.01% of the candidates to match (these values are reported in the "match %" column). Where the table heading includes "All results returned", all matches above the threshold are returned and passed to the solver. With the threshold of 0.5, the large number of matches means that much of the time is spent keeping the candidates in order. Next we limit the number of matches per record to the top 100 - which also must be above the threshold.

In the final two tables we use the threshold value of 0.7, this very effectively filters the number of candidate matches down. Here the throughput is determined primarily by the comparison code itself, adding the top 100 filter has no major impact.

Finally, the Total Time column includes indications as to the proportion of time spent calculating the (sparse) similarity matrix comparisons and the proportion of time spent matching in the greedy solver. This latter is determined by the size of the similarity matrix, which will be approximately #comparisons * match% / 100.

Tests

Run unit tests with `pytest`:

$ pytest
====================================== test session starts ======================================
platform linux -- Python 3.6.4, pytest-3.2.5, py-1.4.34, pluggy-0.4.0
rootdir: /home/hlaw/src/n1-anonlink, inifile:
collected 71 items

tests/test_benchmark.py ...
tests/test_bloommatcher.py ..............
tests/test_e2e.py .............ss....
tests/test_matcher.py ..x.....x......x....x..
tests/test_similarity.py .........
tests/test_util.py ...

======================== 65 passed, 2 skipped, 4 xfailed in 4.01 seconds ========================

To enable slightly larger tests add the following environment variables:

  • INCLUDE_10K
  • INCLUDE_100K

Limitations

Discussion

If you run into bugs, you can file them in our issue tracker on GitHub.

There is also an anonlink mailing list for development discussion and release announcements.

Wherever we interact, we strive to follow the Python Community Code of Conduct.

Citing

Anonlink is designed, developed and supported by CSIRO's Data61. If you use any part of this library in your research, please cite it using the following BibTex entry:

@misc{Anonlink,
  author = {CSIRO's Data61},
  title = {Anonlink Private Record Linkage System},
  year = {2017},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/data61/anonlink}},
}

License

Copyright 2017 CSIRO (Data61)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

anonlink's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar djrobstep avatar gusmith avatar hardbyte avatar joyceyuu avatar nbgl avatar programmedbyjake avatar sjhardy avatar wilko77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anonlink's Issues

[CLOSED] anonlink tests are broken for Python 2

Issue by tho802
Wednesday Jul 13, 2016 at 01:15 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/13


Opps, one of my cffi updates has broken support for python 2

    return cffi_filter_similarity_k(filters1, filters2, k=k, threshold=threshold)
  File "/home/brian/dev/N1/AnonymousLinking/anonlink/entitymatch.py", line 44, in cffi_filter_similarity_k
    bytes([b for f in filters2 for b in f[0].tobytes()]))
IndexError: initializer str is too long for 'char[12800]' (got 87821 characters)

[CLOSED] [WIP] Pure Python Implementation of multibit trees

Issue by tho802
Friday Aug 05, 2016 at 01:42 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/pull/14


Follows the paper "Similarity Filtering with Multibit Trees for Record Linkage" with the following differences:

  • Uses one multibit tree per CLK popcount. E.g., every CLK with 512 bits set will be in one tree. The paper says they get quicker results using one multibit tree for all CLKs however it wasn't clear to me how to modify the calculation of the maximum bounding on jaccard similarity - which is how you determine whether to terminate the search (Equation 5 in the paper). Perhaps @har991 you could point me in the right direction?
  • For the final similarity computation I reused our reasonably efficient cffi_filter_similarity_k function. This means in the third search phase instead of using jaccard similarity we return the top k matches using the Sorensen Dice index.
  • The paper uses ordered q-grams. I think this is entirely independent of the multibit tree datastructure so will add an issue to implement that.

Not ready for merge yet. Just want to solicit feedback/help

Once it is doing the right thing I'll look at performance.


tho802 included the following code: https://github.csiro.au/magic/AnonymousLinking/pull/14/commits

Tests shouldn't be pinned to specific machines

The tests are pinned to specific machines in the jenkinsfile:

def configs = [		    
     [label: 'GPU 1', pythons: ['python3.4', 'python3.5', 'python3.6'], compilers: ['clang', 'gcc']],
     [label: 'McNode', pythons: ['python3.5'], compilers: ['clang', 'gcc']]
 ]

This should be replaced with more generic labels, e.g. osx for McNode.

The problem is, that if one of those nodes goes down, the tests will fail. It renders the redundancy in the jenkins nodes useless.

Avoid copying data back and forth between the Python runtime and the C++ library

According to the benchmark, copying to and from the C++ library currently takes 99% of the time.
Edit: Not true. Copying does take a lot of time in some places (see #79), but the copying mentioned below is not "99%" bad, more like "5%" bad.
Really the C++ library should just be passed an address to the memory it needs to look at directly, making the copy redundant. This may take advantage of a cleaner interface to the arrays provided by a resolution to issue #64.

This issue partly supersedes issue #29 where Brian says:

We also are dealing with "nice" python bitarrays which require some manipulation (1) before passing into native code. We might want to consider adding an accelerated interface that takes our custom bit packed data as plain python bytes.

1: [ffi.new("char[128]", bytes(f[0].tobytes())) for f in filters1]

I've started experimenting in

  • Branch feature-chunked-speedup for a C implementation of many x many comparisons.
  • Branch feature-direct-cffi builds ontop of that with a look at accessing bitarray data from C without a memcopy. Only does a bitarray popcount for now.
# Assume ba is a bitarray
addr = ba.buffer_info()[0]
pntr = ffi.cast("char *", addr)
lib.popcount(pntr)

The comments in issue #18 might still be relevant.

Aha! Link: https://csiro.aha.io/features/ANONLINK-68

Accelerated api for processing n x m entities

At the moment in entitymatch.cffi_filter_similarity_k we pass in two lists of bloom filters, but we end up calling the native code n times. This issue is to move that iteration into C.

We also are dealing with "nice" python bitarrays which require some manipulation (1) before passing into native code. We might want to consider adding an accelerated interface that takes our custom bit packed data as plain python bytes.

1: [ffi.new("char[128]", bytes(f[0].tobytes())) for f in filters1]

I've started experimenting in

  • Branch feature-chunked-speedup for a C implementation of many x many comparisons.
  • Branch feature-direct-cffi builds ontop of that with a look at accessing bitarray data from C without a memcopy. Only does a bitarray popcount for now.
# Assume ba is a bitarray
addr = ba.buffer_info()[0]
pntr = ffi.cast("char *", addr)
lib.popcount(pntr)

e2e test sometimes fails

Very occasionally the e2e test can fail with items that are very similar, but were not the exact match. This is somewhat expected behaviour with probablistic linking so the tests should still pass.

A failing example:

[GPU 1-python3.6-clang] ======================================================================
[GPU 1-python3.6-clang] FAIL: test_cffi_k (tests.test_e2e.TestEntityMatchTopK)
[GPU 1-python3.6-clang] ----------------------------------------------------------------------
[GPU 1-python3.6-clang] Traceback (most recent call last):
[GPU 1-python3.6-clang]   File "/var/jenkins_home/workspace/nk_bug-remove-extra-clkhash-4U5Z47G3YO7PFYAEM2LZQ3G7Y5YBDBNU6TKWSRJCVTBFKS42WJQA@2/tests/test_e2e.py", line 157, in test_cffi_k
[GPU 1-python3.6-clang]     self.assertEqual(s1[indexA], s2[mapping[indexA]])
[GPU 1-python3.6-clang] AssertionError: Tuples differ: (285, 'Cheri Wahlquist', '1975/05/22', 'F') != (208, 'Mozell Wahlquist', '1975/01/05', 'F')
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] First differing element 0:
[GPU 1-python3.6-clang] 285
[GPU 1-python3.6-clang] 208
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] - (285, 'Cheri Wahlquist', '1975/05/22', 'F')
[GPU 1-python3.6-clang] ?    -   ^^ ^^                     ---
[GPU 1-python3.6-clang] 
[GPU 1-python3.6-clang] + (208, 'Mozell Wahlquist', '1975/01/05', 'F')
[GPU 1-python3.6-clang] ?   +    ^^^ ^^                    +++
[GPU 1-python3.6-clang] 

This issue is to create a better assertion for the test.

Calculate nodes' degree to improve greedy method's accuracy

Issue by tho802
Wednesday Jun 22, 2016 at 05:17 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/11


If we apply the threshold of similarity scores in C, and instead of returning the top K edges, we return the edges that meet the threshold (along with the count) it should be slightly quicker.

If we can also determine the degree of the nodes from organisation B we can make more accurate matches - e.g. defeating #10

Aha! Link: https://csiro.aha.io/features/ANONLINK-80

GCC and Clang don't produce equally performant code

As @hardbyte noted in #35, Clang produces much better code than GCC:

GCC

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.065s    |        15.430
  2000 |   2000 |      4000000 |    0.166s    |        24.043
  3000 |   3000 |      9000000 |    0.352s    |        25.532
  4000 |   4000 |     16000000 |    0.557s    |        28.727
  5000 |   5000 |     25000000 |    0.710s    |        35.235
  6000 |   6000 |     36000000 |    0.710s    |        50.699
  7000 |   7000 |     49000000 |    0.747s    |        65.631
  8000 |   8000 |     64000000 |    0.789s    |        81.089
  9000 |   9000 |     81000000 |    0.861s    |        94.081
 10000 |  10000 |    100000000 |    1.003s    |        99.673
Single Core:
  5000 |   5000 |     25000000 |    0.425s    |        58.800

Clang

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.055s    |        18.248
  2000 |   2000 |      4000000 |    0.135s    |        29.720
  3000 |   3000 |      9000000 |    0.275s    |        32.769
  4000 |   4000 |     16000000 |    0.424s    |        37.692
  5000 |   5000 |     25000000 |    0.514s    |        48.607
  6000 |   6000 |     36000000 |    0.537s    |        67.025
  7000 |   7000 |     49000000 |    0.557s    |        87.939
  8000 |   8000 |     64000000 |    0.596s    |       107.434
  9000 |   9000 |     81000000 |    0.636s    |       127.285
 10000 |  10000 |    100000000 |    0.774s    |       129.232
Single Core:
  5000 |   5000 |     25000000 |    0.260s    |        96.132

A quick visual of the generated assembly shows that GCC misses obvious opportunities for function inlining and loop unrolling that probably explain most of the difference. There is ample opportunity to clean up the C++ source to make it easier for GCC to produce better code. At worst we can just do the inlining and unrolling manually; probably worth it to get a 50% speedup.

Support multiple bloom filters

It may make sense to calculate multiple CLKs using different field sets for improved matching, blocking, matching with orgs who only have a subset of the fields, and most importantly for automated methods to determine the threshold.

Example

If the possible fields are Name, Phone, Address, Email we might calculate CLKs for:

  • Name, Phone, Address, Email
  • Name, Phone, Address
  • Name, Address, Email
  • Phone, Address, Email
  • Name, Phone
  • Name, Address
  • Name, Email
  • Phone, Address
  • Phone, Email
  • etc

Need to investigate further and determine what changes would be required in anonlink to support matching datasets which contain sets of different clks.

Aha! Link: https://csiro.aha.io/features/ANONLINK-78

[CLOSED] Similarity matrix

Issue by tho802
Monday Apr 11, 2016 at 16:48 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/1


When determining Bloom Filter similarity the score is computed in C for all NxM combinations, but only the single best match is returned. This somewhat limits the next step that solves pairwise matches for the entire network at once.

Currently the structure is:

  • the index in filters1
  • the similarity score between 0 and 1 of the best match
  • The original index in entity A
  • The original index in entity B
  • The index in filters2 of the best match

Instead of computing a tuple for each entity in entity A, we could explore the memory/accuracy trade off of instead computing a similarity matrix - recording the n-gram similarity score between every pair.

Release Process

Not sure if it would be better to get our private jenkins to publish to pypi, have travis or some other public CI do it, or continue cutting the releases myself and then uploading gpg signed releases to PyPi.

Opinions appreciated.

Note to self to have a look at cryptography's release process. The have a release script

Aha! Link: https://csiro.aha.io/features/ANONLINK-74

Investigate Blocking in privacy preserving way

Issue by tho802
Tuesday Oct 11, 2016 at 03:31 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/19


Yanfeng is looking at various blocking techniques, currently implementing and comparing blocking algorithms in Python.

We need to consider if the server api has to change at all. Clustering occurs on the CLK on the server before matching.

Links

Different results when compiled using clang

Building on OSX using the default clang compiler appears to compile without warning but produces different results than g++. The tests showing the error are L71- L79 of test_bloomfilter.py:

    def test_dice_1_c(self):
        ba = self.generate_bitarray(1024)
        self.assertEqual(bm.dicecoeff(ba, ba), 1.0)

    def test_dice_2_c(self):
        ba = self.generate_bitarray(1024)
        bb = ba.copy()
        bb.invert()
        self.assertEqual(bm.dicecoeff(ba, bb), 0.0)

Test Results show that testing a bitarray against itself, and against its inverse don't give similarity scores of 1, and 0.

See the compile output for osx vs linux.

The command used to build with clang is:

clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/private/var/jenkins_home/workspace/ink_feature-use-jenkinsfile-VGRPYD53GGGDDSBIJDLSUDYPJ34QR63ITGMC5VJNB56W6ID244AA/env/include -I/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c build/temp.macosx-10.12-x86_64-3.5/_entitymatcher.cpp -o build/temp.macosx-10.12-x86_64-3.5/build/temp.macosx-10.12-x86_64-3.5/_entitymatcher.o -std=c++11 -mssse3 -mpopcnt

The code is trying to use strictly sized types, but nonetheless there is an issue.

The Python part:

    e1array = ffi.new("char[]", e1.tobytes())
    e2array = ffi.new("char[]", e2.tobytes())

    return lib.dice_coeff_1024(e1array, e2array)

Which calls the C part:

double dice_coeff_1024(const char *e1, const char *e2) {
    int nbyte = 128; // 1024 bit key, 128 bytes
    int nuint64 = int(nbyte / sizeof(uint64_t)); // assume l is divisible by 64

    const uint64_t *comp1 = (const uint64_t *) e1;
    const uint64_t *comp2 = (const uint64_t *) e2;

    uint32_t count_both = 0;

    count_both += builtin_popcnt_unrolled_errata_manual(comp1, nuint64);
    count_both += builtin_popcnt_unrolled_errata_manual(comp2, nuint64);

    uint64_t* combined = new uint64_t[nuint64];
    for (int i=0 ; i < nuint64; i++ ) {
        combined[i] = comp1[i] & comp2[i];
    }

    uint32_t count_and = builtin_popcnt_unrolled_errata_manual(combined, nuint64);

    delete[] combined;

    return 2 * count_and / (double) (count_both);
}

Which calls the following assembler:

// Source: http://danluu.com/assembly-intrinsics/
// https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
uint32_t builtin_popcnt_unrolled_errata_manual(const uint64_t* buf, int len) {
  assert(len % 4 == 0);
  uint64_t cnt[4];
  for (int i = 0; i < 4; ++i) {
    cnt[i] = 0;
  }

  for (int i = 0; i < len; i+=4) {
    __asm__(
        "popcnt %4, %4  \n\t"
        "add %4, %0     \n\t"
        "popcnt %5, %5  \n\t"
        "add %5, %1     \n\t"
        "popcnt %6, %6  \n\t"
        "add %6, %2     \n\t"
        "popcnt %7, %7  \n\t"
        "add %7, %3     \n\t" // +r means input/output, r means intput
        : "+r" (cnt[0]), "+r" (cnt[1]), "+r" (cnt[2]), "+r" (cnt[3])
        : "r"  (buf[i]), "r"  (buf[i+1]), "r"  (buf[i+2]), "r"  (buf[i+3]));
  }
  return cnt[0] + cnt[1] + cnt[2] + cnt[3];
}

My hunch is the bug is in the C code. Is the casting coming back to haunt me? Anyone know much about differences between clang and g++.
cc: @mhsiah @maxott @sjhardy

GPU integration

Optional interface with cuda.

Note we have a proof of concept for computing the DICE-Sorensen index, sorting and applying a threshold all on the GPU.

Need to consider whether to match the top-k filtering similar to our C implementation, or drop that.

Uniform Python interface to C++ library

The C++ library is currently accessed through the interface generated by Python's CFFI, which is clumsy and not idiomatic. It would neater if we wrapped interaction with the C++ library completely in a more Pythony way. This ought to complement issue #66 which is about avoiding copying back and forth from the Python runtime.

Aha! Link: https://csiro.aha.io/features/ANONLINK-69

Tests that currently trick greedy algorithm

Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.

4048684e-3882-11e6-9a81-105da6c927bd

The network methods should succeed, and for now the greedy algorithm should fail. Correct mapping is:
{a: 2, b: 3, c: 1}

A similar variant where the result will probably be different between the methods (same desired output):

17dad68e-3883-11e6-888d-647c02d5093d

This issue tracks these issues - they exist as tests marked as expectedFailure.

Aha! Link: https://csiro.aha.io/features/ANONLINK-81

Improve priority queue use in C++ matching code

The use of the std::priority_queue in the C++ matching code (match_one_against_many_dice_k_top), is the main bottleneck when k is large. There a few deficiencies that may be negatively impacting performance. For example:

  1. Extensive use of priority_queue::pop() in loops as though it were O(1) when it is in fact O(log n).
  2. AoS memory layout of the queue elements (struct Node) when SoA would probably be better.
  3. Possibly unnecessary maintenance of the queue invariant when k is small (a qsort at the end might do).

Some potentially useful resources:

Aha! Link: https://csiro.aha.io/features/ANONLINK-70

[CLOSED] Parallel computation for greedy method

Issue by tho802
Monday Jun 20, 2016 at 02:49 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/8


At the moment the greedy solver is entirely CPU bound - and uses just one core.
It wold be possible to break the task into parallel chunks.

First calculate plain bitcounts for all bloomfilters in filters1 and filters2.

Then solve chunks of filters1 in parallel:

  • Each thread/process has a chunk of bloomfilters from filters1 (e.g. 500 filters)
  • read only access to all of filters2 (in shared memory)
  • Each process computes the top k indices and scores for every filter from filters1

Then when all jobs complete the mapping is easily assembled:

        mappings = {}
        matched_entries_b = set()
        for possible_index_b, score in zip(indices, scores):
            if possible_index_b not in matched_entries_b and score > threshold:
                mappings[index_a] = possible_index_b
                matched_entries_b.add(possible_index_b)
                break

Source distribution should include C source

Ideally we can create a source release of anonlink with:

python setup.py sdist

And another user (possibly on another version of Python) should be able to install using the created source distribution.

At the moment a .tar.gz gets created with the Python code but not the C/C++ sources.

bit shuffling

Issue by tho802
Tuesday Aug 16, 2016 at 03:43 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/18


The de-serialisation of bitarray instances into char * actually takes some time. We currently do it with:

    carr2 = ffi.new("char[{}]".format(128 * length_f2),
                        bytes([b for f in filters2 for b in f[0].tobytes()]))

    c_popcounts = ffi.new("uint32_t[{}]".format(length_f2), [f[2] for f in filters2])

It might be worth looking at the internal C representation of bitarray objects, and passing that. One of the features of the bitarray library is:

Packing and unpacking to other binary data formats, e.g. numpy.ndarray, is possible

So we might want to benchmark pack/unpack versus our current iteration approach. Or we could probably just pass the bitarray.buffer_info() directly to our C library.

Returns a tuple (address, size, endianness, unused, allocated) giving the current memory address, the size (in bytes) used to hold the bitarray’s contents, the bit endianness as a string, the number of unused bits (e.g. a bitarray of length 11 will have a buffer size of 2 bytes and 5 unused bits), and the size (in bytes) of the allocated memory.

Test suite does not declare expected failures correctly

Running the test suite currently raises a RuntimeWarning:

/usr/lib/python3.6/unittest/case.py:550: RuntimeWarning: TestResult has no addExpectedFailure method, reporting as passes RuntimeWarning)

This is probably because the suite uses the Python @unittest.expectedFailure decorator instead of the Nose @raises decorator in various places (reasoning from here).

Misleading or confusing aspects of anonlink.benchmark

Following the README, we run the benchmark like so:

$ python3 -m anonlink.benchmark
100000 x 1024 bit popcounts in 0.019525 seconds
Popcount speed: 625.18 MiB/s
Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.118s    |         8.478
  2000 |   2000 |      4000000 |    0.239s    |        16.707
  3000 |   3000 |      9000000 |    0.495s    |        18.171
  4000 |   4000 |     16000000 |    0.848s    |        18.874
  5000 |   5000 |     25000000 |    0.942s    |        26.528
  6000 |   6000 |     36000000 |    1.032s    |        34.898
  7000 |   7000 |     49000000 |    1.041s    |        47.049
  8000 |   8000 |     64000000 |    1.107s    |        57.789
  9000 |   9000 |     81000000 |    1.137s    |        71.229
 10000 |  10000 |    100000000 |    1.173s    |        85.245
 20000 |  20000 |    400000000 |    3.641s    |       109.867
Single Core:
  5000 |   5000 |     25000000 |    0.545s    |        45.878

There are three problems here:

  1. The reported popcount throughput is for the count() method in Python's bitarrays, not the C implementation (which, FWIW, I've independently measured to be in the order of 10GiB/s). We use the C implementation exclusively except for testing AFAICT.
  2. The table of comparisons per second is measuring calculations of similarity matrices via the function distributed_processing.calculate_filter_similarity with defaults k=10 and threshold=0.5 (these defaults and the actual work are delegated to entitymatch.calculate_filter_similarity). In contrast, the "Single Core" line uses the function entitymatch.calculate_mapping_greedy which calculates a single similarity matrix AND does a greedy solve with defaults k=5 and threshold=0.95. These two situations are obviously completely incomparable, but are presented in the same table without explanation.
  3. Following on from the previous point, the line for 5000x5000 in the main table is always way slower than the line for single core, which makes no sense if the main table is "distributed"; it must be a consequence of the fact that the k and threshold parameters have been changed.

[CLOSED] OSX build

Issue by tho802
Tuesday Oct 25, 2016 at 00:20 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/20


Currently jenkins is failing to build anonlink on OSX. This could just be the McNode on jenkins isn't set up with XCode etc for compiling. Or it could be that the build doesn't currently work for OSX. Either way we should get it working and document what is required to build on mac.

Popcount function is under performing

In branch hlaw-fix-issue56 the raw popcount function performance is reported to be about 12 GiB/s (on my machine), however (essentially) the same code performs at 18 GiB/s when called from a basic wrapper function thus:

// -*- compile-command: "g++ -ggdb -Wall -Wextra -std=c++11 -O3 -mssse3 -mpopcnt -fvisibility=hidden -o bench bench.cc dice_one_against_many.cpp" -*-

#include <iostream>
#include <stdint.h>
using namespace std;

double popcount_1024_array(const char *many, int n, uint32_t *counts_many);

int main(int argc, char *argv[]) {
    static constexpr int KEYWORDS = 16;
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);
    uint64_t *buf = new uint64_t[KEYWORDS * n];
    uint32_t *counts = new uint32_t[n];
    long nbytes = n*KEYWORDS*sizeof(uint64_t);
    double t = popcount_1024_array(reinterpret_cast<const char *>(buf), n, counts);
    cout << "Time: " << t << "ms => " << (n/1e6) * (1e3/t) << " 1e6 popc/s => "
         <<  (nbytes / (double)(1 << 20)) * (1e3/t) << " MiB/s" << endl;
    delete[] counts;
    delete[] buf;
    return 0;
}

Aha! Link: https://csiro.aha.io/features/ANONLINK-71

Update test suite to use py.test.

The test suite has some vestiges of when it depended on the nose testing framework (e.g. in the README file). These should be removed and the tests should be updated to reflect py.test test idioms.

This issue should resolve #62.

[CLOSED] Positional q-grams

Issue by tho802
Friday Aug 05, 2016 at 02:02 GMT
Originally opened as https://github.csiro.au/magic/AnonymousLinking/issues/15


Positional q-grams should give more accuracy for numerical identifiers such as zip code, house number, date of birth.

With positional q-grams the locations of the q-grams within the string are recorded as additional information. For example, when using the first character of each bi-gram as locational label, the positional bigram set for SMITH is { S(0), SM(1), MI(2), IT(3), TH(4), H(5) }. Positional g-grams are considered equal iff the bigram and the locations are equal.

Only part that isn't clear to me from reading is how to hash positional uni-grams (or bi-grams) into CLKs.
At the moment we transform 1987 into the unigram list [1, 9, 8, 7], but to encode the position I plan to
prepend the index: ["1 1", "2 9", "3 8", "4 7"]. This seems reasonable to me and would allow us to carry on using the same hashing.

Remove dependency on clkhash

Currently a few e2e tests rely on clkhash for generating the bloomfilters.

While it is nice to do an integration test between the related libraries we shouldn't introduce a clkhash as a dependency just for the integration test.

They are also used in the benchmarking script, comment from @wilko77:

none of the benchmarks actually needs real CLKs, as all you measure is timing. You could easily fake CLKs by choosing them randomly. This would not require a clkhash dependency.

Similarity reported as nan for all zero arrays

>>> from bitarray import bitarray
>>> from anonlink import bloommatcher as bm
>>>print(bm.dicecoeff(bitarray('0'*1024), bitarray('0'*1024)))
nan

This is consistent for clang and gcc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.