Comments (11)
A couple of random posts on StackOverflow suggest that using different seeds results in independent hash functions (i.e. initialise your hash function with each seed, then hash the bigram as usual). The two parties in the protocol will have to agree on the sequence (set + ordering) of seeds. It's not clear whether the seeds need to be secret, or whether they can be chosen to be a fixed set of values (e.g. 0, ..., k-1).
from anonlink.
My first (brief) read through indicates that yes it is applicable. It looks like the slightly dubious double hash is actually introducing the weakness that they are able to exploit.
We need to look into this further. It referenced Cryptanalysis of Basic Bloom Filters Used for
Privacy Preserving Record Linkage and mentioned it might propose a solution.
from anonlink.
Some good news, one of the proposed solutions is to use multiple identifiers in the same bloom filter (which we already do):
Regarding the double hash the same paper strongly suggests using k independent hash functions:
If we use a single strong hash with a counter up to k used as salt would that meet that criteria?
I'm thinking of just using sha256
instead of the linear combination of sha1
and md5
(still HMAC'd). The code is now in clkhash
Thoughts? @wilko77 @mhsiah @gusmith @sjhardy
from anonlink.
Slightly different version of "Cryptanalysis of basic Bloom filters used for Privacy
Preserving Record Linkage" here
Question still stands.
from anonlink.
Warning I feel like I'm making up crypto here... would like help.
The problem is to hash a token m
into k
integer locations (% l
). We also have a shared key
to use.
Here is my current proposal:
for m in mlist:
hm = hmac.new(key.encode(), digestmod=sha256)
hm.update(m.encode())
for i in range(k):
hm.update(i.to_bytes(8, 'big'))
mi_index = int.from_bytes(hm.digest(), 'big') % l
bf[mi_index] = 1
from anonlink.
Notice: this is just thoughts, without any proofs.
We need to be sure that the used hash is at least known to be secure against known-plaintext attack.
In fact, with the current version, we will hash the following (represented in bytes):
${secret}0
${secret}01
${secret}012
- ...
So the attacker gets k
ciphertexts of plaintexts from which he knows the suffix, and he knows that the prefix is always the same.
If the hashing method is secure against chosen-plaintext attack, I think we wouldn't have any problem.
from anonlink.
I need to brush up on this stuff, but one potentially useful observation occurs to me: A hash h := SHA256(m) is 256 bits long, so for a filter of length L, perhaps we can use the first lg(L)~10 bits as the first hash function, the second lg(L) bits for the second hash function, and so on. I'm pretty sure substrings of hashes should be (linearly) independent of one another. This would allow us to get ~25 independent hash functions for the price of one actual call to SHA256.
from anonlink.
There is also the idea of using XOR-folding
from anonlink.
Weaknesses in our approach that can be exploited (as described in those attack papers):
- When creating the bi-grams, the first and last bi-gram are padded with a whitespace. This is a weakness, because it allows an attacker more easily to find the beginning and the end of a word.
=> We should investigate if dropping the padding decreases matching accuracy. - The double hashing scheme. One consequence of choosing this scheme is that all bits belonging to a particular n-gram will follow a sequence
b_{n+1} = b_{n} + hash_2 mod l
. This structure can be exploited when looking for "atoms" (the bloom filter generated by one n-gram).
If we use k independent hash functions instead, then there will be no particular structure in an atom.
Also, there is a flaw in the double hashing scheme: ifhash_2 mod l = 0
, then only one bit will be set in the bloom filter irrespective of the value k (the probability of this happening is 1/l = 1/1024 !!!).
from anonlink.
I think we might be able to wrap this issue up. Opening follow up issues to track any loose threads.
- I've created issue #54 to look into weakness with hashing whitespare at the start/end of features.
- Issue data61/clkhash#33 looks into the double hashing flaw
- clkhash has been modified to use KDF to ensure independant hashes are used.
- Issue data61/clkhash#32 discussing looking into xor-folding
- I'll delete my experimental branch with an alternative to the double hash scheme
@wilko77 can you look over the above and make sure I've captured everything before closing?
from anonlink.
There are two separate issues with independence:
- We reused the same keys for every identifier. This means, that an n-gram, let's say
bo
, in the identifierfirst name
gets mapped to the same bit position in the Bloom filter as the n-grambo
inlast name
. Remedy here is to generate different keys for each identifier. This has been addressed in data61/clkhash#23. - As pointed out in the 'Who is 1011...' paper, the double hash encoding scheme of Schnell et. al. can be exploited to mount an attack. There is a proposal to address this in data61/clkhash#40.
from anonlink.
Related Issues (20)
- codecov reports from Azure Dev-Ops are Untitled HOT 2
- Failed test test_cumul_number_matches_vs_threshold HOT 2
- Wheels not published for each platform HOT 4
- Build wheels for each version of Python HOT 1
- Flaky test: test_bytes_bitarray_agree HOT 1
- Flaky test: test_similarities_hist
- Use pypi version of cibuildwheel once macos support for binary wheels is fixed HOT 6
- Start to remove 2 party special case from anonlink
- Anonlink support on MacOS HOT 3
- Use upstream version of bitarray if/when they publish wheels
- Install from PyPi wheel failing with syntax errors for _multiparty_solving_inner.cpp HOT 3
- Upgrade to latest release of clkhash
- Remove clkhash dependency
- Remove duplicate candidate pairs from streams HOT 1
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
- update ubuntu image in CI
- anonlink.benchmark documentation clarity HOT 2
- Broken link in readme
- linux/arm64 wheel errors out with type error: ValueError: Buffer dtype mismatch, expected 'const char' but got 'signed char'
- ValueError: Buffer dtype mismatch when running anonlink.candidate_generation.find_candidate_pairs on AWS Glue HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anonlink.