Comments (7)
This is a bit off topic, and a bit related to the second point above; included FWIW: The distributed_processing
dispatch makes a funny choice for the size of the chunks to send to the worker processes: chunk_size = int(20000000 / len(filters2))
. This gives an ever decreasing chunk size for larger and larger data; it's not clear to me why that should be a good choice. I have 12 cores available, and when I allocate 1/12 of the data to each process,
--- a/anonlink/distributed_processing.py
+++ b/anonlink/distributed_processing.py
@@ -35,6 +35,7 @@ def calculate_filter_similarity(filters1, filters2, k=10, threshold=0.5):
results = []
chunk_size = int(20000000 / len(filters2))
+ chunk_size = int((len(filters2) + 11) / 12)
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = []
(NB, ceiling(n, d) = (n + d - 1) / d with truncated division), then I get the best results I've seen by far:
Size 1 | Size 2 | Comparisons | Compute Time | Million Comparisons per second
1000 | 1000 | 1000000 | 0.082s | 12.231
2000 | 2000 | 4000000 | 0.151s | 26.489
3000 | 3000 | 9000000 | 0.221s | 40.653
4000 | 4000 | 16000000 | 0.307s | 52.118
5000 | 5000 | 25000000 | 0.372s | 67.261
6000 | 6000 | 36000000 | 0.419s | 86.008
7000 | 7000 | 49000000 | 0.525s | 93.272
8000 | 8000 | 64000000 | 0.618s | 103.493
9000 | 9000 | 81000000 | 0.692s | 117.085
10000 | 10000 | 100000000 | 0.817s | 122.385
20000 | 20000 | 400000000 | 2.310s | 173.144
Single Core:
5000 | 5000 | 25000000 | 0.325s | 77.007
from anonlink.
Something to keep in mind is that although we only show comparisons with 400M comparisons in the benchmark we really care about the possibility and speed of much larger comparisons. So the ability to saturate the number of comparisons per second that a system can do should
All the points you've identified look like valid issues to fix. Is it really suprising that for a small matching job that a single core is better than distributing work across multiple cores? Although I agree the k
and threshold
should be the same! Regarding the smaller chunks for larger datasets - I agree that seems a mistake.
from anonlink.
Fair point about the cost associated with executing a small job across multiple cores potentially being dominated by distribution and synchronisation overhead. I think that's a good reason to make a clearer distinction in the benchmarking code between single-core performance and distributed performance, since there are heaps of tricky parameters that come into play in the latter scenario and we want to make sure we know when we're measuring them and when we aren't.
from anonlink.
The benchmarking code should (perhaps optionally) also output the following information
- CPU make and model
- OS name and version and kernel version
- Version numbers for:
- Python
- GCC or Clang used to compile the C++ library
- anonlink itself!
This information should be included to help when comparing benchmark results over time.
from anonlink.
Unfortunately it is not at all clear how to extract most of this information from within the code.
- The portable way of obtaining the CPU and OS information is way too coarse.
- There does not appear to be a way to extract the compile command from Python's FFI module.
- (Getting Python and anonlink versions is easy though, of course.)
Perhaps this means that this information should be acquired and reported by a script that runs the benchmarking suite, rather than from the benchmark programme itself.
from anonlink.
So, I think benchmarking should avoid consideration of the values of k
and threshold
as much as possible. Reason being that those parameters change the performance of the matching in accordance with various statistical properties of the data, rather than through the inherent performance of the algorithm.
For example, as currently implemented, the benchmark is comparing random CLKs, for which the value of threshold
either makes everything match or nothing match (except for a small range around 0.5 where it interpolates between the two cases). This in turn pushes all the computation respectively either largely onto the priority queue insertions or exclusively onto the Dice coefficient calculation. (The latter is what this benchmark should be testing, but it shouldn't be an accidental consequence of the choice of threshold, rather it should be set up explicitly to test that.)
Testing how performance varies as k
and threshold
vary is (probably) important, but it should take explicit consideration of the statistical properties of the CLKs used.
from anonlink.
Resolved by #58.
from anonlink.
Related Issues (20)
- codecov reports from Azure Dev-Ops are Untitled HOT 2
- Failed test test_cumul_number_matches_vs_threshold HOT 2
- Wheels not published for each platform HOT 4
- Build wheels for each version of Python HOT 1
- Flaky test: test_bytes_bitarray_agree HOT 1
- Flaky test: test_similarities_hist
- Use pypi version of cibuildwheel once macos support for binary wheels is fixed HOT 6
- Start to remove 2 party special case from anonlink
- Anonlink support on MacOS HOT 3
- Use upstream version of bitarray if/when they publish wheels
- Install from PyPi wheel failing with syntax errors for _multiparty_solving_inner.cpp HOT 3
- Upgrade to latest release of clkhash
- Remove clkhash dependency
- Remove duplicate candidate pairs from streams HOT 1
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
- update ubuntu image in CI
- anonlink.benchmark documentation clarity HOT 2
- Broken link in readme
- linux/arm64 wheel errors out with type error: ValueError: Buffer dtype mismatch, expected 'const char' but got 'signed char'
- ValueError: Buffer dtype mismatch when running anonlink.candidate_generation.find_candidate_pairs on AWS Glue HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anonlink.