Following the README, we run the benchmark like so: <div class="snippet-clipboard-

Resolved by <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Misleading or confusing aspects of anonlink.benchmark about anonlink HOT 7 CLOSED

data61 commented on June 14, 2024

Misleading or confusing aspects of anonlink.benchmark

from anonlink.

Comments (7)

unzvfu commented on June 14, 2024

This is a bit off topic, and a bit related to the second point above; included FWIW: The distributed_processing dispatch makes a funny choice for the size of the chunks to send to the worker processes: chunk_size = int(20000000 / len(filters2)). This gives an ever decreasing chunk size for larger and larger data; it's not clear to me why that should be a good choice. I have 12 cores available, and when I allocate 1/12 of the data to each process,

--- a/anonlink/distributed_processing.py
+++ b/anonlink/distributed_processing.py
@@ -35,6 +35,7 @@ def calculate_filter_similarity(filters1, filters2, k=10, threshold=0.5):
     results = []
     chunk_size = int(20000000 / len(filters2))
 
+    chunk_size = int((len(filters2) + 11) / 12)
     with concurrent.futures.ProcessPoolExecutor() as executor:
         futures = []

(NB, ceiling(n, d) = (n + d - 1) / d with truncated division), then I get the best results I've seen by far:

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.082s    |        12.231
  2000 |   2000 |      4000000 |    0.151s    |        26.489
  3000 |   3000 |      9000000 |    0.221s    |        40.653
  4000 |   4000 |     16000000 |    0.307s    |        52.118
  5000 |   5000 |     25000000 |    0.372s    |        67.261
  6000 |   6000 |     36000000 |    0.419s    |        86.008
  7000 |   7000 |     49000000 |    0.525s    |        93.272
  8000 |   8000 |     64000000 |    0.618s    |       103.493
  9000 |   9000 |     81000000 |    0.692s    |       117.085
 10000 |  10000 |    100000000 |    0.817s    |       122.385
 20000 |  20000 |    400000000 |    2.310s    |       173.144
Single Core:
  5000 |   5000 |     25000000 |    0.325s    |        77.007

from anonlink.

hardbyte commented on June 14, 2024

Something to keep in mind is that although we only show comparisons with 400M comparisons in the benchmark we really care about the possibility and speed of much larger comparisons. So the ability to saturate the number of comparisons per second that a system can do should

All the points you've identified look like valid issues to fix. Is it really suprising that for a small matching job that a single core is better than distributing work across multiple cores? Although I agree the k and threshold should be the same! Regarding the smaller chunks for larger datasets - I agree that seems a mistake.

from anonlink.

unzvfu commented on June 14, 2024

Fair point about the cost associated with executing a small job across multiple cores potentially being dominated by distribution and synchronisation overhead. I think that's a good reason to make a clearer distinction in the benchmarking code between single-core performance and distributed performance, since there are heaps of tricky parameters that come into play in the latter scenario and we want to make sure we know when we're measuring them and when we aren't.

from anonlink.

unzvfu commented on June 14, 2024

The benchmarking code should (perhaps optionally) also output the following information

CPU make and model
OS name and version and kernel version
Version numbers for:
- Python
- GCC or Clang used to compile the C++ library
- anonlink itself!

This information should be included to help when comparing benchmark results over time.

from anonlink.

unzvfu commented on June 14, 2024

Unfortunately it is not at all clear how to extract most of this information from within the code.

The portable way of obtaining the CPU and OS information is way too coarse.
There does not appear to be a way to extract the compile command from Python's FFI module.
(Getting Python and anonlink versions is easy though, of course.)

Perhaps this means that this information should be acquired and reported by a script that runs the benchmarking suite, rather than from the benchmark programme itself.

from anonlink.

unzvfu commented on June 14, 2024

So, I think benchmarking should avoid consideration of the values of k and threshold as much as possible. Reason being that those parameters change the performance of the matching in accordance with various statistical properties of the data, rather than through the inherent performance of the algorithm.

For example, as currently implemented, the benchmark is comparing random CLKs, for which the value of threshold either makes everything match or nothing match (except for a small range around 0.5 where it interpolates between the two cases). This in turn pushes all the computation respectively either largely onto the priority queue insertions or exclusively onto the Dice coefficient calculation. (The latter is what this benchmark should be testing, but it shouldn't be an accidental consequence of the choice of threshold, rather it should be set up explicitly to test that.)

Testing how performance varies as k and threshold vary is (probably) important, but it should take explicit consideration of the statistical properties of the CLKs used.

from anonlink.

unzvfu commented on June 14, 2024

Resolved by #58.

from anonlink.

Misleading or confusing aspects of anonlink.benchmark about anonlink HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent