Git Product home page Git Product logo

Comments (7)

unzvfu avatar unzvfu commented on June 14, 2024

This is a bit off topic, and a bit related to the second point above; included FWIW: The distributed_processing dispatch makes a funny choice for the size of the chunks to send to the worker processes: chunk_size = int(20000000 / len(filters2)). This gives an ever decreasing chunk size for larger and larger data; it's not clear to me why that should be a good choice. I have 12 cores available, and when I allocate 1/12 of the data to each process,

--- a/anonlink/distributed_processing.py
+++ b/anonlink/distributed_processing.py
@@ -35,6 +35,7 @@ def calculate_filter_similarity(filters1, filters2, k=10, threshold=0.5):
     results = []
     chunk_size = int(20000000 / len(filters2))
 
+    chunk_size = int((len(filters2) + 11) / 12)
     with concurrent.futures.ProcessPoolExecutor() as executor:
         futures = []

(NB, ceiling(n, d) = (n + d - 1) / d with truncated division), then I get the best results I've seen by far:

Size 1 | Size 2 | Comparisons  | Compute Time | Million Comparisons per second
  1000 |   1000 |      1000000 |    0.082s    |        12.231
  2000 |   2000 |      4000000 |    0.151s    |        26.489
  3000 |   3000 |      9000000 |    0.221s    |        40.653
  4000 |   4000 |     16000000 |    0.307s    |        52.118
  5000 |   5000 |     25000000 |    0.372s    |        67.261
  6000 |   6000 |     36000000 |    0.419s    |        86.008
  7000 |   7000 |     49000000 |    0.525s    |        93.272
  8000 |   8000 |     64000000 |    0.618s    |       103.493
  9000 |   9000 |     81000000 |    0.692s    |       117.085
 10000 |  10000 |    100000000 |    0.817s    |       122.385
 20000 |  20000 |    400000000 |    2.310s    |       173.144
Single Core:
  5000 |   5000 |     25000000 |    0.325s    |        77.007

from anonlink.

hardbyte avatar hardbyte commented on June 14, 2024

Something to keep in mind is that although we only show comparisons with 400M comparisons in the benchmark we really care about the possibility and speed of much larger comparisons. So the ability to saturate the number of comparisons per second that a system can do should

All the points you've identified look like valid issues to fix. Is it really suprising that for a small matching job that a single core is better than distributing work across multiple cores? Although I agree the k and threshold should be the same! Regarding the smaller chunks for larger datasets - I agree that seems a mistake.

from anonlink.

unzvfu avatar unzvfu commented on June 14, 2024

Fair point about the cost associated with executing a small job across multiple cores potentially being dominated by distribution and synchronisation overhead. I think that's a good reason to make a clearer distinction in the benchmarking code between single-core performance and distributed performance, since there are heaps of tricky parameters that come into play in the latter scenario and we want to make sure we know when we're measuring them and when we aren't.

from anonlink.

unzvfu avatar unzvfu commented on June 14, 2024

The benchmarking code should (perhaps optionally) also output the following information

  • CPU make and model
  • OS name and version and kernel version
  • Version numbers for:
    • Python
    • GCC or Clang used to compile the C++ library
    • anonlink itself!

This information should be included to help when comparing benchmark results over time.

from anonlink.

unzvfu avatar unzvfu commented on June 14, 2024

Unfortunately it is not at all clear how to extract most of this information from within the code.

  • The portable way of obtaining the CPU and OS information is way too coarse.
  • There does not appear to be a way to extract the compile command from Python's FFI module.
  • (Getting Python and anonlink versions is easy though, of course.)

Perhaps this means that this information should be acquired and reported by a script that runs the benchmarking suite, rather than from the benchmark programme itself.

from anonlink.

unzvfu avatar unzvfu commented on June 14, 2024

So, I think benchmarking should avoid consideration of the values of k and threshold as much as possible. Reason being that those parameters change the performance of the matching in accordance with various statistical properties of the data, rather than through the inherent performance of the algorithm.

For example, as currently implemented, the benchmark is comparing random CLKs, for which the value of threshold either makes everything match or nothing match (except for a small range around 0.5 where it interpolates between the two cases). This in turn pushes all the computation respectively either largely onto the priority queue insertions or exclusively onto the Dice coefficient calculation. (The latter is what this benchmark should be testing, but it shouldn't be an accidental consequence of the choice of threshold, rather it should be set up explicitly to test that.)

Testing how performance varies as k and threshold vary is (probably) important, but it should take explicit consideration of the statistical properties of the CLKs used.

from anonlink.

unzvfu avatar unzvfu commented on June 14, 2024

Resolved by #58.

from anonlink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.