Git Product home page Git Product logo

Comments (3)

davidsantiago avatar davidsantiago commented on August 25, 2024

So, the first two of these are things I originally thought were good ideas, but eventually came to believe (before I finished writing perforate) were actually not in useful toward the goal of providing an accurate performance picture. The third I had not thought of, but think is probably also problematic.

I think the problem with recording benchmark results is that it's very difficult to run tests in the same environment over time. System software updates, background processes run randomly and use various resources that affect your process, modern CPUs will scale their processor speed up and down in response to load. Eventually, you will have to upgrade or replace the machine, rendering all your historical data incomparable. This actually happened to me when I unexpectedly needed to replace a machine I had been using to develop a library I had been keeping informal historical performance numbers for; the numbers on the new machine were totally different. I came to the conclusion that the best policy for monitoring performance is to compare different versions of the software at the same time, to best keep the environment consistent.

This is why perforate does all that stuff with environments, so you can compare the software at specific, known versions against each other at as-close-as-possible to the same time. I find that these numbers can fluctuate a surprising amount, but unless the two versions are nearly identical in performance, their relative performance will be pretty consistent. That said, if you look in the perforate.core namespace, you should see that the main functions are actually public by design, so that if you wish to write some harness to run the tests and append them to a CSV file or throw them in a database or what not, it should be about as easy to do that as you suggest in your paragraph. It's really hard to think of a one-size-fits-all solution for this, even if you wish to do performance testing this way despite the pitfalls.

Similarly, reusing unit tests as benchmarks is something I considered, and I think there are problems with that as well. On the one hand, it would be nice to be able to reuse the code for both pieces of functionality. And benchmarks are useless if they're not producing the correct results, so they could be useful as unit tests as well. I think the problem is just that most unit tests don't make very good benchmarks and vice versa. Unit tests try to capture small bits of functionality and show that they are correct, while benchmarks try to simulate some real high-level workloads and cannot possibly check the results for correctness in many instances. A benchmark might also have to do looping or make additional work for itself in order for the timing routines to be accurate. This is not something I'm opposed to, I am just skeptical that it will be that useful and have never really wanted to benchmark one of my tests. So, I don't have the bandwidth for exploring that functionality.

Finally, I think the distribution of the benchmarks onto cloud nodes would not be helpful either. The problem here is that it takes the problems I mentioned earlier about "comparing on the same environment" and accelerates it. Cloud instances, despite being of the same type, are quite possibly running on different CPUs and different systems that have been judged by the cloud provider to be of about equivalent power. You can't count on getting an "identical" environment between runs, and not even within the same run if you are distributing the tests. Even then, your IO and CPU capability can be affected by how noisy the neighbors sharing the hardware with you are. Criterium and Perforate are intended to help you get a picture of whether some change you are considering to your code is 10% faster or slower, and my worry is that for all the work you'd have to do to distribute the tests that way, it might all be counter productive in the end. So, again, this seems difficult, and I'm not sure it would even help.

However, the problem that "the benchmarks take a long time to run" is definitely a real one. I feel your pain there. Criterium is not helping in this regard, because it runs tests for specific time periods that can seem almost ridiculously long, even when run with the "quick" settings. @hugoduncan and I have talked about making Criterium use adaptive sampling, where it would run the benchmark with a very low sample rate, run the numbers, double the sampling rate, and check the numbers, and so on. Eventually, you would expect to see the per-iteration numbers converge to a reasonable degree and feel confident stopping the test. I think a system like this would actually do much more to make running the benchmarks less painful, since for most tests of fast functions, you could stop within a few seconds at most. Hugo takes a formal statistical approach to the way Criterium works, and I don't know if he's comfortable with the validity of this methodology yet. Nor do I really feel confident digging around and changing this code myself in Criterium, even if I had the time to do so. Sadly, it seems this sort of software is always a distraction from what you really want to be writing.

from perforate.

zmaril avatar zmaril commented on August 25, 2024

Thank you for the detailed response! I'm digging through it right now but it all seems straightforward and logical. Is the gist that "The further apart in time and space two benchmarks are performed, the increasingly less their comparison actually tells you"? If so, then that's a disappointing reality. I'm still interested in distributing tests with spot instances and I'll probably work on that instead.

from perforate.

davidsantiago avatar davidsantiago commented on August 25, 2024

That's been my experience, yeah. Maybe there's some clever solution out there, but I certainly haven't come up with it.

from perforate.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.