bencheeorg / benchee Goto Github PK
View Code? Open in Web Editor NEWEasy and extensible benchmarking in Elixir providing you with lots of statistics!
License: MIT License
Easy and extensible benchmarking in Elixir providing you with lots of statistics!
License: MIT License
In the upcoming plootly_js/HTML formatter I already happily reuse our newly created unit formatting methods (well I don't scale to microseconds because Erlang doesn't like that cause it apparently can't UTF8 but wel'll see about that) and then noticed that it'd be great to have same formatting for percent units (e.g. the ratio of the standard deviation) as in the console formatter.
Right now it's a private method on the console formatter, we could make it public, but that feels wrong. My idea was to create a Percent
or Ratio
or something unit that then encapsulates the formatting of this unit much like Count
and Duration
do. The weird thing would be, that I don't see any scaling happening there. We'd just always scale to percent (for now).
I still think it sort of fits into the concept of a unit... @wasnotrice what do you think? Would love to get your input here!
Background
As a beginner to Elixir, I wasn't quite sure what exactly the mix.exs
file was and couldn't immediately figure out that it was a file that is generally generated by the mix
build tool. I knew for sure that I wanted to set up a benchmarking tool like benchee
(have previous experience of setting up similar tools for ruby
in the past) as soon as I was getting started with some small exercises to get a better understanding of Elixir but I wasn't immediately able to leverage the insights I could get from benchee
as I didn't know what the mix.exs
file was in the first place.
Update the Installation section of the current README file -
Add benchee to your list of dependencies in mix.exs
:
def deps do
[{:benchee, "~> 0.6", only: :dev}]
end
Add benchee to your list of dependencies in mix.exs
as shown below. In case you're new to Elixir and don't know what the mix.exs
file is, you can read more about the same here.
defp deps do
[{:benchee, "~> 0.6", only: :dev}]
end
Kindly note, it looks like deps
is a private function and not a public one from what I could make out from here and a few other places. Also proposing to correct the same(i.e., by using defp
instead of def
) in the PR.
Please let me know your thoughts on the above and I could accordingly submit a PR for the same.
Thank you.
The current implementation of config has to do a special handling of the print
option as whatever is configured needs to be merged in there. It'd be nice to just have a deep_merge implementation to do this, which elixir doesn't provide atm to the best of my knowledge.
For example, display 1.23 Million
instead of 1.23M
.
This should be an option unit_label
with possible values :short
and :long
defaulting to short. It can then be picked up by the different formatters, in the first version just the console formatter to display long names or short names.
edit (@PragTob) : changed to be a general option and not just for the console
Test and make sure that benchee is also usable for more macro benchmarks. E.g. for benchmarks where an individual execution takes seconds or even minutes. With usable I mean that the display still works then and that it is not hard to configure.
Probably depends on #2 (and might also be solved by it)
In my recent benchmarks I noticed, that it was sort of counter productive to first record results and print them to the console and then generate the csv to make the graph. Of course, results are slightly different. One could use the more verbose API, save statistics and then prepare both values. But that's not ideal, hence I want:
formatters:
that takes an array of remote functions like formatters: [&Benchee.Formatters.Console.format/1, &Benchee.Formatters.CSV.format/1]
and then runs both formatters, or maybe just the module when they all use predefined function namesFor example, 1000000 might display as 1.00M
See #26 for the unit handling implementation
For discussion: when removing Benchee.Time
, and replacing it with Benchee.Unit.Duration
, I was reminded that Benchee.Duration
might be nicer, and similarly Benchee.Count
. Not sure how you feel about the hierarchy, @PragTob, and how you'd want to organize the files if we did flatten it out (whether you want the directories to mirror the modules)
Right now the auto scaling strategies best, largest and smallest. So far we always use best (afaik) but it'd be nice if users could also use smallest/largest. Furthermore, if someone doesn't want any unit scaling to happen that should also be possible:
:none
optionRight now, run_times
and statistics
are 2 separate entries in the benchee suite which makes sense given that statistics needs run_times
as an input and that often times afterwards we don't need to worry about them.
For some plugins (like json and html) run_times are convenient and it's a hassle to always grab the run times according to the statistics you currently want to display (or vice versa) - having a function that joins them together under something like measurements
would be greatly beneficial.
-->
run_times: %{"My input" => [...]},
statistics: %{"My input" => %{...}}
adds another key as:
measurements: %{
"My input" => %{
run_times: [...],
statistics: %{...}
}
}
For different use cases, like bencheeorg/benchee_html#10 it'd be great to have statistics about statistics - what I'd call "meta statistics" - although there's probably some better real statistics name for this :)
What should be in there (that I know of so far):
This should be added as a new key to the benchmarking suite (:meta_statistics
) - it could be added within the statistics module but might be better off as a separate MetaStatistics
step that is computed after general statistics have been computed.
The fast execution warning is quite long and verbose - shorten it up and link to the wiki
I would expect standard deviation to be 0 when we only have one sample or all samples are the same, but still someone had standard deviation where it could have/should have taken only one sample:
Erlang/OTP 19 [erts-8.0] [64-bit] [smp:12:12] [async-threads:10]
Elixir 1.3.4
Benchmark suite executing with the following configuration:
warmup: 10.0s
time: 10.0s
parallel: 1
Estimated total run time: 120.0s
Benchmarking map TCO no reverse...
Benchmarking map simple without TCO...
Benchmarking map tail-recursive with ++...
Benchmarking map with TCO new arg order...
Benchmarking map with TCO reverse...
Benchmarking stdlib map...
Name ips average deviation median
map TCO no reverse 0.33 3.06 s ±23.57% 3.11 s
map with TCO reverse 0.26 3.84 s ±28.88% 3.84 s
map with TCO new arg order 0.26 3.91 s ±18.79% 3.91 s
map tail-recursive with ++ 0.0918 10.90 s ±12.83% 10.90 s
stdlib map 0.0910 10.99 s ±11.87% 10.99 s
map simple without TCO 0.0899 11.13 s ±13.20% 11.13 s
Comparison:
map TCO no reverse 0.33
map with TCO reverse 0.26 - 1.26x slower
map with TCO new arg order 0.26 - 1.28x slower
map tail-recursive with ++ 0.0918 - 3.56x slower
stdlib map 0.0910 - 3.59x slower
map simple without TCO 0.0899 - 3.63x slower
(time was 10 seconds and the execution of 2 apparently took over 10 seconds on average, the only way how it could get a standard deviation is that if the first run was faster than 10 seconds afaik)
It’d be nice to for instance show the average time in milliseconds if a benchmark is slower or write something to the effect of “80.9 Million” iterations per second for the console output for a fast benchmark.
It'd be important to me this auto scaling takes into account all results. I.e. I always find it harder to compare when results are reported in different units so either the values of (for instance) all averages are scaled to the same unit/magnitude or none are.
I'm not sure how this works, but there are measures in statistics to be confident of your results (standard deviation we have and I think it is one of them) - with this given we could add an option that says benchmark until this confidence level is reached, with some timeout though so it doesn't run forever if results vary too much naturally.
Something like "benchmarking with a warmup of n seconds and a benchmark time of m seconds. Estimated total run time ~ benchmarks * (m + n)"
Ideas for further plugins, open a new issue or share ideas here :)
Right now formatters are defined through functions, if you use a given plugin you gotta give it the whole Benchee.Formatters.FooBar.output/1
- instead it'd be nice if there were a Formatter behaviour that formatters could just implement and then in the list one could specify functions or module names likes Benchee.Formatters.FooBar
:)
The functions of course would be format/1
and output/1
as it already is :)
Performance isn't only about execution speed but also about memory consumption. There seem to be some erlang APIs to get memory consumption. Question is, how reliable are they?
So far benchee uses maps the configuration options, in Elixir it is more common to use keyword lists though. I detailed some reasons/thoughts in this thread - people pointed out that it is probably still better to stick with the main convention in the language and Jose suggested to just convert the keyword list to a map for internal use.
That seems like a great idea.
However, if I recall correctly options are usually used as the last argument (currently they are the first option from the mantra of first I configure the benchmark, then I define the benchmarks) which would be a rather big API change that we could probably cleverly get out of with some pattern matching.
Here is a little wish list:
Benchee.run
- how does it look/feelInput/ideas welcome ( @wasnotrice ? :) )
--> needs an update, or well more precisely the gists in it as they won't run anymore on Benchee 0.6.0
It'd be amazing to be a good BEAM citizen and let other languages, specifically Erlang, also use benchee.
Would be great to check if/how benchee
can be used from Erlang/rebar3 as package manager and then write down a sample benchmark for using benchee in Erlang in the README.
Of course, also get rid of all incompatibilities. E.g. I'm not sure but right now it might happen that benchee crasses when there is no Elixir version.
Provide a statistic in the statistics module that shows the nth percentile of response times. E.g. for 99 it should show the time in which 99% of all samples finished. This should help remove outliers from garbage collection & friends.
For now I don't think the nth-percentile needs to be adjustable. I'd settle for a pre defined value, I think 99th seems fine - will have to try with some real world examples to see how it performs:
right now statistics are computed sequentially just as formatters are executed sequentially. There's no real reason for this and it should be "stupidly easily parallelizable" as there are no dependencies between those so it should be easily doable via Task.async/1
and Task.await/1
.
There's even some good sense behind it, as statistics generation takes an increasing amount of time the more samples there are - sorting with a Million elements can take ~0.3s and if we let a fast benchmark run even for a little this number is not hard to achieve, of course more benchmarks, even more time.
Formatters are rather fast, but could also take longer and always have some IO going on.
Of course, probably statistics and formatters should be two different PRs :)
Follow up for #55 - depends on #53
As correctly noted in #55 we can't just parallelize the formatters output/1
as console output might come from multiple formatters (warnings and such) which might get in each others way. But right now there is already the convention of having a format/1
function that is pure and just creates the structure which is then just written out in output/1
- and the writing out can then be done sequentially.
It's needed to update the mix.exs file
Add benchee_csv to your list of dependencies in `mix.exs`:
```elixir
def deps do
[ {:benchee, "~> 0.3", only: :dev},
{:benchee_csv, "~> 0.3"},
]
end
Afterwards, run mix deps.get
to install it.
``
Right now the names are just cut off. There should either be a warning or it should work by just using the second line for the name (splitting it onto n lines).
tobi@happy ~/github/benchee $ cat samples/long_name.exs
Benchee.run [{"some very long name that doesn't fit in the space", fn -> :timer.sleep(100) end}]
tobi@happy ~/github/benchee $ mix run samples/long_name.exs
Benchmarking some very long name that doesn't fit in the space...
Name ips average deviation median
some very long name that doesn 9.90 100994.16μs (±0.01%) 100994.00μs
Benchee is getting a lot of configuration options - which is great. But the more configurations options and formatters there are, the more people would probably prefer to configure some of them globally for their project.
Something like:
config :benchee, :options, %{} # fancy map or keyword list overriding default options
These options should then represent the new default options. So merge order would be something like: default_config <- app_config <- benchmark_config (<- meaning right overrides left)
This is relatively easy, once upon a time there was Benchee.Time to convert from seconds to microseconds and the other way around (not sure if this way is used at all) - we have a new module Unit.Duration
where similar functionality lives + more - so the idea is to remove calls to Time by equivalent calls in Duration :)
Right now unit_scaling
is nested underneath the console formatter, where it can and should also remain (for now, we could watn that it moved to top level now).
However, unit_scaling makes sense with more formatters than just the console formatter (everyone that displays units of some kind). Right now that doesn't make as much sense for CSV for instance as there is no unit field, but would definitely make sense for the HTML report.
This way different formatters can support the unit_scaling option and it would be consistent among all formatters used.
edit: As there was a misconception, what is meant with top level configuration option is more the following:
unit_scaling: :best,
formatter_options: whatever
i.e. it's not attached to any specific formatter, but every formatter can look it up in the general configuration
To give an immediate view of which version combination produced the results :)
The comparison report makes sense when you're benchmarking one thing vs another thing, but it doesn't make sense in a land where you're just providing benchmarks for a project.
For example, if I build a library which has function1
, function2
, function3
, etc. I might want to provide a single benchmark.exs
which outputs the stats for all of them (i.e. benchmark the entire library). At this point the comparison doesn't make sense as it's comparing completely different functions, so it'd be nice to have a flag to be able to turn it off :)
Hello @PragTob,
thanks for creating Benchee,
I am playing with it.
just wondering since I am testing a log of functions, that execute really fast, if there is a way to disable this warning
Warning: The function you are trying to benchmark is super fast, making time measures unreliable!
if not, i think having an option to disable it, it will be very useful.
cheers
Sometimes I wish I could run a function after every invocation of the benchmarking function, of course outside of the measurement of the benchmarked function. Why would I want that?
Based on these use case the after
function would need access to:
A good name apparently (at least similar to benchfella) seems to be teardown and as teardown alone would be lonely we can add a setup sibling that behaves similarly.
edit: updated as this blog post also calls for some setup/teardown
edit2: also before/after a scenario sounds sensible to do (specifically I think I need it for benchmarking Hound as I need to start hound for our specific PID)
It'd be great if benchee could warn the user if the benchmarks are run with some settings in the elixir run time that are potentially hampering the performance.
I don't have a full list in mind yet but it includes:
Whenever that happens, we can change the listed applications to extra applications etc. as described in the release notes.
When running tests, some tests fail non-deterministically. They sometimes fail and sometimes pass, without any changes to the code.
Example test failure:
1) test variance does not skyrocket on very fast functions (Benchee.BenchmarkTest)
test/benchee/benchmark_test.exs:60
Assertion with < failed
code: std_dev_ratio < 1.2
lhs: 1.7548733092872468
rhs: 1.2
stacktrace:
(elixir) lib/enum.ex:651: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:651: Enum.each/2
(ex_unit) lib/ex_unit/capture_io.ex:146: ExUnit.CaptureIO.do_capture_io/2
(ex_unit) lib/ex_unit/capture_io.ex:119: ExUnit.CaptureIO.do_capture_io/3
test/benchee/benchmark_test.exs:61: (test)
It'd be great to have a platform independent way to show more basic data about the machine the benchmark is being run on such as:
drawing boxplot diagrams eats a lot of resources on the browser side in github.com/PragTob/benchee_html
So it'd be nice to provide statistics to draw them right away (like this) - median we already have, the quantille1/quantille3, Inter quantille range and others :)
Wait for release of Elixir 1.4
Benchee.Formatters.Console.units/1
keyword
type declaration in Benchee.Formatter.Unit
String.trim/1
instead of String.strip/1
describe
blocks in tests a neededAs pointed out by @michalmuskala on elixirforum it is problematic that the standard to run benchmarks is portrayed as just running outside of a module which gets problematic if more complex code is in the individual anonymous functions
That said, one thing that bothers me each time I look at the README is that the example benchmark is not inside a module. Code that is not inside modules, is not compiled, but interpreted. This gives vastly different performance characteristics and makes benchmarks pretty much useless.
This is not a huge problem in the example, since the functions immediately call a module (so only the initial, anonymous function call is interpreted), but can lead to false results with more complex things.
Ideally the README should point this out and recommend either just testing functions that are defined in their own modules or in case of doubt write a module containing the benchmark and then calling it in the run script.
defmodule MyBenchmark do
def benchmark do
Benchee.run(...)
end
end
MyBenchmark.benchmark()
Code in modules seems to always be compiled so that'd solve the problem mentioned above :)
Right now the internal structure for benchmarks is a list of tuples of benchmark name and function. I did this initially so that benchmark names could be duplicated. However, that doesn't really make much sense and should rather give a warning (otherwise you can't tell them apart in the output either).
So a structure like:
%{"benchmark name" => function, ...}
seems to be better suited.
To not be breaking I'd like to preserve the old [{name, fun}, {name, fun}]
way for now to not break existing benchmarks. The function should just convert it to a hash map then.
Internally right now a lot of printing happens right inside of Benchee.Benchmark
- overall suite information as well as warnings etc. - it'd be great to put all of that into its own module that can be injected and exchanged for testing purposes. That way we'd also avoid all the awkward capture_io
s in the Benchmark tests and be more pure/side effect free.
Especially micro benchmarking can be affected by garbage collection as single runs will be much slower than the others leading to a sky rocketing standard deviation and unreliable measures. Sadly, to the best of my knowledge, one can’t turn off GC on the BEAM.
The best breadcrumb to achieve anything like this so far:
You can try to use erlang:spawn_opt http://erlang.org/doc/man/erlang.html#spawn_opt-2 setting fullsweep_after and min_heap_size to high values to reduce chances of garbage collection.
This would then go into a new configuration option like: avoid_gc: true/false
.
Would also need testing with existing benchmarks to see effect on standard deviation etc. - likely a large-ish operation :)
Right now when you run a slightly more thorough benchmark with increased times or multiple inputs the estimated time shows up with many seconds:
tobi@speedy ~/github/elixir_playground $ mix run bench/tco_blog_post.exs
Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.4.0
Benchmark suite executing with the following configuration:
warmup: 10.0s
time: 10.0s
parallel: 1
inputs: none specified
Estimated total run time: 120.0s
It'd be nice to use the automatic unit scaling to show 2mins
or something similar in cases like this (of course even more so with cases like 360s).
Scaling can be fun in Benchee.Conversion.Duration
while the printing is a part of Benchee.Output.BenchmarkPrinter
:)
an option for the console formatter like extended_statistics
or something would be nice to show statistics that we already collect but don't print yet. As of now this would be:
These values can be interesting for different reasons, e.g. what is the worst case performance here or on how many results is this based. I don't really want to add them to the standard console formatter as the output would probably get too wide.
The new/extra statistics should probably be displayed underneath the normal statistics in a smiliar fashion as the normal statistics, meaning in the same order and in a table like format that goes:
name - minimum - maximum - sample size
See elixir-lang/ex_doc#578 :)
What further statistics do you need/wish for? Open new issues or add them here. I think we can learn some from criterion which antipax pointed out to me.
Rerun the benchmark with different input sizes (aka 10 elements, 100 elements and 1000 elements) or something of the like to get reports on all different sizes with one benchmark run.
Noticed this in elixir-lang/elixir#5082 where it would have been nice to have multiple input sizes in one benchmark.
Having all of benchee's public interface type specced and the typespecs checked in the CI would be great.
As to how to do it, my experience with typespecs is limited - I used dialyxir with varying degree of success. There's also dialyze but it doesn't seem to be as active anymore.
Also, with the behaviours in the unit system we added quite some type specs - would be great to complement those.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.