Benchmarking methodology and number of frames about optcarrot HOT 8 CLOSED

mame commented on August 20, 2024

Benchmarking methodology and number of frames

from optcarrot.

Comments (8)

eregon commented on August 20, 2024 2

Hello,

First I apologize for the long comment, I had too many thoughts this morning 😃
I have thought more about this and I think we need to tweak the benchmarking methodology.

IMO, it should measure the performance of a "normal" status in terms of an emulation mode: a screen is shown, a sound is playing, etc.

I see, that sounds right.

(A NES emulator must run at 60 fps, even immediately after start up. A game player won't wait Ruby implementation warm-up :-)

This is not very realistic though. Even hardware console are typically slower at start up.
Some instructions are typically slower than others and some only appear during startup.
I think a game player might very well wait a few extra seconds if he can get 60fps instead of 30fps after warmup :)

In any case, we really want to measure performance from 180 frames and after (since the player will hopefully play more than 3 seconds). For that, we need more frames (iterations) to get decent statistics and see the longer-term picture.

Also, I don't want to increase the default parameter because I want to keep optcarrot a handy benchmark by default.

Of course, for experimentation, it's fine to use a custom number of frames. If MRI shows stable performance at 180 fps, it's very handy. For JRuby and other implementations, a different threshold is likely better for experimenting and getting a relatively stable number of fps.

I am arguing for the methodology used for producing the chart in the README, and I think that should also be consistent with --benchmark.

Here are some ideas based on compilation thresholds.

Recently, @noahgibbs tried Ruby OMR on optcarrot.
He also noticed that 180 fps for benchmarking is way not enough.
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/77521

In fact, it seems that OMR's default compilation threshold (how many times a method is invoked before it's compiled) is 3000 (and 1000 in the quickstart mode which seems non-default): https://github.com/eclipse/omr/blob/9fe8bc4c8b0b855155c4d609ac1063276d4a1a4e/compiler/control/OMROptions.hpp#L1234-L1238
https://github.com/eclipse/omr/blob/9fe8bc4c8b0b855155c4d609ac1063276d4a1a4e/compiler/control/OMROptions.cpp#L4028-L4042
Playing with OMR_JIT_OPTIONS="-Xjit:count=<COUNT>,verbose" allows one to see that few methods are compiled with the default threshold, while with 50 much more gets compiled (but performance gets worse according to his measurements).

With the default settings (just with OMR_JIT_OPTIONS=-Xjit:verbose), one can see quite a few methods get compiled at frames well above 180 frames. For instance:

...
frame 198, fps: 22.382

 compiling .../optcarrot/lib/optcarrot/cpu.rb:627:_dec
+ (cold @ 0x7f8c08035d14 t=9891174ms  2.668ms) .../lib/optcarrot/cpu.rb:627:_decframe 199, fps: 22.335
frame 200, fps: 23.803
...
frame 2999, fps: 24.650

 compiling .../optcarrot/lib/optcarrot/nes.rb:40:step
+ (cold @ 0x7f8c0804a874 t=131882787ms  5.561ms) .../optcarrot/lib/optcarrot/nes.rb:40:step
 compiling .../optcarrot/lib/optcarrot/ppu.rb:252:setup_frame
+ (cold @ 0x7f8c0804b854 t=131885754ms  2.864ms) .../optcarrot/lib/optcarrot/ppu.rb:252:setup_frame
 compiling .../optcarrot/lib/optcarrot/cpu.rb:925:run
...

In comparison, the default Truffle threshold is at 1000 method invocations.

The reason for these high default thresholds is JITs need to make sure it's worth to just-in-time compile a method. Compiling a method takes time, so it has to be worth it by being called often otherwise it's just wasting CPU time (or making more important methods wait longer to be compiled).

The compilation threshold relates to the number of frames (iterations) as it means that methods that are called at least once per iteration will have been scheduled for compilation.

I think we need to raise the number of frames for benchmarking it to at least 1500 frames (25s at 60fps).
The warmup graph above shows that performance is still not quite stable at 1000 frames, but seems much more stable after. This is also observed on JRuby+Truffle.
3000 frames (50s at 60fps) sounds safer. It's also what is used in this benchmarking paper.
Actually, Noah used 18000 "just to be sure". Too much is rarely a problem.

If we measure more frames, we are likely to observe better long-term performance.
However, the current methodology to take the average of the last 10 frames is not robust.
Specifically, the average is not a robust estimator (nor is standard deviation).
One issue is if a GC happens (or any other factor, for instance JRuby+indy above shows a few slower frames regularly) during the last 10 frames, then the average will significantly change, even though maybe only 1 frame in a hundred is slower.

In any case, a single number is hardly enough to characterize performance like I showed on the warmup graph above. And since I guess MRI cares about all aspects of performance such as startup, warmup and peak performance, the final chart should show all of that.
Therefore, I propose to use a boxplot over all frame times.
For the same data as above it looks like this:

It's not perfect, but it shows the long-term performance, it shows JRuby+indy takes more frames to warmup than JRuby but ends up faster, that MRI has a couple fast frames at the beginning (38 and 36 fps for the 3rd & 4th frame) and stabilizes quickly as expected, etc.

What do you think?
Is it reasonable to increase the number of frames for benchmarking to consider longer-term performance than 3 seconds?
Do you have other ideas how to show the data concisely yet with all important aspects of performance?

from optcarrot.

mame commented on August 20, 2024

Sorry for the late reply! I would like to thank you for your interest in optcarrot.

I am wondering why the number of 180 frames was chosen for the benchmark.

The number of 180 just means three seconds in the emulation time. I just thought that one second might be too short (some ROMs might be still booting), but I have no special reason why I picked "three".

This issue is just to raise awareness of this problem,

Thank you, I didn't notice the problem.

and ask what kind of performance the benchmark wants to measure.

IMO, it should measure the performance of a "normal" status in terms of an emulation mode: a screen is shown, a sound is playing, etc. The 180 frames are waiting for the status (i.e., a ROM has been booted), not for the warm-up of Ruby implementation. (A NES emulator must run at 60 fps, even immediately after start up. A game player won't wait Ruby implementation warm-up :-)

Also, I don't want to increase the default parameter because I want to keep optcarrot a handy benchmark by default. When trying to improve a Ruby implementation by using the optcarrot benchmark, we run it many times. The task would be painful if each run took so long. I want to limit each run below 10 seconds.

Of course, you are free to use optcarrot with your favorite parameter according to your use case. You know, you can tune the number of frame wait by option "-f".

from optcarrot.

mame commented on August 20, 2024

I understand that "180 frames" are inconvenient for JRuby/OMR. But, do you still want me to change the default setting? You can use your favorite settings for your benchmark, can't you?

Do you want me to update the chart in Benchmark example section of README.md? If so, there is another problem; I am not going to update that chart :-( In fact, I want to do so, but it takes more than a day to run the full benchmark. To get an accurate result, I cannot use my laptop during the measurement! I want a "continuous integration" for measuring the benchmark, but it requires a bare-metal machine. (I don't want to use VPS for accurate measurement.) If you kindly provide a result with your favorite settings, I will add a link to that in README.md.

Off topic: in my personal belief, the short start-up time is much more important than the peak performance in Ruby. I rarely write a long-running program in Ruby. I often write a program that takes less than a second, and run it many times. I believe that it is a "correct" usage of Ruby. When the peak performance is really important, we have many choices (Rust, Go, Scala, C/C++, etc.). But we have less languages for short start-up time. As a casual scripting language, and as a marketing strategy, Ruby should focus on the latter.

from optcarrot.

chrisseaton commented on August 20, 2024

Surely the solution is to run in multiple different configurations? 180 frames is one interesting and worthwhile data point to have, as would be 1,800 and 18,000. We can benchmark at each and use that to compare cold, warm and hot performance. People interested in cold performance can look at that graph, and people interested in hot performance can look at that graph.

And since the number of frames is already configurable, we don't need to modify the default, as @mame says.

from optcarrot.

eregon commented on August 20, 2024

EDIT: Sorry, I got confused and looked at my own version of tools/run-benchmark.rb.

I understand that "180 frames" are inconvenient for JRuby/OMR. But, do you still want me to change the default setting? You can use your favorite settings for your benchmark, can't you?

I think the number of frames in the benchmark mode (-b, --benchmark) should be consistent with what is used in tools/run-benchmark.rb, but it's mostly a nice-to-have.

I think the number of frames in tools/run-benchmark.rb should be adapted, but I can do that in my fork.
I will do a PR to mention this in the docs at least.

Do you want me to update the chart in Benchmark example section of README.md?

Yes, I think it would be great to keep it up-to-date.

In fact, I want to do so, but it takes more than a day to run the full benchmark.

Why does it take so long? Is it due to testing different configurations?
I think the various manual optimizations are mostly relevant for MRI and experimentation, but not necessary for the main chart.
A run of MRI23 with ruby tools/run-benchmark.rb ruby23 only takes 45 seconds at me.

Releases most likely have stable results over time, so they only need to be benchmarked a few times in the same conditions to ensure results are consistent with each other.

If you kindly provide a result with your favorite settings, I will add a link to that in README.md.

I will do that, thank you for proposing it!

Thank you for this benchmark and all the setup, it's really great! 😃

from optcarrot.

mame commented on August 20, 2024

Why does it take so long?

Mainly because Opal is super slow :-), and because I run each benchmark ten times (for average calculation) with three modes (default / no-opt / opt). ruby tools/run-benchmark.rb ruby23 -m all -c 10 will take 10 to 20 minutes.

BTW, the frame number until termination is currently configurable with --frames, but I noticed that the number of frame measured was hard-coded as 10. I think it should be also configurable, but I have no idea about a good option name. Do you have any idea?

from optcarrot.

mgaudet commented on August 20, 2024

(Small side note: Default threshold is modified for Ruby+OMR, to 1000. Slightly confusing because the Ruby compilation control is slightly distinct from the compilation control in OMR -- in some sense, it predates it. Super confusing though, I agree!)

<most bias> I tend to agree with @eregon that a methodology that supports warmup is preferrable </most bias> though I'm also on the record saying methodology drives change in a particular direction. It's perfectly valid to interpret the benchmark as written today as optimizing for "We are aiming for 60 fps within the first 180 frames, reliably". (As @mame was saying; startup is important).

Not entirely sure about the box plot though: It'll hide a lot of important details for a benchmark like this -- ie, cyclic slowdown due to GC pauses won't show up in a box plot.

from optcarrot.

mgaudet commented on August 20, 2024

(PS: Has anyone asked if there might be capacity with RubyBench.org for ongoing measurement?)

from optcarrot.

Benchmarking methodology and number of frames about optcarrot HOT 8 CLOSED

Comments (8)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent