mame / optcarrot Goto Github PK

A NES emulator written in Ruby

License: MIT License

Ruby 99.82% JavaScript 0.18%

optcarrot's Introduction

Optcarrot: A NES Emulator for Ruby Benchmark

Project Goals

This project aims to provide an "enjoyable" benchmark for Ruby implementation to drive "Ruby3x3: Ruby 3 will be 3 times faster".

The specific target is a NES (Nintendo Entertainment System) emulator that works at 20 fps in Ruby 2.0. An original NES works at 60 fps. If Ruby3x3 is succeeded, we can enjoy a NES game with Ruby!

NOTE: We do not aim to create a practical NES emulator. There have been already many great emulators available. We recommend you use another emulator if you just want to play a game.

Basic usage

SDL2 is required.

$ git clone http://github.com/mame/optcarrot.git
$ cd optcarrot
$ bin/optcarrot examples/Lan_Master.nes

key	button
arrow	D-pad
`Z`	A button
`X`	B button
space	Start button
return	Select button

See doc/bonus.md for advanced usage.

Benchmark example

Here is FPS after 3 seconds in the game's clock.

Here is FPS after 50 seconds in the game's clock. (Only fast implementations are listed.)

See doc/benchmark.md for the measurement condition and some other charts.

See also Ruby Releases Benchmarks and Ruby Commits Benchmarks for the continuous benchmark results.

You may also want to read @eregon's great post for TruffleRuby potential performance after warm-up.

Optimized mode

It may run faster with the option --opt.

$ bin/optcarrot --opt examples/Lan_Master.nes

This option will generate an optimized (and super-dirty) Ruby code internally, and replace some bottleneck methods with them. See doc/internal.md in detail.

Acknowledgement

We appreciate all the people who devoted efforts to NES analysis. If it had not been not for the NESdev Wiki, we could not create this program. We also read the source code of Nestopia, NESICIDE, and others. We used the test ROMs due to NESICIDE.

optcarrot's People

Contributors

Stargazers

Watchers

optcarrot's Issues

Benchmarking methodology and number of frames

Hello,

I am wondering why the number of 180 frames was chosen for the benchmark.

As a note, it seems the benchmark is repeating the same sequence from frame ~30 (the menu animation of Lan_Master.nes), so that the code executed after that frame is repeatedly the same and should eventually reach predictable performance.

Some implementations might take longer than 180 frames to fully warm-up and reach peak performance, which I think is the intention with the setup and measuring the last 10 frames.
MRI is stable after a couple iterations, but implementations optimizing at runtime using a JIT and deoptimization might take longer to reach stable performance.
For example, in the picture below which measures performance over 3000 frames,
we can observe that JRuby with invokedynamic is not fully warmed-up by 180 frames.
In fact, there are even some variations a bit before 1000 iterations.
Other implementations, such as JRuby+Truffle observe similar warm-up times.

Implementations which are highly optimizing might trade high peak-performance for longer warmup.
The warm-up of VMs is unfortunately very hard to predict, as for example this paper from the PyPy community shows.

This issue is just to raise awareness of this problem, and ask what kind of performance the benchmark wants to measure.

Code does not support MRI 1.8.7

The documentation benchmark.md and plots are suggesting, that this benchmark can be run on mri-1.8.7 (or JRuby in MRI 1.8.7 compatibility mode).
I am, however, unable to run it without extensive modifications. Further details:

The modern hash syntax foo: :bar must be backported to :foo => :bar
The splat operator can only be used as the last argument, further: arguments can't have a dynamic length unless they are the last. There are many places in your codebase that do not support that restriction. (Example)

All the best,
Michael

Sprite collapsed with optimization

When I run optcarrot with --opt (and without --sprite-limit), sprite character is collapsed if sprites lay on same line more than 8.

Sprites are displayed correctly without --opt, or with --sprite-limit option. So I suspect optimization breaks something.

GPU pixel caching?

Hi @mame,

I've read your presentation and I have a question about slide 16

Are the invalidated pixels still tracked. I could not find it in the source code. Could you point me to it?
Did you investigate how this optimisation effects benchmarking? It could make the results vary depending on how much of the screen is changing.

License for project?

Hello there!

Thanks for optcarrot --- it's a really interesting codebase to explore and play with in trying to get Ruby to go faster.

I noticed that there seems to be no license attached to the project, and would like to suggest that one be added. I personally favour LGPLv3+ but you may want to select an even less complex license, such as MIT or Apache.

Thanks 🙏 👍

JRuby performance in benchmark results tells a partial story

I'm confused by some of the numbers I've seen for JRuby versus MRI in the benchmark results.

Your numbers show MRI 2.3 being the fastest in non-optimized mode with 23FPS. JRuby 1.7.24 is the fastest production-usable implementation at 19FPS. JRuby 9k (9.0.5.0) has lower perf than 1.7.24.

My own numbers confirm some parts of this and not others.

Invokedynamic not used

First, JRuby (stock) versus Ruby 2.3.0:

jruby 1.7.24 (1.9.3p551) 2016-01-20 bd68d85 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 +jit [darwin-x86_64]
fps: 13.333334275233959
checksum: 59662

ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-darwin14]
fps: 16.10170636312445
checksum: 59662

So JRuby is a bit slower, matching your results. But this is without invokedynamic, normally turned off for JRuby due to longer startup/warmup time.

jruby 1.7.24 (1.9.3p551) 2016-01-20 bd68d85 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 +indy +jit [darwin-x86_64]
fps: 16.100178451396733
checksum: 59662

This puts JRuby 1.7 at about the speed of MRI 2.3.0 on the normal version of this code.

JRuby 9k

Your results show a perf degradation from 1.7.24 to 9.0.5, which we have confirmed and made a number of fixes for in JRuby 9.1.

Note we have still not yet made homogeneous case/when O(1).

jruby 9.0.5.0 (2.2.3) 2016-01-26 7bee00d Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +indy +jit [darwin-x86_64]
fps: 12.010045906189871
checksum: 59662

jruby 9.1.0.0-SNAPSHOT (2.3.0) 2016-03-24 d851678 Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +indy +jit [darwin-x86_64]
fps: 30.92613156164269
checksum: 59662

This puts JRuby 9.1 nearly 2x faster than MRI 2.3.0 on the normal code.

It's also interesting to note that 9.1 does not appear to require any of the compatibility stubs.

Optimized code

I can confirm that JRuby does not like the optimized code, most likely because very large bodies of code usually do not JIT in JRuby, or if they do JIT to JVM bytecode the JVM itself may not do further optimization on them. However, I thought I'd try it in JRuby anyway.

jruby 9.1.0.0-SNAPSHOT (2.3.0) 2016-03-24 d851678 Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +indy +jit [darwin-x86_64]
fps: 1.649363068304025
checksum: 59662

This is a 20x performance reduction by using the same "optimized" code that speeds up MRI by almost 3x.

If I turn on JIT logging (-Xjit.logging) I can see it fails to compile two generated pieces of code for PPU.run and CPU.run:

...
2016-03-31T21:49:06.096-05:00: JITCompiler: done jitting: DMC Optcarrot::APU::DMC.sample at /Users/headius/projects/optcarrot/lib/optcarrot/apu.rb:770
2016-03-31T21:49:06.411-05:00: JITCompiler: Could not compile; passes run: []: <anon class> Optcarrot::PPU.run at (generated PPU core):0 because of: "Could not compile org.jruby.internal.runtime.methods.MixedModeIRMethod@6b72f764; instruction count 204813 exceeds threshold of 2000"
2016-03-31T21:49:07.168-05:00: JITCompiler: done jitting: <block> poke_2007_CLOSURE_1.poke_2007_CLOSURE_1 at /Users/headius/projects/optcarrot/lib/optcarrot/ppu.rb:481
2016-03-31T21:49:07.181-05:00: JITCompiler: done jitting: PPU Optcarrot::PPU.update_scroll_address_line at /Users/headius/projects/optcarrot/lib/optcarrot/ppu.rb:297
...

This is producing a method that requires 204813 instructions in our IR. Most of our IR instructions compile to many JVM bytecodes. Combine these facts with the JVM's method size limit of 64k bytes of JVM bytecode...even if we tried to force JRuby to JIT this code, it would be far too large for the JVM to accept it without us breaking it into smaller pieces.

Our interpreter-only mode is only slightly slower than the JIT when running with --opt. For the moment, this optimization does not fit JRuby's model of execution.

--opt option collapsed.

It seems that --opt option came to raise error after the commit 47e0c7c.

$ bin/optcarrot-bench
fps: 40.6957710763183
$ bin/optcarrot-bench --opt
fps: 130.21812983686382
/home/optcarrot/lib/optcarrot/ppu.rb:894:in `dispose': undefined method `resume' for nil:NilClass (NoMethodError)
        from /home/monochrome/optcarrot/lib/optcarrot/nes.rb:77:in `dispose'
        from /home/monochrome/optcarrot/lib/optcarrot/nes.rb:96:in `run'
        from bin/optcarrot-bench:9:in `<main>'

Problem in --print-video-checksum

In my environment, it seems that --print-video-checksum option does not work.

❯ ruby bin/optcarrot -b --print-video-checksum examples/Lan_Master.nes
fps: 40.230421976713764
❯ ruby tools/run-benchmark.rb ruby30
+--------------+
| build ruby30 |
+--------------+
*** omitted ***
+----------------------------------+
| measure ruby30 / default (1 / 1) |
+----------------------------------+
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-linux]
fps: 39.59678157409444

FAILED.

❯

I think, in the code below which is modified in the latest commit, ".class" is not necessary.
--print-video-checksum (or tools/run-benchmark.rb) worked correctly after I deleted ".class".
How do you think about this?

optcarrot/lib/optcarrot/nes.rb

Line 70 in 5d3ca11

if @conf.print_video_checksum && @video.class.instance_of?(Video)