Comments (27)
Agreed.
Also it would have been great to add/redo the changes made to Musashi to allow it to run from more threads which would speed up the internal time.
Also for testing performance I think this may be helpful https://github.com/cmr/hprof
from r68k.
Yes, that crate seems useful! And multithreading Musashi would be great, but a lot of effort, when running the tests in parallel (using ./pqc) probably achieves at least 3/4 of the performance boost. I'd rather have each test be quicker at this point.
from r68k.
Yeah I guess that is good enough. Using the hprof create should pin-point where most time is spent at least.
from r68k.
Did a perf capture (in Instruments)
Running Time Self (ms) Symbol Name
726.0ms 4.4% 0,0 _pthread_body
726.0ms 4.4% 0,0 sys::thread::_$LT$impl$GT$::new::thread_start::h6c3076098329c459IZw
726.0ms 4.4% 0,0 fnfn
726.0ms 4.4% 0,0 r68k::sys_common::unwind::try<closure>
726.0ms 4.4% 0,0 __rust_try
726.0ms 4.4% 0,0 fnfn
726.0ms 4.4% 0,0 r68k::musashi::tests::hammer_cores_even_addresses
726.0ms 4.4% 0,0 r68k::musashi::tests::hammer_cores_with
720.0ms 4.4% 0,0 r68k::cpu::Core.Clone::clone
718.0ms 4.4% 0,0 r68k::cpu::ops::instruction_set
718.0ms 4.4% 208,0 r68k::cpu::ops::handlers::generate
486.0ms 3.0% 261,0 r68k::iter::ops::Range<A>.Iterator::next
113.0ms 0.6% 80,0 r68k::mem::swap<u32>
20.0ms 0.1% 20,0 r68k::mem::uninitialized<u32>
13.0ms 0.0% 13,0 r68k::mem::forget<u32>
61.0ms 0.3% 28,0 r68k::ops::&'b u32.Add<&'a u32>::add
33.0ms 0.2% 33,0 r68k::ops::u32.Add::add
19.0ms 0.1% 19,0 r68k::cmp::impls::u32.PartialOrd::lt
18.0ms 0.1% 0,0 <Unknown Address>
14.0ms 0.0% 14,0 r68k::num::u32.One::one
20.0ms 0.1% 0,0 <Unknown Address>
12.0ms 0.0% 12,0 r68k::ops::&'b u32.Add<&'a u32>::add
8.0ms 0.0% 8,0 r68k::cmp::impls::u32.PartialOrd::lt
2.0ms 0.0% 0,0 r68k::iter::ops::Range<A>.Iterator::next
2.0ms 0.0% 1,0 r68k::mem::swap<i32>
1.0ms 0.0% 1,0 r68k::mem::uninitialized<i32>
2.0ms 0.0% 0,0 r68k::vec::Vec<T>::push
2.0ms 0.0% 0,0 r68k::vec::Vec<T>.ops::DerefMut::deref_mut
1.0ms 0.0% 1,0 r68k::ptr::*mut T::is_null
1.0ms 0.0% 1,0 r68k::slice::from_raw_parts_mut<fn(&mut r68k::cpu::Core) -> core::result::Result<r68k::cpu::Cycles, r68k::cpu::Exception>>
2.0ms 0.0% 0,0 r68k::ram::LoggingMem<T>.AddressBus::copy_from
4.0ms 0.0% 0,0 r68k::cpu::Core::new_mem
2.0ms 0.0% 0,0 r68k::musashi::reset_and_execute1
2.0ms 0.0% 0,0 r68k::musashi::initialize_musashi
2.0ms 0.0% 0,0 r68k::ram::LoggingMem<T>.AddressBus::read_byte
1.0ms 0.0% 0,0 r68k::ram::LoggingMem<T>::read_u8
1.0ms 0.0% 0,0 core..cell..BorrowRefMut::drop.47409::h71287f79163fcfd2
from r68k.
Doing the clone here takes a fair amount of time
from r68k.
It's not even a proper clone, so we should probably rename that method. As Musashi's memory-access-fns are global, we need to reset memory before testing each opcode. There's a fair amount of cheating to make it not have to re-set all 16Mb. This is an area where a modified Musashi w/o globals (or, sending along a "instance-token" or something) would help a lot!
from r68k.
I wonder then.. Would it be possible to use the Mame version instead? It has this fixed already and many other fixes that would be nice.
from r68k.
Possibly the latest version is too much work, as it's in C++? I worry about name mangling and such issues, or perhaps you can mark the communication functions as "extern C" to avoid that? But I think the "non-globalization" fix was done before the ceeplusplusification, so perhaps that would be a simpler target?
from r68k.
The way I do this is something like this
extern "C" some_function_to_call()
{
my_cpp_function();
}
But that really comes down to how many functions that are needed to use from the cpp version. Otherwise the going for the before cpp:fication might be easier indeed.
from r68k.
I've done a stab at integrating hprof, and the output confirms that when executing the "hammer_cores" fn, the main culprit is the cloning, which both copies the first page (1KB) of memory associated with the core (in order to capture at least the portion of memory containing the injected opcode) as well as building the opcode handler table)
hammer - 1 * 807.4ms = 807.4ms (100.0%)
create Core - 1 * 4.5ms = 4.5ms (0.6%)
clone Core - 1 * 799.0ms = 799.0ms (99.0%)
musashi.reset_and_execute1 - 1 * 3.8ms = 3.8ms (0.5%)
r68k.execute1 - 1 * 13.2us = 13.2us (0.0%)
assert_cores_equal - 1 * 12.9us = 12.9us (0.0%)
This is just for a single call to "hammer_cores" because I couldn't get the reporting to work if I put the hprof stuff in the QC-per-opcode-loop (probably due to qc-tests being run in a different thread).
The numbers seems to correspond to your perf capture, and then it seems that almost all of the cloning time is spent in ops::handlers::generate, building a new opcode handler table. This is obviously wasteful and unneccesary to do for each execution, and perhaps it's possible to use lazy_static! here as well.
I'm still not comfortable using that, however. Statics seemed so straightforward and unproblematic back when using C/C++ ;)
from r68k.
It's still possible to use statics for this in Rust but in this case they are likely considered not safe (due to multi-threading which makes sense) It's possible to mark/use the the static with a small unsafe block around it so in this case it might make sense to do so.
from r68k.
Something like this
static mut INIT_CPU_CORE:bool = false;
fn cpu_execute() {
unsafe {
if !INIT_CPU_CORE {
println!("Init core!");
INIT_CPU_CORE = true;
}
}
println!("Run instruction!");
}
fn main() {
cpu_execute();
cpu_execute();
cpu_execute();
}
from r68k.
Actually this may be better https://doc.rust-lang.org/std/sync/struct.Once.html
from r68k.
If you look at https://github.com/marhel/r68k/blob/master/src/musashi.rs#L305 we already use the lazy_static macro, which appears to use Once under the covers, to create global mutexes for serializing the QuickCheck and Musashi operations. I meant that I don't have a good intuition of how "bad" this is, or the performance penalty of accessing this static later .
from r68k.
Should be way better than it is now :)
from r68k.
And yeah, the lazy_static macro is doing lots of funky things under the hood. I think just a InitOnce/static would be fine if we know we start it on one thread only first.
from r68k.
Wow, doing one-time initialization via the lazy_static! macro resulted in a major speedup;
First call, that pays the initialization cost, is the same as before, about 800ms;
Timing information for root profiler:
hammer - 1 * 795.1ms = 795.1ms (100.0%)
create Core - 1 * 4.6ms = 4.6ms (0.6%)
clone Core - 1 * 786.0ms = 786.0ms (98.8%)
musashi.reset_and_execute1 - 1 * 4.5ms = 4.5ms (0.6%)
r68k.execute1 - 1 * 21.9us = 21.9us (0.0%)
assert_cores_equal - 1 * 21.9us = 21.9us (0.0%)
But subsequent calls are 100-200 times faster, at around 5ms each;
Timing information for root profiler:
hammer - 1 * 4.3ms = 4.3ms (99.6%)
create Core - 1 * 267.7us = 267.7us (6.3%)
clone Core - 1 * 2.2ms = 2.2ms (51.7%)
musashi.reset_and_execute1 - 1 * 1.8ms = 1.8ms (41.2%)
r68k.execute1 - 1 * 7.8us = 7.8us (0.2%)
assert_cores_equal - 1 * 7.6us = 7.6us (0.2%)
A testrun of a MASK_OUT_X_Y-instruction, matching 64 opcodes, that (at 10 qc-tests per opcode) used to take at least 64 * 0.8 * 10s, or about 8.5 minutes, and will now take more like 64 * 0.005 * 10s + 0.8s, or 4 seconds.
I just need to clean up the code a bit before committing, but it looks very promising!
from r68k.
Nice :) I think I will take a peak at why the setup is so slow. It's not a major issue as not calling it at all every time is of course faster but I'm a bit curious :)
from r68k.
I think its due to the following loop;
for op in optable {
for opcode in op.matching..0x10000 {
if (opcode & op.mask) == op.matching {
handler[opcode as usize] = op.handler;
}
}
}
With around 1K ops, looping (up to) 64K times for each op to find the matching opcode adds up to a lot of iterations. I checked, and it currently adds up to around 32M iterations. We start looping at the appropriate value, but don't know when to stop.
from r68k.
I see
from r68k.
The following quick-fix breaks out of the iteration early (as soon as we found the maximum number of matching ops), and reduces the iteration count about 16 times, to 1,8M turns, and takes about 50-60ms instead of 800 :)
let mut turns = 0;
for op in optable {
let max_count = 1 << (op.mask as u16).count_zeros();
let mut matching = 0;
for opcode in op.matching..0x10000 {
if (opcode & op.mask) == op.matching {
// println!("{:16b}: {}", opcode, op.name);
handler[opcode as usize] = op.handler;
matching += 1;
if matching >= max_count {break;}
}
turns += 1;
}
}
However, if we built a better matching algorithm, it would only need about 29K iterations (that's how many opcodes we've implemented now). Building lookup-tables from MASK_OUT_X_Y to 64 offsets to add to the base value (matching) would be an example. But that's bordering on micro-optimizing stuff that isn't going to run often (nowadays).
from r68k.
Cool :) I think this quick fix is pretty good for now though as 50-60 ms is way better than 800 :)
from r68k.
Ok, that was fun! We're now down to ~10 ms without lazy_static! No extra iterations due to a lookup-table for MASK_OUT_X and MASK_OUT_X_Y. Adding lazy_static! caused a bit of noise everywhere the handlers were set, so I need to refactor that a bit (for example, the default now is to use the test-implementations of handlers that I added early on when testing execution via handlers works at all, this should of course be changed to default to the real handlers).
from r68k.
Awesome :)
from r68k.
Also, this made me find two opcodes that had the wrong mask, and I added test to detect this condition - if the mask has zeroes where the op-constant has ones, we won't generate any handlers, AND it wouldn't be caught in the QC-test, as the test will match zero opcodes and just finish really quickly.
from r68k.
Nice!
from r68k.
After living with the current implementation a while, I think this is good enough for now. There are more important stuff to focus on to reach completeness.
from r68k.
Related Issues (20)
- Implement TRAPV HOT 1
- Implement TST HOT 1
- Implement UNLK HOT 1
- Diff MAME version of Musashi with current HOT 29
- Compiler gets stuck when building release HOT 17
- Use default for Cpu struct
- More accurate cycle emulation for DIVS/DIVU needed?
- CHK, DIVU, DIVS trap case cycles differs from Musashi HOT 6
- Streamline qc and op_handler macros HOT 1
- Improve QC coverage HOT 8
- Use a logging framework HOT 1
- Disassembler HOT 27
- Implement opcode completeness test HOT 5
- Consider using a codegen for glue HOT 2
- Generating cycle tables HOT 23
- Turn r68k into a lib HOT 14
- Host callbacks HOT 11
- Share fuzz testing code HOT 3
- fuzix HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from r68k.