I think we need to look further into where time is spent when doing QC tests. I have a

Did a perf capture (in Instruments) <div class="snippet-clipboard-content notransl

The way I do this is something like this <div class="highlight highlight-source-c+

Improve performance during QC about r68k HOT 27 CLOSED

marhel commented on July 25, 2024

Improve performance during QC

from r68k.

Comments (27)

emoon commented on July 25, 2024

Agreed.

Also it would have been great to add/redo the changes made to Musashi to allow it to run from more threads which would speed up the internal time.

Also for testing performance I think this may be helpful https://github.com/cmr/hprof

from r68k.

marhel commented on July 25, 2024

Yes, that crate seems useful! And multithreading Musashi would be great, but a lot of effort, when running the tests in parallel (using ./pqc) probably achieves at least 3/4 of the performance boost. I'd rather have each test be quicker at this point.

from r68k.

emoon commented on July 25, 2024

Yeah I guess that is good enough. Using the hprof create should pin-point where most time is spent at least.

from r68k.

emoon commented on July 25, 2024

Did a perf capture (in Instruments)

Running Time    Self (ms)       Symbol Name
726.0ms    4.4% 0,0        _pthread_body
726.0ms    4.4% 0,0         sys::thread::_$LT$impl$GT$::new::thread_start::h6c3076098329c459IZw
726.0ms    4.4% 0,0          fnfn
726.0ms    4.4% 0,0           r68k::sys_common::unwind::try<closure>
726.0ms    4.4% 0,0            __rust_try
726.0ms    4.4% 0,0             fnfn
726.0ms    4.4% 0,0              r68k::musashi::tests::hammer_cores_even_addresses
726.0ms    4.4% 0,0               r68k::musashi::tests::hammer_cores_with
720.0ms    4.4% 0,0                r68k::cpu::Core.Clone::clone
718.0ms    4.4% 0,0                 r68k::cpu::ops::instruction_set
718.0ms    4.4% 208,0                    r68k::cpu::ops::handlers::generate
486.0ms    3.0% 261,0                     r68k::iter::ops::Range<A>.Iterator::next
113.0ms    0.6% 80,0                       r68k::mem::swap<u32>
20.0ms    0.1%  20,0                        r68k::mem::uninitialized<u32>
13.0ms    0.0%  13,0                        r68k::mem::forget<u32>
61.0ms    0.3%  28,0                       r68k::ops::&'b u32.Add<&'a u32>::add
33.0ms    0.2%  33,0                        r68k::ops::u32.Add::add
19.0ms    0.1%  19,0                       r68k::cmp::impls::u32.PartialOrd::lt
18.0ms    0.1%  0,0                    <Unknown Address>
14.0ms    0.0%  14,0                       r68k::num::u32.One::one
20.0ms    0.1%  0,0                   <Unknown Address>
12.0ms    0.0%  12,0                       r68k::ops::&'b u32.Add<&'a u32>::add
8.0ms    0.0%   8,0                    r68k::cmp::impls::u32.PartialOrd::lt
2.0ms    0.0%   0,0                   r68k::iter::ops::Range<A>.Iterator::next
2.0ms    0.0%   1,0                    r68k::mem::swap<i32>
1.0ms    0.0%   1,0                     r68k::mem::uninitialized<i32>
2.0ms    0.0%   0,0                   r68k::vec::Vec<T>::push
2.0ms    0.0%   0,0                    r68k::vec::Vec<T>.ops::DerefMut::deref_mut
1.0ms    0.0%   1,0                     r68k::ptr::*mut T::is_null
1.0ms    0.0%   1,0                     r68k::slice::from_raw_parts_mut<fn(&mut r68k::cpu::Core) -> core::result::Result<r68k::cpu::Cycles, r68k::cpu::Exception>>
2.0ms    0.0%   0,0                 r68k::ram::LoggingMem<T>.AddressBus::copy_from
4.0ms    0.0%   0,0                r68k::cpu::Core::new_mem
2.0ms    0.0%   0,0                r68k::musashi::reset_and_execute1
2.0ms    0.0%   0,0                 r68k::musashi::initialize_musashi
2.0ms    0.0%   0,0                  r68k::ram::LoggingMem<T>.AddressBus::read_byte
1.0ms    0.0%   0,0                   r68k::ram::LoggingMem<T>::read_u8
1.0ms    0.0%   0,0                   core..cell..BorrowRefMut::drop.47409::h71287f79163fcfd2

from r68k.

emoon commented on July 25, 2024

Doing the clone here takes a fair amount of time

from r68k.

marhel commented on July 25, 2024

It's not even a proper clone, so we should probably rename that method. As Musashi's memory-access-fns are global, we need to reset memory before testing each opcode. There's a fair amount of cheating to make it not have to re-set all 16Mb. This is an area where a modified Musashi w/o globals (or, sending along a "instance-token" or something) would help a lot!

from r68k.

emoon commented on July 25, 2024

I wonder then.. Would it be possible to use the Mame version instead? It has this fixed already and many other fixes that would be nice.

from r68k.

marhel commented on July 25, 2024

Possibly the latest version is too much work, as it's in C++? I worry about name mangling and such issues, or perhaps you can mark the communication functions as "extern C" to avoid that? But I think the "non-globalization" fix was done before the ceeplusplusification, so perhaps that would be a simpler target?

from r68k.

emoon commented on July 25, 2024

The way I do this is something like this

extern "C" some_function_to_call()
{
   my_cpp_function();
}

But that really comes down to how many functions that are needed to use from the cpp version. Otherwise the going for the before cpp:fication might be easier indeed.

from r68k.

marhel commented on July 25, 2024

I've done a stab at integrating hprof, and the output confirms that when executing the "hammer_cores" fn, the main culprit is the cloning, which both copies the first page (1KB) of memory associated with the core (in order to capture at least the portion of memory containing the injected opcode) as well as building the opcode handler table)

hammer - 1 * 807.4ms = 807.4ms (100.0%)
  create Core - 1 * 4.5ms = 4.5ms (0.6%)
  clone Core - 1 * 799.0ms = 799.0ms (99.0%)
  musashi.reset_and_execute1 - 1 * 3.8ms = 3.8ms (0.5%)
  r68k.execute1 - 1 * 13.2us = 13.2us (0.0%)
  assert_cores_equal - 1 * 12.9us = 12.9us (0.0%)

This is just for a single call to "hammer_cores" because I couldn't get the reporting to work if I put the hprof stuff in the QC-per-opcode-loop (probably due to qc-tests being run in a different thread).

The numbers seems to correspond to your perf capture, and then it seems that almost all of the cloning time is spent in ops::handlers::generate, building a new opcode handler table. This is obviously wasteful and unneccesary to do for each execution, and perhaps it's possible to use lazy_static! here as well.

I'm still not comfortable using that, however. Statics seemed so straightforward and unproblematic back when using C/C++ ;)

from r68k.

emoon commented on July 25, 2024

It's still possible to use statics for this in Rust but in this case they are likely considered not safe (due to multi-threading which makes sense) It's possible to mark/use the the static with a small unsafe block around it so in this case it might make sense to do so.

from r68k.

emoon commented on July 25, 2024

Something like this

static mut INIT_CPU_CORE:bool = false;

fn cpu_execute() {
    unsafe {
        if !INIT_CPU_CORE {
            println!("Init core!");
            INIT_CPU_CORE = true;
        }
    }

    println!("Run instruction!");
}

fn main() {
    cpu_execute();
    cpu_execute();
    cpu_execute();
}

from r68k.

emoon commented on July 25, 2024

Actually this may be better https://doc.rust-lang.org/std/sync/struct.Once.html

from r68k.

marhel commented on July 25, 2024

If you look at https://github.com/marhel/r68k/blob/master/src/musashi.rs#L305 we already use the lazy_static macro, which appears to use Once under the covers, to create global mutexes for serializing the QuickCheck and Musashi operations. I meant that I don't have a good intuition of how "bad" this is, or the performance penalty of accessing this static later .

from r68k.

emoon commented on July 25, 2024

Should be way better than it is now :)

from r68k.

emoon commented on July 25, 2024

And yeah, the lazy_static macro is doing lots of funky things under the hood. I think just a InitOnce/static would be fine if we know we start it on one thread only first.

from r68k.

marhel commented on July 25, 2024

Wow, doing one-time initialization via the lazy_static! macro resulted in a major speedup;

First call, that pays the initialization cost, is the same as before, about 800ms;

Timing information for root profiler:
  hammer - 1 * 795.1ms = 795.1ms (100.0%)
    create Core - 1 * 4.6ms = 4.6ms (0.6%)
    clone Core - 1 * 786.0ms = 786.0ms (98.8%)
    musashi.reset_and_execute1 - 1 * 4.5ms = 4.5ms (0.6%)
    r68k.execute1 - 1 * 21.9us = 21.9us (0.0%)
    assert_cores_equal - 1 * 21.9us = 21.9us (0.0%)

But subsequent calls are 100-200 times faster, at around 5ms each;

Timing information for root profiler:
  hammer - 1 * 4.3ms = 4.3ms (99.6%)
    create Core - 1 * 267.7us = 267.7us (6.3%)
    clone Core - 1 * 2.2ms = 2.2ms (51.7%)
    musashi.reset_and_execute1 - 1 * 1.8ms = 1.8ms (41.2%)
    r68k.execute1 - 1 * 7.8us = 7.8us (0.2%)
    assert_cores_equal - 1 * 7.6us = 7.6us (0.2%)

A testrun of a MASK_OUT_X_Y-instruction, matching 64 opcodes, that (at 10 qc-tests per opcode) used to take at least 64 * 0.8 * 10s, or about 8.5 minutes, and will now take more like 64 * 0.005 * 10s + 0.8s, or 4 seconds.

I just need to clean up the code a bit before committing, but it looks very promising!

from r68k.

emoon commented on July 25, 2024

Nice :) I think I will take a peak at why the setup is so slow. It's not a major issue as not calling it at all every time is of course faster but I'm a bit curious :)

from r68k.

marhel commented on July 25, 2024

I think its due to the following loop;

for op in optable {
    for opcode in op.matching..0x10000 {
        if (opcode & op.mask) == op.matching {
            handler[opcode as usize] = op.handler;
        }
    }
}

With around 1K ops, looping (up to) 64K times for each op to find the matching opcode adds up to a lot of iterations. I checked, and it currently adds up to around 32M iterations. We start looping at the appropriate value, but don't know when to stop.

from r68k.

emoon commented on July 25, 2024

I see

from r68k.

marhel commented on July 25, 2024

The following quick-fix breaks out of the iteration early (as soon as we found the maximum number of matching ops), and reduces the iteration count about 16 times, to 1,8M turns, and takes about 50-60ms instead of 800 :)

let mut turns = 0;
for op in optable {
    let max_count = 1 << (op.mask as u16).count_zeros();
    let mut matching = 0;
    for opcode in op.matching..0x10000 {
        if (opcode & op.mask) == op.matching {
            // println!("{:16b}: {}", opcode, op.name);
            handler[opcode as usize] = op.handler;
            matching += 1;
            if matching >= max_count {break;}
        }
        turns += 1;
    }
}

However, if we built a better matching algorithm, it would only need about 29K iterations (that's how many opcodes we've implemented now). Building lookup-tables from MASK_OUT_X_Y to 64 offsets to add to the base value (matching) would be an example. But that's bordering on micro-optimizing stuff that isn't going to run often (nowadays).

from r68k.

emoon commented on July 25, 2024

Cool :) I think this quick fix is pretty good for now though as 50-60 ms is way better than 800 :)

from r68k.

marhel commented on July 25, 2024

Ok, that was fun! We're now down to ~10 ms without lazy_static! No extra iterations due to a lookup-table for MASK_OUT_X and MASK_OUT_X_Y. Adding lazy_static! caused a bit of noise everywhere the handlers were set, so I need to refactor that a bit (for example, the default now is to use the test-implementations of handlers that I added early on when testing execution via handlers works at all, this should of course be changed to default to the real handlers).

from r68k.

emoon commented on July 25, 2024

Awesome :)

from r68k.

marhel commented on July 25, 2024

Also, this made me find two opcodes that had the wrong mask, and I added test to detect this condition - if the mask has zeroes where the op-constant has ones, we won't generate any handlers, AND it wouldn't be caught in the QC-test, as the test will match zero opcodes and just finish really quickly.

from r68k.

emoon commented on July 25, 2024

Nice!

from r68k.

marhel commented on July 25, 2024

After living with the current implementation a while, I think this is good enough for now. There are more important stuff to focus on to reach completeness.

from r68k.

Improve performance during QC about r68k HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent