andreas-abel / nanobench Goto Github PK

View Code? Open in Web Editor NEW

421.0 421.0 52.0 591 KB

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.

Home Page: http://www.uops.info

License: GNU Affero General Public License v3.0

Makefile 0.25% C 28.40% Shell 2.99% Python 68.37%

nanobench's People

Contributors

Stargazers

Watchers

nanobench's Issues

Performance counters are not correctly measured in AMD ZEN series

Hi.

I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <DeDisUopsFromDecoder.DecoderDispatched> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched>).
I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.

command

sudo ./kernel-nanoBench.sh -basic_mode -unroll_count 1 -loop_count 100000 -cpu 1 -asm "nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax" -config configs/cfg_Zen_all.txt | grep -i "dedisuops"

results (I slightly modified the source code to dump absolute measured counters)

DeDisUopsFromDecoder.DecoderDispatched: 10.00 (1000019)
DeDisUopsFromDecoder.OpCacheDispatched: 0.00 (0)

I cannot understand why every instruction is decoded by the legacy x86 decoder.

I also checked with a simple test program consisting of the same code pattern (see below).
test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>

global _start

_start:
        mov rdi, 100000
        call test_uop_cache_hit
    mov rax, 60
    mov rdi, 0
    syscall

test_uop_cache_hit:
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax

    dec rdi
    jnz test_uop_cache_hit
    ret

Then, I checked the performance counters with the perf tool.

$perf stat -e cycles,instructions,r01AA,r02AA,r03AA ./test

 Performance counter stats for './test':

            298349      cycles                                                      
           1037949      instructions              #    3.48  insn per cycle                                            
             86233      r01AA                                                       
            999280      r02AA                                                       
           1085721      r03AA                                                       

       0.000433346 seconds time elapsed

The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).

Why nanoBench and perf show different results?

Sincerely.
Joonsung Kim.

Consider submitting to the Linux kernel

Please consider submitting the module to the Linux kernel.

It can be a lot of work to clean up initially and can make iterating on it harder later on, but on the other hand it won't be broken due to changes in the internal API, it will be seen/used by more people and other people will likely maintain it if you decide not to at some point.

Thank you!

CPU 0 cannot read MSR 0x00000396

When I run nanoBench/tools/CacheAnalyzer/cacheInfp.py, an error of rdmsr: CPU 0 cannot read MSR 0x00000396
Error: appears, how can I solve it? Thanks

Getting Started - process getting killed

I am trying to setup nanoBench and use the kernel space implementation on a Haswell machine.

I built the code for the driver and installed it. However, when I run the example the process gets killed.

sudo ./kernel-nanoBench.sh -asm "ADD RAX, RBX; add RBX, RAX" -config configs/cfg_Haswell_common.txt ./kernel-nanoBench.sh: line 122: 21457 Killed $taskset cat /proc/nanoBench

{rd,wr}{fs,gs}base

Could you add timings for rdfsbase, rdgsbase, wrfsbase, and wrgsbase to the uops.info tables?

I've seen someone use these to use FS and GS as additional address registers in complicated numerical code, so they might be relevant for performance.

Can't build on kernel 4.4

The kernel module fails to build on Ubuntu 16.04 with kernel 4.4:

cd kernel; make
make[1]: Entering directory '/home/travis/dev/nanoBench/kernel'
make -C /lib/modules/4.4.0-170-generic/build M=/home/travis/dev/nanoBench/kernel modules
make[2]: Entering directory '/usr/src/linux-headers-4.4.0-170-generic'
  CC [M]  /home/travis/dev/nanoBench/kernel/nb_km.o
/home/travis/dev/nanoBench/kernel/nb_km.c:18:10: fatal error: linux/set_memory.h: No such file or directory
 #include <linux/set_memory.h>
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

I think the set_memory functions were instead in asm/cacheflush.h on earlier kernels (I don't know exactly when the cutover happend).

Maybe there is a way to use the correct header, e.g., depending which header is available?

mov r32,same on Alder Lake, Zen

This is a report regarding the uops.info table, specifically latency figures for in-place zero extension.

There are separate experiments for mov r32, <other> (latency 0) and mov r32, <same> (latency 1) on Intel CPUs starting from Ivy Bridge but excluding Alder Lake. It appears on Alder Lake the behavior is unchanged, in-place zero extension is not move-eliminated.

https://uops.info/table.html?search=mov_%20(r32%2C%20r32)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_ADLP=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on

https://uops.info/html-instr/MOV_8B_R32_R32.html
https://uops.info/html-instr/MOV_89_R32_R32.html

My experiments indicate that AMD Zen 2 successfully eliminates in-place zero-extension, for example, the following runs at one cycle per iteration:

.loop:
        mov     eax, eax
        inc     rax
        dec     ecx
        jnz     .loop

Many thanks for making and maintaining this compendium.

Missing latency entry for gathers

You measure many latency stats for gathers which is awesome (and a very important formalization of the way we think about latency), but I think you are missing the most important one.

That is is the 2 -> 1 (address) latency but through the vector index register, not the base register. That's probably the most common latency chain you'll have in practice because it generalizes the notion of pointer chasing. That is, a loop like:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm14,ymm0,ymm0

On my SKL machine I measure the same latency (22) for this: same as for the 3->1 latency.

Support of Kaby Lake wanted

pinsrw latency overestimated(?) because dep chain competes for the same port

https://uops.info/html-lat/SKX/PINSRW_XMM_R32_I8-Measurements.html#lat1-%3E1 experiments only use pinsrw xmm, r32, imm alone, or pinsrw with an XMM->XMM dep chain created by shufpd or pshufd.

But pinsrw itself is 2 uops for port 5 on Intel. Presumably a movd-equivalent uop to feed a 2-input shuffle. One would expect that the GP->XMM (movd) uop could run early if there was a free port, leaving the critical path latency from 1->1 being only 1 cycle.

But resource conflicts with the dep chain prevent this from being demonstrated. Perhaps pand xmm0,xmm0 would be a better choice for at least one of the experiments, or orps xmm0, xmm0. (I guess shufpd and pshufd are looking for bypass latency between integer and FP shuffles?)

vpternlogd latencies on Zen4

On Zen 4, summary of vpternlogd latency experiments is given as

Latency operand 1 → 1: 1
Latency operand 2 → 1: 2
Latency operand 3 → 1: 1

https://uops.info/html-lat/ZEN4/VPTERNLOGD_ZMM_ZMM_ZMM_I8-Measurements.html

but I don't see a substantial difference in 3 → 1 vs. 2 → 1 experiments, or a difference w.r.t its vpternlogq sibling, where all latencies are listed as 1. Shouldn't both dword and qword variants be listed with latency 2 for operands 2 and 3? What am I missing?

If I'm reading Agner's testing harness right, his latency experiment times

vpternlogd zmm0, zmm1, zmm2
vpternlogd zmm2, zmm1, zmm0

repeated 50 times. He lists latency of ternlog on Zen 4 as 1 cycle in all cases (but if latency from second operand is indeed 2, his experiment wouldn't uncover that).

(unfortunately I do not have access to a Zen 4 machine to run more experiments)

Read MSR_PKG_ENERGY_STATUS fails

Hi, I am trying to read the MSR_PKG_ENERGY_STATUS using the user-mode nanobenchmark (as declared in configs/msr_RAPL.txt).

However, it fails with error:
'invalid configuration: msr_611'.

I inspected the code and I suspect that it comes from the following line:

nanoBench/common/nanoBench.c

Line 199 in faf7523

char* evt_num = strsep(&tok, ".");

where it doesn't find any '.' in the configuration line.

Has been made any architectural decision to only support the MSR_3F6H, MSR_PF, MSR_RSP0, MSR_RSP1 in the userspace? Is MSR_PKG_ENERGY_STATUS supported in the kernel space?

Thanks a lot!

Gather and uops/port stats

I'm not sure if this the place to file this issue: I didn't find a github page for uops.info specifically, but if there's a better place let me know.

There is something weird with port reporting for gather ops.

Consider VPGATHERDD for example. It is reported as 1*p0+3*p23+1*p5 but this page and other pages clearly show it sends 8 uops total to p23.

Also, this:

With blocking instructions for ports {2, 3}:

    Code:

       0:	c4 c1 7a 6f 56 40    	vmovdqu xmm2,XMMWORD PTR [r14+0x40]
       6:	c4 c1 7a 6f 5e 40    	vmovdqu xmm3,XMMWORD PTR [r14+0x40]
       c:	c4 c1 7a 6f 66 40    	vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      12:	c4 c1 7a 6f 6e 40    	vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      18:	c4 c1 7a 6f 76 40    	vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      1e:	c4 c1 7a 6f 7e 40    	vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      24:	c4 41 7a 6f 46 40    	vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      2a:	c4 41 7a 6f 4e 40    	vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      30:	c4 41 7a 6f 56 40    	vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      36:	c4 41 7a 6f 5e 40    	vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      3c:	c4 c1 7a 6f 56 40    	vmovdqu xmm2,XMMWORD PTR [r14+0x40]
      42:	c4 c1 7a 6f 5e 40    	vmovdqu xmm3,XMMWORD PTR [r14+0x40]
      48:	c4 c1 7a 6f 66 40    	vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      4e:	c4 c1 7a 6f 6e 40    	vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      54:	c4 c1 7a 6f 76 40    	vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      5a:	c4 c1 7a 6f 7e 40    	vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      60:	c4 41 7a 6f 46 40    	vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      66:	c4 41 7a 6f 4e 40    	vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      6c:	c4 41 7a 6f 56 40    	vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      72:	c4 41 7a 6f 5e 40    	vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      78:	c4 82 75 90 04 36    	vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1

    Init:

    VZEROALL;
    VPGATHERDD YMM0, [R14+YMM14], YMM1;
    VXORPS YMM14, YMM14, YMM14;
    VPGATHERDD YMM1, [R14+YMM14], YMM0

    warm_up_count: 100
    Show nanoBench command
    Results:
        Instructions retired: 21.00
        Core cycles: 15.00
        Reference cycles: 13.68
        UOPS_PORT2: 14.00
        UOPS_PORT3: 14.00

⇨ 3 μops that can only use ports {2, 3}

I don't understand the conclusion. I think the idea is that you have 20 instructions which send 1 uop each to p23, which would nominally execute in 10 cycles (20/2), and then you see much adding the instruction under test increases the runtime, assuming the bottleneck is port pressure. Here, you get to 15 cycles, a difference of 5 cycles. How does that equal 3 uops?

Thanks again for uops.info, it is great :).

Complex instruction throughput often understimated in Haswell

If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:

Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!

The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.

Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:

  0:	48 31 c0             	xor    rax,rax
  3:	41 f7 e0             	mul    r8d

end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.

One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:

xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
xadd eax, eax
xadd ebx, ebx
xadd ecx, ecx
xadd edx, edx

will decode more efficiently (5 cycles) than with full interleaving (8 cycles).

Integration with any source code

I was wondering if there is any pipeline support to handle high-level source code (preferably, hooks like IACA that can be inserted in the source) to get the stats of a region and or functions.

If yes any help with the same will be really helpful!

Thanks!

CacheAnalyzer process killed, kernel module issues

Hi, I'm trying to use the cache analyzer tool. However, the process is getting killed due to errors in the kernel module, and the PC usually slowly dies and needs a restart. Here is a segment of the dmesg after running 'sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"'

[  122.924677] nb: module verification failed: signature and/or required key missing - tainting kernel
[  122.925359] Initializing nanoBench kernel module...
[  123.037080] Vendor ID: GenuineIntel
[  123.037089] Brand: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
[  123.037092] DisplayFamily_DisplayModel: 06_9EH
[  123.037095] Stepping ID: 9
[  123.037097] Performance monitoring version: 4
[  123.037099] Number of fixed-function performance counters: 3
[  123.037101] Number of general-purpose performance counters: 4
[  123.037102] Bit widths of fixed-function performance counters: 48
[  123.037104] Bit widths of general-purpose performance counters: 48
[  133.965640] No physically contiguous memory area of the requested size found.
[  133.965644] Try rebooting your computer.
[  246.783643] msr_str: 0xe01
[  246.783646] msr_str: 0x700
[  246.783648] msr_str: 0xe01
[  246.783649] msr_str: 0x710
[  246.783650] msr_str: 0xe01
[  246.783651] msr_str: 0x720
[  246.783652] msr_str: 0xe01
[  246.783653] msr_str: 0x730
[  246.941670] BUG: unable to handle page fault for address: ffffafa028927e71
[  246.941674] #PF: supervisor instruction fetch in kernel mode
[  246.941676] #PF: error_code(0x0010) - not-present page
[  246.941677] PGD 100000067 P4D 100000067 PUD 0 
[  246.941680] Oops: 0010 [#1] SMP PTI
[  246.941682] CPU: 4 PID: 2321 Comm: python3 Tainted: G           OE     5.15.0-53-generic #59~20.04.1-Ubuntu
[  246.941685] Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
[  246.941686] RIP: 0010:0xffffafa028927e71
[  246.941689] Code: Unable to access opcode bytes at RIP 0xffffafa028927e47.

I've attempted this with an Intel i7-9750H, i9-12900k, and now an i7-7700. Using the i7-7700, I'm testing on a fresh install of Ubuntu 20, Kernel version 5.15. For the set-R14-size.sh script, it almost always fails (even after reboot) when using 'sudo ./set-R14-size.sh 1G'. However, if I do more memory, it seems that the allocation sometimes succeeds. Before the dmesg above, I tried 1G, then around 1200M. This seems a bit strange, could it be the issue? Or is there anything else that I'm obviously missing? Here is an example comamnd sequence that I'm using after boot:

cd nanoBench
make kernel
sudo insmod kernel/nb.ko
sudo ./set-R14-size.sh 1200M
cd tools/CacheAnalyzer
sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"

Definition of latency

What is the definition of latency that you want to use exactly?

In particular, consider a hypothetical operation foo arg1, arg2, arg3 which is 3p0. This uop will have a throughput of 3 due to p0 pressure. Can this op have any latency less than 3? I think yes.

For example, the op might only have a 1 cycle delay from arg2->arg1, because the two uops only uses arg3, and then the second uop uses arg2 and arg3.

However testing back-to-back foo ops will never show it because of the throughput limit. I think you are probably well aware of this since I notice lots of filler uops in tests, like:

   0:	c4 42 38 f2 ca       	andn   r9d,r8d,r10d
   5:	4d 63 c1             	movsxd r8,r9d
   8:	4d 63 c8             	movsxd r9,r8d
   b:	4d 63 c1             	movsxd r8,r9d

All the movsxd given enough breathing room to avoid lots of problems of this type.

However, consider gathers. For 1->1 latency testing this is used:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1

No breathing room, so all these results just end up reporting the throughput number (5 in this case).

The following test:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0

also runs in 5 cycles, so we see the true 1->1 latency is 1 cycle.

cmov missing 1->1 latency

I noticed today that cmov is missing latency from operand 1 -> 1.

For cmov, the first operand is read-write, like say add.

replPolicy.py default -sets value doesn't work

https://github.com/andreas-abel/nanoBench/blob/4e7954f/tools/CacheAnalyzer/replPolicy.py#L61

The code suggests that if I leave -sets empty, all sets will be tested.
However, running that will result in this error

https://github.com/andreas-abel/nanoBench/blob/d70535f/tools/CacheAnalyzer/cacheSim.py#L309

Documentation available for AMD Zen and Zen 2

In case you wanted to fill out the documentation category for AMD chips, it seems that AMD has made available the latency spreadsheet for Zen and Zen 2 (?).

andreas-abel / nanobench Goto Github PK

nanobench's People

Contributors

Stargazers

Watchers

Forkers

nanobench's Issues

Recommend Projects

Recommend Topics

Recommend Org