andreas-abel / nanobench Goto Github PK
View Code? Open in Web Editor NEWA tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
Home Page: http://www.uops.info
License: GNU Affero General Public License v3.0
A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
Home Page: http://www.uops.info
License: GNU Affero General Public License v3.0
Hi.
I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <DeDisUopsFromDecoder.DecoderDispatched
> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched
>).
I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.
command
sudo ./kernel-nanoBench.sh -basic_mode -unroll_count 1 -loop_count 100000 -cpu 1 -asm "nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax" -config configs/cfg_Zen_all.txt | grep -i "dedisuops"
results (I slightly modified the source code to dump absolute measured counters)
DeDisUopsFromDecoder.DecoderDispatched: 10.00 (1000019)
DeDisUopsFromDecoder.OpCacheDispatched: 0.00 (0)
I cannot understand why every instruction is decoded by the legacy x86 decoder.
I also checked with a simple test program consisting of the same code pattern (see below).
test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>
global _start
_start:
mov rdi, 100000
call test_uop_cache_hit
mov rax, 60
mov rdi, 0
syscall
test_uop_cache_hit:
nop ax
nop ax
nop ax
nop ax
nop ax
nop ax
nop ax
nop ax
dec rdi
jnz test_uop_cache_hit
ret
Then, I checked the performance counters with the perf tool.
$perf stat -e cycles,instructions,r01AA,r02AA,r03AA ./test
Performance counter stats for './test':
298349 cycles
1037949 instructions # 3.48 insn per cycle
86233 r01AA
999280 r02AA
1085721 r03AA
0.000433346 seconds time elapsed
The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).
Why nanoBench and perf show different results?
Sincerely.
Joonsung Kim.
Please consider submitting the module to the Linux kernel.
It can be a lot of work to clean up initially and can make iterating on it harder later on, but on the other hand it won't be broken due to changes in the internal API, it will be seen/used by more people and other people will likely maintain it if you decide not to at some point.
Thank you!
When I run nanoBench/tools/CacheAnalyzer/cacheInfp.py, an error of rdmsr: CPU 0 cannot read MSR 0x00000396
Error: appears, how can I solve it? Thanks
I am trying to setup nanoBench and use the kernel space implementation on a Haswell machine.
I built the code for the driver and installed it. However, when I run the example the process gets killed.
sudo ./kernel-nanoBench.sh -asm "ADD RAX, RBX; add RBX, RAX" -config configs/cfg_Haswell_common.txt ./kernel-nanoBench.sh: line 122: 21457 Killed $taskset cat /proc/nanoBench
Could you add timings for rdfsbase, rdgsbase, wrfsbase, and wrgsbase to the uops.info tables?
I've seen someone use these to use FS and GS as additional address registers in complicated numerical code, so they might be relevant for performance.
The kernel module fails to build on Ubuntu 16.04 with kernel 4.4:
cd kernel; make
make[1]: Entering directory '/home/travis/dev/nanoBench/kernel'
make -C /lib/modules/4.4.0-170-generic/build M=/home/travis/dev/nanoBench/kernel modules
make[2]: Entering directory '/usr/src/linux-headers-4.4.0-170-generic'
CC [M] /home/travis/dev/nanoBench/kernel/nb_km.o
/home/travis/dev/nanoBench/kernel/nb_km.c:18:10: fatal error: linux/set_memory.h: No such file or directory
#include <linux/set_memory.h>
^~~~~~~~~~~~~~~~~~~~
compilation terminated.
I think the set_memory functions were instead in asm/cacheflush.h
on earlier kernels (I don't know exactly when the cutover happend).
Maybe there is a way to use the correct header, e.g., depending which header is available?
This is a report regarding the uops.info table, specifically latency figures for in-place zero extension.
There are separate experiments for mov r32, <other>
(latency 0) and mov r32, <same>
(latency 1) on Intel CPUs starting from Ivy Bridge but excluding Alder Lake. It appears on Alder Lake the behavior is unchanged, in-place zero extension is not move-eliminated.
https://uops.info/html-instr/MOV_8B_R32_R32.html
https://uops.info/html-instr/MOV_89_R32_R32.html
My experiments indicate that AMD Zen 2 successfully eliminates in-place zero-extension, for example, the following runs at one cycle per iteration:
.loop:
mov eax, eax
inc rax
dec ecx
jnz .loop
Many thanks for making and maintaining this compendium.
You measure many latency stats for gathers which is awesome (and a very important formalization of the way we think about latency), but I think you are missing the most important one.
That is is the 2 -> 1 (address) latency but through the vector index register, not the base register. That's probably the most common latency chain you'll have in practice because it generalizes the notion of pointer chasing. That is, a loop like:
vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm14,ymm0,ymm0
On my SKL machine I measure the same latency (22) for this: same as for the 3->1 latency.
https://uops.info/html-lat/SKX/PINSRW_XMM_R32_I8-Measurements.html#lat1-%3E1 experiments only use pinsrw xmm, r32, imm
alone, or pinsrw
with an XMM->XMM dep chain created by shufpd
or pshufd
.
But pinsrw
itself is 2 uops for port 5 on Intel. Presumably a movd
-equivalent uop to feed a 2-input shuffle. One would expect that the GP->XMM (movd
) uop could run early if there was a free port, leaving the critical path latency from 1->1 being only 1 cycle.
But resource conflicts with the dep chain prevent this from being demonstrated. Perhaps pand xmm0,xmm0
would be a better choice for at least one of the experiments, or orps xmm0, xmm0
. (I guess shufpd and pshufd are looking for bypass latency between integer and FP shuffles?)
On Zen 4, summary of vpternlogd latency experiments is given as
Latency operand 1 → 1: 1
Latency operand 2 → 1: 2
Latency operand 3 → 1: 1
https://uops.info/html-lat/ZEN4/VPTERNLOGD_ZMM_ZMM_ZMM_I8-Measurements.html
but I don't see a substantial difference in 3 → 1 vs. 2 → 1 experiments, or a difference w.r.t its vpternlogq sibling, where all latencies are listed as 1. Shouldn't both dword and qword variants be listed with latency 2 for operands 2 and 3? What am I missing?
If I'm reading Agner's testing harness right, his latency experiment times
vpternlogd zmm0, zmm1, zmm2
vpternlogd zmm2, zmm1, zmm0
repeated 50 times. He lists latency of ternlog on Zen 4 as 1 cycle in all cases (but if latency from second operand is indeed 2, his experiment wouldn't uncover that).
(unfortunately I do not have access to a Zen 4 machine to run more experiments)
Hi, I am trying to read the MSR_PKG_ENERGY_STATUS using the user-mode nanobenchmark (as declared in configs/msr_RAPL.txt).
However, it fails with error:
'invalid configuration: msr_611'.
I inspected the code and I suspect that it comes from the following line:
Line 199 in faf7523
where it doesn't find any '.' in the configuration line.
Has been made any architectural decision to only support the MSR_3F6H, MSR_PF, MSR_RSP0, MSR_RSP1 in the userspace? Is MSR_PKG_ENERGY_STATUS supported in the kernel space?
Thanks a lot!
I'm not sure if this the place to file this issue: I didn't find a github page for uops.info specifically, but if there's a better place let me know.
There is something weird with port reporting for gather ops.
Consider VPGATHERDD for example. It is reported as 1*p0+3*p23+1*p5
but this page and other pages clearly show it sends 8 uops total to p23
.
Also, this:
With blocking instructions for ports {2, 3}:
Code:
0: c4 c1 7a 6f 56 40 vmovdqu xmm2,XMMWORD PTR [r14+0x40]
6: c4 c1 7a 6f 5e 40 vmovdqu xmm3,XMMWORD PTR [r14+0x40]
c: c4 c1 7a 6f 66 40 vmovdqu xmm4,XMMWORD PTR [r14+0x40]
12: c4 c1 7a 6f 6e 40 vmovdqu xmm5,XMMWORD PTR [r14+0x40]
18: c4 c1 7a 6f 76 40 vmovdqu xmm6,XMMWORD PTR [r14+0x40]
1e: c4 c1 7a 6f 7e 40 vmovdqu xmm7,XMMWORD PTR [r14+0x40]
24: c4 41 7a 6f 46 40 vmovdqu xmm8,XMMWORD PTR [r14+0x40]
2a: c4 41 7a 6f 4e 40 vmovdqu xmm9,XMMWORD PTR [r14+0x40]
30: c4 41 7a 6f 56 40 vmovdqu xmm10,XMMWORD PTR [r14+0x40]
36: c4 41 7a 6f 5e 40 vmovdqu xmm11,XMMWORD PTR [r14+0x40]
3c: c4 c1 7a 6f 56 40 vmovdqu xmm2,XMMWORD PTR [r14+0x40]
42: c4 c1 7a 6f 5e 40 vmovdqu xmm3,XMMWORD PTR [r14+0x40]
48: c4 c1 7a 6f 66 40 vmovdqu xmm4,XMMWORD PTR [r14+0x40]
4e: c4 c1 7a 6f 6e 40 vmovdqu xmm5,XMMWORD PTR [r14+0x40]
54: c4 c1 7a 6f 76 40 vmovdqu xmm6,XMMWORD PTR [r14+0x40]
5a: c4 c1 7a 6f 7e 40 vmovdqu xmm7,XMMWORD PTR [r14+0x40]
60: c4 41 7a 6f 46 40 vmovdqu xmm8,XMMWORD PTR [r14+0x40]
66: c4 41 7a 6f 4e 40 vmovdqu xmm9,XMMWORD PTR [r14+0x40]
6c: c4 41 7a 6f 56 40 vmovdqu xmm10,XMMWORD PTR [r14+0x40]
72: c4 41 7a 6f 5e 40 vmovdqu xmm11,XMMWORD PTR [r14+0x40]
78: c4 82 75 90 04 36 vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
Init:
VZEROALL;
VPGATHERDD YMM0, [R14+YMM14], YMM1;
VXORPS YMM14, YMM14, YMM14;
VPGATHERDD YMM1, [R14+YMM14], YMM0
warm_up_count: 100
Show nanoBench command
Results:
Instructions retired: 21.00
Core cycles: 15.00
Reference cycles: 13.68
UOPS_PORT2: 14.00
UOPS_PORT3: 14.00
⇨ 3 μops that can only use ports {2, 3}
I don't understand the conclusion. I think the idea is that you have 20 instructions which send 1 uop each to p23
, which would nominally execute in 10 cycles (20/2), and then you see much adding the instruction under test increases the runtime, assuming the bottleneck is port pressure. Here, you get to 15 cycles, a difference of 5 cycles. How does that equal 3 uops?
Thanks again for uops.info, it is great :).
If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:
Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!
The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.
Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:
0: 48 31 c0 xor rax,rax
3: 41 f7 e0 mul r8d
end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.
One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:
xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
xadd eax, eax
xadd ebx, ebx
xadd ecx, ecx
xadd edx, edx
will decode more efficiently (5 cycles) than with full interleaving (8 cycles).
Hi
I was wondering if there is any pipeline support to handle high-level source code (preferably, hooks like IACA that can be inserted in the source) to get the stats of a region and or functions.
If yes any help with the same will be really helpful!
Thanks!
Hi, I'm trying to use the cache analyzer tool. However, the process is getting killed due to errors in the kernel module, and the PC usually slowly dies and needs a restart. Here is a segment of the dmesg after running 'sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"'
[ 122.924677] nb: module verification failed: signature and/or required key missing - tainting kernel
[ 122.925359] Initializing nanoBench kernel module...
[ 123.037080] Vendor ID: GenuineIntel
[ 123.037089] Brand: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
[ 123.037092] DisplayFamily_DisplayModel: 06_9EH
[ 123.037095] Stepping ID: 9
[ 123.037097] Performance monitoring version: 4
[ 123.037099] Number of fixed-function performance counters: 3
[ 123.037101] Number of general-purpose performance counters: 4
[ 123.037102] Bit widths of fixed-function performance counters: 48
[ 123.037104] Bit widths of general-purpose performance counters: 48
[ 133.965640] No physically contiguous memory area of the requested size found.
[ 133.965644] Try rebooting your computer.
[ 246.783643] msr_str: 0xe01
[ 246.783646] msr_str: 0x700
[ 246.783648] msr_str: 0xe01
[ 246.783649] msr_str: 0x710
[ 246.783650] msr_str: 0xe01
[ 246.783651] msr_str: 0x720
[ 246.783652] msr_str: 0xe01
[ 246.783653] msr_str: 0x730
[ 246.941670] BUG: unable to handle page fault for address: ffffafa028927e71
[ 246.941674] #PF: supervisor instruction fetch in kernel mode
[ 246.941676] #PF: error_code(0x0010) - not-present page
[ 246.941677] PGD 100000067 P4D 100000067 PUD 0
[ 246.941680] Oops: 0010 [#1] SMP PTI
[ 246.941682] CPU: 4 PID: 2321 Comm: python3 Tainted: G OE 5.15.0-53-generic #59~20.04.1-Ubuntu
[ 246.941685] Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
[ 246.941686] RIP: 0010:0xffffafa028927e71
[ 246.941689] Code: Unable to access opcode bytes at RIP 0xffffafa028927e47.
I've attempted this with an Intel i7-9750H, i9-12900k, and now an i7-7700. Using the i7-7700, I'm testing on a fresh install of Ubuntu 20, Kernel version 5.15. For the set-R14-size.sh script, it almost always fails (even after reboot) when using 'sudo ./set-R14-size.sh 1G'. However, if I do more memory, it seems that the allocation sometimes succeeds. Before the dmesg above, I tried 1G, then around 1200M. This seems a bit strange, could it be the issue? Or is there anything else that I'm obviously missing? Here is an example comamnd sequence that I'm using after boot:
cd nanoBench
make kernel
sudo insmod kernel/nb.ko
sudo ./set-R14-size.sh 1200M
cd tools/CacheAnalyzer
sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"
What is the definition of latency that you want to use exactly?
In particular, consider a hypothetical operation foo arg1, arg2, arg3
which is 3p0
. This uop will have a throughput of 3 due to p0
pressure. Can this op have any latency less than 3? I think yes.
For example, the op might only have a 1 cycle delay from arg2
->arg1
, because the two uops only uses arg3
, and then the second uop uses arg2
and arg3
.
However testing back-to-back foo
ops will never show it because of the throughput limit. I think you are probably well aware of this since I notice lots of filler uops in tests, like:
0: c4 42 38 f2 ca andn r9d,r8d,r10d
5: 4d 63 c1 movsxd r8,r9d
8: 4d 63 c8 movsxd r9,r8d
b: 4d 63 c1 movsxd r8,r9d
All the movsxd
given enough breathing room to avoid lots of problems of this type.
However, consider gathers. For 1->1 latency testing this is used:
vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
No breathing room, so all these results just end up reporting the throughput number (5 in this case).
The following test:
vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
also runs in 5 cycles, so we see the true 1->1 latency is 1 cycle.
I noticed today that cmov
is missing latency from operand 1 -> 1.
For cmov
, the first operand is read-write, like say add
.
https://github.com/andreas-abel/nanoBench/blob/4e7954f/tools/CacheAnalyzer/replPolicy.py#L61
The code suggests that if I leave -sets
empty, all sets will be tested.
However, running that will result in this error
https://github.com/andreas-abel/nanoBench/blob/d70535f/tools/CacheAnalyzer/cacheSim.py#L309
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.