Hi! I was wondering if it would possible to include some benchmarks in the README? We

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Just FYI. Running the benchmark <a href="https://github.com/embano1/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Fixed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

Add benchmarks to README? about automaxprocs HOT 9 CLOSED

uber-go commented on May 20, 2024 3

Add benchmarks to README?

from automaxprocs.

Comments (9)

embano1 commented on May 20, 2024 21

Hi @codesuki

Sorry for my delayed answer (PTO and other stuff). I did some benchmarking (mostly CPU-bound workloads like calculating prime numbers or locks) but did not have the time to write the blog post, which is still planned.

Anyways, here's some data points I ran on a 16 core cloud box.

The tests show different CPU cgroup settings (i.e. CFS quota) and the effect of the benchmark run time.

The first diagram summarises a prime benchmark I wrote for the tests (https://github.com/embano1/gotutorials/tree/master/concprime). It spawns many active Goroutines to find prime numbers. The benchmark execution time is compared in different runs with CPU CFS quota off/1/2/4 vs. different GOMAXPROCS settings (1-16) on a 16 core box. As an example, look at the orange bars comparing the case for GOMAXPROCS=16. The first orange bar shows the run w/out CPU CFS quota, i.e. the fastest of all runs (as expected). The second orange bar is where the container is constrained to 1 CPU (CFS quota 100ms, period 100ms). It's the worst result, meaning that you should tune GOMAXPROCS to CFS quota accordingly especially on large boxes (+8 CPUs).

The second diagram is a mutex lock contention benchmark comparing Go's SyncMap vs. a map with a mutex (https://medium.com/@deckarep/the-new-kid-in-town-gos-sync-map-de24a6bf7c2c). Here you can see that it becomes really critical for performance when there's many mutexes in the game. Compare the orange (map w/ r/w mutex) and yellow lines (map w r/w mutex and CFS quota == 1 CPU). GOMAXPROCS is shown on the horizontal axis. Everything is fine for GOMAXPROCS=1, but it gets really worse with CFS quota applied and GOMAXPROCS=16 (default on that machine). The chart cuts of, you can see the values for both cases in the table below the chart. For sync map w/ r/w mutex and CFS quota == 1 CPU we got two orders of magnitude slower performance when GOMAXPROC is not tuned (151 vs 15000 ns/op)!

I ack that these are synthetic benchmarks but they prove the point. If there's misalignment between CFS quota, the language runtime tuning (in this case GOMAXPROCS) and the workload is mostly CPU bound (e.g. spawning a lot of active goroutines, calculations, etc.) then this could cause performance degradation.

I think it's not that hard to write custom benchmarks to validate the impact of misaligned GOMAXPROCS to CFS quota for the specific application. In fact I recommend having benchmarking/stress-testing being part of CI to establish a baseline and compare against production. I discussed this intensively in a talk at KubeCon (https://www.youtube.com/watch?v=8-apJyr2gi0).

Hope that helps.

from automaxprocs.

gaocegege commented on May 20, 2024 10

Just FYI.

Running the benchmark https://github.com/embano1/gotutorials/tree/master/concprime with native, docker --cpus 4, docker --cpus 2, docker --cpus 1, docker --cpuset-cpus 0,1,2,3, kubernetes resources.limits=4, kubernetes resources.limits=2, kubernetes resources.limits=1.

Test env:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7560U CPU @ 2.40GHz
Stepping:            9
CPU MHz:             1011.469
CPU max MHz:         3800.0000
CPU min MHz:         400.0000
BogoMIPS:            4800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

Got the result

Raw results:

arch || GOMAXPROCS || duration (ms)
Native 4 CPU	1	200
Native 4 CPU	2	134
Native 4 CPU	4	123
docker -cpuset-cpus 4	1	291
docker -cpuset-cpus 4	2	162
docker -cpuset-cpus 4	4	116
docker --cpus 1	1	311
docker --cpus 1	2	320
docker --cpus 1	4	495
docker --cpus 2	1	302
docker --cpus 2	2	179
docker --cpus 2	4	191
docker --cpus 4	1	272
docker --cpus 4	2	169
docker --cpus 4	4	134
K8s resources.limits.cpu=1	1	243
K8s resources.limits.cpu=1	2	277
K8s resources.limits.cpu=1	4	424
K8s resources.limits.cpu=2	1	260
K8s resources.limits.cpu=2	2	154
K8s resources.limits.cpu=2	4	192
K8s resources.limits.cpu=4	1	248
K8s resources.limits.cpu=4	2	154
K8s resources.limits.cpu=4	4	125

library(ggplot2)
library(purrr)

maxprocs = read.delim("maxprocs.txt", header = FALSE, sep = "\t", dec = ".")

maxprocs$V2 = factor(maxprocs$V2, c("1", "2", "4"))

p = ggplot(maxprocs, aes(x = V1, y = V3, fill = V2)) + 
  geom_col(position = "dodge2", width = 0.7) +
  coord_flip() +
  theme(legend.position = "top") + 
  guides(fill = guide_legend(title = "GOMAXPROCS", title.position = "left")) +
  labs(y = "Duration (ms) (lower is better)", x= "")

p

from automaxprocs.

prashantv commented on May 20, 2024 6

I'll add some more data measured from our internal load balancer at Uber.

We ran the load balancer with 200% CPU quota (e.g., 2 cores), and used yab to benchmark.

GOMAXPROCS	RPS	P50 (ms)	P99.9 (ms)
1	28,893.18	1.46	19.70
2 (equal to quota)	44,715.07	0.84	26.38
3	44,212.93	0.66	30.07
4	41,071.15	0.57	42.94
8	33,111.69	0.43	64.32
Default (24)	22,191.40	0.45	76.19

When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases.

When GOMAXPROCS is higher than the CPU quota allocated, we also saw significant throttling:

$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/[...]/cpu.stat
nr_periods 42227334
nr_throttled 131923
throttled_time 88613212216618

Once GOMAXPROCS was reduced to match the CPU quota, we saw no CPU throttling.

from automaxprocs.

codesuki commented on May 20, 2024 1

Nochmals danke!
This aligns with what I expected but couldn't put in words.
Makes total sense now.
I'll have a look at the links, thanks for those, too.

from automaxprocs.

embano1 commented on May 20, 2024

@jeromefroe working on a blog post covering exactly this (incl. benchmarks). hope to have it done by January. will ping you again...

from automaxprocs.

codesuki commented on May 20, 2024

@embano1 Did you finish the blog post? :)

from automaxprocs.

codesuki commented on May 20, 2024

Danke for following up and the write up! Very informative. I'll play with the benchmark a bit.

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

Great talk BTW!

from automaxprocs.

embano1 commented on May 20, 2024

Thank you (also on the talk!) and "gern geschehen" :)

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

The problem is lock/CPU cache contention when there is more than one OS thread (simply speaking number of GOMAXPROCS) active for the program. So in any case, one thread would always beat >1 OS threads when solely looking at locking mechanisms and contention. Details here: https://www.youtube.com/watch?v=C1EtfDnsdDs

Now, should we advise always setting GOMAXPROCS=1? Of course not, since that would hurt performance and destroy the benefits of multi-core machines. I'm just looking at a very specific computing problem here. So with higher GOMAXPROCS (aligned to CPU CFS quota), and thus potential lock contention, most programs perform better by leveraging multi-threading and Goroutines. That's why I alluded to doing benchmarks and stress testing against limits posed on the program, i.e. CPU CFS in that specific case.

I would say Dave Cheney can be considered an authoritative Go source :) and thus linking to his great material on performance tuning: https://github.com/davecheney/high-performance-go-workshop

I'm also pleased to hear that there will be changes coming to Go runtime memory management with regards to memory limits (https://blog.golang.org/ismmkeynote), but that's not related to our discussion here :)

from automaxprocs.

abhinav commented on May 20, 2024

Fixed by #52. Thanks to @SaveTheRbtz for the PR.

from automaxprocs.

Add benchmarks to README? about automaxprocs HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent