Git Product home page Git Product logo

Comments (9)

embano1 avatar embano1 commented on May 20, 2024 21

Hi @codesuki

Sorry for my delayed answer (PTO and other stuff). I did some benchmarking (mostly CPU-bound workloads like calculating prime numbers or locks) but did not have the time to write the blog post, which is still planned.

Anyways, here's some data points I ran on a 16 core cloud box.

prime cgroups chart
syncmap cgroups chart

The tests show different CPU cgroup settings (i.e. CFS quota) and the effect of the benchmark run time.

The first diagram summarises a prime benchmark I wrote for the tests (https://github.com/embano1/gotutorials/tree/master/concprime). It spawns many active Goroutines to find prime numbers. The benchmark execution time is compared in different runs with CPU CFS quota off/1/2/4 vs. different GOMAXPROCS settings (1-16) on a 16 core box. As an example, look at the orange bars comparing the case for GOMAXPROCS=16. The first orange bar shows the run w/out CPU CFS quota, i.e. the fastest of all runs (as expected). The second orange bar is where the container is constrained to 1 CPU (CFS quota 100ms, period 100ms). It's the worst result, meaning that you should tune GOMAXPROCS to CFS quota accordingly especially on large boxes (+8 CPUs).

The second diagram is a mutex lock contention benchmark comparing Go's SyncMap vs. a map with a mutex (https://medium.com/@deckarep/the-new-kid-in-town-gos-sync-map-de24a6bf7c2c). Here you can see that it becomes really critical for performance when there's many mutexes in the game. Compare the orange (map w/ r/w mutex) and yellow lines (map w r/w mutex and CFS quota == 1 CPU). GOMAXPROCS is shown on the horizontal axis. Everything is fine for GOMAXPROCS=1, but it gets really worse with CFS quota applied and GOMAXPROCS=16 (default on that machine). The chart cuts of, you can see the values for both cases in the table below the chart. For sync map w/ r/w mutex and CFS quota == 1 CPU we got two orders of magnitude slower performance when GOMAXPROC is not tuned (151 vs 15000 ns/op)!

I ack that these are synthetic benchmarks but they prove the point. If there's misalignment between CFS quota, the language runtime tuning (in this case GOMAXPROCS) and the workload is mostly CPU bound (e.g. spawning a lot of active goroutines, calculations, etc.) then this could cause performance degradation.

I think it's not that hard to write custom benchmarks to validate the impact of misaligned GOMAXPROCS to CFS quota for the specific application. In fact I recommend having benchmarking/stress-testing being part of CI to establish a baseline and compare against production. I discussed this intensively in a talk at KubeCon (https://www.youtube.com/watch?v=8-apJyr2gi0).

Hope that helps.

from automaxprocs.

gaocegege avatar gaocegege commented on May 20, 2024 10

Just FYI.

Running the benchmark https://github.com/embano1/gotutorials/tree/master/concprime with native, docker --cpus 4, docker --cpus 2, docker --cpus 1, docker --cpuset-cpus 0,1,2,3, kubernetes resources.limits=4, kubernetes resources.limits=2, kubernetes resources.limits=1.

Test env:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7560U CPU @ 2.40GHz
Stepping:            9
CPU MHz:             1011.469
CPU max MHz:         3800.0000
CPU min MHz:         400.0000
BogoMIPS:            4800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

Got the result

gomaxprocs

Raw results:

arch || GOMAXPROCS || duration (ms)
Native 4 CPU	1	200
Native 4 CPU	2	134
Native 4 CPU	4	123
docker -cpuset-cpus 4	1	291
docker -cpuset-cpus 4	2	162
docker -cpuset-cpus 4	4	116
docker --cpus 1	1	311
docker --cpus 1	2	320
docker --cpus 1	4	495
docker --cpus 2	1	302
docker --cpus 2	2	179
docker --cpus 2	4	191
docker --cpus 4	1	272
docker --cpus 4	2	169
docker --cpus 4	4	134
K8s resources.limits.cpu=1	1	243
K8s resources.limits.cpu=1	2	277
K8s resources.limits.cpu=1	4	424
K8s resources.limits.cpu=2	1	260
K8s resources.limits.cpu=2	2	154
K8s resources.limits.cpu=2	4	192
K8s resources.limits.cpu=4	1	248
K8s resources.limits.cpu=4	2	154
K8s resources.limits.cpu=4	4	125
library(ggplot2)
library(purrr)

maxprocs = read.delim("maxprocs.txt", header = FALSE, sep = "\t", dec = ".")

maxprocs$V2 = factor(maxprocs$V2, c("1", "2", "4"))

p = ggplot(maxprocs, aes(x = V1, y = V3, fill = V2)) + 
  geom_col(position = "dodge2", width = 0.7) +
  coord_flip() +
  theme(legend.position = "top") + 
  guides(fill = guide_legend(title = "GOMAXPROCS", title.position = "left")) +
  labs(y = "Duration (ms) (lower is better)", x= "")

p

from automaxprocs.

prashantv avatar prashantv commented on May 20, 2024 6

I'll add some more data measured from our internal load balancer at Uber.

We ran the load balancer with 200% CPU quota (e.g., 2 cores), and used yab to benchmark.

GOMAXPROCS RPS P50 (ms) P99.9 (ms)
1 28,893.18 1.46 19.70
2 (equal to quota) 44,715.07 0.84 26.38
3 44,212.93 0.66 30.07
4 41,071.15 0.57 42.94
8 33,111.69 0.43 64.32
Default (24) 22,191.40 0.45 76.19

When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases.

When GOMAXPROCS is higher than the CPU quota allocated, we also saw significant throttling:

$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/[...]/cpu.stat
nr_periods 42227334
nr_throttled 131923
throttled_time 88613212216618

Once GOMAXPROCS was reduced to match the CPU quota, we saw no CPU throttling.

from automaxprocs.

codesuki avatar codesuki commented on May 20, 2024 1

Nochmals danke!
This aligns with what I expected but couldn't put in words.
Makes total sense now.
I'll have a look at the links, thanks for those, too.

from automaxprocs.

embano1 avatar embano1 commented on May 20, 2024

@jeromefroe working on a blog post covering exactly this (incl. benchmarks). hope to have it done by January. will ping you again...

from automaxprocs.

codesuki avatar codesuki commented on May 20, 2024

@embano1 Did you finish the blog post? :)

from automaxprocs.

codesuki avatar codesuki commented on May 20, 2024

Danke for following up and the write up! Very informative. I'll play with the benchmark a bit.

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

Great talk BTW!

from automaxprocs.

embano1 avatar embano1 commented on May 20, 2024

Thank you (also on the talk!) and "gern geschehen" :)

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

The problem is lock/CPU cache contention when there is more than one OS thread (simply speaking number of GOMAXPROCS) active for the program. So in any case, one thread would always beat >1 OS threads when solely looking at locking mechanisms and contention. Details here: https://www.youtube.com/watch?v=C1EtfDnsdDs

Now, should we advise always setting GOMAXPROCS=1? Of course not, since that would hurt performance and destroy the benefits of multi-core machines. I'm just looking at a very specific computing problem here. So with higher GOMAXPROCS (aligned to CPU CFS quota), and thus potential lock contention, most programs perform better by leveraging multi-threading and Goroutines. That's why I alluded to doing benchmarks and stress testing against limits posed on the program, i.e. CPU CFS in that specific case.

I would say Dave Cheney can be considered an authoritative Go source :) and thus linking to his great material on performance tuning: https://github.com/davecheney/high-performance-go-workshop

I'm also pleased to hear that there will be changes coming to Go runtime memory management with regards to memory limits (https://blog.golang.org/ismmkeynote), but that's not related to our discussion here :)

from automaxprocs.

abhinav avatar abhinav commented on May 20, 2024

Fixed by #52. Thanks to @SaveTheRbtz for the PR.

from automaxprocs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.