Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

MongoDB 3.4.9 Community, WT, 1 replica 95% read, from cache, random rea

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Possible latency regression in latest release about mgo HOT 8 CLOSED

globalsign commented on September 23, 2024

Possible latency regression in latest release

from mgo.

Comments (8)

commented on September 23, 2024

At a high load, every 30 seconds, when there is synchronization with the MongoDB cluster, many connections start to work longer than a second.
FailFast partially solves this problem.

from mgo.

domodwyer commented on September 23, 2024

Hi @sshipkov

This change was added by PR #52 to help mitigate a different issue - can you confirm it's definitely this change that causes your issue? As part of the release process we tested this change against a sharded, replicated MongoDB cluster but we didn't see anything unusual.

What exactly is happening? Can you reproduce the issue with debug logging enabled and post the relevant output?

Dom

from mgo.

commented on September 23, 2024

999 percentile, ~ 3500rps read requests

from mgo.

commented on September 23, 2024

smoother monogdb servers tx

from mgo.

commented on September 23, 2024

I'm sorry, I will not include debugging

from mgo.

domodwyer commented on September 23, 2024

Hi @sshipkov

Thanks for taking the time to put together some annotated graphs, however they're not as helpful as you might think - they show something was working and then it isn't.

Anything between the application code through to OS, CPU load, network card, drivers, switches, network utilisation, etc could be causing what I assume is a throughput dropout (not latency drop as your graphs show, which would be a good thing) - what has led you to believe #52 is the cause?

I'd suggest running an version of mgo without the socket change (instructions below) on N-1 machines, and leaving a single machine as a canary deployment to help isolate the problem to the changes introduced in #52. Avoid making any other changes, as it will make the results harder to reason about.

Can you provide any details? Some helpful information to reproduce the issue, such as:

MongoDB version, storage engine, number of shards/replicas, failover frequency, etc
Query workload (read only, 10%r/90%w, from cache or "random" reads, etc)
Approximate dataset size, document size/complexity
Network (typical latency, loss percentage, max throughput, background throughput, WAN/LAN, etc)
Environment (OS, VM/physical machines, number of nodes, NUMA, transparent huge pages, etc)
Go version, mgo usage patterns (I assume you do Copy() and Close() rather than using a single session
Debug log

On a side note, during our last testing run we had ~35,000op/s (mixed workload) and our 99.9 percentile was 19.1ms - we have not experienced any problems. Your first graph shows a 99.9% latency varying between 100ms and 1 second - if you're within a datacenter (not over a WAN link) this is a hugely worrying sign - it's likely there are problems elsewhere in your infrastructure. If the latency graph appeared in our environment, the right hand side would be considered normal, and the left would be the worry.

I'm happy to revert the change if it is causing an issue, but not without proof it is the cause. The above questions will help reproduce the issue in our testing environment and diagnose further.

Thanks,
Dom

To build your application without #52:

cd $GOPATH/src/github.com/globalsign/mgo
git pull
git revert 5be15cc6a44937b6937a9487e1633f8a5fb25e4a

Then build your application as normal. If you vendor mgo (you should) then check your dependency management tool's documentation.

from mgo.

commented on September 23, 2024

MongoDB 3.4.9 Community, WT, 1 replica
95% read, from cache, random reads
3mb, 88b, 32000 documents
0, 0, ~10gb, lan (own data center, high-performance network)
MongoDB is located on 2x physical machines (10 cores, 380 gb of RAM), debian, lxc, cpu load <2%, la < 1
golang 1.9.2, gomaxproc 16, mgo Copy/Close (poolLimit 8), apps in kubernetes cluster, 6 pods on different physical nodes (pod cpu load <0.2m)

Many measures have been taken before changing mgo.
Changes in the mgo were made at night, the influence of other factors is minimal.
Now all requests are executed faster than 0ms.

Unfortunately, I do not have time to thoroughly investigate mgo.
I will later try to test the mgo update on other services, if the result is the same, we will just use 10gen/mgo fork ;)

from mgo.

domodwyer commented on September 23, 2024

Hi @sshipkov

Sorry for the delay, but it's the new year holidays! ;)

I might have misunderstood your graphs above - is the left of move socket.Unlock() with #52 and the right without #52? Or the other way around?

Either way, I replicated your setup as close as I reasonably could, and I can see no latency regression from #52 - only a reduction in the latency variance (a good thing). Note the 99th percentile (p99) is the same for both unlike the graphs above:

Values in milliseconds:

x => r2017.10.11-select-zipfian-latency.log 
y => r2017.11.06-select-zipfian-latency.log 

          n min   max median   average    stddev  p99
x 200000009 0.5 240.2    0.5 0.5159448 0.1934402 1.05
y 200000009 0.5 240.2    0.5 0.5163596 0.1721280 1.05

.3                 0.4                 0.5                0.6                   
+|------------------+-------------------+------------------+-------------------|
                                        +                                       
1                                       |                                       
                                        +                                       
                                        +                                       
2                                       |                                       
                                        +                                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~0.000415)
===> variance is statistically significant    (p=0.000000)

While the change in average latency is statistically significant, it's tiny (0.0004148ms) so is ignored. From the above I see no changes in average latency between c4a7121 and 5be15cc. The variance (spread of measurements) is reduced - this is a good thing, and means we have less requests taking a relatively long time.

The throughput however has changed in an interesting way:

Values are reads per second:

x => r2017.10.11-select-zipfian-throughput.log 
y => r2017.11.06-select-zipfian-throughput.log 

     n   min   max median  average   stddev      p99
x 2865 56970 72599  67608 67584.57 2382.520 72044.52
y 2865 57506 74826  69783 69797.62 2435.384 74466.88

                      65000                    70000                    75000   
 |----------------------+------------------------+------------------------+----|
                             +-------+-------+                                  
1    ------------------------|       |       |-----------------                 
                             +-------+-------+                                  
                                        +-------+-------+                       
2                   --------------------|       |       |-----------------      
                                        +-------+-------+                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~2213.048866)
===> variance not statistically significant   (p=0.240275)

Because of what I imagine is due to a reduced lock duration, the average throughput has increased by ~2213 reads per second.

It wasn't till after I ran the tests that I saw your poolLimit 8 comment - that is a really small number of connections to allow and to me is a really bad idea - are you sure this is the right thing to do for your workload? Once those 8 connections are busy, mgo blocks waiting for a connection to become free - this could easily be the source of your high 99th percentile measurements. It is not a good idea to use SetPoolLimit() as a concurrency throttle - see this post by @niemeyer (the original mgo author) for a clear explanation why.

I run a lot of benchmarks, but I'm not a statistician - so take it all the analysis above with a pinch of salt! If you would like to look at the latency measurements yourself, there's grouped data and a histogram (on the second sheet) here.

from mgo.

Possible latency regression in latest release about mgo HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent