Git Product home page Git Product logo

Comments (8)

 avatar commented on September 23, 2024

At a high load, every 30 seconds, when there is synchronization with the MongoDB cluster, many connections start to work longer than a second.
FailFast partially solves this problem.

from mgo.

domodwyer avatar domodwyer commented on September 23, 2024

Hi @sshipkov

This change was added by PR #52 to help mitigate a different issue - can you confirm it's definitely this change that causes your issue? As part of the release process we tested this change against a sharded, replicated MongoDB cluster but we didn't see anything unusual.

What exactly is happening? Can you reproduce the issue with debug logging enabled and post the relevant output?

Dom

from mgo.

 avatar commented on September 23, 2024

mongo
999 percentile, ~ 3500rps read requests

from mgo.

 avatar commented on September 23, 2024

network
smoother monogdb servers tx

from mgo.

 avatar commented on September 23, 2024

I'm sorry, I will not include debugging

from mgo.

domodwyer avatar domodwyer commented on September 23, 2024

Hi @sshipkov

Thanks for taking the time to put together some annotated graphs, however they're not as helpful as you might think - they show something was working and then it isn't.

Anything between the application code through to OS, CPU load, network card, drivers, switches, network utilisation, etc could be causing what I assume is a throughput dropout (not latency drop as your graphs show, which would be a good thing) - what has led you to believe #52 is the cause?

I'd suggest running an version of mgo without the socket change (instructions below) on N-1 machines, and leaving a single machine as a canary deployment to help isolate the problem to the changes introduced in #52. Avoid making any other changes, as it will make the results harder to reason about.

Can you provide any details? Some helpful information to reproduce the issue, such as:

  • MongoDB version, storage engine, number of shards/replicas, failover frequency, etc
  • Query workload (read only, 10%r/90%w, from cache or "random" reads, etc)
  • Approximate dataset size, document size/complexity
  • Network (typical latency, loss percentage, max throughput, background throughput, WAN/LAN, etc)
  • Environment (OS, VM/physical machines, number of nodes, NUMA, transparent huge pages, etc)
  • Go version, mgo usage patterns (I assume you do Copy() and Close() rather than using a single session
  • Debug log

On a side note, during our last testing run we had ~35,000op/s (mixed workload) and our 99.9 percentile was 19.1ms - we have not experienced any problems. Your first graph shows a 99.9% latency varying between 100ms and 1 second - if you're within a datacenter (not over a WAN link) this is a hugely worrying sign - it's likely there are problems elsewhere in your infrastructure. If the latency graph appeared in our environment, the right hand side would be considered normal, and the left would be the worry.

I'm happy to revert the change if it is causing an issue, but not without proof it is the cause. The above questions will help reproduce the issue in our testing environment and diagnose further.

Thanks,
Dom


To build your application without #52:

cd $GOPATH/src/github.com/globalsign/mgo
git pull
git revert 5be15cc6a44937b6937a9487e1633f8a5fb25e4a

Then build your application as normal. If you vendor mgo (you should) then check your dependency management tool's documentation.

from mgo.

 avatar commented on September 23, 2024
  • MongoDB 3.4.9 Community, WT, 1 replica
  • 95% read, from cache, random reads
  • 3mb, 88b, 32000 documents
  • 0, 0, ~10gb, lan (own data center, high-performance network)
  • MongoDB is located on 2x physical machines (10 cores, 380 gb of RAM), debian, lxc, cpu load <2%, la < 1
  • golang 1.9.2, gomaxproc 16, mgo Copy/Close (poolLimit 8), apps in kubernetes cluster, 6 pods on different physical nodes (pod cpu load <0.2m)

Many measures have been taken before changing mgo.
Changes in the mgo were made at night, the influence of other factors is minimal.
Now all requests are executed faster than 0ms.

Unfortunately, I do not have time to thoroughly investigate mgo.
I will later try to test the mgo update on other services, if the result is the same, we will just use 10gen/mgo fork ;)

from mgo.

domodwyer avatar domodwyer commented on September 23, 2024

Hi @sshipkov

Sorry for the delay, but it's the new year holidays! ;)

I might have misunderstood your graphs above - is the left of move socket.Unlock() with #52 and the right without #52? Or the other way around?

Either way, I replicated your setup as close as I reasonably could, and I can see no latency regression from #52 - only a reduction in the latency variance (a good thing). Note the 99th percentile (p99) is the same for both unlike the graphs above:

Values in milliseconds:

x => r2017.10.11-select-zipfian-latency.log 
y => r2017.11.06-select-zipfian-latency.log 

          n min   max median   average    stddev  p99
x 200000009 0.5 240.2    0.5 0.5159448 0.1934402 1.05
y 200000009 0.5 240.2    0.5 0.5163596 0.1721280 1.05

.3                 0.4                 0.5                0.6                   
+|------------------+-------------------+------------------+-------------------|
                                        +                                       
1                                       |                                       
                                        +                                       
                                        +                                       
2                                       |                                       
                                        +                                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~0.000415)
===> variance is statistically significant    (p=0.000000)

While the change in average latency is statistically significant, it's tiny (0.0004148ms) so is ignored. From the above I see no changes in average latency between c4a7121 and 5be15cc. The variance (spread of measurements) is reduced - this is a good thing, and means we have less requests taking a relatively long time.

The throughput however has changed in an interesting way:

Values are reads per second:

x => r2017.10.11-select-zipfian-throughput.log 
y => r2017.11.06-select-zipfian-throughput.log 

     n   min   max median  average   stddev      p99
x 2865 56970 72599  67608 67584.57 2382.520 72044.52
y 2865 57506 74826  69783 69797.62 2435.384 74466.88

                      65000                    70000                    75000   
 |----------------------+------------------------+------------------------+----|
                             +-------+-------+                                  
1    ------------------------|       |       |-----------------                 
                             +-------+-------+                                  
                                        +-------+-------+                       
2                   --------------------|       |       |-----------------      
                                        +-------+-------+                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~2213.048866)
===> variance not statistically significant   (p=0.240275)

Because of what I imagine is due to a reduced lock duration, the average throughput has increased by ~2213 reads per second.

It wasn't till after I ran the tests that I saw your poolLimit 8 comment - that is a really small number of connections to allow and to me is a really bad idea - are you sure this is the right thing to do for your workload? Once those 8 connections are busy, mgo blocks waiting for a connection to become free - this could easily be the source of your high 99th percentile measurements. It is not a good idea to use SetPoolLimit() as a concurrency throttle - see this post by @niemeyer (the original mgo author) for a clear explanation why.

I run a lot of benchmarks, but I'm not a statistician - so take it all the analysis above with a pinch of salt! If you would like to look at the latency measurements yourself, there's grouped data and a histogram (on the second sheet) here.

from mgo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.