Git Product home page Git Product logo

Comments (14)

domodwyer avatar domodwyer commented on June 21, 2024 1

This is great news - thanks @dvic for reporting and @KJTsanaktsidis for such a comprehensive analysis and fix! Open source communities are alive and well! πŸ‘

I will close this after the hotfix - thanks a lot!

Dom

from mgo.

dvic avatar dvic commented on June 21, 2024

Could it be a problem with the -race flag? We removed the -race flag and up to this point the tests have stopped failing.

from mgo.

KJTsanaktsidis avatar KJTsanaktsidis commented on June 21, 2024

I got a similar (but not identical) deadlock & backtrace when running TestConnectCloseConcurrency. I think the main source of this problem are these two stacks:

goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
	/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
	/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
	/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
	/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
	/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175

and

goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
	/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
	/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
	/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3

As near as I can tell....

  • Goroutine 11 has the syncServersLoop, which loops every few hundred ms and checks the topology of the cluster.
  • syncServersLoop calls syncServersIteration to do its actual work on every pump of the loop
  • syncServersIteration spawns a new goroutine 30 and blocks goroutine 11 waiting for 30 on a sync.waitGroup
  • The anonymous function in syncServersIteration calls cluster.syncServer() to probe it and add it to the cluster.masters and cluster.servers slices.
  • cluster.syncServer explicitly opens a socket to this particular server with a call to server.AcquireSocket (as opposed to opening a socket to any server in the cluster)
  • cluster.syncServer calls server.isMaster() with this socket, to ask if the server is a replset master
  • isMaster creates a new session and explicitly assigns the passed-in socket to it. It prepares a command and then attempts to execute it with session.Run
  • This eventually falls in to Database.Run(), which calls session.acquireSocket()
  • acquireSocket() should be a no-op, since the isMaster call a few frames above explicitly set s.setSocket. However, it apparently fails the checks that s.masterSocket != nil && s.masterSocket.dead == nil or s.slaveSocket != nil && s.slaveSocket.dead == nil && s.slaveOk && slaveOk && (s.masterSocket == nil || s.consistency != PrimaryPreferred && s.consistency != Monotonic), and thus falls into s.cluster().AcquireSocket(). THIS is I believe the bug; the code higher up the stack is trying to call isMaster on a particular server, but this is going to get a connection to any arbitrary server matching the tags.
  • AcquireSocket looks for a server in its understanding of the topology by checking cluster.masters.Len() and cluster.servers.Len(). However, the cluster discovery hasn't actually run yet - syncServersIteration (further up our call stack in this goroutine) is supposed to populate those collections with a call to cluster.addServer(), but it needs to finish its call to syncServer/isMaster first.
  • Since the cluster topology isn't populated yet, AcquireSocket attempts to poke the syncServers loop on goroutine 11 by calling cluster.syncServers which just writes to a channel. This is actually a total no-op because both sides of the channel are read/written to nonblocking and the data is just a signal, but this is a different bug and not the actual issue.
  • AcquireSocket then waits on the condition variable cluster.serverSynced.Wait().
  • BUT, that condition variable is broadcast from three places:
    • syncServersLoop, which is not iterating at the moment because goroutine 11 is blocked on the waitgroup in syncServersIteration
    • addServer and syncServer, both of which are only called from syncServersIteration, which we are blocking on goroutine 30
  • Thus, we have a deadlock.

phew. That was fun.

I'm pretty sure the bug is that isMaster is using session.setSocket to ensure that the command with Run is run against the right server, but if something is wrong with the socket, instead of passing an error up to isMaster, Run calls acquireSocket which just attempts to make a new socket to any random server in the cluster. The deadlock is not a code path that should ever be made to work, I think.

Thoughts?

from mgo.

domodwyer avatar domodwyer commented on June 21, 2024

Hi @dvic and @KJTsanaktsidis

First off - @dvic thanks for the solid report, and @KJTsanaktsidis thanks for diving deeper into mgo than is good for your sanity!

We'll take a look at this - we've never seen any deadlocks ourselves but the possibility is definitely there - there's an amazing amount of interplay with the locks (as @KJTsanaktsidis can clearly attest!) Do either of you have any reproducing code we can look at?

Dom

from mgo.

KJTsanaktsidis avatar KJTsanaktsidis commented on June 21, 2024

I’ll have a look and see if I can find a solid reproduction next week - maybe a β€œmongo” server that accepts then closes all connections might trigger this code path?

from mgo.

KJTsanaktsidis avatar KJTsanaktsidis commented on June 21, 2024

@domodwyer I think I've managed to provide a repro in #121 - the test in the first commit fails about 20% of the time when i run it with go test -check.v -check.f "S.TestNoDeadlockOnClose" -timeout 25s on my machine.

from mgo.

domodwyer avatar domodwyer commented on June 21, 2024

Hi @dvic

We're going to merge #121 into development ASAP (thanks to @KJTsanaktsidis !) and cut a hotfix to master once it's tested. In the meantime would you be able to run your tests using the development mgo branch to check if it resolves this issue?

Dom

from mgo.

dvic avatar dvic commented on June 21, 2024

Hi @domodwyer, sure no problem. Thanks! Will try it now and get back to you.

from mgo.

domodwyer avatar domodwyer commented on June 21, 2024

Hey @dvic

It's not merged just yet - I'll post here when it's done πŸ‘

Dom

from mgo.

dvic avatar dvic commented on June 21, 2024

No problem, for now I just used https://github.com/zendesk/mgo/tree/fix_dial_deadlock directly, TravisCI is running.. 🀞

from mgo.

dvic avatar dvic commented on June 21, 2024

Good news: I ran the test suite three times now, each passed without problems πŸ‘ I'll keep them running just to be sure and I can also run it a few times on the dev branch once you're ready.

from mgo.

dvic avatar dvic commented on June 21, 2024

@domodwyer Tests keep passing, #121 definitely seems to solve the problem (for me at least). Let me know if you want me to perform additional test runs on the dev branch.

from mgo.

KJTsanaktsidis avatar KJTsanaktsidis commented on June 21, 2024

Really happy to help - having this library be actively maintained helps everyone!

from mgo.

domodwyer avatar domodwyer commented on June 21, 2024

Hi @dvic, @KJTsanaktsidis

Sorry for disappearing, I was out the country! It looks like this has been fixed (thanks!) but with a direct push to development so this didn't close (I'll also find out how that happened - it should be PR only) so closing now.

I will cut a hotfix release after a test run - thanks again!

Dom

from mgo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.