Comments (14)
This is great news - thanks @dvic for reporting and @KJTsanaktsidis for such a comprehensive analysis and fix! Open source communities are alive and well! π
I will close this after the hotfix - thanks a lot!
Dom
from mgo.
Could it be a problem with the -race
flag? We removed the -race
flag and up to this point the tests have stopped failing.
from mgo.
I got a similar (but not identical) deadlock & backtrace when running TestConnectCloseConcurrency
. I think the main source of this problem are these two stacks:
goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175
and
goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3
As near as I can tell....
- Goroutine 11 has the
syncServersLoop
, which loops every few hundred ms and checks the topology of the cluster. syncServersLoop
callssyncServersIteration
to do its actual work on every pump of the loopsyncServersIteration
spawns a new goroutine 30 and blocks goroutine 11 waiting for 30 on async.waitGroup
- The anonymous function in
syncServersIteration
callscluster.syncServer()
to probe it and add it to thecluster.masters
andcluster.servers
slices. cluster.syncServer
explicitly opens a socket to this particular server with a call toserver.AcquireSocket
(as opposed to opening a socket to any server in the cluster)cluster.syncServer
callsserver.isMaster()
with this socket, to ask if the server is a replset masterisMaster
creates a new session and explicitly assigns the passed-in socket to it. It prepares a command and then attempts to execute it withsession.Run
- This eventually falls in to
Database.Run()
, which callssession.acquireSocket()
acquireSocket()
should be a no-op, since theisMaster
call a few frames above explicitly sets.setSocket
. However, it apparently fails the checks thats.masterSocket != nil && s.masterSocket.dead == nil
ors.slaveSocket != nil && s.slaveSocket.dead == nil && s.slaveOk && slaveOk && (s.masterSocket == nil || s.consistency != PrimaryPreferred && s.consistency != Monotonic)
, and thus falls intos.cluster().AcquireSocket()
. THIS is I believe the bug; the code higher up the stack is trying to callisMaster
on a particular server, but this is going to get a connection to any arbitrary server matching the tags.AcquireSocket
looks for a server in its understanding of the topology by checkingcluster.masters.Len()
andcluster.servers.Len()
. However, the cluster discovery hasn't actually run yet -syncServersIteration
(further up our call stack in this goroutine) is supposed to populate those collections with a call tocluster.addServer()
, but it needs to finish its call tosyncServer
/isMaster
first.- Since the cluster topology isn't populated yet,
AcquireSocket
attempts to poke thesyncServers
loop on goroutine 11 by callingcluster.syncServers
which just writes to a channel. This is actually a total no-op because both sides of the channel are read/written to nonblocking and the data is just a signal, but this is a different bug and not the actual issue. AcquireSocket
then waits on the condition variablecluster.serverSynced.Wait()
.- BUT, that condition variable is broadcast from three places:
syncServersLoop
, which is not iterating at the moment because goroutine 11 is blocked on the waitgroup insyncServersIteration
addServer
andsyncServer
, both of which are only called fromsyncServersIteration
, which we are blocking on goroutine 30
- Thus, we have a deadlock.
phew. That was fun.
I'm pretty sure the bug is that isMaster
is using session.setSocket
to ensure that the command with Run
is run against the right server, but if something is wrong with the socket, instead of passing an error up to isMaster
, Run
calls acquireSocket
which just attempts to make a new socket to any random server in the cluster. The deadlock is not a code path that should ever be made to work, I think.
Thoughts?
from mgo.
Hi @dvic and @KJTsanaktsidis
First off - @dvic thanks for the solid report, and @KJTsanaktsidis thanks for diving deeper into mgo than is good for your sanity!
We'll take a look at this - we've never seen any deadlocks ourselves but the possibility is definitely there - there's an amazing amount of interplay with the locks (as @KJTsanaktsidis can clearly attest!) Do either of you have any reproducing code we can look at?
Dom
from mgo.
Iβll have a look and see if I can find a solid reproduction next week - maybe a βmongoβ server that accepts then closes all connections might trigger this code path?
from mgo.
@domodwyer I think I've managed to provide a repro in #121 - the test in the first commit fails about 20% of the time when i run it with go test -check.v -check.f "S.TestNoDeadlockOnClose" -timeout 25s
on my machine.
from mgo.
Hi @dvic
We're going to merge #121 into development ASAP (thanks to @KJTsanaktsidis !) and cut a hotfix to master once it's tested. In the meantime would you be able to run your tests using the development mgo branch to check if it resolves this issue?
Dom
from mgo.
Hi @domodwyer, sure no problem. Thanks! Will try it now and get back to you.
from mgo.
Hey @dvic
It's not merged just yet - I'll post here when it's done π
Dom
from mgo.
No problem, for now I just used https://github.com/zendesk/mgo/tree/fix_dial_deadlock directly, TravisCI is running.. π€
from mgo.
Good news: I ran the test suite three times now, each passed without problems π I'll keep them running just to be sure and I can also run it a few times on the dev branch once you're ready.
from mgo.
@domodwyer Tests keep passing, #121 definitely seems to solve the problem (for me at least). Let me know if you want me to perform additional test runs on the dev branch.
from mgo.
Really happy to help - having this library be actively maintained helps everyone!
from mgo.
Hi @dvic, @KJTsanaktsidis
Sorry for disappearing, I was out the country! It looks like this has been fixed (thanks!) but with a direct push to development so this didn't close (I'll also find out how that happened - it should be PR only) so closing now.
I will cut a hotfix release after a test run - thanks again!
Dom
from mgo.
Related Issues (20)
- Query blocks with ChangeStream HOT 2
- Upsert doesn't work well with query where value is []byte format
- Expose dialInfo through a getter?
- High concurrency session copy does not release the connection and poolLimit is invalid
- Request: BSON stabilize ordering of map keys (instead of GoLang's random ordering) HOT 1
- Unwanted allocs in mgo
- Auto type assertion not working when fetching data
- Clean actions attribute of Bulk struct
- time.Parse HOT 2
- can any one tell me how to use $near to calculate lat,lng distance
- MongoDB index not being used when querying through mgo on mongoDB views
- }
- DropDatabase returns not master when seesion mode is Secondary
- mgo: FindAndModify create too many sockets
- error="SASL support not enabled during build (-tags sasl)" HOT 8
- Daemontools
- Panic With ObjectIdHex(s string)
- current
- BSON unmarshal unable to parse $date
- error="no reachable servers" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mgo.