Git Product home page Git Product logo

Comments (5)

panjf2000 avatar panjf2000 commented on August 25, 2024 2

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

I'll keep investigating this. In the meantime, please inform us by updating this issue thread if any helpful info pops into your head, thanks!

from go.

mknyszek avatar mknyszek commented on August 25, 2024

CC @neild

from go.

panjf2000 avatar panjf2000 commented on August 25, 2024

This issue may have something to do with 212d385 and 854a2f8.

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

  • the processing timeouted or is canceled
  • the rc.reqCancel or rc.req.Context().Done() is processed by readLoop

Nonetheless, I'm not so convinced by this analysis because if the client has made its way to here:

go/src/net/http/transport.go

Lines 2272 to 2277 in 065c5d2

case <-rc.req.Cancel:
alive = false
pc.t.cancelRequest(rc.cancelKey, errRequestCanceled)
case <-rc.req.Context().Done():
alive = false
pc.t.cancelRequest(rc.cancelKey, rc.req.Context().Err())

, a response should have been sent out, which would set free the roundtrip method from blocking in select{...}:

go/src/net/http/transport.go

Lines 2708 to 2718 in 065c5d2

case re := <-resc:
if (re.res == nil) == (re.err == nil) {
panic(fmt.Sprintf("internal error: exactly one of res or err should be set; nil=%v", re.res == nil))
}
if debugRoundTrip {
req.logf("resc recv: %p, %T/%#v", re.res, re.err, re.err)
}
if re.err != nil {
return nil, pc.mapRoundTripError(req, startBytesWritten, re.err)
}
return re.res, nil

Any chance you can write a test that can reproduce this issue? Thanks!
@rschone

from go.

rschone avatar rschone commented on August 25, 2024

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

from go.

gopherbot avatar gopherbot commented on August 25, 2024

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

from go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.