Comments (16)
This is not the recommended way to restart your servers (though you application needs to be resilient to server crashes, of course). Ideally you would catch SIGINT/etc and do a GracefulStop
+ some grace period + Stop
instead. If you have long-running streams, you would also signal them at that time (after GracefulStop
) to end their RPCs ASAP.
I would recommend something like this:
var stream pb.FooClientStream
for {
var err error
if stream, err = client.Foo(ctxWithoutDeadline, req); err == nil {
break
}
if clientConn.State() != connectivity.TransientFailure {
return err
}
if !clientConn.WaitForStateChange(ctxWithDeadline, connectivity.TransientFailure) {
return ctxWithDeadline.Err()
}
}
// use stream
Using wait-for-ready is probably not what you want here, and with the option you're proposing, the RPC could fail in other ways besides "couldn't connect in time" (e.g. connection loss after RPC started), so you'll always need a loop around stream establishment.
from grpc-go.
Wait-for-ready is often not a good idea, particularly with long-lived streaming RPCs (i.e. no deadline), and even more so if you need the stream to start working in a short amount of time. Why are you using WFR?
Another option would be to stop using WFR and retry the stream a few times manually if it fails before your 3s stream establishment deadline.
from grpc-go.
Retrying without WFR is exactly what we're doing now. Maybe I should've listed it as a considered alternative. But to me WFR would seem like a better approach, if its timeout worked as described in this feature request. Without it we're trying to send the RPC in a retry loop (with exponential backoff between the requests) but this doesn't really make any sense while the connection is not ready, does it?
And the 3s timeout is just an example, of course. The use case that motivated trying out the wait for ready option is that the server is simply restarted. Normally that only takes a few seconds and all its clients should reconnect as soon as possible. But on the other hand if there's a misconfiguration (e.g. the client tries to connect to a wrong address), we should see that error after a reasonable timeout, instead of waiting forever.
from grpc-go.
this doesn't really make any sense while the connection is not ready, does it?
The problem with WFR is that the name is misleading.
ALL RPCs wait until the channel is ready while it is connecting. The difference with WFR is that it will survive if the channel goes into a failure mode. This is rarely important to anyone. Note that even if the server restarts, the client does not go into failure unless it takes too long for the restart and the reconnection attempt times out or if it gets I/O errors when reconnecting.
from grpc-go.
But on the other hand if there's a misconfiguration (e.g. the client tries to connect to a wrong address), we should see that error after a reasonable timeout, instead of waiting forever.
In this case, what are you doing in the event of a timeout? Logging? Changing health status? Etc?
from grpc-go.
ALL RPCs wait until the channel is ready while it is connecting. The difference with WFR is that it will survive if the channel goes into a failure mode. This is rarely important to anyone. Note that even if the server restarts, the client does not go into failure unless it takes too long for the restart and the reconnection attempt times out or if it gets I/O errors when reconnecting.
When the server restarts, without wait for ready the client RPC returns with this error instantly:
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial │ds finished completely"}
tcp <ip>:9090: connect: connection refused
In this case, what are you doing in the event of a timeout? Logging? Changing health status? Etc?
This depends on the particular client. Logging is always a good idea. Sending an alert, restarting the client are other options.
from grpc-go.
Btw I'm playing with the service config, and the retry policy doesn't seem to work either. Given this config:
{
"methodConfig": [{
"name": [{"service": "%s"}],
"retryPolicy": {
"MaxAttempts": 30,
"InitialBackoff": ".01s",
"MaxBackoff": "1s",
"BackoffMultiplier": 2,
"RetryableStatusCodes": ["UNAVAILABLE"]
}
}]
}
the stream RPC call returns instantly with the above Unavailable error.
from grpc-go.
without wait for ready the client RPC returns with this error instantly
Hmm, yes, in that case if Dial
fails, then channel will transition into TRANSIENT_FAILURE.
How are you restarting your servers?
Are you just doing a hard restart, or shutting them down with a GracefulStop
, a timeout, then a hard Stop
?
If an RPC has started on the server, then it is normally not retryable through the service config retry.
from grpc-go.
How are you restarting your servers?
In production the server is running in a Kubernetes pod and a "restart" means that the pod is restarted. But a very similar behaviour can be seen locally if you simply kill (Ctrl+C), or don't even start the server.
from grpc-go.
Thank you @dfawley 🙂 This is very close to what we're already using, with the additional WaitForStateChange
, which I quite like 👍
Actually, it's pretty much exactly what this proposal proposed: a waitForReady
with the requested "readiness timeout". The (small) downside of this solution is that this code needs to be copy-pasted to wrap every single stream RPC call (vs the single default service config).
(Btw I changed one small but important thing: I replaced the last ctxWithDeadline.Err()
return with a wrapped error that also contains the last err
.)
from grpc-go.
Sounds good. I think the problem here that is preventing you from doing the simple thing is that you're getting RPC errors when your server restarts, which should be avoidable IIUC by shutting the server down gracefully. If those errors weren't occurring, this would be simply:
stream, err := client.Foo(ctxWithoutDeadline, req)
if err != nil { // this is only ever a connection error or an internal/programmer error
return err
}
With service config retries on UNAVAILABLE, I believe this should also retry in all the scenarios we're talking about. If you aren't seeing that, then I'd be interested in a repro case if you have one.
I'm fairly sure the case I was talking about that isn't retriable isn't possible at this stage. It would only occur after this returns, and a subsequent stream read might return a non-retriable UNAVAILABLE error.
from grpc-go.
With service config retries on UNAVAILABLE, I believe this should also retry in all the scenarios we're talking about. If you aren't seeing that, then I'd be interested in a repro case if you have one.
// Nothing's running on localhost:9090!
conn, err := grpc.Dial("localhost:9090",
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultServiceConfig(fmt.Sprintf(`{
"methodConfig": [{
"name": [{"service": "%s"}],
"retryPolicy": {
"MaxAttempts": 5,
"InitialBackoff": "1s",
"MaxBackoff": "1s",
"BackoffMultiplier": 1,
"RetryableStatusCodes": ["UNAVAILABLE"]
}
}]
}`, grpc_testing.SearchService_ServiceDesc.ServiceName)),
)
if err != nil {
panic(err)
}
client := grpc_testing.NewSearchServiceClient(conn)
start := time.Now()
_, err = client.Search(context.Background(), &grpc_testing.SearchRequest{})
// Based on the config I would expect the elapsed time to be at least 5 seconds
// But in reality it's around 1s
fmt.Println("Elapsed: ", time.Since(start))
fmt.Println("Error: ", err)
fmt.Println("Status: ", conn.GetState().String())
Also, I added logging to the above code and it confirmed that only a single call is made.
from grpc-go.
If you're logging in an interceptor, you won't see multiple calls, because interceptors are invoked per channel operation, not per attempt operation.
Also, you're not seeing 5s elapse because retries happen with a delay of rand(0, backoff)
:
https://github.com/grpc/proposal/blob/master/A6-client-retries.md#integration-with-service-config
When I added your test to try to repro this, and logging in gRPC, I see the appropriate behavior:
https://github.com/dfawley/grpc-go/tree/testretry
tlogger.go:111: ERROR stream.go:623 [core] Should retry? rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:9090: connect: connection refused" (t=+1.785625ms)
tlogger.go:111: ERROR stream.go:709 [core] RP=&{MaxAttempts:5 InitialBackoff:1s MaxBackoff:1s BackoffMultiplier:1 RetryableStatusCodes:map[Unavailable:true]}; fact=1, cur=1e+09 (t=+1.831392ms)
tlogger.go:111: ERROR stream.go:716 [core] Timer with duration 181.966465ms (t=+1.850658ms)
tlogger.go:111: ERROR stream.go:721 [core] time to retry (t=+183.947434ms)
tlogger.go:111: ERROR stream.go:623 [core] Should retry? rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:9090: connect: connection refused" (t=+184.003701ms)
tlogger.go:111: ERROR stream.go:709 [core] RP=&{MaxAttempts:5 InitialBackoff:1s MaxBackoff:1s BackoffMultiplier:1 RetryableStatusCodes:map[Unavailable:true]}; fact=1, cur=1e+09 (t=+184.057532ms)
tlogger.go:111: ERROR stream.go:716 [core] Timer with duration 777.045481ms (t=+184.08327ms)
tlogger.go:111: ERROR stream.go:721 [core] time to retry (t=+962.094056ms)
(etc)
from grpc-go.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
from grpc-go.
@dfawley Thank you for the clarification and for your help :)
I can live with the workaround you suggested (loop + wait for change), but in my opinion it's less convenient than what I'm suggesting in this feature request. If you disagree, feel free to close this issue.
from grpc-go.
I don't think this proposal is likely to be accepted (for the reasons mentioned above), but if you do want to pursue it, it would be best to file an issue in the grpc/grpc repo, since it would need to be accepted cross-language first. I'll close this, but feel free to reference it if you do file one.
from grpc-go.
Related Issues (20)
- xds: v1.66.0 regression in `xds.BootstrapContentsForTesting` HOT 11
- mem package & bufferPoolingThreshold HOT 3
- Generic Streams from protoc-gen-go-grpc are not detected by mock tools HOT 4
- Allow generated handlers to check the correct errors are returned at compile time HOT 1
- Flaky Test: xds/TestUnmarshalListener_WithUpdateValidatorFunc HOT 2
- gRPC server rarely returns UNAVAILABLE on seemingly successful requests HOT 13
- Flaky test: Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs HOT 4
- Where are the v1.66.1 release notes? HOT 3
- Uptick of FLOW_CONTROL_ERROR errors in CI after updating to 1.66.0 HOT 3
- google.golang.org/grpc v1.66.1 is missing a retraction HOT 4
- gRPC Server Sends RST_STREAM without trailers when TCP Reassembly occurs HOT 6
- 1.66.0 - increase in produced zero addresses HOT 9
- Flaky test: TestServeAndCloseDoNotRace
- panic: send on closed channel HOT 2
- mem.NewBuffer() should look at slice capacity? HOT 2
- Long running streams fail HOT 4
- "package maps is not in GOROOT" with go1.18 HOT 4
- Move Outlier Detection to be able to use it outside of xDS
- Flaky test: Test/ProducerStreamStartsAfterReady HOT 3
- More resolvers to support third-party registry HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grpc-go.