Use case(s) - what problem will this feature solve? We have a long

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Wait for ready with "readiness timeout" about grpc-go HOT 16 CLOSED

gnvk commented on September 24, 2024

Wait for ready with "readiness timeout"

from grpc-go.

Comments (16)

dfawley commented on September 24, 2024 1

This is not the recommended way to restart your servers (though you application needs to be resilient to server crashes, of course). Ideally you would catch SIGINT/etc and do a GracefulStop + some grace period + Stop instead. If you have long-running streams, you would also signal them at that time (after GracefulStop) to end their RPCs ASAP.

I would recommend something like this:

var stream pb.FooClientStream
for {
	var err error
	if stream, err = client.Foo(ctxWithoutDeadline, req); err == nil {
		break
	}
	if clientConn.State() != connectivity.TransientFailure {
		return err
	}
	if !clientConn.WaitForStateChange(ctxWithDeadline, connectivity.TransientFailure) {
		return ctxWithDeadline.Err()
	}
}
// use stream

Using wait-for-ready is probably not what you want here, and with the option you're proposing, the RPC could fail in other ways besides "couldn't connect in time" (e.g. connection loss after RPC started), so you'll always need a loop around stream establishment.

from grpc-go.

dfawley commented on September 24, 2024

Wait-for-ready is often not a good idea, particularly with long-lived streaming RPCs (i.e. no deadline), and even more so if you need the stream to start working in a short amount of time. Why are you using WFR?

Another option would be to stop using WFR and retry the stream a few times manually if it fails before your 3s stream establishment deadline.

from grpc-go.

gnvk commented on September 24, 2024

Retrying without WFR is exactly what we're doing now. Maybe I should've listed it as a considered alternative. But to me WFR would seem like a better approach, if its timeout worked as described in this feature request. Without it we're trying to send the RPC in a retry loop (with exponential backoff between the requests) but this doesn't really make any sense while the connection is not ready, does it?

And the 3s timeout is just an example, of course. The use case that motivated trying out the wait for ready option is that the server is simply restarted. Normally that only takes a few seconds and all its clients should reconnect as soon as possible. But on the other hand if there's a misconfiguration (e.g. the client tries to connect to a wrong address), we should see that error after a reasonable timeout, instead of waiting forever.

from grpc-go.

dfawley commented on September 24, 2024

this doesn't really make any sense while the connection is not ready, does it?

The problem with WFR is that the name is misleading.

ALL RPCs wait until the channel is ready while it is connecting. The difference with WFR is that it will survive if the channel goes into a failure mode. This is rarely important to anyone. Note that even if the server restarts, the client does not go into failure unless it takes too long for the restart and the reconnection attempt times out or if it gets I/O errors when reconnecting.

from grpc-go.

dfawley commented on September 24, 2024

But on the other hand if there's a misconfiguration (e.g. the client tries to connect to a wrong address), we should see that error after a reasonable timeout, instead of waiting forever.

In this case, what are you doing in the event of a timeout? Logging? Changing health status? Etc?

from grpc-go.

gnvk commented on September 24, 2024

ALL RPCs wait until the channel is ready while it is connecting. The difference with WFR is that it will survive if the channel goes into a failure mode. This is rarely important to anyone. Note that even if the server restarts, the client does not go into failure unless it takes too long for the restart and the reconnection attempt times out or if it gets I/O errors when reconnecting.

When the server restarts, without wait for ready the client RPC returns with this error instantly:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial │ds finished completely"}
tcp <ip>:9090: connect: connection refused

In this case, what are you doing in the event of a timeout? Logging? Changing health status? Etc?

This depends on the particular client. Logging is always a good idea. Sending an alert, restarting the client are other options.

from grpc-go.

gnvk commented on September 24, 2024

Btw I'm playing with the service config, and the retry policy doesn't seem to work either. Given this config:

{
	"methodConfig": [{
		"name": [{"service": "%s"}],
		"retryPolicy": {
			"MaxAttempts": 30,
			"InitialBackoff": ".01s",
			"MaxBackoff": "1s",
			"BackoffMultiplier": 2,
			"RetryableStatusCodes": ["UNAVAILABLE"]
		}
	}]
}

the stream RPC call returns instantly with the above Unavailable error.

from grpc-go.

dfawley commented on September 24, 2024

without wait for ready the client RPC returns with this error instantly

Hmm, yes, in that case if Dial fails, then channel will transition into TRANSIENT_FAILURE.

How are you restarting your servers?

Are you just doing a hard restart, or shutting them down with a GracefulStop, a timeout, then a hard Stop?

If an RPC has started on the server, then it is normally not retryable through the service config retry.

from grpc-go.

gnvk commented on September 24, 2024

How are you restarting your servers?

In production the server is running in a Kubernetes pod and a "restart" means that the pod is restarted. But a very similar behaviour can be seen locally if you simply kill (Ctrl+C), or don't even start the server.

from grpc-go.

gnvk commented on September 24, 2024

Thank you @dfawley 🙂 This is very close to what we're already using, with the additional WaitForStateChange, which I quite like 👍

Actually, it's pretty much exactly what this proposal proposed: a waitForReady with the requested "readiness timeout". The (small) downside of this solution is that this code needs to be copy-pasted to wrap every single stream RPC call (vs the single default service config).

(Btw I changed one small but important thing: I replaced the last ctxWithDeadline.Err() return with a wrapped error that also contains the last err.)

from grpc-go.

dfawley commented on September 24, 2024

Sounds good. I think the problem here that is preventing you from doing the simple thing is that you're getting RPC errors when your server restarts, which should be avoidable IIUC by shutting the server down gracefully. If those errors weren't occurring, this would be simply:

stream, err := client.Foo(ctxWithoutDeadline, req)
if err != nil {  // this is only ever a connection error or an internal/programmer error
	return err
}

With service config retries on UNAVAILABLE, I believe this should also retry in all the scenarios we're talking about. If you aren't seeing that, then I'd be interested in a repro case if you have one.

I'm fairly sure the case I was talking about that isn't retriable isn't possible at this stage. It would only occur after this returns, and a subsequent stream read might return a non-retriable UNAVAILABLE error.

from grpc-go.

gnvk commented on September 24, 2024

With service config retries on UNAVAILABLE, I believe this should also retry in all the scenarios we're talking about. If you aren't seeing that, then I'd be interested in a repro case if you have one.

// Nothing's running on localhost:9090!
conn, err := grpc.Dial("localhost:9090",
	grpc.WithTransportCredentials(insecure.NewCredentials()),
	grpc.WithDefaultServiceConfig(fmt.Sprintf(`{
		"methodConfig": [{
			"name": [{"service": "%s"}],
			"retryPolicy": {
					"MaxAttempts": 5,
					"InitialBackoff": "1s",
					"MaxBackoff": "1s",
					"BackoffMultiplier": 1,
					"RetryableStatusCodes": ["UNAVAILABLE"]
			}
	}]
	}`, grpc_testing.SearchService_ServiceDesc.ServiceName)),
)
if err != nil {
	panic(err)
}
client := grpc_testing.NewSearchServiceClient(conn)
start := time.Now()
_, err = client.Search(context.Background(), &grpc_testing.SearchRequest{})
// Based on the config I would expect the elapsed time to be at least 5 seconds
// But in reality it's around 1s
fmt.Println("Elapsed: ", time.Since(start))
fmt.Println("Error:   ", err)
fmt.Println("Status:  ", conn.GetState().String())

Also, I added logging to the above code and it confirmed that only a single call is made.

from grpc-go.

dfawley commented on September 24, 2024

If you're logging in an interceptor, you won't see multiple calls, because interceptors are invoked per channel operation, not per attempt operation.

Also, you're not seeing 5s elapse because retries happen with a delay of rand(0, backoff):

https://github.com/grpc/proposal/blob/master/A6-client-retries.md#integration-with-service-config

When I added your test to try to repro this, and logging in gRPC, I see the appropriate behavior:

https://github.com/dfawley/grpc-go/tree/testretry

    tlogger.go:111: ERROR stream.go:623 [core] Should retry? rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:9090: connect: connection refused"  (t=+1.785625ms)
    tlogger.go:111: ERROR stream.go:709 [core] RP=&{MaxAttempts:5 InitialBackoff:1s MaxBackoff:1s BackoffMultiplier:1 RetryableStatusCodes:map[Unavailable:true]}; fact=1, cur=1e+09  (t=+1.831392ms)
    tlogger.go:111: ERROR stream.go:716 [core] Timer with duration 181.966465ms  (t=+1.850658ms)
    tlogger.go:111: ERROR stream.go:721 [core] time to retry  (t=+183.947434ms)
    tlogger.go:111: ERROR stream.go:623 [core] Should retry? rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:9090: connect: connection refused"  (t=+184.003701ms)
    tlogger.go:111: ERROR stream.go:709 [core] RP=&{MaxAttempts:5 InitialBackoff:1s MaxBackoff:1s BackoffMultiplier:1 RetryableStatusCodes:map[Unavailable:true]}; fact=1, cur=1e+09  (t=+184.057532ms)
    tlogger.go:111: ERROR stream.go:716 [core] Timer with duration 777.045481ms  (t=+184.08327ms)
    tlogger.go:111: ERROR stream.go:721 [core] time to retry  (t=+962.094056ms)

(etc)

from grpc-go.

github-actions commented on September 24, 2024

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

from grpc-go.

gnvk commented on September 24, 2024

@dfawley Thank you for the clarification and for your help :)

I can live with the workaround you suggested (loop + wait for change), but in my opinion it's less convenient than what I'm suggesting in this feature request. If you disagree, feel free to close this issue.

from grpc-go.

dfawley commented on September 24, 2024

I don't think this proposal is likely to be accepted (for the reasons mentioned above), but if you do want to pursue it, it would be best to file an issue in the grpc/grpc repo, since it would need to be accepted cross-language first. I'll close this, but feel free to reference it if you do file one.

from grpc-go.

Wait for ready with "readiness timeout" about grpc-go HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent