Comments (6)
I also encountered the same problem. From the source code, the dns resolver will wait 30s at the very least to do the next dns lookup, and it's also consistent with my test results.
After I read through the related issues, they suggest to set the MaxConnectionAge
config in the grpc server side to do the server-side load balance. On the client side, just use the regular service name as the endpoint. It works, but not the best way.
Another way is to implement custom dns resolver. We can decrease the MinResolutionRate
so that dns query can happen more frequently. But I donβt know when ResolveNow
will be called, this will affect whether the custom MinResolutionRate
value is reasonable, any suggestions?
β
from grpc-go.
We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.
from grpc-go.
Yeah seems like the 30 second kube-dns update is the limiting time here. That algorithm is an algorithm called exponential backoff: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) that slows down the algorithm. Spurious RPC failures are expected in this case where you switch over, as we start a DNS Request that could potentially either return the old address still or make RPC's before the new DNS request completes with the new DNS address so this seems to be WAI. Unfortunately more "intelligent" reresolution as outlined in the issue is a tricky slope to navigate (as outlined in the linked issue), for the basically all cases it creates too many other issues.
from grpc-go.
We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.
According to my tests, we use the coreDNS in k8s, and it seems do not have a cache or clear the cache when service is deploying. We implement a custom dns resolver and set the MinResolutionRate
to 0.1s, and the resolver get the new addresses after the connection receive a "connection refused" error. The following is part of the grpc log:
2024-02-22T01:20:14.013091720Z 2024/02/22 01:20:14 INFO: [core] Creating new client transport to "{Addr: \"a.b.c.d:50051\", ServerName: \"lynx-block-discovery-dev-ep-va1-web:50051\", }": connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013211442Z 2024/02/22 01:20:14 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "a.b.c.d:50051", ServerName: "lynx-block-discovery-dev-ep-va1-web:50051", }. Err: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013224182Z 2024/02/22 01:20:14 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013229682Z 2024/02/22 01:20:14 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc00095af90, TRANSIENT_FAILURE
2024-02-22T01:20:14.112361725Z do look up!!!!!
2024-02-22T01:20:14.122264803Z 2024/02/22 01:20:14 INFO: [core] [Channel #1] Resolver state updated: {
2024-02-22T01:20:14.122301124Z "Addresses": [
2024-02-22T01:20:14.122307714Z {
2024-02-22T01:20:14.122315774Z "Addr": "a1.b1.c1.d1:50051",
2024-02-22T01:20:14.122321754Z "ServerName": "",
2024-02-22T01:20:14.122326344Z "Attributes": null,
2024-02-22T01:20:14.122331014Z "BalancerAttributes": null,
2024-02-22T01:20:14.122335064Z "Metadata": null
2024-02-22T01:20:14.122339024Z },
2024-02-22T01:20:14.122342814Z {
2024-02-22T01:20:14.122348264Z "Addr": "a2.b2.c2.d2:50051",
2024-02-22T01:20:14.122362854Z "ServerName": "",
2024-02-22T01:20:14.122366344Z "Attributes": null,
2024-02-22T01:20:14.122369714Z "BalancerAttributes": null,
2024-02-22T01:20:14.122372734Z "Metadata": null
2024-02-22T01:20:14.122375694Z }
2024-02-22T01:20:14.122379294Z ],
from grpc-go.
@zasweq What are the bad cases if I implement a custom dns resolver and set the MinResolutionRate
to 0.1s
from grpc-go.
@NeoyeElf @zasweq Yes, we came up with a very similar solution / workaround: k8s cache with low (1s) TTL and custom resolver with instant lookup. I would also love to know about the bad cases with this setup.
Btw I also found this related PR: #6962, which pretty much solves this issue.
from grpc-go.
Related Issues (20)
- Deprecate WithBlock and WithReturnConnectionError HOT 3
- gRPC casing looks really ugly [Serious] HOT 2
- resolver.Address's BalancerAttributes update will not reflect in Balancer HOT 3
- How to modify the returned metadata HOT 3
- In picker logic(picker_wrapper.go), clearly understand the real reason for failures after a context.Error(timeout) HOT 2
- rpc error: code = Internal desc = header key "myheader" contains value with non-printable ASCII characters HOT 1
- Move pick_first to separate package
- run staticcheck for all modules
- header key "XXX" contains value with non-printable ASCII characters HOT 1
- Service reflection can not handle directory paths with imported proto files HOT 1
- Delete grpclog's Debugf function HOT 3
- feature: `protoc-gen-go-grpc` to create test client / `bufconn` constructors HOT 2
- Flaky test: TestClientSendsAGoAway HOT 8
- Catching grpc connection configuration errors at startup HOT 2
- CloseSend immediately after Send
- Minor 'invalid code' error output suggestion
- advancedtls package does not set VerifiedChains in TLSInfo.State HOT 5
- question: Bidirectional Streaming get error: rpc error: code = `Canceled` desc = `context canceled` HOT 6
- Delayed dns resolve after upgrade to v1.60.0 HOT 5
- channelz: SubChannel do not report their Target properly since 1.63 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grpc-go.