Git Product home page Git Product logo

Comments (6)

NeoyeElf avatar NeoyeElf commented on May 24, 2024

I also encountered the same problem. From the source code, the dns resolver will wait 30s at the very least to do the next dns lookup, and it's also consistent with my test results.

After I read through the related issues, they suggest to set the MaxConnectionAge config in the grpc server side to do the server-side load balance. On the client side, just use the regular service name as the endpoint. It works, but not the best way.

Another way is to implement custom dns resolver. We can decrease the MinResolutionRate so that dns query can happen more frequently. But I don’t know when ResolveNow will be called, this will affect whether the custom MinResolutionRate value is reasonable, any suggestions?
​

from grpc-go.

gnvk avatar gnvk commented on May 24, 2024

We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.

from grpc-go.

zasweq avatar zasweq commented on May 24, 2024

Yeah seems like the 30 second kube-dns update is the limiting time here. That algorithm is an algorithm called exponential backoff: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) that slows down the algorithm. Spurious RPC failures are expected in this case where you switch over, as we start a DNS Request that could potentially either return the old address still or make RPC's before the new DNS request completes with the new DNS address so this seems to be WAI. Unfortunately more "intelligent" reresolution as outlined in the issue is a tricky slope to navigate (as outlined in the linked issue), for the basically all cases it creates too many other issues.

from grpc-go.

NeoyeElf avatar NeoyeElf commented on May 24, 2024

We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.

According to my tests, we use the coreDNS in k8s, and it seems do not have a cache or clear the cache when service is deploying. We implement a custom dns resolver and set the MinResolutionRate to 0.1s, and the resolver get the new addresses after the connection receive a "connection refused" error. The following is part of the grpc log:

2024-02-22T01:20:14.013091720Z 2024/02/22 01:20:14 INFO: [core] Creating new client transport to "{Addr: \"a.b.c.d:50051\", ServerName: \"lynx-block-discovery-dev-ep-va1-web:50051\", }": connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013211442Z 2024/02/22 01:20:14 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "a.b.c.d:50051", ServerName: "lynx-block-discovery-dev-ep-va1-web:50051", }. Err: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013224182Z 2024/02/22 01:20:14 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013229682Z 2024/02/22 01:20:14 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc00095af90, TRANSIENT_FAILURE

2024-02-22T01:20:14.112361725Z do look up!!!!!
2024-02-22T01:20:14.122264803Z 2024/02/22 01:20:14 INFO: [core] [Channel #1] Resolver state updated: {
2024-02-22T01:20:14.122301124Z   "Addresses": [
2024-02-22T01:20:14.122307714Z     {
2024-02-22T01:20:14.122315774Z       "Addr": "a1.b1.c1.d1:50051",
2024-02-22T01:20:14.122321754Z       "ServerName": "",
2024-02-22T01:20:14.122326344Z       "Attributes": null,
2024-02-22T01:20:14.122331014Z       "BalancerAttributes": null,
2024-02-22T01:20:14.122335064Z       "Metadata": null
2024-02-22T01:20:14.122339024Z     },
2024-02-22T01:20:14.122342814Z     {
2024-02-22T01:20:14.122348264Z       "Addr": "a2.b2.c2.d2:50051",
2024-02-22T01:20:14.122362854Z       "ServerName": "",
2024-02-22T01:20:14.122366344Z       "Attributes": null,
2024-02-22T01:20:14.122369714Z       "BalancerAttributes": null,
2024-02-22T01:20:14.122372734Z       "Metadata": null
2024-02-22T01:20:14.122375694Z     }
2024-02-22T01:20:14.122379294Z   ],

from grpc-go.

NeoyeElf avatar NeoyeElf commented on May 24, 2024

@zasweq What are the bad cases if I implement a custom dns resolver and set the MinResolutionRate to 0.1s

from grpc-go.

gnvk avatar gnvk commented on May 24, 2024

@NeoyeElf @zasweq Yes, we came up with a very similar solution / workaround: k8s cache with low (1s) TTL and custom resolver with instant lookup. I would also love to know about the bad cases with this setup.

Btw I also found this related PR: #6962, which pretty much solves this issue.

from grpc-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.