Comments (5)
@sebastianjonasson : You are correct in your assessment about when the timer is created in the DNS resolver's watcher goroutine. I was also going to fix it the same way you have attempted to fix it. But I wanted to have a test for it, which is why I asked you if you had an easy reproduction. Since you don't have an easy repro, I will work on a unit test for this scenario. Thank you for reporting this issue and all the information you have provided along the way.
from grpc-go.
@easwars could you verify if the the patch from PR #6668 can cause this delay?
from grpc-go.
I can see how #6668 could be the problem maker. I understand the problem, and the fix required as well.
Is there a way to easily reproduce the problem though (without any complicated k8s setup)? That would be helpful to verify the fix. Thanks.
from grpc-go.
Thanks for the response @easwars!
Unfortunately I've not been able to reproduce this locally. All debugging have been done in our cloud based dev environment.
I've done some additional testing and found that the problem still remains on master. However, I've been able to get this to work by checking out master and reverting the watcher
function to use a timer
instead of time.After
. Example here: https://github.com/sebastianjonasson/grpc-go/pull/1/files.
It seems like the issue is that the current implementation assigns the duration instead of starting the timer as the previous (<=v1.59.0
) implementation did. This means that the current implementation will always wait 30s after receiving the <-rn
message, while in the previous implementation the timer would have expired when receiving the <-rn
message yielding an immediate resolve.
grpc-go/internal/resolver/dns/dns_resolver.go
Lines 212 to 233 in e22436a
To try to prove my case I've added the following debug logs for master and my test branch to highlight the behavior:
- debug: start timer
- debug: wait for timer
- debug: timer done
As you can see in the logs for master we wait 30s upon receiving the <-rn
message, while the logs for the test branch shows immediate resolution since the timer has already expired.
Logs master:
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel switches to new LB policy "round_robin"
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [dns] debug: start timer
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.10.40:8000", ServerName: "", }}]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 09:32:05 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel deleted
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel deleted
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [dns] debug: wait for timer
2024/05/17 09:37:16 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:37:16 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:17 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:21 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:25 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:46 INFO: [dns] debug: timer done
2024/05/17 09:37:46 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
Logs test branch:
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [dns] debug: start timer
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.20.50:8000", ServerName: "", }}]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 11:37:15 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel deleted
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel deleted
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [dns] debug: wait for timer
2024/05/17 11:46:18 INFO: [dns] debug: timer done
2024/05/17 11:46:18 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:46:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.20.50:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 11:46:18 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #4]Subchannel created
from grpc-go.
Great! Thanks for looking into this!
from grpc-go.
Related Issues (20)
- Flaky test: Test/AggregateCluster_BadDNS_GoodEDS HOT 7
- Setting HTTPS proxy without setting HTTPS_PROXY env variable HOT 2
- channelz has a startup sequencing bug HOT 2
- ring_hash stuck in TransientFailure despite having available endpoints HOT 1
- Flaky Test/AuthorityRevive in google.golang.org/grpc/xds/internal/xdsclient/tests HOT 3
- Implement Custom Types for gRPC-Go Framer HOT 1
- xds: implement ADS stream flow control mechanism
- xds/csds: unskip test after ADS stream flow control changes are merged
- 'use of closed network connection', what's the best practice to deal with this? HOT 7
- Infinite loop in bufWriter.Write() HOT 7
- metadata: MD.String should only show metadata known to not contain tokens or other PII HOT 1
- Over 300+ groutines after only a few minutes of testing HOT 6
- Allow client-side access to `grpc-status` and `grpc-message` HOT 6
- scripts: fixes/improvements to regenerate.sh / install_protoc.sh
- Surface response body when receiving an unexpected status code and content-type HOT 2
- Default fallback scheme changed from `passthrough` to `dns` HOT 2
- Support for custom code check HOT 6
- grpc: error unmarshalling request: failed to unmarshal, message is X, want proto.Message HOT 1
- Is it ok to call grpc.UnaryInvoker multiple times inside an interceptor? HOT 8
- Flaky Test: AllMetricsOneFunction HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grpc-go.