Git Product home page Git Product logo

Comments (11)

atollena avatar atollena commented on May 25, 2024

I want to add another problem that I see in the current behavior of grpc-go (and probably other languages?) for channel transition:

READY -> IDLE -> CONNECTING

If the data plane is disconnected from the management server when this transition happens, when the management server comes back online, it has no way to know that the client has discarded the corresponding API listener. I'm not sure if this qualifies as a bug or just an oddity of the protocol.

Consider the following events:

  1. channel is in READY state, client has acked API listener resource at version A
  2. management server goes offline. channel continues to work fine with its current version of the listener
  3. channel goes IDLE, and discards the listener resource. Current version for this type URL is not changed, it is still A (the code for unsubscription is here)
  4. channel goes CONNECTING. xDS resolver tries to obtain the API listener resource with a discovery request with version A.

If the management server comes back online after step 3 or step 4, it has no way to know that the client does not already have a valid API listeners for the subscribed resource, as the version matches the current resources version. The data plane has discarded the resource, the resource has not changed so the version has not changed, and the management server was not exposed to the unsubscription+re-subscription events because it was offline when it occurred. The channel will only recovers from that state when at least one listener resource changes, triggering a full update response.

That seems like an inherent problem of the current implementation, and the only fix I could think of would be for the data plane to discard the version anytime it unsubscribed from a resource and has not yet gotten a response with the corresponding nonce for this URL type. This would force the control plane to send all subscribed resources.

And reading the protocol spec carefully, this problem is acknowledged:

Note that the version for a resource type is not a property of an individual xDS stream but rather a property of the resources themselves. If the stream becomes broken and the client creates a new stream, the client’s initial request on the new stream should indicate the most recent version seen by the client on the previous stream. Servers may decide to optimize by not resending resources that the client had already seen on the previous stream, but only if they know that the client is not subscribing to a new resource that it was not previously subscribed to. For example, it is generally safe for servers to do this optimization for LDS and CDS when the only subscription is a wildcard subscription, and it is safe to do in environments where the clients will always subscribe to exactly the same set of resources.

For context, we use a hash of a sorted list of <resource_name, resource_content> as the resource version for each resource type, for all subscribed resources, with a special value for the content if the resource does not exist. And we originally thought that by being clever and including the list of subscriptions inside the version info, we could safely resume subscription without triggering a full update. @valerian-roche is in the process of making https://github.com/envoyproxy/go-control-plane play better with gRPC by making state of the world stateful, and implemented this trick.

So it seems like when the client is a gRPC application, presumably with dynamic subscription per channel, the server should never do this optimization. This bears the question of why gRPC sends a version string on new ADS streams, even though we know that in most cases, it is unsafe for the management server to take this version info into consideration. Wouldn't it have been less error prone to just not include the version?

Note that this problem wouldn't happen with incremental, because un-subscription + re-subscriptions are explicit and resource version are per resource (via the https://github.com/envoyproxy/data-plane-api/blob/main/envoy/api/v2/discovery.proto#L171). It'd be nice to see progress on incremental transport for gRPC at some point, mostly because it is simpler.

from grpc-go.

dfawley avatar dfawley commented on May 25, 2024

I think upon reconnection, for CDS and LDS, if there is no subscription then there should be no request sent.

Yes, this seems like the proper behavior to me. I'll look into this further.

I want to add another problem that I see in the current behavior of grpc-go (and probably other languages?) for channel transition:

This is very interesting. We'll talk about this a bit more internally, but one possible workaround would be to not ever send uncached resources in the initial request, and then add subscriptions for them after that.

Another option: don't set the version with the initial request if we have any uncached resources being requested.

from grpc-go.

dfawley avatar dfawley commented on May 25, 2024

@atollena I sent #7026 to fix the initial bug report. Is it possible for you to confirm that this fixes the issue for you in practice?

For the second issue, we have an open discussion item for deciding how to handle it, but it likely affects all languages.

from grpc-go.

dfawley avatar dfawley commented on May 25, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

from grpc-go.

atollena avatar atollena commented on May 25, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

I was able to try your fix today, and it does fix the issue of wildcard. Thank you!

from grpc-go.

atollena avatar atollena commented on May 25, 2024

Should there be another issue for the protocol problem I pointed out in #7013 (comment)?

from grpc-go.

atollena avatar atollena commented on May 25, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

I don't need a patch release, we're not using this in production yet.

from grpc-go.

dfawley avatar dfawley commented on May 25, 2024

Should there be another issue for the protocol problem I pointed out in #7013 (comment)?

I am tracking this internally. We need to discuss this cross-language to determine a solution.

from grpc-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.