Git Product home page Git Product logo

Comments (15)

jgeewax avatar jgeewax commented on September 27, 2024 1

Currently, our internal APIs allow a configurable field of type repeated string that can be populated with any unreachable locations.

This would make the list response look something like the following:

message ListBooksResponse {
  repeated Book books = 1;
  string next_page_token = 2;
  repeated string unreachable_locations = 3;
}

Then in our RPC annotations we have something along the lines of

service BookService {
  rpc ListBooks(ListBooksRequest) returns (ListBooksResponse) {
    option (internal-thing-here).list_resources = {
      unreachable_field: "unreachable_locations"
    };
  }
}

I think that given our internal tooling, we are now left only to decide on the naming of the field and start suggesting it (via "should") for all List RPCs.

from google.aip.dev.

lukesneeringer avatar lukesneeringer commented on September 27, 2024

This is sort of my view as well. I think I mentioned this to you on a call a week or two ago, but I think this ship effectively sailed when a bunch of APIs went beta on that tool.

from google.aip.dev.

jgeewax avatar jgeewax commented on September 27, 2024

Yep. I'm going to start writing this one.

from google.aip.dev.

whaught avatar whaught commented on September 27, 2024

Cloud Run is doing global list in a similar but slightly differently. For regional details we have:

region_details {
    'us-central1' {
	      error_details {
         code: 429
         reason: 'Too many requests. Quota limit for project was reached.'
	      }
    }

We were concerned that a flat list of strings leaves no room for current or future details about the regions. Specifically we wanted a way to surface errors per-region, especially if they are user-recoverable or would assist in debugging regional outages from customer reports.

from google.aip.dev.

jgeewax avatar jgeewax commented on September 27, 2024

I'm a bit worried about that. The goal of listing resources is to show the items found, and if we couldn't get a hold of a location for some reason, we can note that and move on. Any subsequent diagnosis can be done by a more specific list operation (e.g., listing across a single location) which can then return what you have here as an actual error response.

In other words:

  • GET locations/-/books might return unavailable locations = 'us-central1'
  • User might say ... this is weird, let me dig into what's going on
  • GET locations/us-central1/books might return 429: Too many requests. Quota limit for project was reached

We're already stretching things with the List method here -- unless there's an amazing reason, I'm concerned that we're trying to stretch it even further.

from google.aip.dev.

whaught avatar whaught commented on September 27, 2024

One of our original ideas with Global list was to make it a repeated of regional list responses instead of a flat list. One weirdness with aggregating them all is that you lose all of the regional information by making the response looks the same as a regional response (like a Global virtual region). It's nice for clients because they can treat global the same, but the downside is it leads clients to rely on the global implementation as if it were a a regular region - our fear is that clients might never implement the to-the-side information about failed regions and implement clients that aren't aware of the hidden failures.

We went this direction as a compromise so that it would look more like existing solutions (there wasn't strong guidance at the time) but I do think the regional information is important so that we still have ways to understand regional failure. If the above guidances are accepted we may wind up having both fields for compat, but we thought we'd offer our thoughts.

from google.aip.dev.

lukesneeringer avatar lukesneeringer commented on September 27, 2024

Worth noting: I scheduled a meeting on this for Tuesday (but please keep discussing here -- this is great, because it is public and we have history).

My feeling is somewhat in the middle. I actually do see value in repeated errors rather than repeated strings -- in particular, because the error might be transient, the example @jgeewax gave above may yield a success value on the third request, and you never really know what went wrong (although maybe you do not care?).

On the other hand, I do agree that we do not want to add new things to standard methods unless there is a really good reason. Also, there are some new user challenges: for example, deserializing the error details (anything involving a protobuf Any is a pain in the butt for gRPC / client library users). In the workflow @jgeewax describes, that is less of an issue because the error handling goes through the usual error handling workflow.

from google.aip.dev.

jgeewax avatar jgeewax commented on September 27, 2024

Let's remember that standard methods are intended to be atomic, so conveying partial success (and therefore partial failure) is not an explicit goal of these methods.

Put slightly different, the purpose of a multi-parent list operation (GET parents/-/children) is to aggregate together children from multiple parents. Since some of those parents might be unavailable, we want to distinguish between "parents/1 has no children to list" and "parents/1 is not included in this listing due to some unusual circumstances" -- the unavailable_locations field is how we convey which parents were excluded from the list for some reason, but it is not our goal to help diagnose the underlying issue with one of the many parents involved in the list. I think it should be left to the client to decide what to do with the knowledge that a specific set of parents is excluded from the resource listing -- including listing one of the parents specifically to understand what the issue is (whether it's quota, an outage, or something else).

I think we also should recognize that this is sort of like a very specific version of a BatchList method -- so we might want to consider our choices on the other Batch methods.

from google.aip.dev.

whaught avatar whaught commented on September 27, 2024

If it is strongly a goal of the standard methods to not convey partial success/failure information, then we might want to consider discouraging multi-parent standard methods entirely. It would be impossible to comply with reliability guidance.

That said I'm not sure if I agree that it "is not our goal to help diagnose the underlying issue": by providing unavailable_locations we've already started down the road of showing unusual circumstances. Pushing diagnosis down to the clients seems like a user-hostile way of keeping the interface marginally cleaner? I guess I'm not sold - but let's discuss as a group Tuesday. From side discussions it looks like there are a few groups interested in this topic too.

Even if the standard were simply a repeated message with a string for unavailable regions, it would give implementors a place to extend without making the guidance much more complex:

repeated UnreachableLocation unreachable_locations = 3;
message UnreachableLocation {
   string name = 1;
}

from google.aip.dev.

jgeewax avatar jgeewax commented on September 27, 2024

by providing unavailable_locations we've already started down the road of showing unusual circumstances

I hear you -- and yes, we have extended the method beyond the basics. However we specifically are taking a simplistic attitude towards the issue -- we only say "resources belonging to the following parents are not included in this list response." Nothing more than a string representing which parent's children are omitted.

The goal of showing the parents that were unavailable is to convey the difference between "parent A had no children" and "we couldn't list anything under parent A for some reason -- to get more details, List resources on parent A".

from google.aip.dev.

jgeewax avatar jgeewax commented on September 27, 2024

A few updates after our meeting today:

  1. We noted that Listing resources can yield individual resources in the list that we know exist but were unable to return the content of the resource for some reason. This can happen regardless of whether this is a cross-parent List request or a standard (single parent) list request).

  2. We discuss that this is a "parent availability" scenario, not necessarily specific to locations. If we do GET /shelves/-/books, and a shelf is unavailable or a particular shelf's books can't be listed for some reason, we need a way to specifying that.

  3. One major concern the debug-ability of a request. Why provide a cross-parent List request if the response might say that one of those parents' children are excluded without saying why? The counter argument was that in the general case, this won't happen. If it does, you can diagnose by making a parent-specific List request to figure out what's going on.

More info to come

from google.aip.dev.

whaught avatar whaught commented on September 27, 2024

We also noted that asking the client to make a parent-specific List request to figure out what's going on may be a non-starter. If clients were to build that logic, it would obviate the need for having a multi-parent list call at all (which may not be in keeping with reliability guidelines anyway, so that might be the answer). Although in the general case omitted parents won't happen, it leads clients to fail to implement partial-success handling at all and have stealthy failures - a situation we have already seen.
The major concern for us is not just debug-ability, but a design that leads clients to ensure they are not ignoring the multi-parent-ness of the data so we never have a mystery situation for users if there are regional outages. This concern trumps ease-of-use of the API.

One counterpoint that was made was that showing per-parent failure details might cause clients to retry in a way that multiplies load (since successful/available parents would be retried as well), although this may already be the case with the unavailable_locations field.

from google.aip.dev.

whaught avatar whaught commented on September 27, 2024

Today we discussed further. There are some issues with gRPC Status messages which can be difficult to parse due to the use of protobuf.any. AIP greatly prefers to leave the string and have customers call failed parents to retrieve further information. One drawback may be transient failures where the call might succeed on the region after failing on '-' which will be a [rare] mystery to the user but ultimately successful anyway. This does avoid a wall-of-text issue if there are failures in many regions, but does require many RPCs to handle failure properly.

One interesting idea from @jgeewax would be showing a non-200 status on partial failure for the whole call to force the client to pay attention to error detail (which could include partial success information). We probably won't implement this, but it might solve the concerns about guiding clients to not make assumptions about global / multi-parent results.

We'll be waiting for finalized guidance in terms of naming for the partial success fields from AIP before completing our implementation.

from google.aip.dev.

lukesneeringer avatar lukesneeringer commented on September 27, 2024

Note: This should probably be amended into AIP-159.

from google.aip.dev.

lukesneeringer avatar lukesneeringer commented on September 27, 2024

AIP-217 is approved, so I am closing this now. :)

from google.aip.dev.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.