Git Product home page Git Product logo

Comments (6)

twmb avatar twmb commented on June 24, 2024

The client internally retries the retriable errors and does not bubble those up. There are a few potential errors that can be returned from a poll, all of which the client can't handle and essentially unrecoverable:

  • not authorized
  • unsupported compression (you should not receive this)
  • unsupported message version (you should also not receive this
  • unknown error

I think in these cases, the partition is repeatedly fetched, so these errors should crop up continuously.

If the partition requires loading offsets or listing the broker epoch (for data loss detection), if these fail with non-retriable errors (auth), the client does not try fetching the partition again.

It's an oversight at the moment to continue retrying on auth failures in the fetch response itself.


For step 2, the records should always be processed -- any successfully processed record in the client advances the client's offset. So, skipping records is not necessary and could actually lead to missed records.

Primarily, you should not get any errors from PollFetches, so any that you do receive should be investigated.

Let me know if this clarifies anything at all, I think this is pointing to my documentation needing to be touched up. I've also made a note to address the fetch loop returning the same auth errors, rather than returning the error once and not fetching anymore.

from franz-go.

skoo25 avatar skoo25 commented on June 24, 2024

@twmb Thanks for clarification. It sounds like the only errors that the consumer will get are non-retriable errors, and there isn't much consumer can do do other than logging the errors and wait for manual intervention

from franz-go.

twmb avatar twmb commented on June 24, 2024

Yep that's the goal!

Here is the place where errors are inspected when consuming records, note that only the default case does not clear the error, all other cases are for retriable errors and the client clears the error:

franz-go/pkg/kgo/source.go

Lines 673 to 752 in 8178f2c

switch fp.Err {
default:
// - bad auth
// - unsupported compression
// - unsupported message version
// - unknown error
// - or, no error
case kerr.UnknownTopicOrPartition,
kerr.NotLeaderForPartition,
kerr.ReplicaNotAvailable,
kerr.KafkaStorageError,
kerr.UnknownLeaderEpoch, // our meta is newer than broker we fetched from
kerr.OffsetNotAvailable: // fetched from out of sync replica or a behind in-sync one (KIP-392: case 1 and case 2)
fp.Err = nil // recoverable with client backoff; hide the error
case kerr.OffsetOutOfRange:
fp.Err = nil
// If we are out of range, we reset to what we can.
// With Kafka >= 2.1.0, we should only get offset out
// of range if we fetch before the start, but a user
// could start past the end and want to reset to
// the end. We respect that.
//
// KIP-392 (case 3) specifies that if we are consuming
// from a follower, then if our offset request is before
// the low watermark, we list offsets from the follower.
//
// KIP-392 (case 4) specifies that if we are consuming
// a follower and our request is larger than the high
// watermark, then we should first check for truncation
// from the leader and then if we still get out of
// range, reset with list offsets.
//
// It further goes on to say that "out of range errors
// due to ISR propagation delays should be extremely
// rare". Rather than falling back to listing offsets,
// we stay in a cycle of validating the leader epoch
// until the follower has caught up.
if s.nodeID == partOffset.from.leader { // non KIP-392 case
reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{
replica: -1,
Offset: s.cl.cfg.resetOffset,
})
} else if partOffset.offset < fp.LogStartOffset { // KIP-392 case 3
reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{
replica: s.nodeID,
Offset: s.cl.cfg.resetOffset,
})
} else { // partOffset.offset > fp.HighWatermark, KIP-392 case 4
reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{
replica: -1,
Offset: Offset{
at: partOffset.offset,
epoch: partOffset.lastConsumedEpoch,
},
})
}
case kerr.FencedLeaderEpoch:
fp.Err = nil
// With fenced leader epoch, we notify an error only
// if necessary after we find out if loss occurred.
// If we have consumed nothing, then we got unlucky
// by being fenced right after we grabbed metadata.
// We just refresh metadata and try again.
if partOffset.lastConsumedEpoch >= 0 {
reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{
replica: -1,
Offset: Offset{
at: partOffset.offset,
epoch: partOffset.lastConsumedEpoch,
},
})
}
}

This is the place where ListOffsets or OffsetForLeader epoch can have a fatal error:

switch load.err.(type) {
case *ErrDataLoss:
s.c.addFakeReadyForDraining(load.topic, load.partition, load.err) // signal we lost data, but set the cursor to what we can
use()
case nil:
use()
default: // from ErrorCode in a response
if !kerr.IsRetriable(load.err) { // non-retriable response error; signal such in a response
s.c.addFakeReadyForDraining(load.topic, load.partition, load.err)
continue
}
reloads.addLoad(load.topic, load.partition, loaded.loadType, load.request)
}
. Here, if data loss is detected, an error is injected but the partition continues to be used. Only non-retriable errors stop using the partition.

As mentioned in my prior comment, I'm going to address the source.go code to stop using a partition if there is an auth error, and I'll also add a configuration option to force continuing to use the partition on auth errors (and make the list offsets non-retriable error behave the same as the source.go error).

from franz-go.

skoo25 avatar skoo25 commented on June 24, 2024

Thanks. I'll reopen the issue if I have any further question regarding this.

from franz-go.

twmb avatar twmb commented on June 24, 2024

I'm going to reopen this issue to keep track of this open inconsistency; I plan to address this soon. There's a need to handle the client retrying on auth failure, allowing an operator to start a client before it is authorized, or to rotate credentials.

from franz-go.

twmb avatar twmb commented on June 24, 2024

I closed this with the most recent commit (which will be in the next tagged release in a week or so-ish). The new behavior continues to retry loading a partition's offset or epoch even in the face of non-retriable errors: we will back off 1s and retry, rather than permanently failing the partition.

This is a bit different from my prior comment's goals, which were to stop consuming if a non-retriable error occurred. I think it's safer and probably better to just always continue retrying.

from franz-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.