While writing a cooperative consumer, I read in the doc that all FetchErrors should be

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question] PollFetch Error Handling about franz-go HOT 6 CLOSED

twmb commented on June 24, 2024

[Question] PollFetch Error Handling

from franz-go.

Comments (6)

twmb commented on June 24, 2024

The client internally retries the retriable errors and does not bubble those up. There are a few potential errors that can be returned from a poll, all of which the client can't handle and essentially unrecoverable:

not authorized
unsupported compression (you should not receive this)
unsupported message version (you should also not receive this
unknown error

I think in these cases, the partition is repeatedly fetched, so these errors should crop up continuously.

If the partition requires loading offsets or listing the broker epoch (for data loss detection), if these fail with non-retriable errors (auth), the client does not try fetching the partition again.

It's an oversight at the moment to continue retrying on auth failures in the fetch response itself.

For step 2, the records should always be processed -- any successfully processed record in the client advances the client's offset. So, skipping records is not necessary and could actually lead to missed records.

Primarily, you should not get any errors from PollFetches, so any that you do receive should be investigated.

Let me know if this clarifies anything at all, I think this is pointing to my documentation needing to be touched up. I've also made a note to address the fetch loop returning the same auth errors, rather than returning the error once and not fetching anymore.

from franz-go.

skoo25 commented on June 24, 2024

@twmb Thanks for clarification. It sounds like the only errors that the consumer will get are non-retriable errors, and there isn't much consumer can do do other than logging the errors and wait for manual intervention

from franz-go.

twmb commented on June 24, 2024

Yep that's the goal!

Here is the place where errors are inspected when consuming records, note that only the default case does not clear the error, all other cases are for retriable errors and the client clears the error:

franz-go/pkg/kgo/source.go

Lines 673 to 752 in 8178f2c

 switch fp.Err { 

 default: 

 // - bad auth 

 // - unsupported compression 

 // - unsupported message version 

 // - unknown error 

 // - or, no error 

 case kerr.UnknownTopicOrPartition, 

 kerr.NotLeaderForPartition, 

 kerr.ReplicaNotAvailable, 

 kerr.KafkaStorageError, 

 kerr.UnknownLeaderEpoch, // our meta is newer than broker we fetched from 

 kerr.OffsetNotAvailable: // fetched from out of sync replica or a behind in-sync one (KIP-392: case 1 and case 2) 

 fp.Err = nil // recoverable with client backoff; hide the error 

 case kerr.OffsetOutOfRange: 

 fp.Err = nil 

 // If we are out of range, we reset to what we can. 

 // With Kafka >= 2.1.0, we should only get offset out 

 // of range if we fetch before the start, but a user 

 // could start past the end and want to reset to 

 // the end. We respect that. 

 // 

 // KIP-392 (case 3) specifies that if we are consuming 

 // from a follower, then if our offset request is before 

 // the low watermark, we list offsets from the follower. 

 // 

 // KIP-392 (case 4) specifies that if we are consuming 

 // a follower and our request is larger than the high 

 // watermark, then we should first check for truncation 

 // from the leader and then if we still get out of 

 // range, reset with list offsets. 

 // 

 // It further goes on to say that "out of range errors 

 // due to ISR propagation delays should be extremely 

 // rare". Rather than falling back to listing offsets, 

 // we stay in a cycle of validating the leader epoch 

 // until the follower has caught up. 

 if s.nodeID == partOffset.from.leader { // non KIP-392 case 

 reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{ 

 replica: -1, 

 Offset: s.cl.cfg.resetOffset, 

 }) 

 } else if partOffset.offset < fp.LogStartOffset { // KIP-392 case 3 

 reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{ 

 replica: s.nodeID, 

 Offset: s.cl.cfg.resetOffset, 

 }) 

 } else { // partOffset.offset > fp.HighWatermark, KIP-392 case 4 

 reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{ 

 replica: -1, 

 Offset: Offset{ 

 at: partOffset.offset, 

 epoch: partOffset.lastConsumedEpoch, 

 }, 

 }) 

 } 

 case kerr.FencedLeaderEpoch: 

 fp.Err = nil 

 // With fenced leader epoch, we notify an error only 

 // if necessary after we find out if loss occurred. 

 // If we have consumed nothing, then we got unlucky 

 // by being fenced right after we grabbed metadata. 

 // We just refresh metadata and try again. 

 if partOffset.lastConsumedEpoch >= 0 { 

 reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{ 

 replica: -1, 

 Offset: Offset{ 

 at: partOffset.offset, 

 epoch: partOffset.lastConsumedEpoch, 

 }, 

 }) 

 } 

 }

This is the place where ListOffsets or OffsetForLeader epoch can have a fatal error:

franz-go/pkg/kgo/consumer.go

Lines 893 to 907 in 8178f2c

 switch load.err.(type) { 

 case *ErrDataLoss: 

 s.c.addFakeReadyForDraining(load.topic, load.partition, load.err) // signal we lost data, but set the cursor to what we can 

 use() 

 case nil: 

 use() 

 default: // from ErrorCode in a response 

 if !kerr.IsRetriable(load.err) { // non-retriable response error; signal such in a response 

 s.c.addFakeReadyForDraining(load.topic, load.partition, load.err) 

 continue 

 } 

 reloads.addLoad(load.topic, load.partition, loaded.loadType, load.request) 

 }

. Here, if data loss is detected, an error is injected but the partition continues to be used. Only non-retriable errors stop using the partition.

As mentioned in my prior comment, I'm going to address the source.go code to stop using a partition if there is an auth error, and I'll also add a configuration option to force continuing to use the partition on auth errors (and make the list offsets non-retriable error behave the same as the source.go error).

from franz-go.

skoo25 commented on June 24, 2024

Thanks. I'll reopen the issue if I have any further question regarding this.

from franz-go.

twmb commented on June 24, 2024

I'm going to reopen this issue to keep track of this open inconsistency; I plan to address this soon. There's a need to handle the client retrying on auth failure, allowing an operator to start a client before it is authorized, or to rotate credentials.

from franz-go.

twmb commented on June 24, 2024

I closed this with the most recent commit (which will be in the next tagged release in a week or so-ish). The new behavior continues to retry loading a partition's offset or epoch even in the face of non-retriable errors: we will back off 1s and retry, rather than permanently failing the partition.

This is a bit different from my prior comment's goals, which were to stop consuming if a non-retriable error occurred. I think it's safer and probably better to just always continue retrying.

from franz-go.

[Question] PollFetch Error Handling about franz-go HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	switch fp.Err {
	default:
	// - bad auth
	// - unsupported compression
	// - unsupported message version
	// - unknown error
	// - or, no error

	case kerr.UnknownTopicOrPartition,
	kerr.NotLeaderForPartition,
	kerr.ReplicaNotAvailable,
	kerr.KafkaStorageError,
	kerr.UnknownLeaderEpoch, // our meta is newer than broker we fetched from
	kerr.OffsetNotAvailable: // fetched from out of sync replica or a behind in-sync one (KIP-392: case 1 and case 2)

	fp.Err = nil // recoverable with client backoff; hide the error

	case kerr.OffsetOutOfRange:
	fp.Err = nil

	// If we are out of range, we reset to what we can.
	// With Kafka >= 2.1.0, we should only get offset out
	// of range if we fetch before the start, but a user
	// could start past the end and want to reset to
	// the end. We respect that.
	//
	// KIP-392 (case 3) specifies that if we are consuming
	// from a follower, then if our offset request is before
	// the low watermark, we list offsets from the follower.
	//
	// KIP-392 (case 4) specifies that if we are consuming
	// a follower and our request is larger than the high
	// watermark, then we should first check for truncation
	// from the leader and then if we still get out of
	// range, reset with list offsets.
	//
	// It further goes on to say that "out of range errors
	// due to ISR propagation delays should be extremely
	// rare". Rather than falling back to listing offsets,
	// we stay in a cycle of validating the leader epoch
	// until the follower has caught up.

	if s.nodeID == partOffset.from.leader { // non KIP-392 case
	reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{
	replica: -1,
	Offset: s.cl.cfg.resetOffset,
	})
	} else if partOffset.offset < fp.LogStartOffset { // KIP-392 case 3
	reloadOffsets.addLoad(topic, partition, loadTypeList, offsetLoad{
	replica: s.nodeID,
	Offset: s.cl.cfg.resetOffset,
	})
	} else { // partOffset.offset > fp.HighWatermark, KIP-392 case 4
	reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{
	replica: -1,
	Offset: Offset{
	at: partOffset.offset,
	epoch: partOffset.lastConsumedEpoch,
	},
	})
	}

	case kerr.FencedLeaderEpoch:
	fp.Err = nil

	// With fenced leader epoch, we notify an error only
	// if necessary after we find out if loss occurred.
	// If we have consumed nothing, then we got unlucky
	// by being fenced right after we grabbed metadata.
	// We just refresh metadata and try again.
	if partOffset.lastConsumedEpoch >= 0 {
	reloadOffsets.addLoad(topic, partition, loadTypeEpoch, offsetLoad{
	replica: -1,
	Offset: Offset{
	at: partOffset.offset,
	epoch: partOffset.lastConsumedEpoch,
	},
	})
	}
	}

	switch load.err.(type) {
	case *ErrDataLoss:
	s.c.addFakeReadyForDraining(load.topic, load.partition, load.err) // signal we lost data, but set the cursor to what we can
	use()

	case nil:
	use()

	default: // from ErrorCode in a response
	if !kerr.IsRetriable(load.err) { // non-retriable response error; signal such in a response
	s.c.addFakeReadyForDraining(load.topic, load.partition, load.err)
	continue
	}
	reloads.addLoad(load.topic, load.partition, loaded.loadType, load.request)
	}