Git Product home page Git Product logo

Comments (7)

tymeshifter avatar tymeshifter commented on September 26, 2024

Seems to be related to updateMetadataLoop and conditions that handle error conditions.

from franz-go.

twmb avatar twmb commented on September 26, 2024

I think you're right that this would be coming from updateMetadataLoop, but I'm not sure if there is a better approach here. The current code is to backoff using the RetryBackoff function, and the default backoff is exponential up to a max of 1s.

Even if that error may look fatal, a local resolver can map that at some point to a viable address, so the dial can succeed at some point in the future. Also I figured that for some errors, it's better to retry so that any monitoring system keeps getting pinged, rather than one very important signal being lost in time.

If this error is not retried in the metadata loop, the client becomes functionally dead, because metadata will never update again.

What do you think?

from franz-go.

tymeshifter avatar tymeshifter commented on September 26, 2024

Your reasons for this behaviour are valid. I guess I was coming from the position of the caller in its own retry loops which expects for the function call (client.Produce) to return at some point in the future with proper reason for the failure. The main source of this tradeoff is lazy loading the connection. Before reading documentation I even thought that client will establish connection when its instantiated. That way you know that connection issues are present when client is created so you don't proceed to the sending/consuming part.

With current situation only viable approach is to use context with timeout, I am just not clear on how to properly be notified about the original error. Hopefully configured logger will catch it.

from franz-go.

twmb avatar twmb commented on September 26, 2024

to return at some point in the future with proper reason for the failure

This is something that I see doesn't happen now that should happen.

Right now, when you produce, it's buffered into an unknown topic, and when metadata loads successfully, any topics buffered as unknown are notified if there was an error while loading (an error in the actual request itself). If, after enough notifications, the topic has still not loaded successfully, any buffered record will be failed with the latest error.

However, this is contingent on you disabling idempotence, because otherwise for the most part, having any potential to fail records can introduce sequence number problems. Edit: and looking into the code, retries are required to be unlimited if using idempotence. I might be able to split what is required to have unlimited retries for idempotence and what isn't, although that may make the options a little bit more confusing.

I can easily bubble the error the error back.

I might be able to stop loading metadata if a topic was never loaded successfully and then we fail all buffered records for it and then there is no reason to load metadata anymore. That'd be a good bit harder because the internal topics map is add-only, so even if all buffered records are failed, there is still a topic to be loaded, so the update metadata loop still wants to issue requests and get a successful response

from franz-go.

tymeshifter avatar tymeshifter commented on September 26, 2024

If, after enough notifications, the topic has still not loaded successfully, any buffered record will be failed with the latest error.

How do you configure "enough notifications"? Because in my tests this never ends, it effectively becomes an infinite loop.

having any potential to fail records can introduce sequence number problems

When this happens I assume the only solution for the client is to be recreated? Will canceling context cause sequence number problems? How does client behave when this happens, will it error on every call?

Also, would it be possible to add some method which would try to establish connection explicitly? Publishing to Kafka is an important step in most applications so if there is a connection failure or brokers are misconfigured, application can't do its thing so app initialisation should fail. Same like with databases where it's recommended to ping it on startup, if it's not there there is nothing to do until it comes online. It is application level concern to retry on connection failures if needed.

In my opinion user should be in the control of the client not limited by it. Also it's an ephemeral in nature. Things crash and get restarted, it can never guarantee total delivery unless it includes its own storage. So going into an infinite loop to meet some unachievable guarantees doesn't seem like a viable solution to me.

from franz-go.

twmb avatar twmb commented on September 26, 2024

How do you configure "enough notifications"?

This is based off

// RequestRetries sets the number of tries that retriable requests are allowed,
// overriding the unlimited default. This option does not apply to produce
// requests.
func RequestRetries(n int) Opt {
        return clientOpt{func(cfg *cfg) { cfg.retries = int64(n) }}
}

and the check is here:

case err, ok = <-unknown.wait:
if !ok {
cl.cfg.logger.Log(LogLevelInfo, "done waiting for unknown topic, metadata was successful", "topic", topic)
return // metadata was successful!
}
cl.cfg.logger.Log(LogLevelInfo, "unknown topic wait failed, retrying wait", "topic", topic, "err", err)
tries++
if int64(tries) >= cl.cfg.retries {
err = fmt.Errorf("no partitions available after refreshing metadata %d times, last err: %w", tries, err)
}

(the retrying in that loop was lost in a recent commit and will be added back shortly)

When this happens I assume the only solution for the client is to be recreated?

Yes, which is why the default is to avoid the potential.

Will canceling context cause sequence number problems?

The context passed to Produce is only used if you are trying to produce and the client already has MaxBufferedRecords. It's not used past that:

// The context is used if the client currently has the max amount of buffered
// records. If so, the client waits for some records to complete or for the
// context or client to quit. If the context / client quits, this returns an
// error.

A sequence number problem causes the client to enter a failed state if the client is either transactional (although in some cases the client can recover), or if StopOnDataLoss is configured.

Also, would it be possible to add some method which would try to establish connection explicitly?

What do you mean by this? There is a Dialer option to configure the dialer that is used. In the worst case, you could wrap a net.Dialer with a function that, if the dial fails, quits the program. I wouldn't recommend this because there are a lot of temporary dial errors (although I guess you could check net.Error.Temporary()).

In my opinion user should be in the control of the client not limited by it.

Yes agreed, which is why there are quite a lot of options, and why I chose functional options rather than a struct -- it's easy to add new configuration. However, a big part of interacting with Kafka is to ensure there are no accidental duplicates nor data loss, and providing this guarantee means by-default retrying. Bubbling up an error past a certain point to let the user decide what to do actually poses risk for duplicates. I'd rather err on retrying with a hook so that a user can notify themselves if retrying has happened more than they want and so that a user can explicitly opt in to potentially unsafe behavior.

A crashing program is pretty understood to have problems around restarts, and these can be planned for so that manual verification can happen after the fact. Restarts can have a safe slow shutdown. Most importantly, the default state of a server is to have some form of connectivity, so retrying until success seems much more viable than assuming that once things fail for a few minutes, we should just give up. Manual intervention to save the network can recover a client that is still retrying.

That being said, there is likely a little bit that I can do to allow failing records while retrying when producing while also ensuring sequence numbers do not get broken. I can look into this.

from franz-go.

twmb avatar twmb commented on September 26, 2024

I've release v0.6.13, which allows using record timeouts and setting limited produce retries even with idempotency enabled, as well as changes where errors are processed for updating metadata (so that the error you originally reported can trigger the produce retry limit). This new release should address this issue.

Please note that the two important options for producing are RecordTimeout and ProduceRetries.

The RequestRetries option mentioned in the first comment refers to non-produce requests (to allow finer grained control over record production). If metadata never loads a topic within RequestRetries, any buffered records for that topic are canceled, but that's about it. RequestTimeout refers to timeouts on issuing a request in general, not timeouts on a record.

from franz-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.