Git Product home page Git Product logo

Comments (12)

michaelklishin avatar michaelklishin commented on May 18, 2024

Thank you for providing background on this.

Certain things related to flow control can be made tunable. These will be advanced options but I see the need for this for some users.

We'll discuss what can be made tunable and whether disabling flow control entirely is a feature we'd feel comfortable shipping, even as an option.

On 26/2/2015, at 17:09, Peter Dyson [email protected] wrote:

We have seen flow control kick in when the server itself is under no real stress. It has much more resources available yet rabbitmq has throttled the publish rate.

The workload pattern we're servicing is when generating a large (tens of millions of messages) backlog in a large somewhat prolonged burst to then be processed slowly over time, but while having the work queued quickly and reliably.

We want the ability to do this at a decent flow rate in order to process the backlog over time as it gets through it. Our servers can easily take a larger backlog at full speed, but rabbit will throttle the incoming publish rate down with flow control.

It would be good to have better ways to control this behaviour to either delay/extend the time or queue sizes before this flow control kicks in, or have it flow control slowly/incrementally or in an extreme case, disable flow control all together and rely only on the high watermark to prevent server death and protect stability.

Our testing is showing flow control is triggering when there is no serious load on any of memory, cpu, disk IO at the time.

What are some thoughts around this?


Reply to this email directly or view it on GitHub.

from rabbitmq-server.

simonmacmullen avatar simonmacmullen commented on May 18, 2024

Flow control is not the problem here. Not saying there isn't a problem, but thinking about it in terms of flow control is unlikely to be helpful.

Flow control is just saying "the broker is unable to accept messages as fast as publishers would like to publish them". It's more important to focus on why that is happening. Every complaint about flow control in the past - and there have been a few - has always turned out to be about something else performing badly.

(Analogy: if there is too much traffic in your city, the solution is not to remove all the traffic lights.)

So what actually is the bottleneck? This blog post: http://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/ talks about how to interpret flow control state to find bottlenecks, but if I had to take a guess from the symptoms described, I would think that queues have gone over their paging ratio and have started to page messages out to disk. Everything I write from here on assumes that's the case.

(Aside: RabbitMQ 3.5.0 will expose information about how much disk activity queues are doing, so this should be a lot easier to spot. See "per queue paging stats" at http://imgur.com/a/tLwQs#vU8pQSj)

So first of all, I would suggest @geekpete that you look at http://www.rabbitmq.com/memory.html#memsup-paging and possibly tweak the values of vm_memory_high_watermark_paging_ratio or vm_memory_high_watermark.

But there is a real issue here: while queues manage their workload to balance the work done accepting messages vs delivering messages, they don't attempt to balance paging in the same way; it's at the mercy of the scheduler. It's possible that once a queue decides it has to page it will devote 90% of its time to doing so (until enough messages are paged out, anyway!)

This is a possible area of future improvement, but not going to make it into 3.5.0 I'm afraid.

from rabbitmq-server.

geekpete avatar geekpete commented on May 18, 2024

Thanks for your response.

The paging behaviour sounds like it's triggering the flow control.

A test scenario we've used to replicate the behaviour is having 5 queues with 2-5 million msgs in them each then try to publish more messages to the server. Flow control kicks in dropping the publish rate from 5k/sec to 300 msgs/sec. Purging one queue at a time sees the speed incrementally speed up, first jumping back up to 1700/sec then back to 5k/sec once enough messages have been purged.

So from my server's point of view, it can totally handle that paging as it has enough IO being a physical server with SSD and using the deadline scheduler (another option would also be to use NOOP scheduler but usually similar raw speed to deadline). We also know that the server can hande very large volumes of messages in a backlog.

I do realise that wanting a backlog on purpose might be an edge case.

I think the analogy for my case is closer to a very large car park next to a loading dock rather than a city with traffic. I want to unload all the cars from all the ships as fast I can so the ships can then depart for the next port and the cars can be collected whenever. Making the ships unload slower once half the car park is full isn't helping.

It feels a bit like having to bump up max files open or similar settings on the OS, the limit is a general average case setting which keeps most people out of trouble but for high performance scenarios it's a limit that becomes artificial and has to be increased when using much larger/faster hardware.

So while I can tweak the vm_memory_high_watermark_paging_ratio setting in combination with the vm_memory_high_watermark, I'm guessing that once that ratio is hit the flow control will kick in. This is great if I can get a server with enormous amounts of ram, but otherwise adjusting this on my current rig will only delay the flow control. It would be great to have it page out to disk to avoid using ram, hit the IO hard and I'll give it enough fast disk to deal with it.

I'm also using both durable and persistent for my queues, so would this be writing the messages out to disk twice in a way? Once for the persistence and again for paging them out of ram?

I'd love to be able to disable the flow control memory paging trigger, keep hitting the rabbit with messages, have it pour them out to disk and as long as I don't go near the high water mark and it's coping then have it continue to go full publish speed.

I'll do some more testing today with different high water mark settings and see how I go.
I'll also investigate using servers with a great deal more ram.

from rabbitmq-server.

patmanh avatar patmanh commented on May 18, 2024

@geekpete where did you land with this? I'm in a similar boat e.g. seeing flow control being enabled when I'd rather the messages go straight to RAM or disk - plenty of both left. Although, my server (r3.large on ec2) is maxed out on CPU.

Any learnings you discovered would be appreciated!

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on May 18, 2024

@patmanh #143, #227, #351.

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on May 18, 2024

I believe #143 largely addresses this, as do related improvements, in particular #227 and friends, and #351. Collecting metrics about flow control and making it more fine-grained are still left to be done but they are generic improvements not related to tuning.

from rabbitmq-server.

geekpete avatar geekpete commented on May 18, 2024

Hey Team,

Thanks for the follow up with these improvement references.

We'll test out these new features in the new version.

from rabbitmq-server.

patmanh avatar patmanh commented on May 18, 2024

as a follow up here - we tried playing with the flow control credit settings but couldn't find a configuration that did more good than harm to overall cluster performance when we were undergoing heavy publish loads without enough time to add more consumers (~millions of messages in <3 mins).

in the end, we set all of our queues to "lazy" queues. despite all messages going to disk being slower (vs transient queues), our overall throughput is high enough (easily XX,XXX's of messages per second for each of publish/deliver/ack on decent ec2 hardware - especially after bumping our number of IOPS on EBs a little bit). in the end, this allows everything to perform reliably regardless if there are 0 messages in our queues or millions.

when there is more guidance published re: flow control credits we'd be up to give things another try but we're very happy with lazy queues for now!

from rabbitmq-server.

videlalvaro avatar videlalvaro commented on May 18, 2024

@patmanh this is pretty cool, thanks fro the feedback

from rabbitmq-server.

geekpete avatar geekpete commented on May 18, 2024

Hey Team,

Just like to say thanks for this lazy queue feature.

We now don't hit any flow control from initial testing. 100+ million in the queue and still flowing and no back pressure. For all intents and purposes with some caveats (disks and filesystem cache are fast enough to deal with read/write workload) this solves our flow control problems at the queue level.

From what it looks like, the disk will have to to 100% before we see any problems.

And this performs quite well even on SATA RAID 10 so SSD should cover it fully.

Can anyone think of or has seen any other issues/caveats with using lazy policy for all queues?

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on May 18, 2024

@geekpete the combination of lazy queues with queue limits is the only thing we've seen reported. Well, that and #514 which will ship in 3.6.1.

from rabbitmq-server.

geekpete avatar geekpete commented on May 18, 2024

Excellent, thanks.

from rabbitmq-server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.