Comments (29)
@dmitry-vychikov I've checked on my side with just applying index order - so, we don't have that unexpected degradation anymore. The query is executing ten times faster. Please merge into master, we are ready to update ASAP on all our environments.
We are continuing to monitor that the issue will not reproduce, but here is the latency before applying index order and after:
I hope we will not have degradation anymore.
@dmitry-vychikov, @azygis thank you guys for the fast solution!
from hangfire.postgresql.
Lovely! Then we don't need any conditionals, I suppose?
Yes, we don't. I tried to test two cases with regards to ordering:
- Reversed order by queue name when query has to pick first jobs from the bottom of the queue
- Evenly distributed jobs based on queue name.
I didn't see any significant difference between these two. Both cases run faster with updated index. At least, we did not make it worse.
Would you be able to look at other indexes? If order is needed for any of them?
I might take a look next weekend.
Right now I'm not even sure if other indexes for jobqueue
are needed at all. I had a brief look at past commits. , and there were performance improvements creating new indexes in that table, but old ones were not touched.
Maybe it's time to do some clean ups.
from hangfire.postgresql.
Published as 1.20.8.
from hangfire.postgresql.
Hi,
Please share additional info:
- hangfire and hangfire postgres versions
- Hangfire configuration code
- How many workers and queues do you have
- How many hangfire processes do you run
- Dotnet version
- Is it deployed on cloud?
- what's CPU and mem usage on the postgres server?
from hangfire.postgresql.
Hi, Sure
- dotnet 7
- Hangfire.Core - 1.8.6
- Hangfire.PostgreSql - 1.20.4.0
services.AddHangfire(globalConfiguration =>
{
globalConfiguration.UsePostgreSqlStorage(Configuration.GetConnectionString(HangfireExtensions.HangfireDatabase));
globalConfiguration.UseSerializerSettings(HangfireExtensions.HangfireJsonSerializerSettings);
globalConfiguration.UseTagsWithPostgreSql();
});
services.AddHangfireServer(options =>
{
options.WorkerCount = agentSettings.GetValue("WorkerCount", 15);
options.Queues = targetQueues;
});
We have in GCP a few servers. Each server could have specific list of queues. Independent database for Hangfire as service in GCP. Database CPU - 100%, memory - around 50%.
We can add additional instances fast. However, we have that issue when we have even 1 server.
13 secs latency for an index scan. And it's slowly degrading. We have 60k jobs in the queue and hangfire cannot start processing more than 5-10 jobs. Now we are not even creating jobs, just trying to process existing. So, technically we cannot add additional servers for horizontal scaling, because DB is dying. Do you have any ideas about what are we doing wrong?
from hangfire.postgresql.
It seems like the database is feeling better when the queue is smaller. Latency better and better with decreasing queue.
Any suggestions? :)
from hangfire.postgresql.
Try to reduce increase polling interval for workers. There must be a settings for this inside of in AddHangfireServer
or UsePostgresSqlStorage
calls.
If you have lots of workers that constantly poll the db, but never get any jobs (e.g because their own queue is empty), this can create unnecessary stress.
from hangfire.postgresql.
That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?
from hangfire.postgresql.
And here are transactions per second for the database. Take a look at performance. Amount of jobs in queue decreasing and performance is growing. In the end we don't have jobs. We are talking about 10k-50k jobs in the queue. Hangfire is processing it for so long due to issues with fetching jobs for processing. Jobs duration 5-50ms.
from hangfire.postgresql.
That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?
Technically, yes; in practice, depending on the server which enqueues the job, it can be picked up by a worker in the same server instance immediately because of AutoResetEvent
.
Partial screenshot of the execution plan has little value. Can you provide the actual query plan from PG in JSON format instead? I can't open any of these images in full size, personally (404).
EDIT: Also, what is behind UseTagsWithPostgreSql()
?
from hangfire.postgresql.
Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.
from hangfire.postgresql.
Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.
That is my expectation. :) Will try to share the JSON plan later. I cannot it extract from GCP so simply. UseTagsWithPostgreSql enables Tags functionality by creating an independent table. It doesn't affect the initial table and selecting from jobqueue. OK, we will try to increase pooling interval.
from hangfire.postgresql.
I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.
from hangfire.postgresql.
I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.
Tried, didn't help :(
from hangfire.postgresql.
So, the reason is 100% related to amount of records in jobqueu. That subquery is the issue
SELECT "id"
FROM "hangfire"."jobqueue"
WHERE
"queue" = ANY (ARRAY['kafka-consumption'])
AND ("fetchedat" IS NULL
OR "fetchedat" < NOW() + INTERVAL '1 YEAR')
ORDER BY
"fetchedat" NULLS FIRST,
"queue",
"jobid"
for update skip locked
LIMIT 1;
Here is the plan for the query https://explain.dalibo.com/plan/ga35182eddag2f6h#
As I understand we have problem because of for update skip locked
. Without for update skip locked
plan: https://explain.dalibo.com/plan/76gad6c1eec5bd4e But honestly, I don't understand why the plan is so different.
from hangfire.postgresql.
https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.
I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.
Maybe there is something better, need to read until the end.
from hangfire.postgresql.
https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.
I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.
Maybe there is something better, need to read until the end.
Checked that thread. Interesting, but I don’t understand what can we do. Will Increasing pooling interval help us? Honestly I don’t understand how to use Postgres in production now. We are doing everything by best practices, however additional workers fully destroying performance of job processing. Why nobody else has that issue? Or nobody is using Postgres in production?
from hangfire.postgresql.
Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.
I personally have no experience with GCP so I can't really say anything about it.
from hangfire.postgresql.
Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.
I personally have no experience with GCP so I can't really say anything about it.
Yeah, just topic above is not related to hangfire and the current package, but Postgres. That is what I don’t understand why we have that issue but nobody else. What are we doing differently?:) I do see that issue related to processing “for update skip locked” in many threads. Any ideas what can we do with that?
from hangfire.postgresql.
Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.
I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).
from hangfire.postgresql.
Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.
I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).
I believe you don’t understand the severity of that issue. Topic above is not related to GCP but yes, to something specific inside of Postgres. Until we understand what is that and how to mitigate it everyone who is using Hangfire with current package under risk of unpredictable performance degradation. It’s time bomb. It could happen once you try to add additional workers, or maybe not. Until we know what is that - production risks are too high for everyone.
from hangfire.postgresql.
I agree; to an extent.
That could happen with or without this package, as the linked mailing list shows. Without a reproduction there's virtually no way to see what can or can not happen.
I understand how you feel, being the only one which reported the issue in this repo (I'm not saying you're the only one that's impacted; just about reporting). Unfortunately, you're the one that has a reproduction and means to see whether any changes to the queries are valid or not. Since it's an OSS package, you can submit a pull request if there's a possible fix.
from hangfire.postgresql.
@vadim-kupovykh
Can you provide a minimal runnable sample app that reproduces your issue say within a docker environment?
I may have some free time this weekend to try come up with something.
I have only 2 ideas so far:
- Introduce jitter delay within Hangfire worker, as recommended by the article above
- Try to change the the DB query in a way that sacrifices job ordering in exchange for performance gains. And expose it as a configuration option.
from hangfire.postgresql.
From what I understand, it's really about many workers trying to handle the same table, as in essence SKIP LOCKED
should simply skip the locked rows (read - locked by other workers), but feels like postgresql can choke a little with many rows and many concurrent updates.
I'm not convinced there's a good way to just drop the ordering, even if configurable. All jobs must be picked up in order they were created.
But this gave me an idea - maybe we lack the ordered index for this particular query?
from hangfire.postgresql.
But this gave me an idea - maybe we lack the ordered index for this particular query?
Probably not. At least, I see these 3 indexes that I think should work:
create index ix_hangfire_jobqueue_jobidandqueue
on jobqueue (jobid, queue);
create index ix_hangfire_jobqueue_queueandfetchedat
on jobqueue (queue, fetchedat);
create index jobqueue_queue_fetchat_jobid
on jobqueue (queue, fetchedat, jobid);
Requirements
All jobs must be picked up in order they were created
Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?
I just checked MSSQL server implementation, and they do NOT order the jobs on deque (see https://github.com/HangfireIO/Hangfire/blob/main/src/Hangfire.SqlServer/SqlServerJobQueue.cs#L191-L201)
Here's their SQL query:
set nocount on;set xact_abort on;set tran isolation level read committed;
update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [{_storage.SchemaName}].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in @queues and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()));
Ordering question has been discussed here HangfireIO/Hangfire#398 , and it seems that the correct way to maintain job dependencies is using job continuation.
Usage scenarios
I think that it really depends on the use case of each consumer whether ordering is important or not.
Some users (just as myself) suffered from enforced ordering pattern. We had 4 queues, and wanted them to be processed almost evenly, but it was not possible because workers would pickup jobs using order by queue names.
Introducing an option to avoid sorting can be considered a feature that gives users even more control over their processing pipeline.
Performance
Ensuring ordering is really putting much stress on performance.
I did a very stupid investigation, but results can be seen easily.
I generated 4 queues with 30_000 jobs each, and ran hangfire server two times:
- w/ ordering
- consumed 1600% of my CPU (all cores full load)
- only 20,000 jobs were processed within approx. 3 minutes
- w/o ordering
- was consuming about 600% (almost 3 times less)
- All 120,000 jobs processed within 3 minutes
See the container CPU usage chart from docker.
My setup
- Windows 11, x64
- I7-10700k (16 cores)
- 48GB RAM
- latest postgres, running in docker
- 1 hangfire server process, 60 workers
- Empty jobs that do nothing:
-
public async Task Delayed() {}
-
Suggested next steps
@azygis I stand firm on my suggestion with a config flag that removes ordering in exchange for more performance.
If you approve, I can try to submit a PR sometime next week.
@vadim-kupovykh do you think this tradeoff would be applicable for your case, at least?
Though, it may require additional tweaks on Dashboard and other parts of the library. On the dashboard, I saw strange artifacts when page with queue jobs would show "No jobs in queue" occasionally during active processing.
Useful reading
I also read this article. It is basically saying that FOR UPDATE SKIP LOCKED
is the best way to do queues in postgres, but performance will suffer with multiple workers.
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
Source code of testing app
using Hangfire;
using Hangfire.PostgreSql;
string[] queues = { "alpha", "beta", "gamma", "test" };
bool generateJobs = true; //Toggle to switch between creating new jobs and processing them
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHangfire(configuration => {
configuration.UsePostgreSqlStorage(options => {
options.UseNpgsqlConnection("User ID=postgres;Password=mysecretpassword;Host=localhost;Port=5432;Database=hangfire-postgres");
});
});
if (!generateJobs)
{
builder.Services.AddHangfireServer(options => {
options.Queues = queues;
options.WorkerCount = 60;
});
}
var app = builder.Build();
if (generateJobs)
{
var client = app.Services.GetRequiredService<IBackgroundJobClientV2>();
foreach (string queue in queues)
{
Task.Run(() => {
for (int i = 0; i < 30_000; i++)
{
client.Enqueue<Jobs>(queue, jobs => jobs.Delayed());
}
});
}
}
app.UseHangfireDashboard("");
app.Run();
public class Jobs
{
public async Task Delayed()
{
}
}
Toggling job ordering
I compiled Hangfire.Postgres locally, and basically removed ORDER BY ...
statement from here:
from hangfire.postgresql.
What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.
create index jobqueue_queue_fetchat_jobid
on jobqueue (fetchedat NULLS FIRST, queue, jobid);
Something like that?
Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?
I cannot. It's a mistake done by many, and since the package has been with this ordering for quite some time, it can become a breaking change for some people. Of course, with a default of "keep ordering" it's not a breaking change though.
from hangfire.postgresql.
What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.
@azygis brilliant idea!
On my test data it works as fast as w/o ordering.
from hangfire.postgresql.
Guys, you are amazing.:) going to test with ordered index! Will update you later.
from hangfire.postgresql.
Lovely! Then we don't need any conditionals, I suppose?
Would you be able to look at other indexes? If order is needed for any of them?
Edit: @vadim-kupovykh yeah it's kinda safe to do live. When a new version is deployed, the index will just get recreated.
from hangfire.postgresql.
Related Issues (20)
- How inject in hangfire postgres connection HOT 5
- NpgsqlConnectionStringBuilder is not giving updated builder instance for new connection string value. HOT 6
- Unable to run the solution as is in .NET 8 HOT 2
- ExpirationManager issue removing million of jobs HOT 13
- Job supposed to be re-queued on worker shutdown but it never is HOT 38
- NpgsqlDataSource support HOT 3
- How to migrate obsolete PostgreSqlBootstrapperConfigurationExtensions usage HOT 2
- Error "unsupported Unicode escape sequence" after update HOT 1
- EnableHeavyMigrations missing from PostgreSqlStorageOptions HOT 1
- One jobId with two entries of same parameter name in db HOT 1
- Not operating: "Could not place a lock on the resource 'hangfire:locks:schedulepoller': Lock timeout" HOT 1
- 1.8 migration guide HOT 1
- Could not place a lock on the resource 'hangfire:lock:recurring-job:jobid': Lock timeout. HOT 2
- Processing order in Hangfire.PostgreSql? HOT 2
- Job count by state query performance HOT 4
- Implement the List-returning GetFirstByLowestScoreFromSet HOT 3
- Attempting to retrieve FetchedJobs crashes due to duplicates
- Exception when Using NpgsqlDataSource with ConnectionFactory HOT 4
- Loading 5000 Failed jobs when there's even more causes a Memory Leak HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hangfire.postgresql.