Comments (13)
I might be able to change it during the weekend, but the chances are slim. My suggestion is for you to create a pull request; I can definitely find 5min to push the package to NuGet (still had no time to automate that).
from hangfire.postgresql.
Pushed, will soon be visible in NuGet.
from hangfire.postgresql.
I agree, commit should be called on a batch. Seems like a bug to me.
There is even a log message within the loop saying jobs were removed while in fact it's not true.
@azygis what do you think?
from hangfire.postgresql.
Sorry, opened the tab and closed with a batch of others.
I agree the transaction should commit batches, but it's kinda muddy in my eyes. Do we know that the transaction itself is timing out? Or is the read itself having problems? @vicentegnz do you see the logs from expiration manager (even if misleading as we see)?
from hangfire.postgresql.
Hi, first of all, thanks for the quick reply,
As you said @dmitry-vychikov, @azygis , I enable the information
logs, and I see the trace that the 'jobs' have been removed from job
table, even though we know it's not true.
We have never been able to see the end of the delete process. After several minutes, sometimes between 15 and 20 minutes, an exception occurs in the job that looks like a timeout, and the only trace we see in postgresql is the one I put below.
ERROR: canceling statement due to user request
statement:
DELETE FROM "hangfire"."job"
WHERE "id" IN (
SELECT "id"
FROM "hangfire"."job"
WHERE "expireat" < NOW()
LIMIT 50000)
Without taking into account the "commit" of the transaction outside of the batch, is it possible that we have something wrong configured regarding to the hangfire job timeouts? the configuration we have is quite standard, we increase DistributedLockTimeout and InvisibilityTimeout to avoid locks between hangfire servers, but we have not seen any other relevant timeout that we can increase to avoid that case.
Thanks again,
from hangfire.postgresql.
Just to make it clear - do you have multiple Removed 50000 outdated record(s) from 'job' table
rows in logs? Or just that one? And when the exception occurs, it's after this "false" log +15min or so?
It would be interesting to see the exact logging when this happens. Can you provide the relevant parts of the logs?
from hangfire.postgresql.
Sure, let me leave the history of the last execution here, I cleaned some traces that were not relevant, by activating the debug logs you can see when the execution stops.
Without activating this log level, it seemed that it took between 15 and 20 minutes, but with this level it seems that the process stops after 5 minutes with the exception in postgreSQL.
2024-02-29 13:01:35.602 (INFO) "Removed 1 outdated record(s) from 'aggregatedcounter' table."
2024-02-29 13:02:04.542 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:02:34.866 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:05.691 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:36.024 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:06.508 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:35.694 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:06.250 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:33.834 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:06:04.927 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:11:33.321 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:11:57.557 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:21.890 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:45.359 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:10.496 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:38.750 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:02.421 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:29.227 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:55.796 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:21.472 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:45.479 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:12.319 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:42.197 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:10.090 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:41.097 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:23:07.290 (INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:23:36.232 (INFO) "Removed 50000 outdated record(s) from 'job' table."
The exception message is Exception while reading from stream
and the stack trace of the exception is...
at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|233_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)\n at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n at Npgsql.NpgsqlDataReader.NextResult()\n at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken)\n at Npgsql.NpgsqlCommand.ExecuteNonQuery()\n at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action2 paramReader) in /_/Dapper/SqlMapper.cs:line 2938\n at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 648\n at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass11_1.<Execute>b__0(IDbConnection connection)\n at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass15_0.<UseConnectionDistributedLock>b__0(DbConnection connection)\n at Hangfire.PostgreSql.PostgreSqlStorage.<>c__DisplayClass30_0.<UseConnection>b__0(DbConnection connection)\n at Hangfire.PostgreSql.PostgreSqlStorage.UseConnection[T](DbConnection dedicatedConnection, Func2 func)\n at Hangfire.PostgreSql.ExpirationManager.UseConnectionDistributedLock(PostgreSqlStorage storage, Action1 action)\n at Hangfire.PostgreSql.ExpirationManager.Execute(CancellationToken cancellationToken)\n at Hangfire.Processing.BackgroundExecution.Run(Action2 callback, Object state)
from hangfire.postgresql.
Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.
In that case, committing batches would not help as connection itself would still be kept alive. Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.
I'm a bit stumped here. Not entirely sure how to proceed, other than checking where the timeout is set (whether in the connection string, loadbalancer or pg itself) first. We probably can proceed from there.
from hangfire.postgresql.
Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.
If we commit in batches, we might never see this 5 min timeout in the first place. But if we do – that's not a big problem, because part of the clean up will be completed, and we will continue on the next iteration.
On the other hand, I don't think it is that bad to open a fresh connection per batch...
from hangfire.postgresql.
I see we are already kinda do that, each table gets a new connection. In this case, I agree, we can try with connection per batch.
There's no reason to use the transaction if we go that way though. A single statement in a transaction is unnecessary.
from hangfire.postgresql.
Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.
Yes, we have pgbouncer, that's something we're looking at, we probably have something at the PostgreSQL level that interrupts the transaction..
As @dmitry-vychikov said, if we commit in batches, probably after several iterations we will no longer see that timeout, at least log messages indicating that 'x' records have been removed will be true.
from hangfire.postgresql.
@vicentegnz even though the issue is closed, it would be nice to hear whether it helped in your case, if you can.
from hangfire.postgresql.
@azygis after uploading the version we solved our problem. Although we had a problem with the script that made an update on the jobs table (Install.v20.sql) due to the number of records in the table, this script did not finish its execution.
We solved it by waiting for the records to be deleted, we were lucky that the new version code was compatible with the old schema.
So I recommend that if someone has the same problem, check before updating the version which scripts are going to be applied, because in our case, the jobs table was locked, and we had to kill the query since we were not processing anything.
from hangfire.postgresql.
Related Issues (20)
- How inject in hangfire postgres connection HOT 5
- NpgsqlConnectionStringBuilder is not giving updated builder instance for new connection string value. HOT 6
- Unable to run the solution as is in .NET 8 HOT 2
- Performance Issues with Postgress HOT 29
- Job supposed to be re-queued on worker shutdown but it never is HOT 38
- NpgsqlDataSource support HOT 3
- How to migrate obsolete PostgreSqlBootstrapperConfigurationExtensions usage HOT 2
- Error "unsupported Unicode escape sequence" after update HOT 1
- EnableHeavyMigrations missing from PostgreSqlStorageOptions HOT 1
- One jobId with two entries of same parameter name in db HOT 1
- Not operating: "Could not place a lock on the resource 'hangfire:locks:schedulepoller': Lock timeout" HOT 1
- 1.8 migration guide HOT 1
- Could not place a lock on the resource 'hangfire:lock:recurring-job:jobid': Lock timeout. HOT 2
- Processing order in Hangfire.PostgreSql? HOT 2
- Job count by state query performance HOT 4
- Implement the List-returning GetFirstByLowestScoreFromSet HOT 3
- Attempting to retrieve FetchedJobs crashes due to duplicates
- Exception when Using NpgsqlDataSource with ConnectionFactory HOT 4
- Loading 5000 Failed jobs when there's even more causes a Memory Leak HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hangfire.postgresql.