Hello, Due to the large number of jobs pending to be deleted that we

Hi, first of all, thanks for the quick reply, As you said <a class="

ExpirationManager issue removing million of jobs about hangfire.postgresql HOT 13 CLOSED

vicentegnz commented on August 12, 2024 2

ExpirationManager issue removing million of jobs

from hangfire.postgresql.

Comments (13)

azygis commented on August 12, 2024 1

I might be able to change it during the weekend, but the chances are slim. My suggestion is for you to create a pull request; I can definitely find 5min to push the package to NuGet (still had no time to automate that).

from hangfire.postgresql.

azygis commented on August 12, 2024 1

Pushed, will soon be visible in NuGet.

from hangfire.postgresql.

dmitry-vychikov commented on August 12, 2024

I agree, commit should be called on a batch. Seems like a bug to me.

There is even a log message within the loop saying jobs were removed while in fact it's not true.

@azygis what do you think?

from hangfire.postgresql.

azygis commented on August 12, 2024

Sorry, opened the tab and closed with a batch of others.

I agree the transaction should commit batches, but it's kinda muddy in my eyes. Do we know that the transaction itself is timing out? Or is the read itself having problems? @vicentegnz do you see the logs from expiration manager (even if misleading as we see)?

from hangfire.postgresql.

vicentegnz commented on August 12, 2024

Hi, first of all, thanks for the quick reply,

As you said @dmitry-vychikov, @azygis , I enable the information logs, and I see the trace that the 'jobs' have been removed from job table, even though we know it's not true.

We have never been able to see the end of the delete process. After several minutes, sometimes between 15 and 20 minutes, an exception occurs in the job that looks like a timeout, and the only trace we see in postgresql is the one I put below.

ERROR: canceling statement due to user request
statement:
DELETE FROM "hangfire"."job"
WHERE "id" IN (
    SELECT "id"
    FROM "hangfire"."job"
    WHERE "expireat" < NOW()
    LIMIT 50000)

Without taking into account the "commit" of the transaction outside of the batch, is it possible that we have something wrong configured regarding to the hangfire job timeouts? the configuration we have is quite standard, we increase DistributedLockTimeout and InvisibilityTimeout to avoid locks between hangfire servers, but we have not seen any other relevant timeout that we can increase to avoid that case.

Thanks again,

from hangfire.postgresql.

azygis commented on August 12, 2024

Just to make it clear - do you have multiple Removed 50000 outdated record(s) from 'job' table rows in logs? Or just that one? And when the exception occurs, it's after this "false" log +15min or so?

It would be interesting to see the exact logging when this happens. Can you provide the relevant parts of the logs?

from hangfire.postgresql.

vicentegnz commented on August 12, 2024

Sure, let me leave the history of the last execution here, I cleaned some traces that were not relevant, by activating the debug logs you can see when the execution stops.

Without activating this log level, it seemed that it took between 15 and 20 minutes, but with this level it seems that the process stops after 5 minutes with the exception in postgreSQL.

2024-02-29 13:01:35.602	(INFO) "Removed 1 outdated record(s) from 'aggregatedcounter' table."
2024-02-29 13:02:04.542	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:02:34.866	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:05.691	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:36.024	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:06.508	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:35.694	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:06.250	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:33.834	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:06:04.927 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:11:33.321	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:11:57.557	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:21.890	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:45.359	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:10.496	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:38.750	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:02.421	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:29.227	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:55.796	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:21.472	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:45.479	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:12.319	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:42.197	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:10.090	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:41.097 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:23:07.290	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:23:36.232	(INFO) "Removed 50000 outdated record(s) from 'job' table."

The exception message is Exception while reading from stream and the stack trace of the exception is...

at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|233_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)\n   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlDataReader.NextResult()\n   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteNonQuery()\n   at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action2 paramReader) in /_/Dapper/SqlMapper.cs:line 2938\n   at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 648\n   at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass11_1.<Execute>b__0(IDbConnection connection)\n   at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass15_0.<UseConnectionDistributedLock>b__0(DbConnection connection)\n   at Hangfire.PostgreSql.PostgreSqlStorage.<>c__DisplayClass30_0.<UseConnection>b__0(DbConnection connection)\n   at Hangfire.PostgreSql.PostgreSqlStorage.UseConnection[T](DbConnection dedicatedConnection, Func2 func)\n   at Hangfire.PostgreSql.ExpirationManager.UseConnectionDistributedLock(PostgreSqlStorage storage, Action1 action)\n   at Hangfire.PostgreSql.ExpirationManager.Execute(CancellationToken cancellationToken)\n   at Hangfire.Processing.BackgroundExecution.Run(Action2 callback, Object state)

from hangfire.postgresql.

azygis commented on August 12, 2024

Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.

In that case, committing batches would not help as connection itself would still be kept alive. Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.

I'm a bit stumped here. Not entirely sure how to proceed, other than checking where the timeout is set (whether in the connection string, loadbalancer or pg itself) first. We probably can proceed from there.

from hangfire.postgresql.

dmitry-vychikov commented on August 12, 2024

Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.

If we commit in batches, we might never see this 5 min timeout in the first place. But if we do – that's not a big problem, because part of the clean up will be completed, and we will continue on the next iteration.

On the other hand, I don't think it is that bad to open a fresh connection per batch...

from hangfire.postgresql.

azygis commented on August 12, 2024

I see we are already kinda do that, each table gets a new connection. In this case, I agree, we can try with connection per batch.

There's no reason to use the transaction if we go that way though. A single statement in a transaction is unnecessary.

from hangfire.postgresql.

vicentegnz commented on August 12, 2024

Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.

Yes, we have pgbouncer, that's something we're looking at, we probably have something at the PostgreSQL level that interrupts the transaction..

As @dmitry-vychikov said, if we commit in batches, probably after several iterations we will no longer see that timeout, at least log messages indicating that 'x' records have been removed will be true.

from hangfire.postgresql.

azygis commented on August 12, 2024

@vicentegnz even though the issue is closed, it would be nice to hear whether it helped in your case, if you can.

from hangfire.postgresql.

vicentegnz commented on August 12, 2024

@azygis after uploading the version we solved our problem. Although we had a problem with the script that made an update on the jobs table (Install.v20.sql) due to the number of records in the table, this script did not finish its execution.

We solved it by waiting for the records to be deleted, we were lucky that the new version code was compatible with the old schema.

So I recommend that if someone has the same problem, check before updating the version which scripts are going to be applied, because in our case, the jobs table was locked, and we had to kill the query since we were not processing anything.

from hangfire.postgresql.

ExpirationManager issue removing million of jobs about hangfire.postgresql HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent