Git Product home page Git Product logo

Comments (13)

azygis avatar azygis commented on August 12, 2024 1

I might be able to change it during the weekend, but the chances are slim. My suggestion is for you to create a pull request; I can definitely find 5min to push the package to NuGet (still had no time to automate that).

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024 1

Pushed, will soon be visible in NuGet.

from hangfire.postgresql.

dmitry-vychikov avatar dmitry-vychikov commented on August 12, 2024

I agree, commit should be called on a batch. Seems like a bug to me.

There is even a log message within the loop saying jobs were removed while in fact it's not true.

@azygis what do you think?

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024

Sorry, opened the tab and closed with a batch of others.

I agree the transaction should commit batches, but it's kinda muddy in my eyes. Do we know that the transaction itself is timing out? Or is the read itself having problems? @vicentegnz do you see the logs from expiration manager (even if misleading as we see)?

from hangfire.postgresql.

vicentegnz avatar vicentegnz commented on August 12, 2024

Hi, first of all, thanks for the quick reply,

As you said @dmitry-vychikov, @azygis , I enable the information logs, and I see the trace that the 'jobs' have been removed from job table, even though we know it's not true.

We have never been able to see the end of the delete process. After several minutes, sometimes between 15 and 20 minutes, an exception occurs in the job that looks like a timeout, and the only trace we see in postgresql is the one I put below.

ERROR: canceling statement due to user request
statement:
DELETE FROM "hangfire"."job"
WHERE "id" IN (
    SELECT "id"
    FROM "hangfire"."job"
    WHERE "expireat" < NOW()
    LIMIT 50000)

Without taking into account the "commit" of the transaction outside of the batch, is it possible that we have something wrong configured regarding to the hangfire job timeouts? the configuration we have is quite standard, we increase DistributedLockTimeout and InvisibilityTimeout to avoid locks between hangfire servers, but we have not seen any other relevant timeout that we can increase to avoid that case.

Thanks again,

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024

Just to make it clear - do you have multiple Removed 50000 outdated record(s) from 'job' table rows in logs? Or just that one? And when the exception occurs, it's after this "false" log +15min or so?

It would be interesting to see the exact logging when this happens. Can you provide the relevant parts of the logs?

from hangfire.postgresql.

vicentegnz avatar vicentegnz commented on August 12, 2024

Sure, let me leave the history of the last execution here, I cleaned some traces that were not relevant, by activating the debug logs you can see when the execution stops.

Without activating this log level, it seemed that it took between 15 and 20 minutes, but with this level it seems that the process stops after 5 minutes with the exception in postgreSQL.

2024-02-29 13:01:35.602	(INFO) "Removed 1 outdated record(s) from 'aggregatedcounter' table."
2024-02-29 13:02:04.542	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:02:34.866	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:05.691	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:03:36.024	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:06.508	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:04:35.694	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:06.250	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:05:33.834	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:06:04.927 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:11:33.321	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:11:57.557	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:21.890	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:12:45.359	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:10.496	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:13:38.750	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:02.421	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:29.227	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:14:55.796	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:21.472	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:15:45.479	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:12.319	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:16:42.197	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:10.090	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:17:41.097 (DEBUG) "Execution loop ExpirationManager:1c392727 caught an exception and will be retried in 00:05:00" (Exception while reading from stream - Npgsql)
2024-02-29 13:23:07.290	(INFO) "Removed 50000 outdated record(s) from 'job' table."
2024-02-29 13:23:36.232	(INFO) "Removed 50000 outdated record(s) from 'job' table."

The exception message is Exception while reading from stream and the stack trace of the exception is...

at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|233_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)\n   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlDataReader.NextResult()\n   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken)\n   at Npgsql.NpgsqlCommand.ExecuteNonQuery()\n   at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action2 paramReader) in /_/Dapper/SqlMapper.cs:line 2938\n   at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 648\n   at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass11_1.<Execute>b__0(IDbConnection connection)\n   at Hangfire.PostgreSql.ExpirationManager.<>c__DisplayClass15_0.<UseConnectionDistributedLock>b__0(DbConnection connection)\n   at Hangfire.PostgreSql.PostgreSqlStorage.<>c__DisplayClass30_0.<UseConnection>b__0(DbConnection connection)\n   at Hangfire.PostgreSql.PostgreSqlStorage.UseConnection[T](DbConnection dedicatedConnection, Func2 func)\n   at Hangfire.PostgreSql.ExpirationManager.UseConnectionDistributedLock(PostgreSqlStorage storage, Action1 action)\n   at Hangfire.PostgreSql.ExpirationManager.Execute(CancellationToken cancellationToken)\n   at Hangfire.Processing.BackgroundExecution.Run(Action2 callback, Object state)

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024

Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.

In that case, committing batches would not help as connection itself would still be kept alive. Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.

I'm a bit stumped here. Not entirely sure how to proceed, other than checking where the timeout is set (whether in the connection string, loadbalancer or pg itself) first. We probably can proceed from there.

from hangfire.postgresql.

dmitry-vychikov avatar dmitry-vychikov commented on August 12, 2024

Closing and reopening the connections would not make a lot of sense in a single processing unit, which is expiration manager.

If we commit in batches, we might never see this 5 min timeout in the first place. But if we do – that's not a big problem, because part of the clean up will be completed, and we will continue on the next iteration.

On the other hand, I don't think it is that bad to open a fresh connection per batch...

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024

I see we are already kinda do that, each table gets a new connection. In this case, I agree, we can try with connection per batch.

There's no reason to use the transaction if we go that way though. A single statement in a transaction is unnecessary.

from hangfire.postgresql.

vicentegnz avatar vicentegnz commented on August 12, 2024

Are you using loadbalancer, like pgbouncer or similar? It's pretty much 5min each time, which means something kills the connection after 5min. Sounds like a hard timeout somewhere.

Yes, we have pgbouncer, that's something we're looking at, we probably have something at the PostgreSQL level that interrupts the transaction..

As @dmitry-vychikov said, if we commit in batches, probably after several iterations we will no longer see that timeout, at least log messages indicating that 'x' records have been removed will be true.

from hangfire.postgresql.

azygis avatar azygis commented on August 12, 2024

@vicentegnz even though the issue is closed, it would be nice to hear whether it helped in your case, if you can.

from hangfire.postgresql.

vicentegnz avatar vicentegnz commented on August 12, 2024

@azygis after uploading the version we solved our problem. Although we had a problem with the script that made an update on the jobs table (Install.v20.sql) due to the number of records in the table, this script did not finish its execution.

We solved it by waiting for the records to be deleted, we were lucky that the new version code was compatible with the old schema.

So I recommend that if someone has the same problem, check before updating the version which scripts are going to be applied, because in our case, the jobs table was locked, and we had to kill the query since we were not processing anything.

from hangfire.postgresql.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.