Git Product home page Git Product logo

Comments (3)

x4m avatar x4m commented on July 30, 2024

Hi!
Yes, wal-g will return non-zero return code upon failure. And Postgres will retry archival of the file. Please check:

  1. what is in pg_stat_archive
  2. change AWS endpoint and run your archive manually. Does it return non-zero exit code
  3. your restore_command. The error in log that you posted is not from wal-fetch, it's from streaming replication
  4. contents of pg_wal/archive_status. There are files .ready and .done for files ready for archivation and done with archivation.

from wal-g.

M1ha-Shvn avatar M1ha-Shvn commented on July 30, 2024

Hi, thanks for quick reply.

  1. what is in pg_stat_archive

Seems, no errors?

archived_count |    last_archived_wal     |      last_archived_time       | failed_count | last_failed_wal | last_failed_time |          stats_reset
----------------+--------------------------+-------------------------------+--------------+-----------------+------------------+-------------------------------
          21949 | 000000080000B7EB00000004 | 2023-10-12 09:54:30.988245+00 |            0 |                 |                  | 2023-10-10 12:05:28.317102+00
  1. change AWS endpoint and run your archive manually. Does it return non-zero exit code

That was an accidental timeout error. There are lots of WALs before and after missing one, that ran successfully. I'm not sure how to reproduce a connection timeout situation. I've tried sending another existing WAL to non-existing host, and tried to send not-existing WAL. In both cases I got exit code 1, so it seems to be correct. I'm not sure how to reproduce the very same timeout error in order to check the exact situation.

your restore_command. The error in log that you posted is not from wal-fetch, it's from streaming replication

The problem here is not with restore. File has not been uploaded to S3 by wal-push command. I've already gone to storage and checked this out. Nevertheless, here are restore commands:

# recovery.conf
restore_command = '/usr/local/scripts/walg-fetch.sh "%f" "%p"'
archive_cleanup_command = 'pg_archivecleanup /srv/postgresql/9.6/main/pg_xlog "%r"'

# /usr/local/scripts/walg-fetch.sh
set -o noclobber  # Avoid overlay files (echo "hi" > foo)
set -o errexit    # Used to exit upon error, avoiding cascading errors
set -o nounset    # Exposes unset variables
set -o pipefail   # Unveils hidden failures

set -o allexport
source /etc/default/wal-g
set +o allexport

/usr/local/bin/wal-g wal-fetch $1 $2

contents of pg_wal/archive_status. There are files .ready and .done for files ready for archivation and done with archivation.

The database is highly loaded. I've also checked, that there is no such file on master server in both pg_xlog and pg_xlog/archive_status directory now. Neigther ready or done. The bug in wal-push occured yesterday, as it has not been expected, It has not been noticed. I've found it only today, while restoring from backup. But the fact, that the file has been removed from both pg_xlog and archive_status, may mean that postgres "thinks" it has been uploaded successfully.

from wal-g.

M1ha-Shvn avatar M1ha-Shvn commented on July 30, 2024

Got a very similar error while doing a backup with backup-push command. It is wierd, there were no retries, but I have WALG_S3_MAX_RETRIES=5
Log:

ERROR: 2023/10/13 07:15:30.005112 failed to upload 'shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4' to bucket 'db-backup': MultipartUpload: upload multipart failed
        upload id: fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60
caused by: RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4?partNumber=10&uploadId=fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60": write tcp 172.16.200.144:49530->52.218.108.216:443: write: connection timed out
ERROR: 2023/10/13 07:15:30.005127 upload: could not upload 'base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4'
ERROR: 2023/10/13 07:15:30.005144 failed to upload 'shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4' to bucket 'db-backup': MultipartUpload: upload multipart failed
        upload id: fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60
caused by: RequestError: send request failed
caused by: Put "https://db-backup.s3.dualstack.eu-west-1.amazonaws.com/shard_2/basebackups_005/base_000000080000B806000000A8_D_000000080000B7A700000047/tar_partitions/part_090.tar.lz4?partNumber=10&uploadId=fiZmvFty6a_GcOq4q4iQ81KBySt3IvMSw_LOa9bjbGodNitPA1Hizb2EAubMRTZgT.puzOTzJBTV5G8ZMJL4fnlvejD0_9yMb1j5DwEGg0irJG257jKTlkhZpiV2jo60": write tcp 172.16.200.144:49530->52.218.108.216:443: write: connection timed out
ERROR: 2023/10/13 07:15:30.005154 Unable to continue the backup process because of the loss of a part 90.

from wal-g.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.