Git Product home page Git Product logo

wal-g's Introduction

WAL-G

Docker-tests-status Unit-tests-status Go Report Card Documentation Status

This documentation is also available at wal-g.readthedocs.io

WAL-G is an archival restoration tool for PostgreSQL, MySQL/MariaDB, and MS SQL Server (beta for MongoDB and Redis).

WAL-G is the successor of WAL-E with a number of key differences. WAL-G uses LZ4, LZMA, ZSTD, or Brotli compression, multiple processors, and non-exclusive base backups for Postgres. More information on the original design and implementation of WAL-G can be found on the Citus Data blog post "Introducing WAL-G by Citus: Faster Disaster Recovery for Postgres".

Table of Contents

Installation

A precompiled binary for Linux AMD 64 of the latest version of WAL-G can be obtained under the Releases tab.

Binary name has the following format: wal-g-DBNAME-OSNAME, where DBNAME stands for the name of the database (for example pg, mysql), OSNAME stands for the name of the operating system used for building the binary.

To decompress the binary, use:

tar -zxvf wal-g-DBNAME-OSNAME-amd64.tar.gz
mv wal-g-DBNAME-OSNAME-amd64 /usr/local/bin/wal-g

For example, for Postgres and Ubuntu 18.04:

tar -zxvf wal-g-pg-ubuntu-18.04-amd64.tar.gz
mv wal-g-pg-ubuntu-18.04-amd64 /usr/local/bin/wal-g

For other systems, please consult the Development section for more information.

WAL-G supports bash and zsh autocompletion. Run wal-g help completion for more info.

Configuration

There are two ways how you can configure WAL-G:

  1. Using environment variables

  2. Using a config file

    --config /path flag can be used to specify the path where the config file is located.

    We support every format that the viper package supports: JSON, YAML, envfile and others.

Every configuration variable mentioned in the following documentation can be specified either as an environment variable or a field in the config file.

Storage

To configure where WAL-G stores backups, please consult the Storages section.

Compression

  • WALG_COMPRESSION_METHOD

To configure the compression method used for backups. Possible options are: lz4, lzma, zstd, brotli. The default method is lz4. LZ4 is the fastest method, but the compression ratio is bad. LZMA is way much slower. However, it compresses backups about 6 times better than LZ4. Brotli and zstd are a good trade-off between speed and compression ratio, which is about 3 times better than LZ4.

Encryption

  • YC_CSE_KMS_KEY_ID

To configure Yandex Cloud KMS key for client-side encryption and decryption. By default, no encryption is used.

  • YC_SERVICE_ACCOUNT_KEY_FILE

To configure the name of a file containing private key of Yandex Cloud Service Account. If not set a token from the metadata service (http://169.254.169.254) will be used to make API calls to Yandex Cloud KMS.

  • WALG_LIBSODIUM_KEY

To configure encryption and decryption with libsodium. WAL-G uses an algorithm that only requires a secret key. libsodium keys are fixed-size keys of 32 bytes. For optimal cryptographic security, it is recommened to use a random 32 byte key. To generate a random key, you can something like openssl rand -hex 32 (set WALG_LIBSODIUM_KEY_TRANSFORM to hex) or openssl rand -base64 32 (set WALG_LIBSODIUM_KEY_TRANSFORM to base64).

  • WALG_LIBSODIUM_KEY_PATH

Similar to WALG_LIBSODIUM_KEY, but value is the path to the key on file system. The file content will be trimmed from whitespace characters.

  • WALG_LIBSODIUM_KEY_TRANSFORM

The transform that will be applied to the WALG_LIBSODIUM_KEY to get the required 32 byte key. Supported transformations are base64, hex or none (default). The option none exists for backwards compatbility, the user input will be converted to 32 byte either via truncation or by zero-padding.

  • WALG_GPG_KEY_ID (alternative form WALE_GPG_KEY_ID) ⚠️ DEPRECATED

To configure GPG key for encryption and decryption. By default, no encryption is used. Public keyring is cached in the file "/.walg_key_cache".

  • WALG_PGP_KEY

To configure encryption and decryption with OpenPGP standard. You can join multiline key using \n symbols into one line (mostly used in case of daemontools and envdir). Set private key value when you need to execute wal-fetch or backup-fetch command. Set public key value when you need to execute wal-push or backup-push command. Keep in mind that the private key also contains the public key.

  • WALG_PGP_KEY_PATH

Similar to WALG_PGP_KEY, but value is the path to the key on file system.

  • WALG_PGP_KEY_PASSPHRASE

If your private key is encrypted with a passphrase, you should set passphrase for decrypt.

  • WALG_ENVELOPE_PGP_KEY To configure encryption and decryption with the envelope PGP key stored in key management system. This option allows you to securely manage your PGP keys by storing them in the KMS. It is crucial to ensure that the key passed is encrypted using kms and encoded with base64. Also both private and publlic parts should be presents in key because envelope key will be injected in metadata and used later in wal/backup-fetch.

Please note that currently, only Yandex Cloud Key Management Service (KMS) is supported for configuring. Ensure that you have set up and configured Yandex Cloud KMS mentioned below before attempting to use this feature.

  • WALG_ENVELOPE_CACHE_EXPIRATION

This setting controls kms response expiration. Default value is 0 to store keys permanent in memory. Please note that if the system will not be able to redecrypt the key in kms after expiration, the previous response will be used.

  • WALG_ENVELOPE_PGP_YC_ENDPOINT

Endpoint is an API endpoint of Yandex.Cloud against which the SDK is used. Most users won't need to explicitly set it.

  • WALG_ENVELOPE_PGP_YC_CSE_KMS_KEY_ID

Similar to YC_CSE_KMS_KEY_ID, but only used for envelope pgp keys.

  • WALG_ENVELOPE_PGP_YC_SERVICE_ACCOUNT_KEY_FILE

Similar to YC_SERVICE_ACCOUNT_KEY_FILE, but only used for envelope pgp keys.

  • WALG_ENVELOPE_PGP_KEY_PATH

Similar to WALG_ENVELOPE_PGP_KEY, but value is the path to the key on file system.

Monitoring

  • WALG_STATSD_ADDRESS

To enable metrics publishing to statsd or statsd_exporter. Metrics will be sent on a best-effort basis via UDP. The default port for statsd is 8125.

  • WALG_STATSD_EXTRA_TAGS

Use this setting to add static tags (host, operation, database, etc) to the metrics WAL-G publishes to statsd.

If you want to make demo for testing purposes, you can use graphite service from docker-compose file.

Profiling

Profiling is useful for identifying bottlenecks within WAL-G.

  • PROFILE_SAMPLING_RATIO

A float value between 0 and 1, defines likelihood of the profiler getting enabled. When set to 1, it will always run. This allows probabilistic sampling of invocations. Since WAL-G processes may get created several times per second (e.g. wal-g wal-push), we do not want to profile all of them.

  • PROFILE_MODE

The type of pprof profiler to use. Can be one of cpu, mem, mutex, block, threadcreation, trace, goroutine. See the runtime/pprof docs for more information. Defaults to cpu.

  • PROFILE_PATH

The directory to store profiles in. Defaults to $TMPDIR.

Rate limiting

  • WALG_NETWORK_RATE_LIMIT

Network traffic rate limit during the backup-push/backup-fetch operations in bytes per second.

Database-specific options

More options are available for the chosen database. See it in Databases

Usage

WAL-G currently supports these commands for all type of databases:

backup-list

Lists names and creation time of available backups.

--pretty flag prints list in a table

--json flag prints list in JSON format, pretty-printed if combined with --pretty

--detail flag prints extra backup details, pretty-printed if combined with --pretty, json-encoded if combined with --json

delete

Is used to delete backups and WALs before them. By default, delete will perform a dry run. If you want to execute deletion, you have to add --confirm flag at the end of the command. Backups marked as permanent will not be deleted.

delete can operate in four modes: retain, before, everything and target.

retain [FULL|FIND_FULL] %number% [--after %name|time%]

if FULL is specified, keep %number% full backups and everything in the middle. If with --after flag is used keep $number$ the most recent backups and backups made after %name|time% (including).

before [FIND_FULL] %name%

If FIND_FULL is specified, WAL-G will calculate minimum backup needed to keep all deltas alive. If FIND_FULL is not specified, and call can produce orphaned deltas, the call will fail with the list.

everything [FORCE]

target [FIND_FULL] %name% | --target-user-data %data% will delete the backup specified by name or user data. Unlike other delete commands, this command does not delete any archived WALs.

(Only in Postgres & MySQL) By default, if delta backup is provided as the target, WAL-G will also delete all the dependant delta backups. If FIND_FULL is specified, WAL-G will delete all backups with the same base backup as the target.

Examples

everything all backups will be deleted (if there are no permanent backups)

everything FORCE all backups, include permanent, will be deleted

retain 5 will fail if 5th is delta

retain FULL 5 will keep 5 full backups and all deltas of them

retain FIND_FULL 5 will find necessary full for 5th and keep everything after it

retain 5 --after 2019-12-12T12:12:12 keep 5 most recent backups and backups made after 2019-12-12 12:12:12

before base_000010000123123123 will fail if base_000010000123123123 is delta

before FIND_FULL base_000010000123123123 will keep everything after base of base_000010000123123123

target base_0000000100000000000000C9 delete the base backup and all dependant delta backups

target --target-user-data "{ \"x\": [3], \"y\": 4 }" delete backup specified by user data

target base_0000000100000000000000C9_D_0000000100000000000000C4 delete delta backup and all dependant delta backups

target FIND_FULL base_0000000100000000000000C9_D_0000000100000000000000C4 delete delta backup and all delta backups with the same base backup

More commands are available for the chosen database engine. See it in Databases

Storage tools

wal-g st command series allows the direct interaction with the configured storage. Storage tools documentation

Databases

PostgreSQL

Information about installing, configuration and usage

MySQL/MariaDB

Information about installing, configuration and usage

SQLServer

Information about installing, configuration and usage

Mongo [Beta]

Information about installing, configuration and usage

FoundationDB [Work in progress]

Information about installing, configuration and usage

Redis [Beta]

Information about installing, configuration and usage

Greenplum [Work in progress]

Information about installing, configuration and usage

ETCD [Work in progress]

Information about installing, configuration and usage

Development

The following steps describe how to build WAL-G for PostgreSQL, but the process is the same for other databases. For example, to build WAL-G for MySQL, use the make mysql_build instead of make pg_build.

Optional:

  • To build with brotli compressor and decompressor, set the USE_BROTLI environment variable.
  • To build with libsodium, set the USE_LIBSODIUM environment variable.
  • To build with lzo decompressor, set the USE_LZO environment variable.

Installing

Ubuntu

# Install latest Go compiler
sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt update
sudo apt install golang-go

# Install lib dependencies
sudo apt install libbrotli-dev liblzo2-dev libsodium-dev curl cmake

# Fetch project and build
# Go 1.15 and below
go get github.com/wal-g/wal-g
# Go 1.16+ - just clone repository to $GOPATH
# if you want to save space add --depth=1 or --single-branch
git clone https://github.com/wal-g/wal-g $(go env GOPATH)/src/github.com/wal-g/wal-g

cd $(go env GOPATH)/src/github.com/wal-g/wal-g

# optional exports (see above)
export USE_BROTLI=1
export USE_LIBSODIUM=1
export USE_LZO=1

make deps
make pg_build
main/pg/wal-g --version

Users can also install WAL-G by using make pg_install. Specifying the GOBIN environment variable before installing allows the user to specify the installation location. By default, make pg_install puts the compiled binary in the root directory (/).

export USE_BROTLI=1
export USE_LIBSODIUM=1
export USE_LZO=1
make pg_clean
make deps
GOBIN=/usr/local/bin make pg_install

macOS

# brew command is Homebrew for Mac OS
brew install cmake

# Fetch project and build
# Go 1.15 and below
go get github.com/wal-g/wal-g
# Go 1.16+ - just clone repository to $GOPATH
# if you want to save space add --depth=1 or --single-branch
git clone https://github.com/wal-g/wal-g $(go env GOPATH)/src/github.com/wal-g/wal-g

cd $(go env GOPATH)/src/github.com/wal-g/wal-g

export USE_BROTLI=1
export USE_LIBSODIUM="true" # since we're linking libsodium later
./link_brotli.sh
./link_libsodium.sh
make install_and_build_pg

# if you need to install
GOBIN=/usr/local/bin make pg_install

To build on ARM64, set the corresponding GOOS/GOARCH environment variables:

env GOOS=darwin GOARCH=arm64 make install_and_build_pg

The compiled binary to run is main/pg/wal-g

Testing

WAL-G relies heavily on unit tests. These tests do not require S3 configuration as the upload/download parts are tested using mocked objects. Unit tests can be run using

export USE_BROTLI=1
make unittest

For more information on testing, please consult test, testtools and unittest section in Makefile.

WAL-G will perform a round-trip compression/decompression test that generates a directory for data (e.g., data...), compressed files (e.g., compressed), and extracted files (e.g., extracted). These directories will only get cleaned up if the files in the original data directory match the files in the extracted one.

Test coverage can be obtained using:

export USE_BROTLI=1
make coverage

This command generates coverage.out file and opens HTML representation of the coverage.

Development on Windows

Information about installing and usage

Troubleshooting

A good way to start troubleshooting problems is by setting one or both of these environment variables:

  • WALG_LOG_LEVEL=DEVEL

Prints out the used configuration of WAL-G and detailed logs of the used command.

  • S3_LOG_LEVEL=DEVEL

If your commands seem to be stuck it could be that the S3 is not reachable, certificate problems or other S3 related issues. With this environment variable set you can see the Requests and Responses from S3.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the Apache License, Version 2.0, but the lzo support is licensed under GPL 3.0+. Please refer to the LICENSE.md file for more details.

Acknowledgments

WAL-G would not have happened without the support of Citus Data

WAL-G came into existence as a result of the collaboration between a summer engineering intern at Citus, Katie Li, and Daniel Farina, the original author of WAL-E, who currently serves as a principal engineer on the Citus Cloud team. Citus Data also has an open-source extension to Postgres that distributes database queries horizontally to deliver scale and performance.

WAL-G development is supported by Yandex Cloud

Chat

We have a Slack group and Telegram chat to discuss WAL-G usage and development. To join PostgreSQL slack, use invite app.

wal-g's People

Contributors

apelsin234 avatar debebantur avatar dependabot[bot] avatar fdr avatar ferhatelmas avatar fizic avatar g0djan avatar incubusrk avatar kadukm avatar katie31 avatar khurtindn avatar legec avatar leoltron avatar mialinx avatar munakoiso avatar ostinru avatar perekalov avatar proggga avatar pushrbx avatar rdjjke avatar reshke avatar savichev-igor avatar sebasmannem avatar sergey-arefev avatar serprex avatar tinsane avatar tri0l avatar usernamedt avatar vgoshev avatar x4m avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wal-g's Issues

Issues with parallel backup-push

I've been testing out the parallel backup-push that is currently in master, and have run into a few breaking scenarios.

First, and foremost, if the backup doesn't crash it never finishes. It gets through the last files, and seems to freeze while waiting for the last few to upload. Example:

...
/pg_xlog
/recovery.done
/server.crt
/server.key
Finished writing part 78.

Secondly, I've run into a couple of crashes, details of which can be found in this gist. I've hit the "Wait Group" exception the vast majority of the time, and the "concurrent map iteration and map write" a handful of times.

I've not yet been able to track down the cause of any of these issues, but I haven't spent a lot of time looking.

WAL prefetch

@fdr , I'll create an issue here to keep the discussion on implementation details open.

Brief description

I'm working now on WAL-prefetch feature. Postgres asks WAL-fetch command only when it has nothing else to do. This is not very performant design, we can download WALs that probably will be needed when Postgres is busy with a replay of what's in hand already.

Details

In the email conversation, @fdr stated that we should prefetch 8 files ahead of what was asked by Postgres.

I have few more questions:

  1. Should we trigger prefetch after fetch, or start prefetch along with fetch? I propose triggering prefetch just before returning.
  2. Current wal-fetch command before the start is doing the following:
    a. Check if there is .lzo from WAL-E
    b. Check if there is .lz4 from WAL-G
    c. Start downloading whatever is there
    I propose changing this behavior towards more optimistic path:
    a. Start downloading .lz4
    b. If failed - start downloading .lzo
    c. If failed - return failure

archive file has wrong size

Hi, we are currently testing wal-g with a small database that writes an entry every minute. When we try to restore the DB, we sometimes get this error:

2018-02-10 05:48:59.478 EST [25962] FATAL:  archive file "000000010000000000000028" has wrong size: 8388608 instead of 16777216

If we start over (rm -rf the data dir, backup-fetch, then recovery), we sometimes manage to fully restore the db, sometimes we get another similar error:

2018-02-10 06:01:29.671 EST [27419] FATAL:  archive file "000000010000000000000027" has wrong size: 8388608 instead of 16777216

The relevant part of the postgresql.conf:

wal_level = logical
archive_mode = on
archive_command = 'envdir /etc/wal-g.d/env /usr/local/bin/wal-g wal-push %p'
archive_timeout = 60

The recovery.conf file when we try to restore the db on a secondary cluster:

restore_command = 'envdir /etc/wal-g.d/env /usr/local/bin/wal-g wal-fetch "%f" "%p"'

We are not using GPG and we only declare basic environment variables:

ls -l /etc/wal-g.d/env/
total 12
-rwxr-x--- 1 root postgres 13 Feb  9 04:32 AWS_REGION
-rwxr-x--- 1 root postgres 16 Feb  9 04:53 PGHOST
-rwxr-x--- 1 root postgres 39 Feb  9 04:32 WALE_S3_PREFIX

wal-g crashes when datadir is symlink.

Hi,
wal-g crashes when it's trying to push backup and datadir is symlink.
It's not a critical issue, but it seems to me it's better to return error (or walk to destination dir) instead of crash.
wal-g version 0.1.3

postgres@pg0:~$ ls -l /var/lib/postgresql/9.6/main
lrwxrwxrwx 1 postgres postgres 16 Nov 21 07:19 /var/lib/postgresql/9.6/main -> /data/postgresql
postgres@pg0:~$ wal-g backup-push /var/lib/postgresql/9.6/main
BUCKET: production-backups
SERVER: 
Walking ...
Starting part 1 ...

Finished writing part 1.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x89766b]

goroutine 1 [running]:
github.com/wal-g/wal-g.(*Bundle).HandleSentinel(0xc4201579e0, 0x0, 0x0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/upload.go:281 +0x3b
github.com/wal-g/wal-g.HandleBackupPush(0x7ffe8fd6be04, 0x1c, 0xc42011cfc0, 0xc420188660)
        /home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:555 +0x796
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107 +0x6db
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ wal-g backup-push /data/postgresql/
BUCKET: production-backups
SERVER: 
Walking ...
Starting part 1 ...

PG_VERSION
backup_label.old
base
base/1
....

create restore checkpoint right before every backup

Hi.

I was wondering if it makes sense to create a checkpoint right before each backup, so in case we're using a backup combined with recovery.conf then we can use recovery_target_name so it doesn't restore beyond the checkpoint for the backup but it's still able to fetch all the wal logs necessary to reach the point of recovery via restore_command.

It should be as easy as calling pg_create_restore_point('base_<wal-id>') once we compose the name of wal log for the backup.

This way when you list all backups with backup-list, you know you can use the name of the backup as recovery_target_name to perform PITR.

Let me know if this makes sense, I can send a PR for it.

DecompressLzo: write to pipe failed

Versions

CentOS 7.3
wal-g v0.1.2
wal-e 1.0.3 (creator of source basebackup)

Problem

Two attempts to backup-fetch a ~1TB basebackup have resulted in wal-g failing with the following stack trace:

base/16417/12983_vm
base/16417/27620292
base/16417/10323582
base/16417/10324516
base/16417/33825612_fsm
2017/08/29 20:07:43 DecompressLzo: write to pipe failed
github.com/wal-g/wal-g.DecompressLzo
        /home/travis/gopath/src/github.com/wal-g/wal-g/decompress.go:126
github.com/wal-g/wal-g.tarHandler
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:66
github.com/wal-g/wal-g.ExtractAll.func2.2
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:138
runtime.goexit
        /home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197
ExtractAll: lzo decompress failed
github.com/wal-g/wal-g.tarHandler
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:68
github.com/wal-g/wal-g.ExtractAll.func2.2
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:138
runtime.goexit
        /home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197

In both cases, wal-g appeared to be near the end of the restore (over 1TB of data was written to the restore directory) and failed with the same trace. After inspecting the restore and attempting to start postgres, I can confirm that the restore is indeed incomplete.

The basebackup was taken with wal-e 1.0.3, which was also able to restore the same backup without any issues.

Permission denied for function pg_start_backup

I am trying to get backup-push working but keep hitting a permissions problem.

$ sudo -u postgres /usr/local/bin/wal-g-wrapper backup-push /var/lib/postgresql/9.6/main
BUCKET: mybucket
SERVER: db
2017/09/20 13:14:50 ERROR: permission denied for function pg_start_backup (SQLSTATE 42501)
QueryFile: start backup failed
github.com/wal-g/wal-g.StartBackup
	/home/travis/gopath/src/github.com/wal-g/wal-g/connect.go:36
main.main
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:280
runtime.main
	/home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/proc.go:185
runtime.goexit
	/home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197

The file wal-g-wrapper sets up the env variables and then calls wal-g:

$ cat /usr/local/bin/wal-g-wrapper
#!/bin/bash
#
# Passes all arguments through to wal-g with correct env variables.

export WALE_S3_PREFIX=s3://mybucket/db

export AWS_ACCESS_KEY_ID=<redacted>
export AWS_SECRET_ACCESS_KEY=<redacted>
export AWS_REGION=eu-west-2

export PGUSER=myuser
export PGPASSWORD=thepassword
export PGDATABASE=mydatabase

/usr/local/bin/wal-g "$@"

The myuser postgresql user/role has permission for the mydatabase database. Initially it didn't have any role attributes so I gave it replication thinking that would solve the permission problem. It didn't.

So then I gave it the superuser attribute and backup-push was able to run successfully.

Should the replication attribute alone be sufficient for backup-push? If so, how can I get it to work?

This is with WAL-G v0.1.2 and postgresql 9.6 on Ubuntu 16.04.

Thanks!

Deletion Failures

Seeing errors such as this when deleting a large backlog of WALs (this is the first time I've tested multi-page WAL deletion, i.e. more than 1000 objects):

2017/12/05 22:32:45 Unable to delete WALS before base_000000010000006A00000073MalformedXML: The XML you provided was not well-formed or did not validate against our published schema
	status code: 400, request id: E783694080458AC0, host id: p7AI/ZTMbb9aeRdsouupEC0ziU4w6Gy2H3AK6EWdJbpwo8tFnabJr81OGfbaf1frDUbvdgYYqog=

Surprising UX followed by panic

I have just added wal-g to one of my machines, but I don't entirely understand how to use it. Here is a transcript from a first time user:

[root@hydra:~]# wal-g
Please choose a command:
  backup-fetch  fetch a backup from S3
  backup-push   starts and uploads a finished backup to S3
  wal-fetch     fetch a WAL file from S3
  wal-push      upload a WAL file to S3

[root@hydra:~]# wal-g backup-push
Please choose a command:
  backup-fetch  fetch a backup from S3
  backup-push   starts and uploads a finished backup to S3
  wal-fetch     fetch a WAL file from S3
  wal-push      upload a WAL file to S3

[root@hydra:~]# wal-g backup-push --help
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/wal-g/wal-g.Configure(0x10, 0xc42016e540, 0x0, 0x18)
        /tmp/nix-build-wal-g-0.1.2.drv-0/go/src/github.com/wal-g/wal-g/upload.go:77 +0xa6e
main.main()
        /tmp/nix-build-wal-g-0.1.2.drv-0/go/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:83 +0x1f6

I didn't expect the second command (wal-g backup-push) to work, but I did expect it to inform me about the path. It didn't, so I called with --help which caused some kind of panic and crash.

AWS_REGION should be optional

Right now, the user has to specify WALE_S3_PREFIX and AWS_REGION. However, WALE_S3_PREFIX indicates an S3 bucket, S3 buckets have globally unique names, and the s3:GetBucketLocation API can take a S3 bucket name and return its region. AWS_REGION should therefore be optional, with wal-g able to determine the bucket's region automatically.

Need to strip more from the archive path

I was giving wal-g backup+restore a try and I think I spotted a difference in behavior from WAL-E:

While it is true most tar archives box all archive contents in a directory (e.g. postgres.tar.gz would untar to the directory postgres), not so for Postgres backups, because there is no sensible/idiomatic directory name in which a database directory is contained.

Thus the member data/pg_hba.conf may be better as merely pg_hba.conf.

If not for this, I think the whole round trip from archive to restore is working!

Crash when fetching

Unexpected EOF

github.com/katie31/wal-g.(*FileTarInterpreter).Interpret
	/home/mapi/var/go/src/github.com/katie31/wal-g/tar.go:55
github.com/katie31/wal-g.extractOne
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:37
github.com/katie31/wal-g.ExtractAll.func1.2
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:106
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:2197
extractOne: Interpret failed
github.com/katie31/wal-g.extractOne
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:39
github.com/katie31/wal-g.ExtractAll.func1.2
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:106
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:2197
/home/mapi/lib/util.rb:4:in `r': unhandled exception

Missing WAL

Hi,

I'm running into an issue with a missing WAL file. Currently I have postgres archiving with both wal-e and wal-g

archive_command = '/usr/bin/envdir /etc/wal-e.d/writer /usr/local/bin/wal-e wal-push %p && /usr/bin/envdir /etc/wal-g.d/writer /usr/local/bin/wal-g wal-push %p'

However when restoring from wal-g, postgres reports an error:

2018/02/05 22:18:06 WAL-prefetch file:  000000010000B6A90000008E
2018/02/05 22:18:06 Archive '000000010000B6A90000008E' does not exist.
2018-02-05 22:18:06.792 UTC [15164] LOG:  restored log file "000000010000B6A90000008D" from archive
2018-02-05 22:18:07.580 UTC [15164] FATAL:  WAL ends before end of online backup
2018-02-05 22:18:07.580 UTC [15164] HINT:  All WAL generated while online backup was taken must be available at recovery.

In the wal-g log it does in fact seem to skip right over 000000010000B6A90000008E

BUCKET: database
SERVER: wal-g/db3
WAL PATH: wal-g/db3/wal_005/000000010000B6A90000008D.lz4

BUCKET: database
SERVER: wal-g/db3
WAL PATH: wal-g/db3/wal_005/000000010000B6A90000008F.lz4

The wal-e log does show this segment:

Feb  5 04:12:41 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008D" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:41.865887-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008D state=begin

Feb  5 04:12:41 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008E" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:41.869823-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008E state=begin

Feb  5 04:12:42 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo" complete at 14886.7KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:42.557404-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo prefix=wal-e/db3/ rate=14886.7 seg=000000010000B6A90000008D state=complete

Feb  5 04:12:42 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo" complete at 11440.3KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:42.792245-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo prefix=wal-e/db3/ rate=11440.3 seg=000000010000B6A90000008E state=complete

Feb  5 04:12:44 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008F" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:44.996389-00 pid=15545 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008F state=begin

Feb  5 04:12:46 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo" complete at 5754.38KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:46.528526-00 pid=15545 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo prefix=wal-e/db3/ rate=5754.38 seg=000000010000B6A90000008F state=complete

I do not see any errors reported in the log.

Is there anything I can provide that would help figure out why the segment is missing?

Thanks!

backup-push to size-limited bucket

@istarling recently have reported the interesting problem. When there is not enough space in bucket WAL-G retries every file many times.
I think that best solution should be to remove existing retry infrastructure in favor of AWS SDK built-in retries.
This may interfere with #74. @tamalsaha , how do you think, if I'll do that in a week or two I won't make a problem for your implementations of #74?

S3 only

Will the scope of WAL-G remain S3 only, or should it grow support for OpenStack Swift and Azure like WAL-E ?

Logs for upload retry errors

It's a bit hard to debug what the issue is with failed uploads, the following error doesn't tell much:

postgres_1  | 2017/08/19 11:54:20 upload: failed to upload 'pg_xlog/000000010000000000000091'. Restarting in 1.75 seconds

A more useful output would be:

2017/08/19 11:56:27 upload: failed to upload 'pg_xlog/000000010000000000000091': RequestError: send request failed
postgres_1  | caused by: Put https://backups.s3.amazonaws.com/wal-g/wal_005/000000010000000000000091.lz4: x509: certificate signed by unknown authority. Restarting in 2.02 seconds
service_1   | Insert

upload.go

if multierr, ok := e.(s3manager.MultiUploadFailure); ok {
				log.Printf("upload: failed to upload '%s' with UploadID '%s'. Restarting in %0.2f seconds", path, multierr.UploadID(), et.wait)
			} else {
				log.Printf("upload: failed to upload '%s': %s. Restarting in %0.2f seconds", path, e.Error(), et.wait)
			}

runtime error: slice bounds out of range

Hello,

I'm interested in trying out wal-g. Currently I'm trying to restore an existing backup that was created by wal-e and I'm getting an error:

envdir /etc/wal-e.d/reader /usr/local/bin/wal-g backup-fetch /var/lib/postgresql/9.2/main LATEST
BUCKET: database-backups
SERVER: path/to/backup
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.main()
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:136 +0x1890

These are the environment variables in /etc/wal-e.d/reader/:

ls -1 /etc/wal-e.d/reader/
AWS_ACCESS_KEY_ID
AWS_REGION
AWS_SECRET_ACCESS_KEY
WALE_S3_PREFIX

Any idea what I'm doing wrong?

Thanks!

Refactoring

Hi!
There are few places that I want to refactor, e.g.:

et c.

How do you think, is it feasible to create such PRs? Will they be reviewed or do they interfere with your plans on product development?

Tablespace Support

Does WAL-G support tablespace backup , It seems WAL-G didn't pick the data from tablespace and while restoration it didn't ask any tablespace details(RESTORE SPEC) like --restore-spec, which WAL-E asks while using backup-fetch command.

And upon starting the restored cluster throws the error of missing tablespace.

Are we missing anything here ..?

Any pointers in this direction will be appreciated.

Recreate folder structure during backup-fetch

Seems like this commit is incomplete.
I've encountered problem with absent pg_logical\snapshots and pg_logical\mappings.
Getting errors
[ 2017-12-04 15:29:14.792 MSK ,,,402290,58P01 ]:ERROR: could not open directory "pg_logical/snapshots": No such file or directory
until manual mkdir

There are two possible options - add empty marker files to tar or save empty folders to json sentinel.

backup-push runtime error

Hi,

I'm getting the following error when attempting a backup-push. Any ideas on what might be wrong? Please let me know if I can provide any more information.

Thanks!

/usr/bin/envdir /etc/wal-g.d/writer /usr/local/bin/wal-g backup-push /var/lib/postgresql/9.2/main
BUCKET: database
SERVER: wal-g/db3-00:25:90:f5:33:aa
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/wal-g/wal-g.ParseLsn(0x0, 0x0, 0x3, 0x3, 0xc38620)
        /home/travis/gopath/src/github.com/wal-g/wal-g/timeline.go:29 +0x223
github.com/wal-g/wal-g.(*Bundle).StartBackup(0xc4201519e0, 0xc420012700, 0xc4201d6ed0, 0x26, 0x26, 0x433b2e, 0xc4200001a0, 0x200000003, 0xc4200001a0, 0xc420020000)
        /home/travis/gopath/src/github.com/wal-g/wal-g/connect.go:76 +0x361
github.com/wal-g/wal-g.HandleBackupPush(0xc42017c900, 0x1c, 0xc420116e70, 0xc42017c6c0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:549 +0x318
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107 +0x6db

wal-g should be less smart about AWS credentials

wal-g requires the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in order to prepare an aws.Config that uses them. This allows wal-g to return an error message early if it's not configured, but this approach breaks three other methods of supplying credentials which I actually use:

The AWS SDK for Go enables all of these by default, in addition to allowing configuration via AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Removing the environment check and leaving Credentials unspecified will preserve the current behavior and re-enable all the other ways that typical AWS tools search for credentials.

Crash with disappearing base backup segment

Hi,

I ran into a crash recently during a backup-push. The server happened to be running pg_repack at the same time, and I'm guessing that's the cause of this since it creates and drops lot of temporary tables.

I'm not sure if this is something wal-g could/should handle, but I thought I'd report it to see if it could recover from this particular error.

Thanks!

/base/16400/94197891.8
2018/02/11 03:17:12 lstat /var/lib/postgresql/10/main/base/16400/94197891.9: no such file or directory
TarWalker: walk failed
github.com/wal-g/wal-g.(*Bundle).TarWalker
	/home/travis/gopath/src/github.com/wal-g/wal-g/walk.go:33
github.com/wal-g/wal-g.(*Bundle).TarWalker-fm
	/home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:570
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:372
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:376
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:376
path/filepath.Walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:398
github.com/wal-g/wal-g.HandleBackupPush
	/home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:570
main.main
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107
runtime.main
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/runtime/proc.go:185
runtime.goexit
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/runtime/asm_amd64.s:2197

AWS_SECURITY_TOKEN is not really optional

Readme states AWS_SECURITY_TOKEN is optional, I keep getting:

postgres_1  | 2017/08/18 20:44:54 FATAL: Did not set the following environment variables:
postgres_1  | AWS_SECURITY_TOKEN

On WAL-PUSH, WAL-G sometimes doesn't get the same log files as WAL-E?

Ok, this is a strange one. I am trying to migrate from WAL-E to WAL-G, but for a number of reasons, we have to maintain both for now. So we are dual-writing backups and WAL logs to two different S3 paths.

To do that, we run two backups each night (one at 8AM UTC, one at 12AM UTC). We also set our archive_command to:

archive_command = '/mnt/postgres-scripts/wale/wale_archive_command.sh "%p" && /usr/local/bin/wal-g.sh wal-push %p'

The behavior we see thats strange is that our wale_archive_command.sh script will run and seemingly get one WAL file passed in, but then upload two WAL files in its execution:

+ WAL=pg_xlog/000000010000CC0A000000D2
+++ dirname /mnt/postgres-scripts/wale/wale_archive_command.sh
++ cd /mnt/postgres-scripts/wale
++ pwd  
+ ENV=/mnt/postgres-scripts/wale/env.sh
+ . /mnt/postgres-scripts/wale/env.sh
++ set +x
AWS Credentials Sourced: /mnt/postgres-scripts/wale/aws.sh
Configured WAL-E Bucket: s3://company.com/us1/company @ us-west-2
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-push".
        STRUCTURED: time=2018-01-22T03:03:12.757110-00 pid=82794
wal_e.worker.upload INFO     MSG: begin archiving a file
        DETAIL: Uploading "pg_xlog/000000010000CC0A000000D2" to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo".
        STRUCTURED: time=2018-01-22T03:03:12.800269-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo prefix=us1/company/ seg=000000010000CC0A000000D2 state=begin
wal_e.worker.upload INFO     MSG: begin archiving a file
        DETAIL: Uploading "pg_xlog/000000010000CC0A000000D3" to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo".
        STRUCTURED: time=2018-01-22T03:03:12.818554-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo prefix=us1/company/ seg=000000010000CC0A000000D3 state=begin
wal_e.worker.upload INFO     MSG: completed archiving to a file
        DETAIL: Archiving to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo" complete at 16356.5KiB/s.
        STRUCTURED: time=2018-01-22T03:03:13.513893-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo prefix=us1/company/ rate=16356.5 seg=000000010000CC0A000000D2 state=complete
wal_e.worker.upload INFO     MSG: completed archiving to a file
        DETAIL: Archiving to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo" complete at 18033.6KiB/s.
        STRUCTURED: time=2018-01-22T03:03:13.528637-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo prefix=us1/company/ rate=18033.6 seg=000000010000CC0A000000D3 state=complete

Meanwhile, our wal-g.sh script will only upload the first file:

BUCKET: company.com
SERVER: us1/company/wal-g    
WAL PATH: us1/company/wal-g/wal_005/000000010000CC0A000000D2.lz4

I don't understand the behavior at all. It feels like the wale_archive_command.sh script is being invoked with two WAL files.. but if it was, the extended output we have should show that in the WAL=... line. However, its pretty clear that the script runs once, and yet wal-e sees two files and uploads them in order.

Meanwhile, the wal-g.sh script seems so simple that I can't imagine I'm doing anything wrong there.

Any thoughts on what could be happening? Here are our scripts, just for your reference:

wale_archive_command.sh

#!/bin/bash -x
WAL=$1
ENV="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/env.sh"
. $ENV || fail "Could not find ${ENV}. Exiting."
wal-e wal-push "$WAL"

wal-g.sh

envdir /etc/wal-g /usr/local/bin/wal-g $*

S3 Object Listing Does Not Paginate

Calls such as GetBackups and GetWals do not paginate across S3's maximum object return list size of 1000. These parts of the code base—especially the GetWals call—should be refactored to call ListObjectsV2Pages rather than ListObjectsV2. The consequence of this today is that deleting old backups may only clear up to the oldest 1000 WAL segments, but none further.

Some LZOP archives don't work

I'm still taking this apart but it can be related to #22.

I have this backup that has hundreds of archives, but one of those archives causes a systematic crash, whereas lzop seems fine with it. There's not much to do but for me to go through it with a fine-tooth comb, but, FYI.

cc @x4m

File encryption

Will wal-g support GPG encryption as wal-e does ? It's a main feature when storing on public clouds sensitive datas.

wal-push failed, but exited with a 0 code?

We got into a nasty situation yesterday when a wal-g wal-push failed, threw some log output (below), and then seems to have exited with a 0 code which allowed Postgres to clean up the failed WAL log.

Host Details:

  • Ubuntu 14.04
  • Postgres 9.6.6
  • WAL-G 0.1.7

archive_command = 'envdir /etc/wal-g /usr/local/bin/wal-g wal-push %p'

Log:

WAL PATH: us1/xxx/wal_005/000000020000D96A00000039.lz4
BUCKET: xxx.com
SERVER: us1/xxxx=
2018/04/04 20:54:21 upload: failed to upload 'pg_xlog/000000020000D96A0000003A': SerializationError: failed to decode S3 XML error response
	status code: 400, request id: 342A75FE7FA0F3D6, host id: 77+fUhrRM9zyLPA/OFYJxqEHOeryiTI4zwVzLOrz7U0LtU4eazY8uw+dLSo2gnocSAnj5Q3Dbng=
caused by: unexpected EOF. Restarting in 1.02 seconds
WAL PATH: us1/xxx/wal_005/000000020000D96A0000003A.lz4
BUCKET: xxx.com
SERVER: us1/xxx

Finally, here's a snapshot of the file as it was uploaded to S3 .. note, its 0 bytes:

$ aws s3 ls s3://xxx.com/us1/xxx/wal_005/000000020000D96A0000003A.lz4
2018-04-04 13:54:23          0 000000020000D96A0000003A.lz4

Should detect when `archive_command` is not set right..

We are doing some development of new puppet code for managing databases, and in my testing I hadn't yet set archive_command to anything on our dbmaster. Obviously we need to do that .. but if you miss it, and you try to execute a wal-g backup-push, it just hangs near the end:

/pg_subtrans
/pg_tblspc
/pg_twophase
/pg_xlog
/postgresql.auto.conf
/postgresql.conf
Finished writing part 1.
Starting part 2 ...
/global/pg_control
Finished writing part 2.
<hang is here>

Digging into the Postgres logs, we saw:

2018-01-05 08:00:02.232 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,3,"SELECT",2018-01-05 08:00:02 UTC,3/1674,0,LOG,00000,"duration: 124.497 ms  execute <unnamed>: SELECT case when pg_is_in_recovery() then '' else (pg_xlogfile_name_offset(lsn)).file_name end, lsn::text, pg_is_in_recovery() FROM pg_start_backup($1, true, false) lsn","parameters: $1 = '2018-01-05 08:00:02.106987155 +0000 UTC'",,,,,,,,""
2018-01-05 08:01:02.103 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,4,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:02:02.171 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,5,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:04:02.305 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,6,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (240 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:08:02.570 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,7,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (480 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:16:03.100 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,8,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (960 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:32:04.156 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,9,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 09:04:06.275 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,10,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (3840 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 10:08:10.511 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,11,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 12:16:18.992 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,12,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (15360 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""

Adding archive_command = "/bin/true" and HUPing postgres solved the issue. However, it seems that wal-g should have some way to detect when its in this hung state and get out of it with a useful error.

Postgres10 support

When I try to run wal-g backup-push I get this error:

ERROR:  function pg_xlogfile_name_offset(pg_lsn) does not exist at character 23
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
STATEMENT:  SELECT file_name FROM pg_xlogfile_name_offset(pg_start_backup($1, true, false))

wal-g version: 0.1.2
PostgreSQL version: 10.1

Looks like it's something similar to this issue here: wal-e/wal-e#339

Support backups located in the root of a bucket

It looks like backups are expected to be located in subfolders inside an S3 bucket, but our WAL-E backups are located in the root - it'd be great if this were supported too! Currently, a lot of panic: runtime error: index out of range messages are thrown when trying to hack around this by changing the locations in the configuration.

backup-fetch fails with "Interpret: copy failed"

Two servers failed to boot up and restore their databases last night. These two servers were booting in different AWS Regions, restoring from the same S3 backup. They both failed on exactly the same file, which indicates that we pushed a corrupt backup in some way.

The backup-push from the DB master had been run by hand, and there were no errors in the log output.

Master Server Details:

  • Ubuntu 14.04
  • Postgres 9.6.6
  • WAL-G 0.1.7

DB Size: ~1.8TB

Server 1 Failure

Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655160
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655162
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655164
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 2018/04/05 10:56:19 unexpected EOF
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: Interpret: copy failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.(*FileTarInterpreter).Interpret
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/tar.go:86
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:51
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: extractOne: Interpret failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:53
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Info: Class[Wal_g::Db_restore]: Unscheduling all events on Class[Wal_g::Db_restore]

Server 2 Failure

Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655162
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655164
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 2018/04/05 06:56:25 unexpected EOF
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: Interpret: copy failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.(*FileTarInterpreter).Interpret
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/tar.go:86
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:51
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: extractOne: Interpret failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:53
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Info: Class[Wal_g::Db_restore]: Unscheduling all events on Class[Wal_g::Db_restore]
STDERR> Error: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: change from 'notrun' to ['0'] failed: '/usr/local/bin/wal-g-restore.sh' returned 1 instead of one of [0]

backup-list doesn't show backups

If backups stored in the root directory of the bucket, the 'backup-list' command shows "No backups found". Replacing backups to subdirectory is solving the problem.

postgres@pg0:~$ aws s3 ls s3://production-backups/
                           PRE basebackups_005/
                           PRE wal_005/
                           PRE walg/
postgres@pg0:~$ aws s3 ls s3://production-backups/walg/
                           PRE basebackups_005/
                           PRE wal_005/
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ export WALE_S3_PREFIX=s3://production-backups/
postgres@pg0:~$ wal-g backup-list
BUCKET: production-backups
SERVER: 
2018/01/16 13:12:48 No backups found
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ export WALE_S3_PREFIX=s3://production-backups/walg
postgres@pg0:~$ wal-g backup-list
BUCKET: production-backups
SERVER: walg
name    last_modified   wal_segment_backup_start
base_0000000100000001000000D4   2018-01-16T13:06:28Z    0000000100000001000000D4

Deleting backups leaves sentinel file

Apologies if this is me misunderstanding what is actually correct behaviour, but when I run this:

$ wal-g delete retain FULL 1 --confirm
2018/03/25 11:45:20 base_000000010000000000000008 skipped         
2018/03/25 11:45:20 base_000000010000000000000004 will be deleted 

The backup base_000000010000000000000004 correctly disappears from my S3 bucket. However the sentinel file base_000000010000000000000004_backup_stop_sentinel.json remains, meaning that the deleted backup shows up in wal-g backup-list even though it no longer exists:

$ wal-g backup-list    
name    last_modified   wal_segment_backup_start                                                       
base_000000010000000000000004   2018-03-25T11:44:39Z    000000010000000000000004                       
base_000000010000000000000008   2018-03-25T11:45:02Z    000000010000000000000008                       

Manually deleting the sentinel file stops the actually-deleted backup from showing up.

Is this intended behaviour? I found it a bit misleading, so thought it was worth checking. Thanks!

prefetch failed

Hi,
I've restored a backup and during recovery, WAL segments are copied as usual but I also see messages WAL-prefetch failed: no such file or directory. What is the cause?

recovery.conf

standby_mode = 'on'
restore_command = '. /etc/wal-g/env && wal-g wal-fetch "%f" "%p"'
recovery_target = 'immediate'

Google Summer of Code 2018 with PostgreSQL

Hi!
If you are a student and wish to contribute to WAL-G I've created GSoC project under the umbrella of PostgreSQL.
Feel free to contact me on the matter.

Also, if you are doing graduation project I have some ideas for you too. And surely, you can combine both graduation project and GSoC project (with different, but related topics).

If this is important, I'm Associated Professor at Ural Federal University, Russia. I've about 8 years of advising student researchers in the field of Computer Science and Software Engineering. I hold Ph.D. in Theoretical Informatics (but projects are going to be 100% practical).

Support other cloud providers

We are using wal-g in our project https://github.com/kubedb . We would like to support Google Cloud Store, Azure Blob Store and Openstack Swift as supported backends. This issue is intended to discuss the general design for this.

We have contributed this type of changes to other tools, example: restic: https://github.com/restic/restic/pulls?utf8=%E2%9C%93&q=is%3Apr+diptadas . The general pattern was to extract a Backend interface and then plugin cloud provider specific implementation based on some flag. We are willing to sponsor cloud provider accounts that can be used to run e2e tests via travis.

If this sounds good, this is process I propose:

  • Refactor the S3 implementation to extract the interface and merge that pr.
  • Create a separate pr for each of the 3 other providers.

What do you think? @fdr @x4m

cc: @aerokite @diptadas

Decompress1X panic crash

While fetching a backup (not a wal segment). Commit: 96e46ad

panic: Decompress1X

goroutine 14 [running]:
github.com/katie31/extract.Decompress(0x935060, 0xc42000e088, 0x9354e0, 0xc42029c200)
	/home/mapi/var/go/src/github.com/katie31/extract/lzo.go:96 +0x747
github.com/katie31/extract.ExtractAll.func1.1(0x934fa0, 0xc420015ca0, 0xc42000e088)
	/home/mapi/var/go/src/github.com/katie31/extract/decomex.go:66 +0xbf
created by github.com/katie31/extract.ExtractAll.func1
	/home/mapi/var/go/src/github.com/katie31/extract/decomex.go:68 +0x113
/home/mapi/lib/util.rb:4:in `r': unhandled exception
	from /home/mapi/lib/postgres_installer.rb:104:in `fetchdb'

Postscript:

I think these two branches should be swapped, i.e. an error should be printed before doing the output-length check. Typical Go convention the output of a multi-valued-return that includes an error may still allow an output result value to be returned in an undefined state.

https://github.com/katie31/extract/blob/96e46adc0722e20462be8e0e1ac96bd84f1792b5/lzo.go#L95-L100

PSPS:

I swapped the order and gave things another try:

...
base/13290/2605
base/13290/2610_vm
base/13290/2601_fsm
base/13290/2685
base/13290/2618
base/13290/3164
base/13290/3079
base/13290/13136
base/13290/3079_vm
base/13290/3118_vm
base/13290/1255_vm
panic: EOF

Support alternative S3 implementations

Additionally, in order to support other non-official S3 implementations such as https://minio.io/ we would need to be able to setup a few custom settings:

Endpoint:         aws.String(os.Getenv("AWS_ENDPOINT")),
S3ForcePathStyle: aws.Bool(os.Getenv("AWS_S3_FORCE_PATH_STYLE") == "true"),

Example:

WALE_S3_PREFIX: s3://backups/wal-g
AWS_ENDPOINT: http://minio:9000

S3ForcePathStyle is required so the aws go sdk doesn't try to call http://backups.s3.amazonaws.com/wal-g and instead reach the minio server using sub path resources as in http://minio:9000/backups/wal-g.

Source https://github.com/minio/cookbook/blob/master/docs/aws-sdk-for-go-with-minio.md

Environment Variables for Connection with S3

Hi,

We are using Softlayer S3 storage & which works on Swift API. I have successfully setup WAL-E using S3 storage by defining WALE_S3_PREFIX, SWIFT_AUTH_VERSION, SWIFT_PASSWORD, SWIFT_USER & SWIFT_AUTHURL under /etc/wal.e.d/env.

I have installed the wal-g & moved it to /usr/bin but when I am executing wal-push it is giving an error "FATAL: Did not set the following environment variables:WALE_S3_PREFIX".
I tried setting up this environment variable which I had setup for WAL-E implementation in bashrc file but after that I am getting below error:
goroutine 1 [running]:
github.com/wal-g/wal-g.Configure(0x10, 0xc420156620, 0x0, 0x18)
/home/travis/gopath/src/github.com/wal-g/wal-g/upload.go:77 +0xa6e
main.main()
/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:83 +0x1f6

Request you to please let me know if i missed any step or i need to make some changes in my environment variable for WAL-G to connect with S3 storage on Softlayer.

basebackup_005 and wal_005

What is the reason for these folders being named as such? Is there intention to have these folders actually be dynamically named based on part of the WAL prefix, in order to better organize the objects rather than dumping them all into the same logical location? If yes, then I will rephrase this issue's title to suggest implementing this behavior.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.