Git Product home page Git Product logo

amazon-rds-postgresql-bulk-loading-blog's Introduction

Optimized bulk loading in Amazon RDS for PostgreSQL

The artifacts in this repository support the published blog: Optimized bulk loading in Amazon RDS for PostgreSQL. Refer to the blog for detailed instructions on setup and configuration for benchmarking.

Setting up your testbench

To replicate the results presented in the blog you will need to complete the following steps:

  • Create an EC2 instance to serve as a bastion host from which to execute psql commands.
  • Create a multi-AZ Amazon RDS for PostgreSQL database instance with IAM authentication enabled.
  • Follow this guide to create an IAM policy, an IAM role, and attach the role to Amazon RDS for PostgreSQL instance.
  • Follow this guide to install the aws_s3 extension, create an IAM role and policy that provides Amazon S3 permissions to execute the s3import function.
  • Setup a pgpass file with the proper credentials to allow the bash scripts to execute against the database instance.
  • Create the imdb database, staging schema, and imdb tables using the provided DDL scripts.
  • Download and stage the IMDb public dataset in an Amazon S3 bucket.
  • Git clone this repo to the bastion host.

Optimized bulk load configuration

The folloiwng cli commands and json are provided in the cli folder.

1. Setup optimized database parameter group

First create a database parameter group and set parameters that are optimized for bulk loading.

aws rds create-db-parameter-group \
   --db-parameter-group-name rds-postgres14-bulkload \
   --db-parameter-group-family postgres14 \
   --description "Optimized database parameters for bulk loading into Amazon RDS for PostgreSQL"

Create json file rds-postgresql14-bulkload.json with optimized parameters.

{
    "DBParameterGroupName": "rds-postgresql14-bulkload",
    "Parameters": [
		{
			"ParameterName": "maintenance_work_mem",
			"ParameterValue": "1048576",
			"ApplyMethod": "immediate"
		},
		{
			"ParameterName": "max_wal_size",
			"ParameterValue": "4096",
			"ApplyMethod": "immediate"
		},
		{
			"ParameterName": "checkpoint_timeout",
			"ParameterValue": "1800",
			"ApplyMethod": "immediate"
		}
	]
}

Modify the json file with parameter values to the optimized database parameter group.

aws rds modify-db-parameter-group \
   --db-parameter-group-name rds-postgres14-bulkload \
   --cli-input-json file://rds-postgresql14-bulkload.json

2. Apply optimized configurations prior to bulk loading

Apply database parameter group optimized for bulk loading

aws rds modify-db-instance \
   --db-instance-identifier [DB_INSTANCE_IDENTIFIER] \
   --db-parameter-group rds-postgres14-bulkload

3. Return to normal configuration after bulk loading completes

Apply normal database parameter group

aws rds modify-db-instance \
   --db-instance-identifier [DB_INSTANCE_IDENTIFIER] \
   --db-parameter-group default.postgres14

Performing bulk loading

The bash scripts provided execute psql commands against the Amazon RDS for PostgreSQL database. The script named copy_bulk_sequential.sh executes each psql command after the previous one finishes. The script named copy_bulk_parallel.sh executes all commands in parallel without waiting for the previous commands to complete. There are also a "pre" and "post" bulk loading scripts. The pre_bulkloading.sh script disables autovacuum at the table level. The post_bulkloading.sh script executes a checkpoint command, runs vacuum analyze on each of the target tables, then reenables autovacuum on each table.

Make scripts executable

cd psql
chmod +x pre_bulkloading.sh
chmod +x post_bulkloading.sh
chmod +x copy_bulk_sequential.sh
chmod +x copy_bulk_parallel.sh

Executing bulk load sequentially and report loading time

./pre_bulkloading.sh
time ./copy_bulk_sequential.sh
./post_bulkloading.sh

Executing bulk load in parallel and report loading time

./pre_bulkloading.sh
time ./copy_bulk_parallel.sh
./post_bulkloading.sh

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.