Git Product home page Git Product logo

amazon-aurora-postgresql-fast-failover-demo's Introduction

Amazon Aurora PostgreSQL Fast Failover Demo

A simple demonstration of in-region (HA) and cross-region (DR) failover automation using Amazon RDS Aurora PostgreSQL Global Database, RDS Proxy and Route53

Cross-Region Recovery

  • The CloudFormation template sets up a two-region database cluster for Diaster Recoveryand statically stable application routing to the local databse cluster.
  • A Lambda canary process continually stans the primary region (every 10 seconds) from the secondary region, and enacts failover if it observes 30 seconds of consecutive failures.

architecture for cross-region failover

In-Region Recovery

  • Each region also contains an RDS Proxy to provide High Availability transparent failover to the failved-over writer in the secondary Availability Zone.
  • By queueing read and write queries until the reader instance comes back as a writer, RDS Proxy turns what would be SQL errors into a few extra seconds of latency for the application.

architecture for cross-region failover

Demos

  • The repository also contains a test application to produce load for the application and log successful calls and errors in a simple UI, and times the failover
  • In-region failover with and without RDS proxy

Pre-Requisites

  • A Public Hosted Zone registered with AWS Route53
    • If you don't have one, please follow these instructions to create a new one
    • You can find the Public Hosted Zone ID in the AWS Console under Route53 -> Hosted Zones -> Hosted zone ID
    • You can find the Public Service FQDN in the AWS Console under Route53 -> Hosted Zones -> Hosted zone name
  • Pick a unique-looking stack name (suggest all-caps with whole words describing what the stack does). These will become the prefix for all the resources the stack creates.
  • Pick a database username and password you'd like the demo to use for the Aurora Postgres databases it creates. Please be aware that the password must be longer than 8 characters, and the username can't be a Postgres keyword (like "admin")
  • Pick two AWS regions (example: us-east-1 and us-east-2) where you'll be testing
  • Ensure that your Lambda concurrency is increased to at least 1000 in both test regions. In the AWS Console, navigate to: Service Quotas -> AWS services -> AWS Lambda -> Concurrent executions and click Request quota increase, selecting 1000. Do this in each test region.
  • Optionally, you can specify the CIDR blocks (e.g. 10.10.0.0/21) for 2 VPCs and their subnets the solution will create
    • The template creates one VPC in the primary region and one in the secondary region
    • Each of the 2 VPCs the template will create will containin:
      • 2 public subnets (for the Internet-facing API we'll be calling to test the region's applicaion)
      • 2 private subnets (for the application's middle tier lambda)
      • 2 more private subnets (for the database instances the middle tier will write to)

Deploying This Solution

  • See pre-requisites section above, as you will be prompted for these by the next step

  • This solution can be deployed using a single main CloudFormation template located here. Both main.yml and main.json are functionally identical. This template takes roughly 60 minutes to deploy.

  • During deployment, this template will launch several additional multi-region CloudFormation StackSets to fully deploy the required resources. While you don't need to launch or modify these StackSets directly, the underlying templates have been included in this repo for your reference.

  • The "Primary-Databases" substack step waits for RDS databases to come up in the failover region, so it could take >20m

  • Once deployed, the primary stack you launch will contain the following outputs:

    • InRegionFailoverDemoUrl - The dashboard you can use to simulate an in-region failover.
    • CrossRegionFailoverDemoUrl - The dashboard you can use to simulate a cross-region failover.
  • This template implements several mechanisms that may be useful to you outside of this use case. These mechanisms can be extracted from this repo and used elsewhere in your environment. These mechanisms include using CloudFormation to:

    • Empty an Amazon S3 bucket prior to deletion by CloudFormation.
    • Download code from a public repo URL and deploy it to Amazon S3.
    • Retrieve CloudFormation Exports from another AWS Region and use them as variables inside the invoking template.
    • Create a cross-region VPC Peering Connection and configure the required VPC routes on both sides.
    • Create a custom AWS Lambda Layer for Python runtimes. This tooling takes, as input, the PyPI packages you want included in the layer, as well as any custom functions, then returns the resulting Layer Version ARN for use elsewhere in your template(s).
    • Execute DDL statements against a remote database in need of configuration. In the case of this solution, this tooling is used to initialize an RDS Aurora PostgreSQL database. However, using the custom Lambda Layer creation tooling mentioned above, it could be easily modified to target additional DB engines (e.g., MySQL, MariaDB).
    • Delete some or all DNS records within an Amazon Route53 Hosted Zone during a CloudFormation Stack deletion. By default, CloudFormation's AWS::Route53::RecordSet[Group] resource handlers will NOT delete DNS records if they have a value different than that which was used at their creation. If your application modifies DNS records created by CloudFormation, CloudFormation will not delete them in an effort to protect customers from inadvertently deleting records that are still needed. The mechanism included in this solution is indiscriminate when it comes to deletion. It takes, as input, the FQDN(s) you would like deleted from the specified Hosted Zone and deletes those records, regardless of their current values.

Running This Solution

  • Once the above CloudFormation stack has completed:
  • Test in-region failover using Amazon RDS Proxy:
    • Find the URLs of your test dashboard: Go to CloudFormation -> Stacks -> The stack you just created -> Outputs:
      • CrossRegionFailoverDemoUrl: https://dashboard./cross-region-failover.html?apiHost=api.&primaryRegion=&failoverRegion=
      • InRegionFailoverDemoUrl: https://dashboard./in-region-failover.html?apiHost=api.&primaryRegion=&failoverRegion=
    • Use the control panel in the top-left corner to run the test:
      • in-region control panel
      • 1: Bypass (or enable) RDS Proxy to see the difference in data loss with and without it
      • 2: Click "Generate Client Traffic" to begin the test.
      • Note the blue line whihc measures successful requests
      • 3: Click "Send Failover Resuest" to shut down the active Aurora writer instance, causing an outage until the Aurora reader instance in the secondary availability zone restarts as a writer.
      • Note the red line that measures errors returned to the client and the outage duration with and without RDS Proxy
      • 4: Reset the test environment before switching RDS Proxy on or off or re-running the test.
  • Test in-region failover using Amazon Aurora Global Database:
    • Navigate to https://dashboard./in-region-failover.html
      • cross-region control panel
      • 1: Click "Generate Client Traffic" to begin the test.
      • Note the blue line whihc measures successful requests
      • 2: Click "Simulate LSE" to shut down all network connectivity to the primary region's Aurora databases causing an outage of the whole region's test applications in this VPC.
      • Note the red line that measures errors returned to the client and the outage duration.
      • Wait until the Canary Lambda notices the region is unhealthy, the Aurora reader instance in the secondary region restarts as a writer, and the Route53 CNAME is redirected to the secondary region.
      • 3: Reset the test environment before re-running the test.

Cleaning Up This Solution

  • To clean up / undeploy this solution, simply delete the primary CloudFormation Stack you initially launched. The cleanup will take roughly 45 minutes.
  • If you see a delete failure, retry it, without skipping any failed deletions.
  • Sometimes ENIs take longer than expexted to delete, in which case you can delete them and the VPC manually, then retry the stack deletion.
  • After the stack deletion is complete, you need to delete the IAM roles used to do the deletion. Navigate to IAM -> Roles and find the roles that have your stack name as the prefix

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Note that the Python code depends on, but does not include, the LGPL-3.0 licensed Psycopg PostgreSQL adapter.

amazon-aurora-postgresql-fast-failover-demo's People

Contributors

amazon-auto avatar cartermeyers avatar hyandell avatar winmaxim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

amazon-aurora-postgresql-fast-failover-demo's Issues

Issue when deploying Cloudformation Template.

Cloudformation Stack deployment error:
CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [7590d9b8-a15f-4aff-b844-e8a07e5fd529]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.

Lambda Error (AURORAFAILOVER-CfnExportRetriever error)

Function Logs
START RequestId: 167c0cf9-4d88-4deb-8049-301fc7b2f811 Version: $LATEST
[ERROR] Runtime.ImportModuleError: Unable to import module 'index': cannot import name 'DEFAULT_CIPHERS' from 'urllib3.util.ssl_' (/tmp/urllib3/util/ssl_.py)
Traceback (most recent call last):END RequestId: 167c0cf9-4d88-4deb-8049-301fc7b2f811

Screenshot 2023-06-05 at 11 56 14 PM Screenshot 2023-06-05 at 11 55 35 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.