Git Product home page Git Product logo

achieving-operational-excellence-using-automated-playbook-and-runbook's Introduction

Achieving Operational Excellence Using Automated Playbook and Runbook

ℹī¸ You will run this lab in your own AWS account. Please follow directions at the end of the lab to remove resources to avoid future costs.

Introduction

This lab was derived directly from one of Operataional Excellence Labs named Automating operations with Playbooks and Runbooks in AWS Well-Architected Lab.

Manually running your runbooks and playbooks for operational activities has a number of drawbacks:

  • Activities are prone to errors & difficult to trace.
  • Manual activities do not allow your operational practice to scale in line with your business requirements.

In contrast, implementing automation in these activities has the following benefits:

  • Improved reliability by preventing the introduction of errors through manual processes.
  • Increased scalability by allowing non linear resource investment to operate your workload.
  • Increased traceability on your operation through log collection of the automation activity.
  • Improved incident response by reducing idle time and automatically triggering activity based on known events.
Click here if you would like to know what runbook and playbook are

At a glance, both runbooks and playbooks appear to be similar documents that technical users, can use to perform operational activities. However, there an essential difference between them:

  • A playbook documents contain processes that guides you through activities to investigate an issue. For example, gathering applicable information, identifying potential sources of failure, isolating faults, or determining the root cause of issues. Playbooks can follow multiple paths and yield more than one outcome.

  • A runbook contains procedures necessary to achieve a specific outcome. For example, creating a user, rolling back configuration, or scaling resource to resolve the issue identified.

This hands-on lab will guide you through the steps to automate your operational activities using runbooks and playbooks built with AWS tools.

We will show how you can build automated runbooks and playbooks to investigate and remediate application issues using the following AWS services:

Prerequisites:

Costs

NOTE: You will be billed for any applicable AWS resources used if you complete this lab that are not covered in the AWS Free Tier.

This lab walks you through creating a CI/CD workflow for serveress applications.

Content

Step 1. Deploy the sample application environment

In this section, you will prepare a sample application. The application is an API hosted inside a docker container, using Amazon Elastic Compute Service (ECS).. The container is accessed via an Application Load Balancer.

The API is a private microservice within your Amazon Virtual Private Cloud (VPC). Communication to the API can only be done privately through routes within the VPC subnet. In our lab example, the business owner has agreed to run the API over HTTP protocol to simplify the implementation.

The API has two actions available which encrypt and decrypt information. This is triggered by doing a REST POST call to the /encrypt / /decrypt methods as appropriate.

  • The encrypt action will allow you to pass a secret message along with a 'Name' key as the identifier and it will return a 'Secret Key Id' that you can use later to decrypt your message.
  • The decrypt action allows you to then decrypt the secret message passing along the 'Name' key and 'Secret Key Id' you obtained before to get your secret message.

Both actions will make a write and read call to the application database hosted in Amazon Relation Database Service (RDS), where the encrypted messages are stored.

The following step-by-step instructions will provision the application that you will use with your runbooks and playbooks .

Explore the contents of the CloudFormation script to learn more about the environment and application.

You will use this sample application as a sandbox to simulate an application performance issue, start your runbooks and playbooks to autonomously investigate and remediate.

Actions items in this section:

  1. You will prepare the Cloud9 workspace launched with a new VPC.
  2. You will run the application build script from the Cloud9 console to build the sample application as shown in the diagram below.

Section1 App Arch

1.0 Prepare Cloud9 workspace.

In this first step you will provision a CloudFormation stack that builds a Cloud9 workspace along with the VPC for the sample application. This Cloud9 workspace will be used to run the provisioning script of the sample application. You can choose the to deploy stack in one of the regions below.

  1. Click on the link below to deploy the stack. This will take you to the CloudFormation console in your account. Use walab-ops-base-resources as the stack name, and take the default values for all options.

    • us-west-2 : here
    • ap-southeast-2 : here
    • ap-southeast-1 : here
  2. Once the template is deployed, wait until the CloudFormation Stack reaches the CREATE_COMPLETE state.

Section1

1.1 Run the build application script.

Next, run the build script to build and deploy you application environment from the Cloud9 workspace as follows:

  1. From the main console, access the Cloud9 service.

  2. Click Environments section on the left menu, and locate an environment named WellArchitectedOps-walab-ops-base-resources as below, then click Open.

    Section 2 Cloud9 IDE Welcome Screen

  3. Your environment will bootstrap the lab repository. You should see a terminal output showing the following output:

    Section 2

    When the bootstrap script finishes you will see a folder called aws-well-architected-labs.

  4. In the IDE terminal console, change directory to the working folder where the build script is located:

    cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
    
  5. Copy and paste the command below, replacing [email protected] and [email protected] with the email address you would like the application to notify you with. Replace the [email protected] value with email representing system operators team and [email protected] with email address representing business owner.

    bash build_application.sh walab-ops-base-resources [email protected] [email protected]
    

The build_application.sh script will build and deploy your sample application, along with the architecture that hosts it. The application architecture will have capabilities to notify systems operators and owners, leveraging Amazon Simple Notification Service. You can use the same email address for [email protected] and [email protected] if you need to, but ensure that you have both values specified.

If you have deployed Amazon ECS before in your account, you may encounter InvalidInput error with message "AWSServiceRoleForECS has been taken" while running the build_application.sh script. You can safely ignore this message, as the script will continue despite the error.

  1. The above command runs the build and provisioning of the application stack. The script should take about 20 mins to finish.

    Section 2 Cloud9 IDE Welcome Screen

The build_application.sh will deploy the application docker image and push it to Amazon ECR. This is used by Amazon ECS. Once the build script completes, another CloudFormation stack containing the application resources (ECS, RDS, ALB, and others) will be deployed.

  1. In the CloudFormation console, you should see a new stack being deployed called walab-ops-sample-application. Wait until the stack reaches CREATE_COMPLETE state and proceed to the next step.

    Section 2 CreateComplete

1.2. Confirm the application status.

Once the application is successfully deployed, go to your CloudFormation console and locate the stack named walab-ops-sample-application.

  1. Confirm that the stack is in a 'CREATE_COMPLETE' state.

  2. Record the following output details as it will be required later:

  3. Take note of the DNS value specified under OutputApplicationEndpoint of the Outputs.

    The screenshot below shows the output from the CloudFormation stack:

    Section2 DNS Output

  4. Check for an email sent to the system operator and owner addresses you've specified in the build_application.sh script. This email should also be visible in the CloudFormation parameter under in the SystemOpsNotificationEmail and SystemOwnerNotificationEmail.

  5. Click confirm subscription on the email links to subscribe.

    Section2 DNS Output

There will be 2 emails sent to your address, please ensure to subscribe to both of them.

1.3. Test the application.

In this section, you will be testing the encrypt API action from the deployed application.

The application will take a JSON payload with Name as the identifier and Text key as the value of the secret message.

The application will encrypt the value under Text key with a designated KMS key and store the encrypted text in the RDS database with Name as the primary key.

Note: For simplicity purposes the sample application will re-use the same KMS keys for each record generated.

Click here to test
  1. In the Cloud9 terminal, run the command below, replacing the ApplicationEndpoint with the OutputApplicationEndpoint from previous step. This command will run curl to send a POST request with the secret message payload {"Name":"Bob","Text":"Run your operations as code"} to the API.

    ALBEndpoint="ApplicationEndpoint"
    
    curl --header "Content-Type: application/json" --request POST --data '{"Name":"Bob","Text":"Run your operations as code"}' $ALBEndpoint/encrypt
    
  2. Once you run this command, you should see output as follows:

    {"Message":"Data encrypted and stored, keep your key save","Key":"EncryptKey"}
    
  3. Take note of the encrypt key value under Key .

  4. Run the command below, pasting the encrypt key you took note of previously under the Key section to test the decrypt API.

    curl --header "Content-Type: application/json" --request GET --data '{"Name":"Bob","Key":"EncryptKey"}' $ALBEndpoint/decrypt
    
    
  5. Once you run the command you should see the following output:

    {"Text":"Run your operations as code"}
    

Congratulations!

You have now completed the first section of the Lab.

You should have a sample application API which we will use for the remainder of the lab.

Step 2. Simulate an Application Issue

Understanding the health of your workload is an essential component of Operational Excellence. Defining metrics and thresholds, together with appropriate alerts will ensure that issues can be acknowledged and remediated within an appropriate timeframe.

In this section of the lab, you will simulate a performance issue within the API. Using Amazon CloudWatch synthetic, your API will utilize a canary monitor, which continuously checks API response time to detect an issue.

In this example, should the API take longer than 6 seconds to respond, an alert will be created, triggering a notification email.

Actions items in this section:

  1. You will run a script that will send a large amount of traffic to the API.
  2. You will observe and confirm the issue through AWS monitoring tools.

The following resources had been deployed to perform these actions.

Section3 Base Architecture

2.0 Sending traffic to the application

In this section, you will send multiple concurrent requests to the application, simulating a large surge of incoming traffic. This will overwhelm the API, which will gradually increase the response time of the application. This results in the canary monitoring exceeding the set threshold, triggering the CloudWatch Alarm to send notification.

Follow below steps to continue:

  1. From the Cloud9 terminal, run the command shown below to change directory to the working script folder:

    cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
    
  2. Confirm that you have the test.json in the folder and it contains the following text:

    {"Name":"Test User","Text":"This Message is a Test!"}
    
  3. Go to CloudFormation console and take note of the OutputApplicationEndpoint value under Output tab of walab-ops-sample-application stack. This is the DNS endpoint of the Application Load Balancer.

    Section3 Succces Screenshot

  4. Make sure you have test the application previously. If so, execute the command below:

    bash simulate_request.sh $ALBEndpoint
    

    This script uses the Apache Benchmark to send 60,000,000 requests, 3000 concurrent request at a time.

    When you run the command you will see the output gradually change from a consistently successful 200 response to include 504 time-out responses.

    The requests generated by the script are overwhelming the application API and result in occasional timeouts by your load balancer.

    Keep the command running in the background as you proceed through the lab.

    Section3 Succces Screenshot

    Section3 Failure Screenshot

2.1 Observing the alarm being triggered.

  1. After approximately 6 minutes, you will see an alarm which is triggered as a response to the generated activity. This will trigger an email indicating that the CloudWatch alarm has been triggered.

    Section3 Email

  2. Check and confirm the alarm by going to the CloudWatch console.

  3. Click on the Alarms section on the left menu.

  4. Click on the Alarms called mysecretword-canary-duration-alarm, which should be in an alarm state.

    Section3 Failure Screenshot

  5. Click on the alarm to display the CloudWatch metrics that the alarm data is based from.

  6. The alarm is based on the Duration metric data emitted by the mysecretword-canary CloudWatch synthetic canary monitor. The Duration metric measures how long it takes for the canary requests to receive a response from the application.

  7. The alarm is triggered whenever the value of the Duration metric is above 6 seconds within a 1 minute duration. The latest threshold will be 5000 for 3 datapoints within 6 minutes.

    Section3 Failure Screenshot

  8. On the left menu click on **Application monitoring and Synthetics Canaries and locate the canary monitor named mysecretword-canary.

    Section3 Canary

  9. Click on the canary and the select the Configuration tab.

  10. From here you will see the canary configuration and a snippet of the canary script.

  11. In the canary script section, scroll down to the section that contains let requestOptionStep1 as shown in the screenshot below. This is the configuration that controls the destination of the request (hostname, path and payload body).

    Section3 Canary

  12. Click on the Monitoring tab.

  13. From here you will see the visualization of the metrics that the canary monitor generates.

  14. Locate the 'Duration' metric that is being used to trigger the CloudWatch alarm.

  15. You will see the average duration value of the canary request representing the time to complete. A value above 6000ms signifies that the request has taken more than 6 seconds to receive a response from the application, indicating a performance issue in the API.

    Section3 Canary

You have now completed the second section of the lab.

You should still have the simulate_request.sh running in the background, simulating a large influx of traffic to your API. This causes the application to respond slowly and time-out periodically. The CloudWatch Alarm will be triggering and performance issue notifications sent to your System Operator to prompt them into action.

This concludes Section 2 of this lab. Click 'Next step' to continue to the next section of the lab where we will build an automated playbook to assist investigation of the issue.

Step 3. Build and Run an Investigative Playbook

The efficiency of issue resolution within an Operations team is directly linked to their tenure and experience. Where an Operator has prior knowledge of a particular issue, they will have a headstart in being able to reach resolution in terms of understanding logs and metrics which were used in previous situations. Whilst this constitutes value to an Operations group, it also represents a single point of failure and a scalability challenge.

This is where playbooks become important. Playbooks are a documented set of predefined steps, which are run to identify an issue. The result of each step can be used to either call more steps to run, or alternatively to trigger manual intervention.

Automating playbook activities wherever possible, is critical to reducing the time to respond to an incident.

The AWS Cloud offers multiple services you can use to build an automated playbook, one which is AWS Systems Manager.

AWS Systems Manager offers an automation document capability (known within Systems Manager as runbooks), which allows for the creation of a series of executable steps to orchestrate your investigation and remediation. AWS Systems Manager Automation Documents allow a user to run custom scripts, call AWS service APIs, or even run remote commands on cloud or on-premise compute instances.

In this section, you will focus on creating an automated playbook in assisting your investigation, as a Systems Operator.

Actions items in this section:

  1. You will build a playbook to gather information about the workload and query the relevant metrics and logs.
  2. You will run the automation document to investigate your issue.

3.0 Prepare Automation Document IAM Role

The Systems Manager Automation Document you are building will require assumed permissions to run the investigation and remediation steps. You will need to create the IAM role that will assume the permissions to perform the playbook activities. To simplify the deployment process, a CloudFormation template has been provided that you can deploy via the console or AWS CLI. Please choose one of the two following deployment steps:

Click here for CloudFormation Console deployment step
  1. Download the template here.
  2. Follow this guide for information on how to deploy the CloudFormation template.
  3. Use waopslab-automation-role as the Stack Name, as this is referenced by other stacks later in the lab.
Click here for CloudFormation CLI deployment step (Preferred way)

Note: To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials.

  1. From the Cloud9 terminal change to the appropriate folder as shown:
cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
  1. Then run the command listed below:
aws cloudformation create-stack --stack-name waopslab-automation-role \
                                --capabilities CAPABILITY_NAMED_IAM \
                                --template-body file://automation_role.yml 
  1. Confirm that the stack has installed correctly. You can do this by running the describe-stacks command:
aws cloudformation describe-stacks --stack-name waopslab-automation-role

Locate the StackStatus and confirm it is set to CREATE_COMPLETE

  1. Once you have deployed the CloudFormation stack above, go to the IAM Console.

  2. On the side menu, click on Roles and locate the IAM role named AutomationRole.

  3. Take note of the ARN of the role, as we will need it later in the lab.

Section3

3.1 Building the "Gather-Resources" Playbook.

In preparation for the investigation, you need to know all services and resources associated to the issue. When the email notification is sent, information in the email does not contain any resources information. To gather this necessary information, we will build a playbook to acquire all related resources using our CloudWatch alarm ARN as a reference.

Codifying your playbook with AWS Systems Manager allows for maximum code reusability. This will reduce overhead in re-writing codes that has identical objectives.

Section4

Note: Follow these step to build and run playbook. Select a guide to deploy using either the AWS console, the AWS CLI or via a CloudFormation template deployment.

Click here for CloudFormation Console deployment step

Download the template here.

If you decide to deploy the stack from the console, ensure that you follow below requirements & step:

  1. Follow this guide for information on how to deploy the CloudFormation template.
  2. Use waopslab-playbook-gather-resources as the Stack Name, as this is referenced by other stacks later in the lab.
Click here for CloudFormation CLI deployment step (Preferred way)

Note: To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials.

  1. From the Cloud9 terminal, run the command to get into the working script folder
cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
  1. Then run the below commands, replacing the 'AutomationRoleArn' with the Arn of AutomationRole you took note in previous step 3.0.
aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
                                --template-body file://playbook_gather_resources.yml 

Example:

aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \
                                --template-body file://playbook_gather_resources.yml 

Note: Please adjust your command-line if you are using profiles within your aws command line as required.

  1. Confirm that the stack has installed correctly. You can do this by running the describe-stacks command below, locate the StackStatus and confirm it is set to CREATE_COMPLETE.
aws cloudformation describe-stacks --stack-name waopslab-playbook-gather-resources
Click here for Console step-by-step
  1. Go to the AWS Systems Manager console. Click Documents under Shared Resources on the left menu. Then click Create Automation as show in the screen shot below:

Section4

  1. Enter Playbook-Gather-Resources in the Name field and copy the notes shown below into the Document description field.
# What does this **playbook** do?

Query the CloudWatch Synthetics Canary and look for all resources related to the application based on it's Application Tag. This **playbook** takes an input of the CloudWatch Alarm ARN triggered by the canary

Note : Application resources must be deployed using CloudFormation and properly tagged accordingly.

## Actions taken in this playbook.
1. Describe CloudWatch Alarm ARN and identify the Canary resource.
2. Describe the Canary resource to gather the value of 'Application' tag
3. Gather CloudFormation Stack with the same value of 'Application' tag.
4. List all resources in CloudFormation Stack.
5. Parse list of resources into String Output.
  1. In the Assume role field, enter the IAM role ARN we created in the previous section 3.0 Prepare Automation Document IAM Role.

Section4

  1. Expand the Input Parameters section and enter AlarmARN as the Parameter name. Set the type as String and Required as Yes. This will define a Parameter within our playbook, so that the value of the CloudWatch Alarm ARN can be passed into the playbook to run the action.

Section4

  1. Under Step 1 section specify Gather_Resources_For_Alarm Step name, select aws::executeScript as the Action type.

  2. Under Inputs set Python3.6 as the Runtime and specify script_handler as the Handler.

  3. Paste in below python codes into the Script section.

Section4

  import json
  import re
  from datetime import datetime
  import boto3
  import os

  def arn_deconstruct(arn):
  arnlist = arn.split(":")
  service=arnlist[2]
  region=arnlist[3]
  accountid=arnlist[4]
  servicetype=arnlist[5]
  name=arnlist[6]
  return {
    "Service": service,
    "Region": region,
    "AccountId": accountid,
    "Type": servicetype,
    "Name": name
  }

  def locate_alarm_source(alarm):
  cwclient = boto3.client('cloudwatch', region_name = alarm['Region'] )
  alarm_source = {}
  alarm_detail = cwclient.describe_alarms(AlarmNames=[alarm['Name']])  

  if len(alarm_detail['MetricAlarms']) > 0:
    metric_alarm = alarm_detail['MetricAlarms'][0]
    namespace = metric_alarm['Namespace']
    
    # Condition if NameSpace is CloudWatch Syntetics
    if namespace == 'CloudWatchSynthetics':
      if 'Dimensions' in metric_alarm:
        dimensions = metric_alarm['Dimensions']
        for i in dimensions:
          if i['Name'] == 'CanaryName':
            source_name = i['Value']
            alarm_source['Type'] = namespace
            alarm_source['Name'] = source_name
            alarm_source['Region'] = alarm['Region']
            alarm_source['AccountId'] = alarm['AccountId']

    result = alarm_source
    return result

  def locate_canary_endpoint(canaryname,region):
  result = None
  synclient = boto3.client('synthetics', region_name = region )
  res = synclient.get_canary(Name=canaryname)
  canary = res['Canary']
  if 'Tags' in canary:
    if 'TargetEndpoint' in canary['Tags']:
      target_endpoint = canary['Tags']['TargetEndpoint']
      result = target_endpoint
  return result


  def locate_app_tag_value(resource):
  result = None
  if resource['Type'] == 'CloudWatchSynthetics':
    synclient = boto3.client('synthetics', region_name = resource['Region'] )
    res = synclient.get_canary(Name=resource['Name'])
    canary = res['Canary']
    if 'Tags' in canary:
      if 'Application' in canary['Tags']:
        apptag_val = canary['Tags']['Application']
        result = apptag_val
  return result

  def locate_app_resources_by_tag(tag,region):
  result = None

  # Search CloufFormation Stacks for tag
  cfnclient = boto3.client('cloudformation', region_name = region )
  list = cfnclient.list_stacks(StackStatusFilter=['CREATE_COMPLETE','ROLLBACK_COMPLETE','UPDATE_COMPLETE','UPDATE_ROLLBACK_COMPLETE','IMPORT_COMPLETE','IMPORT_ROLLBACK_COMPLETE']  )
  for stack in list['StackSummaries']:
    app_resources_list = []
    stack_name = stack['StackName']
    stack_details = cfnclient.describe_stacks(StackName=stack_name)
    stack_info = stack_details['Stacks'][0]
    if 'Tags' in stack_info:
      for t in stack_info['Tags']:
        if t['Key'] == 'Application' and t['Value'] == tag:
          app_stack_name = stack_info['StackName']
          app_resources = cfnclient.describe_stack_resources(StackName=app_stack_name)
          for resource in app_resources['StackResources']:
            app_resources_list.append(
              { 
                'PhysicalResourceId' : resource['PhysicalResourceId'],
                'Type': resource['ResourceType']
              }
            )
          result =  app_resources_list

  return result
  def script_handler(event, context):
  result = {}
  arn = event['CloudWatchAlarmARN']
  alarm = arn_deconstruct(arn)
  # Locate tag from CloudWatch Alarm

  alarm_source = locate_alarm_source(alarm) # Identify Alarm Source
  tag_value = locate_app_tag_value(alarm_source) #Identify tag from source

  if alarm_source['Type'] == 'CloudWatchSynthetics':
    endpoint = locate_canary_endpoint(alarm_source['Name'],alarm_source['Region'])
    result['CanaryEndpoint'] = endpoint
    
  # Locate cloudformation with tag
  resources = locate_app_resources_by_tag(tag_value,alarm['Region'])
  result['ApplicationStackResources'] = json.dumps(resources) 

  return result
  1. Under Additional inputs specify the input value to the step, passing in the parameter we created previously. To do this, specify below values:

    • InputPayload as the Input name
    • CloudWatchAlarmARN: '{{AlarmARN}}' as the Input Value.
  2. Under Outputs specify below values:

    • Resources as Name
    • $.Payload.ApplicationStackResources as Selector
    • String as Type
  3. Once your settings match the screenshot below, click on Create Automation

Section4

Once the automation document is created, you can now give it a test.

  1. You can then find the newly created document under the Owned by me tab of the Document section in Systems Manager Console.

Section3

  1. Click on the playbook called Playbook-Gather-Resources and click on Execute Automation to run your playbook.
  2. Paste in the CloudWatch Alarm ARN ( You can find this ARN in the email notification in section 2.1 Observing the alarm being triggered ) and click on Execute to test the playbook.

Section3

  1. Once the playbook run is completed successfully, click on the Step Id to see the final message and output of the step. You should be able to see this output listing all the resources of the application

Section3

  1. Copy the Resources list output from the section as highlighted in the screenshot below. This list consist of the all the resources defined in the CloudFormation stack related to our application. These information includes the Elastic Load Balancer, ECS and RDS resource id that we can now use to further our investigation of the underlying issue.

Section3

  1. You can Paste the output into a temporary location like notepad for now. You will need this value for our next step.

3.2 Building the "Investigate-Application-Resources" Playbook.

In the previous step, you have created a playbook that finds all related AWS resources in the application. In this step you will create a playbook that will interrogate resources, capture recent metrics and logs, to look for insights and better understand the root cause of the issue.

In practice, there can be various possibilities of actions that the playbook can take to investigate, depending on the scenario presented by the issue. The purpose of this Lab is to showcase how you can use playbook to aid investigation, rather than advise on a specific action path.

Therefore, in this lab we will assume an example scenario. The playbook will look at metrics and logs of the ELB, ECS and RDS services in the resource list. The playbook will then highlight the metrics and logs that is considered outside of normal operational threshold.

Section3

Please follow the below instructions to build this playbook:

Note: We will deploy this playbook via CloudFormation template to simplify deployment. Please follow the steps below to deploy the CloudFormation template via CLI / or Console.

Click here for CloudFormation Console deployment step

Download the template here.

If you decide to deploy the stack from the console, ensure that you follow below requirements & step:

  1. Please follow this guide for information on how to deploy the CloudFormation template.
  2. Use waopslab-playbook-investigate-resources as the Stack Name, as this is referenced by other stacks later in the lab.
Click here for CloudFormation CLI deployment step (Preferred way)
  1. From the Cloud9 terminal, change to the required folder as shown:
cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
  1. Run the command below, replacing the 'AutomationRoleArn' with the Arn of AutomationRole you took note in previous step 3.0 Prepare Automation Document IAM Role.
aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
                                --template-body file://playbook_investigate_application_resources.yml 

Example:

aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \
                                --template-body file://playbook_investigate_application_resources.yml 
  1. Confirm that the stack has installed correctly. You can do this by running the describe-stacks command as follows:
aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-resources
  1. Locate the StackStatus and confirm it is set to CREATE_COMPLETE

When the document is created, you can go ahead and run a quick test.

You can find the newly created document under the Owned by me tab of the Document resource in the Systems Manager console.

  1. Click on the playbook called Playbook-Investigate-Application-Resources and click on Execute Automation to run our playbook.

  2. Paste in the resources list you took note from the output of the previous playbook ( refer to section 3.1 Building the "Gather-Resources" Playbook ) under Resources and click on Execute

    Section3

  3. Under Executed Steps you should be able to see each of the step the playbook. If you view the content of the document you will be able to see the code and find out what each step does.

    Section3

    For simplicity, we have created a list of output and description for each step. Expand the list below to view.

    Output list
    Step Name Description Output list
    Gather_ELB_Statistics Go through the resource list and locate the ELB. Query data from the ELB CloudWatch metrics, looking at metrics from the last 60 minutes. TargetResponseTime (Average)
    HTTPCode_Target_2XX_Count (Sum)
    HTTPCode_Target_3XX_Count (Sum)
    HTTPCode_Target_4XX_Count (Sum)
    HTTPCode_Target_5XX_Count (Sum)
    TargetConnectionErrorCount (Sum)
    UnHealthyHostCount (Average)
    ActiveConnectionCount (Sum)
    HTTPCode_ELB_3XX_Count (Sum)
    HTTPCode_ELB_4XX_Count (Sum)
    HTTPCode_ELB_5XX_Count (Sum)
    HTTPCode_ELB_500_Count (Sum)
    HTTPCode_ELB_502_Count (Sum)
    HTTPCode_ELB_503_Count (Sum)
    HTTPCode_ELB_504_Count (Sum)
    Gather_RDS_Statistics Go through resource list and locate the RDS resource. Query data from the RDS CloudWatch metrics, looking at metrics from the last 60 minutes. BinLogDiskUsage (Sum)
    BinLogDiskUsage (Sum)
    BurstBalance (Average)
    CPUUtilization (Average)
    CPUCreditUsage (Sum)
    CPUCreditBalance (Maximum)
    DatabaseConnections (Sum)
    DiskQueueDepth (Maximum)
    FailedSQLServerAgentJobsCount (Average)
    FreeableMemory (Maximum)
    MaximumUsedTransactionIDs (Maximum)
    NetworkReceiveThroughput (Average)
    OldestReplicationSlotLag (Average)
    ReadIOPS (Average)
    ReadLatency (Average)
    ReadThroughput (Maximum)
    ReplicaLag (Average)
    ReplicationSlotDiskUsage (Maximum)
    SwapUsage (Maximum)
    TransactionLogsDiskUsage (Maximum)
    TransactionLogsGeneration (Average)
    ReplicationSlotDiskUsage (Maximum)
    WriteIOPS (Average)
    WriteLatency (Average)
    WriteThroughput (Average)
    Gather_ECS_Statistics Go through the resource list and locate the ECS resource. Query data from the ECS CloudWatch metrics, looking at metrics from the last 6 minutes. CPUUtilization (Maximum)
    MemoryUtilization (Maximum)
    Gather_ECS_Error_Logs Go through the resource list and locate the ECS Service. Search in CloudWatch logs for any Error occurrence.
    Gather_ECS_Config Go through the resource list and locate the ECS resource. Describe the ECS service configuration.
    Gather_RDS_Config Go through the resource list and locate the RDS resource. Describe RDS Instance Config & Parameters.
    Inspect_Playbook_Results Go through the output of above steps, inspect results and check if it is above the threshold. TargetResponseTime = 5 (ELB)
    TargetConnectionErrorCount= 0 (ELB)
    UnHealthyHostCount = 0 (ELB)
    ELB5XXCount = 0 (ELB)
    ELB500Count = 0 (ELB)
    ELB502Count = 0 (ELB)
    ELB503Count = 0 (ELB)
    ELB504Count = 0 (ELB)
    Target4XXCount = 0 (ELB)
    Target5XXCount = 0 (ELB)
    CPUUtilization = 80 (ECS)
  4. Wait until all steps are completed successfully.

3.3 Building the "Investigate-Application-From-Alarm" Playbook.

So far we have 2 separate playbooks. The first playbook gathers the list of resources associated with the application. The second playbook queries the relevant resources and investigates the appropriate logs and metrics.

In this step we will automate our playbooks further by creating a parent playbook that orchestrates the 2 Investigative playbooks. We will add another step to send notification to our Developers and System Owners.

Section4

Follow the instructions below to build the parent Playbook.

Note: Select a step-by-step guide below to build the parent playbook using either the AWS console a CloudFormation template.

Click here for CloudFormation Console deployment step

Download the template here.

If you decide to deploy the stack from the console, follow these steps:

  1. Please follow this guide for information on how to deploy the CloudFormation template.
  2. Use waopslab-playbook-investigate-application as the Stack Name, as this is referenced by other stacks later in the lab.
  3. In the parameter input screen, under PlaybookIAMRole enter ARN of playbook IAM role (defined in previous step), under NotificationEmail enter your designated email for playbook notification
Click here for CloudFormation CLI deployment step (Preferred way)
  1. From the Cloud9 terminal, change to the required folder as shown:
cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
  1. Then run below command :
aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
                                --template-body file://playbook_investigate_application.yml 

Example:

aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \
                                --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \
                                --template-body file://playbook_investigate_application.yml 

Note: Please adjust your command-line if you are using profiles within your aws command line as required.

Confirm that the stack has installed correctly. You can do this by running the describe-stacks command as follows:

aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-application 

Locate the StackStatus and confirm it is set to CREATE_COMPLETE

Click here for Console step-by-step guide
  1. From the AWS Systems Manager console, click on documents as shown below. Once you are there, click on Create Automation

Section4

  1. Next, enter in Playbook-Investigate-Application-From-Alarm in the Name and paste in the notes shown below into the Description box. This provides a description of the playbook. Systems Manager supports putting in notes as markdown, so feel free to format as required.
# What is does this **playbook** do?

This **playbook** will run **Playbook-Gather-Resources** to gather Application resources monitored by Canary.

Then subsequently run **Playbook-Investigate-Application-Resources** to Investigate the resources for issues. 

Outputs of the investigation will be sent to SNS Topic Subscriber
  
  1. Under Assume role field, enter in the ARN of the IAM role we created in the previous step.

  2. Under Input Parameters field, enter AlarmARN as the Parameter name. Set the type as String and Required as Yes. This will define a Parameter into our playbook, which allows the value of the CloudWatch Alarm to be passed to the main step that will run the action.

  3. Add another parameter by clicking on the Add a parameter link. Enter SNSTopicARN as the Parameter name. Set the type as String and Required as Yes. This will define another Parameter into our playbook, so that we can send notification to the Owner and Developer.

Section4

  1. Click Add Step and create the first step of aws:executeAutomation Action type with StepName PlaybookGatherAppResourcesCanaryCloudWatchAlarm

  2. Specify Playbook-Gather-Resources as the Document name under Inputs and under Additional inputs specify RuntimeParameters with {"AlarmARN":'{{AlarmARN}}'} as it's value (refer to screenshot below). This step we will be run the Gather-Resources playbook which we created previously.

Section4

  1. Once this step is defined, add another step by clicking on Add Step at the bottom of the section.

  2. For this second step, specify the Step name as PlaybookInvestigateAppResourcesELBECSRDS and an action type of aws:executeAutomation.

  3. Specify Playbook-Investigate-Application-Resources as the Document name and RuntimeParameters as Resources: '{{PlaybookGatherAppResourcesCanaryCloudWatchAlarm.Output}}' This will take the output of the first step and pass to the second playbook to run the investigation of associated resources.

Section4

  1. For the last step, take the output investigation from the second step and send that to the SNS topic where our owner, developers and admin are subscribed.

  2. Specify the Step name as AWSPublishSNSNotification and the action type as aws:executeAutomation.

  3. Specify AWS-PublishSNSNotification as the Document name and RuntimeParameters as shown below. This will take the output of the second step which contains summary data of the investigation and AWS-PublishSNSNotification which will send an email to the SNS we specified in the parameters.

TopicArn: '{{SNSTopicARN}}'
Message: '{{ PlaybookInvestigateAppResourcesELBECSRDS.Output }}'

Section4

  1. Our playbook will run investigative tasks and send the result to an SNS topic where our Systems administrator / engineer will subscribe to. To do this we will need to create an SNS topic that our playbook will send notification to. Please follow the instructions specified in this link and create a Standard SNS topic and name it PlaybookNotificationSNSTopic

  2. Once you've created the topic, go ahead and subscribe your an email using this instruction here

3.4 Executing investigation Playbook.

You can now run the playbook to discover the result of the investigation.

  1. Go to the Output section of the deployed CloudFormation stack walab-ops-sample-application and take note of below output values.

  2. Go to the Systems Manager Automation document we just created in the previous step, Playbook-Investigate-Application-From-Alarm.

  3. And then run the playbook passing the ARN as the AlarmARN input value, along with the SNSTopicArn.

    • You can get the AlarmARN from the email that you received from CloudWatch Alarm as described in step 3.1 Building the "Gather-Resources" Playbook. in this lab.
    • To get the value for SNSTopicArn, go to the CloudFormation console output of walab-ops-sample-application stack and copy, paste the value of OutputSystemEventTopicArn

Section3

  1. When the playbook completed, an email will be send to you, which contains a summary of the investigation completed by the playbook as shown.

Section3

  1. Copy and paste the message section and use a json linter tool such as jsonlint.com to give better structure for visibility. The result from the playbook investigation might vary slightly, but the overall findings should be similar to the below screenshot.

Section3

  1. From the report being generated you should see a large number of ELB504Count error and a high TargetResponseTime from the Load balancer. This explains the delay we are seeing from our canary alarm.

    If you then look at the ECS summary, you will notice that there is only 1 ECS TaskRunningCount, with a relatively high CPUUtilization average. The script calculates the average of maximum value on the ECS service in the last 6 minutes window. If you do not see CPUUtilization value in the json, you can confirm this by going to the ECS service console and click on the Metrics tab.

    Section3

    Therefore, it is likely that the immediate cause of the latency is resource constrained at the application API level running in ECS. Ideally, if we can increase the number of tasks in the ECS service, the application should be able to release some of the CPU Utilization constraints.

    With all of these information provided by our playbook findings, we should be able to determine what is the next course of action to attempt remediation to the issue.

This concludes Section 3 of this lab, click on the link below to move on to the next section to build the remediation runbook.

Step 4. Build and Run Remediation Runbook

In contrast to playbooks, runbooks are procedures that accomplish specific tasks to achieve an outcome. In the previous section, you have identified an issue with CPU utilization, which occurs because there is only 1 ECS task running in the cluster. This could be remediated through the use of auto-scaling.

However, implementing this requires preparation and planning. When an incident occurs, operations teams should have a defined escalation path for the issue. Depending on the criticality of the system they should also be equipped to do what is necessary to ensure system availability is protected while the escalation occurs.

In this section, you will build an automated runbook to remediate the CPU utilization issue by increasing the number of tasks in the ECS cluster. Your automated runbook, will notify the owner of the workload and give them the option to be able to intercept the scale-up action should they choose not to proceed.

Actions items in this section:

  1. You will build a runbook to scale up the ECS cluster, with the approval mechanism.
  2. You will execute the runbook and observe the recovery of your application.

4.0 Building the "Approval-Gate" Runbooks.

In this section you will build a reusable runbook, which provides the owner with the ability to deny or approve remediation actions within a defined waiting period. If the wait time is exceeded and a decision has has not been made, the runbook will automatically approve the action as shown.

Section5

We will achieve this through the use of a Systems Manager Automation document, which we will build using the following steps:

  1. The Approval-Gate runbook executes a separate document called the Approve-Timer.

  2. The Approve-Timer runbook will then wait for a preconfigured amount of time and send an approve signal to the Approval-Gate runbook.

  3. Meanwhile, the Approval-Gate runbook then sends an approval request to the workload owner via a designated SNS topic.

  • If the owner choose to approve, the Approval-Gate runbook will continue to the next step.
  • If the owner declines the approval, the runbook will fail, blocking further steps.
  • However, if the owner does not response within the preconfigured wait time, the Approve-Timer runbook will automatically approve the request.

Follow the instructions below to build the runbook:

Note: Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.

Click here for Console step by step
  1. Go to the AWS Systems Manager console. Click Documents under Shared Resources on the left menu. Then click Create Automation as show in the screen shot below:

    Section5

  2. Enter Approval-Timer in the Name field and copy the notes shown below into the Document description field.

      # What does this automation do?
    
      Automatically trigger 'Approval' Signal to an execution, after a timer lapse
    
      ## Steps 
    
      1. Sleep for X time specified on the parameter input
      2. Automatically signal 'Approval' to the Execution specified in parameter input
    
  3. In the Assume role field, enter the IAM role ARN we created in the previous section 3.0 Prepare Automation Document IAM Role.

  4. Expand the Input Parameters section and enter Timer as the Parameter name. Set the type as String and Required as Yes.

  5. Then add another parameter this time called AutomationExecutionId, of type String and set Required to Yes. Once you are done, your configuration should look like the screenshot below.

    Section4

  6. Under Step 1 section specify SleepTimer as Step name, select aws::sleep as the Action type.

  7. Expand the Inputs section of the step, and specify {{Timer}} as the Duration

    Section4

  8. Click on Add step and specify ApproveExecution as Step name, select aws::executeAwsApi as the Action type.

  9. Expand the Inputs section of the step, and specify ssm in the Service field and SendAutomationSignal in the API field.

  10. Under Additional inputs specify below values.

    • Approve as the SignalType
    • {{AutomationExecutionId}} as the AutomationExecutionId.

    Once you are done, your configuration should look like the screenshot below.

    Section5

    Section5

6 . Click on Create automation once you are done.

Next, you will create the Approval-Gate runbook responsible for running the Approval-Timer runbook asynchronously. Follow below steps to complete the configuration:

  1. From the AWS Systems Manager console, select Documents under Shared Resources on the left menu. Then click Create Automation as show in the screen shot below:

    Section5

  2. Next, enter Approval-Gate in the Name field and add the notes shown below to the Document description field.

      # What does this automation do?
    
      Place a gate before your desired step to create approval mechanism.
      Automation will trigger an asynchronously timer that will automatically approve once the time has lapsed.
      Automation will then send approval / deny request to the designated SNS Topic.
      When deny is triggered by approver, the step will fail and block the following step from executing.
    
      Note: Please ensure to have onFailure set to abort in your automation document.
    
      ## Steps 
    
      1. Trigger an asynchronously timer that will automatically approve once the time has lapsed.
      2. Send approval / deny request to the designated SNS Topic.
    
    
  3. In the Assume role field, enter the IAM role ARN we created in the previous section 3.0 Prepare Automation Document IAM Role.

  4. Expand the Input Parameters section and enter the following:

    • Timer as the Parameter name, set the type as String and Required as Yes.
    • NotificationMessage as the Parameter name, set the type as type String and Required is Yes.
    • NotificationTopicArn as the Parameter name, set the type as type String and Required is Yes.
    • ApproverRoleArn as the Parameter name, set the type as type String and Required is Yes.
  5. Expand Step 1 create a step named executeAutoApproveTimer and action type aws:executeScript.

  6. Expand Inputs, then set the Runtime as Python3.6 and paste in below code into the script section. Note that code snippet will execute the Approval-Timer runbook you created asyncronously.

    import boto3
    def script_handler(event, context):
      client = boto3.client('ssm')
      response = client.start_automation_execution(
          DocumentName='Approval-Timer',
          Parameters={
              'Timer': [ event['Timer'] ],
              'AutomationExecutionId' : [ event['AutomationExecutionId'] ]
          }
      )
      return None
    
  7. Expand Additional Inputs, then select InputPayload under Input Name, and add the text shown below to Input Value:

    AutomationExecutionId: '{{automation:EXECUTION_ID}}'
    Timer: '{{Timer}}'
    

    Once you have completed this step, your Step 1 configuration should look like below screenshot.

    Section4

  8. Click Add step to create Step 2

  9. Create a step named ApproveOrDeny and action type aws:approve.

  10. Expand Inputs and specify below values under Approvers, replacing the AutomationRoleArn with the Arn of AutomationRole you took note of in section 3.0 Prepare Automation Document IAM Role.

    [ '{{ApproverRoleArn}}', 'AutomationRoleArn' ]
    

    Example:

    [ '{{ApproverRoleArn}}', 'arn:aws:iam::xxxxx:role/AutomationRole' ]
    
  11. Expand Additional Inputs and specify the following values:

    • NotificationArn as the Input name, and {{NotificationTopicArn}} as the Input value
    • Message as the Input name, and {{NotificationMessage}} as the Input value
    • MinRequiredApprovals as the Input name, and 1 as the Input value
  12. Expand Common properties and change the following properties to below values (keep the remaining as it is):

    • Continue for On failure
    • false for Is critical

    Once you have completed this step, your Step 2 configuration should look like below screenshot.

    Section4

  13. Click Add step to create Step 3

  14. Create a step named getApprovalStatus and action type aws:executeAwsApi

  15. Expand Inputs and specify ssm in the Service field, and DescribeAutomationStepExecutions in the API field.

  16. Expand Additional Inputs and specify below values:

    • AutomationExecutionId as the Input Name, and {{automation:EXECUTION_ID}} as the Input value

    • Filters as the Input Name, and copy below values as the Input value

        - Key: StepName
          Values:
            - requestApproval
      
  17. Expand Outputs and specify below values:

    • approvalStatusVariable as the Name
    • $.StepExecutions[0].Outputs.ApprovalStatus[0] as the Selector
    • String as the Type

    Once you have completed this step, your Step 3 configuration should look like below screenshot.

    Section4

  18. Click on Create automation to complete the configuation.

Click here for CloudFormation deployment steps

Download the template here.

If you decide to deploy the stack from the console, ensure that you follow below requirements & step:

  1. Please follow this guide for information on how to deploy the CloudFormation template.
  2. Use waopslab-runbook-approval-gate as the Stack Name, as this is referenced by other stacks later in the lab.
Click here for CloudFormation CLI deployment step (Preferred way)
  1. From the Cloud9 terminal, change to the templates folder as shown:

    cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
    
  2. Run the below commands, replacing the AutomationRoleArn with the Arn of AutomationRole you took note of in section 3.0 Prepare Automation Document IAM Role.

    aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \
                                    --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
                                    --template-body file://runbook_approval_gate.yml 
    

    With your AutomationRole Arn in place your command will look similar to the following example:

    aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \
                                    --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-runbook-role \
                                    --template-body file://runbook_approval_gate.yml 
    
  3. Confirm that the stack has installed correctly. You can do this by running the describe-stacks command below, locate the StackStatus and confirm it is set to CREATE_COMPLETE.

aws cloudformation describe-stacks --stack-name waopslab-runbook-approval-gate

4.1 Building the "ECS-Scale-Up" runbook.

Section5

Next, you are going to build the ECS-Scale-Up runbook which will complete the following:

  1. Run the Approval-Gate runbook which you created previously.
  2. Wait for the Approval-Gate runbook to complete.
  3. Once the Approval-Gate runbook completes successfully, the runbook will increase the number of ECS tasks in the cluster.

Please follow below steps to build the runbook.

Note: Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.

Click here for Console step by step
  1. Go to the AWS Systems Manager console. Click Documents under Shared Resources on the left menu. Then click Create Automation as show in the screen shot below.

    Section5

  2. Next, enter Runbook-ECS-Scale-Up in the Name field and add the notes shown below to the Document description field:

      # What does this automation do?
    
      Scale up a given ECS service task desired count to certain number, with approval process.
      The automation will trigger Approval-Gate runbook, before executing.
    
      ## Steps 
    
      1. Trigger Approval-Gate
      2. Scale ECS Service by number of service
    
  3. In the Assume role field, enter the IAM role ARN we created in the previous section 3.0 Prepare Automation Document IAM Role.

  4. Expand the Input Parameters section and enter the following.

    • ECSDesiredCount as the Parameter name, set the type as Integer and Required as Yes.
    • ECSClusterName as the Parameter name, set the type as String and Required is Yes.
    • ECSServiceName, as the Parameter name, set the type as String and Required is Yes.
    • NotificationTopicArn, as the Parameter name, set the type as String and Required is Yes.
    • NotificationMessage, as the Parameter name, set the type as String and Required is Yes.
    • ApproverRoleArn, as the Parameter name, set the type as String and Required is Yes.
    • Timer, as the Parameter name, set the type as String and Required is Yes.
  5. Expand Step 1 create a step named executeApprovalGate and action type aws:executeAutomation.

  6. Expand Inputs, then set the Document name as Approval-Gate.

  7. Expand Additional inputs and select RuntimeParameters as the Input Name

  8. Paste in below as the Input Value

{
"Timer":'{{Timer}}',
"NotificationMessage":'{{NotificationMessage}}',
"NotificationTopicArn":'{{NotificationTopicArn}}',
"ApproverRoleArn":'{{ApproverRoleArn}}'
}
  1. Click Add Step to create the second step.

  2. Specify updateECSServiceDesiredCount as Step Name and select aws:executeAwsApi as Action type.

  3. Expand Inputs and configure the following values:

    • ecs as Service
    • UpdateService as Api
  4. Expand Additional inputs and configure the following values:

    • forceNewDeployment as the Input Name and true as Input Value
    • desiredCountas the Input Name and {{ECSDesiredCount}} as Input Value
    • service as the Input Name and {{ECSServiceName}} as Input Value
    • cluster as the Input Name and {{ECSClusterName}} as Input Value

13 . Click on Create automation once complete

Click here for CloudFormation Console deployment step

Download the template here.

If you decide to deploy the stack from the console, ensure that you complete the following steps:

  1. Please follow this guide for information on how to deploy the CloudFormation template.
  2. Use waopslab-runbook-scale-ecs-service as the Stack Name, as this is referenced by other stacks later in the lab.
Click here for CloudFormation CLI deployment step (Preferred way)
  1. From the Cloud9 terminal, run the command to get into the working script folder.

    cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
    
  2. Then run below commands, replacing the 'AutomationRoleArn' with the Arn of AutomationRole you took note in previous step 3.0 Prepare Automation Document IAM Role.

    aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \
                                    --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
                                    --template-body file://runbook_scale_ecs_service.yml 
    

    Example:

    aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \
                                    --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \
                                    --template-body file://runbook_scale_ecs_service.yml 
    
  3. Confirm that the stack has installed correctly. You can do this by running the describe-stacks command below, locate the StackStatus and confirm it is set to CREATE_COMPLETE.

aws cloudformation describe-stacks --stack-name waopslab-runbook-scale-ecs-service

4.2 Executing remediation Runbook.

Now, lets run the runbook you created above to remediate the issue.

  1. Go to the AWS CloudFormation console.

  2. Click on the stack named walab-ops-sample-application.

  3. Click on the Output tab, and take note following output values. You will need these values to execute the runbook.

    • OutputECSCluster
    • OutputECSService
    • OutputSystemOwnersTopicArn

 Section4

  1. If you are currently using an IAM user or role to log into your AWS Console, take note of the ARN. You will need this ARN when executing the runbook to restrict access to approve or deny request capability.

    To find your current IAM user ARN, go to the IAM console and click Users on the left side menu, then click on your User name. For IAM role, go to the IAM console and click Roles on the left side menu, then click on the Role name, you are using.

    You will see something similar to the example below. Take note of the ARN value,and proceed to the next step.

 Section4

  1. Go to the Systems Manager Automation console, click on Document under Shared Resources, locate and click an automation document called Runbook-ECS-Scale-Up.

  2. Then click Execute automation.

  3. Fill in the Input parameters with values below.

     Section4

    • For ECSServiceName, place the value of OutputECSService you took note on step 3.

    • For ECSClusterName, Place the value of OutputECSCluster you took note on step 3.

    • For ApproverArn, place the ARN value you took note on step 4.

    • For ECSDesiredCount, place in 100 to increase the task number to 100.

    • For NotificationMessage, place in any message that can help the approver make an informed decision when approving or denying the requested action.

      For example:

      Hello, your mysecretword app is experiencing performance degradation. To maintain quality customer experience we will manually scale up the supporting cluster. This action will be approximately 10 minutes after this message is generated unless you do not consent and deny the action within the period.
      
    • For NotificationTopicArn, place the value of OutputSystemOwnersTopicArn you took note on step 3.

    • For Timer, you can specify PT5M or specify a value defined in ISO 8601 duration format.

  4. Click Execute to run the runbook.

  5. Once the runbook is running, you will receive an email with instructions approve or deny, on the email address subscribed to the owners SNS topic ARN. Follow the link in the email using the User of the ApproverArn you placed in the Input parameters. The link will take you to the SSM Console where you can approve or deny the request.

     Section4

    If you approve, or ignore the email, the request will be automatically be approved after the Timer set in the runbook expires. If you deny, the runbook will fail and no action will be taken.

  6. Once the runbook completes, you can see that the ECS task count increased to the value specified.

  7. Go to ECS console and click on Clusters and select mysecretword-cluster.

  8. Click on the mysecretword-service Service, and you will see the number of running tasks increasing to 100 and the average CPUUtilization decrease.

     Section4

     Section4

  9. Subsequently, you will see the API response time returns to normal and the CloudWatch Alarm returns to an OK state.

     Section4

    You can check both using your CloudWatch Console, following the steps you ran in section 2.1 Observing the alarm being triggered.

Congratulations !

You have now completed the Automating operations with Playbooks and Runbooks lab, click on the link below to cleanup the lab resources.

Teardown

In this section you will delete all resources related to the lab environment.

  1. Run the following command to navigate to the script folder.
cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
  1. Run the teardown_resources.sh script to delete all resources related to the lab.
bash teardown_resources.sh

Summary

In this lab you learnt:

  • Build and run automated playbooks to support your investigations
  • Build and run automated runbooks to remediate specific faults
  • Enabling traceability of operations activities in your environment

achieving-operational-excellence-using-automated-playbook-and-runbook's People

Contributors

amazon-auto avatar awswa avatar sssalim-aws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

achieving-operational-excellence-using-automated-playbook-and-runbook's Issues

Teardown step does not successfully destroy all resources

Hi, went through the whole lab, thanks for creating it

i ran the bash teardown_resources.sh but some resources fails to delete and cause some of the cloudformation stack to be in the delete_failed state and i had to go in and manually delete the stacks

I also did not know that the s3 bucket was retained and thought it would have been automatically deleted as part of the bash script

might be worth documenting this behaviour or side effects to let people double check if the resources are still living in their aws account?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.