Git Product home page Git Product logo

covid-alert-metrics-terraform's Introduction

La version française suit.

Covid Alert Metrics Terraform

COVID Alert is now retired: For more information, visit the Government of Canada COVID Alert home page.

This is the Terraform and Terragrunt repository used to manage the Metrics Application Programming Interface (API) infrastructure in the Amazon Web Services (AWS) cloud platform.

The Covid Alert App sends anonymous usage data to the Metrics API /save-metrics endpoint which is then converted to data comma-separated values (CSV) files by the Covid Alert Metrics ETL project.

Infrastructure

This project creates and manages the following AWS resources in the Staging and Production environments:

  • API Gateway which provides the /save-metrics endpoint.
  • Lambda functions process metrics data payloads from the Covid Alert App.
  • DynamoDB tables to store raw and aggregated metrics data.
  • Elastic Container Registries (ECR) and an Elastic Container Service (ECS) Fargate cluster to store the Covid Alert Metrics ETL Docker images and process the metrics data.
  • S3 buckets to store the generated metrics CSV files.
  • Route53 Domain Name System (DNS) record for the metrics subdomain.
  • Virtual Private Cloud (VPC) configuration.
  • Identity and Access Management (IAM) roles and policies.
  • CloudWatch log groups and alarm.

Setup

  1. Install Docker.
  2. Install VS Code and the Remote Containers extension.
  3. Using the Remote Containers status bar, open the project folder in a container. VS Code has a tutorial on using devcontainers if you get stuck.
  4. Once the devcontainer finishes building, you will have a Terraform and Terragrunt development environment.
  5. Export your AWS credentials.
  6. Change directories into the env/staging or env/production folder, depending on which environment you are working on.
  7. Run terragrunt run-all plan to see infrastructure changes.
  8. Run terragrunt run-all apply to apply infrastructure changes.

Note

For env/production changes, you also need to specify the tagged version of the code you want to run:

export INFRASTRUCTURE_VERSION=2.0.3
terragrunt run-all plan
terragrunt run-all apply

Référentiel Terraform pour les mesures de performance d’Alerte COVID

Alerte COVID a été mis hors service : Pour en savoir davantage, visitez la page d'accueil d’Alerte COVID du gouvernement du Canada.

Il s’agit du référentiel Terraform et Terragrunt utilisé pour gérer l’infrastructure de l’API des mesures de performance dans la plate-forme d’infonuagique Amazon Web Services (AWS).

L’application Alerte COVID envoie des données d’utilisation anonymes au point de terminaison /save-metrics de l’API des mesures de performance qui sont ensuite converties en fichiers de valeurs séparées par des virgules (CSV) par le projet Extraction, transformation et chargement des mesures d’Alerte COVID (ETL).

Infrastructure

Ce projet crée et gère les ressources AWS suivantes dans les environnements de préparation et de production :

  • Passerelle API qui fournit le point de terminaison /save-metrics.
  • Les fonctions Lambda traitent les données des mesures de performance envoyées par l’application Alerte COVID.
  • Tables DynamoDB pour stocker les données de mesures brutes et agrégées.
  • Elastic Container Registries (ECR) et un groupe Fargate Elastic Container Service (ECS) pour stocker les images Docker du projet ETL des mesures de performance d’Alerte COVID et traiter les données de mesures.
  • Des compartiments S3 pour stocker les fichiers CSV de mesures générés.
  • Enregistrement du système de noms de domaine (DNS) Route53 pour le sous-domaine metrics.
  • Configuration du nuage privé virtuel (VPC).
  • Rôles et stratégies de gestion des identités et des accès (IAM).
  • Groupes et alertes de journal CloudWatch.

Configuration

  1. Installer Docker.
  2. Installer VS Code et l’extension Remote Containers.
  3. À l’aide de la barre d’état Remote Containers, ouvrez le dossier du projet dans un conteneur. VS Code propose un tutoriel sur l’utilisation de devcontainers en cas de difficultés.
  4. Une fois le devcontainer généré, l’environnement de développement Terraform et Terragrunt est prêt à utiliser.
  5. Exporter les identifiants AWS.
  6. Modifier les répertoires pour créer des dossiers env/staging ou env/production, selon l’environnement dans lequel on souhaite travailler.
  7. Exécuter terragrunt run-all plan pour voir les changements d’infrastructure.
  8. Exécuter terragrunt run-all apply pour appliquer les modifications apportées à l’infrastructure.

Note :

Pour les modifications dans le dossier env/production, il faut aussi préciser la version balisée du code que vous souhaitez exécuter :

export INFRASTRUCTURE_VERSION=2.0.3
terragrunt run-all plan
terragrunt run-all apply

covid-alert-metrics-terraform's People

Contributors

calvinrodo avatar mohdnr avatar patheard avatar sre-read-write[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-alert-metrics-terraform's Issues

Bug: fix issue where an alarm keeps changing...

aws_cloudwatch_metric_alarm.backoff_retry_average_duration will be updated in-place

~ resource "aws_cloudwatch_metric_alarm" "backoff_retry_average_duration" {
~ alarm_description = "This metric monitors average duration for the save_metrics lambda" -> "This metric monitors average duration for the backoff_retry lambda"
~ dimensions = {
~ "FunctionName" = "save-metrics" -> "backoff_retry"
}
id = "save-metrics-average-duration"
tags = {}
# (16 unchanged attributes hidden)
}

Aggregate new "Error" metrics

Summary

The app will be sending new error metric payloads that need to be aggregated. Proposed payload is in cds-snc/covid-alert-metrics-payloads#5.

This metric includes a code and message:

Code Message
500 Bad URL from QR code
500 Problem decoding or parsing QR hash data
400 Submission keys: bad certificate
400 Error while downloading key file

Todo

  • Determine how to aggregate code/message (unique identifier per error, new field or use existing field like state).
  • Update covid-alert-metrics to a generate new CSVs based on error metrics.

Incident: investigate 2021-06-27 API Gateway `save-metrics` traffic spike

Summary

  • At 00:00:00 EST on 2021-06-27 the metrics API gateway 5xx alarm triggered.
  • This was caused by a large traffic spike to the endpoint, which caused AWS to throttle 111 invocations of the save-metrics Lambda.
  • Incident doc

Todo

  • Determine if there was a reason for this traffic spike, or it if was a one-off.
  • Determine if the 5xx metrics API gateway 5xx alarm threshold should be increased (is this a new normal traffic pattern).

Bug: accidental deploys to production are possible

To reproduce:

  1. Add code change that does not rely on a tag version (e.g. terragrunt.hcl or Lambda code change).
  2. Don't have that /env/production/$MODULE as part of the pull-request workflow matrix. As a result, there will be no plan run for the changes.
  3. Merge the PR, which runs terragrunt run-all apply which will apply the changes from step 1.

We saw this happen in #185.

Possible workaround would be to have the Apply Production workflow only operate on tagged versions of the repo (rather than using main for non-module changes).

Increase API gateway 5xx alarm threshold

The new metric traffic pattern now causes throttled Lambda invocations at midnight EST. The current metic 5xx alarm threshold of 100 in 1 min is no longer sufficient.

Related #122

Fix TF plan on network due to default NACL

This shouldn't show up each time we do a PR.

# aws_default_network_acl.default will be updated in-place
~ resource "aws_default_network_acl" "default" {
      id                     = "acl-0a293e49ea86cc7ce"
    ~ subnet_ids             = [
        - "subnet-066a4cc52f615255c",
        - "subnet-07f256fc1e9b65732",
      ]
      tags                   = {
          "Name" = "metricsstaging_default_nacl"
      }
      # (5 unchanged attributes hidden)


      # (3 unchanged blocks hidden)
  }

Add critical and warning alarms for error-500-qr-parse

As per our Sev 1 Discussion this morning we need to add an alarm for the above metric.

If it hits above 3 in a day we should trigger a critical error
If it hits 1 in a day we should warn

  • Public custom metric from aggregate lambda
  • Create cloudwatch alarm based on above thresholds

Lower 4xx alerting threshold

Current 4xx CloudWatch threshold is too high and caused the project to not notice a spike in 4xx errors.

Related: incident-en-4xx-high

Bug: Fix a bug where policies are being managed by two tf states

The following repos are both managing the same policies:

The following policies are duplicated between repos:

aws_iam_policy.aggregate_metrics_update
aws_iam_policy.vpc_networking
aws_iam_policy.write_logs

Solution is to create a new backoff_retry policy and remove the above resources from remote state.

  • Remove from staging
  • Remove from production

Implement Runtime security in Amazon ECS using CNCF Falco

Falco is the CNCF open source project for intrusion and abnormality detection for containers and cloud-native apps. Falco will generate security events when it finds abnormal behaviours, which are defined by a customizable set of rules.

Migrate metrics from Terraform Repo to Teragrunt Repo

Now that we have a nice clean Terragrunt repo for metrics we need to migrate over all the metrics repo from the keyserver.

  • aggregate_metrics table
  • raw_metrics table
  • dead letter queue
  • backoff retry lambda
  • aggregate_metrics lambda
  • raw_metrics lambda
  • api_gateway
  • cloudwatch alarms
  • cloudwatch dashboard

New alarm: drop in traffic from previous day

Add a new alarm that checks for a significant drop in traffic from one day to the next:

  1. Aggregate previous day's traffic (24 hour period).
  2. If traffic of current day has dropped by a significant percentage (5%, 10%?), trigger alarm.

This will catch cases where an app release has caused the metrics to drop unexpectedly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.