chime / terraform-aws-alternat Goto Github PK

View Code? Open in Web Editor NEW

1.1K 31.0 66.0 821 KB

High availability implementation of AWS NAT instances.

License: MIT License

Dockerfile 0.71% Python 23.36% Shell 11.07% HCL 47.10% Go 17.75%

aws

terraform-aws-alternat's Introduction

alterNAT

NAT Gateways are dead. Long live NAT instances!

Built and released with 💚 by

Background

On AWS, NAT devices are required for accessing the Internet from private VPC subnets. Usually, the best option is a NAT gateway, a fully managed NAT service. The pricing structure of NAT gateway includes charges of $0.045 per hour per NAT Gateway, plus $0.045 per GB processed. The former charge is reasonable at about $32.40 per month. However, the latter charge can be extremely expensive for larger traffic volumes.

In addition to the direct NAT Gateway charges, there are also Data Transfer charges for outbound traffic leaving AWS (known as egress traffic). The cost varies depending on destination and volume, ranging from $0.09/GB to $0.01 per GB (after a free tier of 100GB). That’s right: traffic traversing the NAT Gateway is first charged for processing, then charged again for egress to the Internet.

Consider, for instance, the cost of sending 1PB to and from the Internet through a NAT Gateway - not an unusual amount for some use cases - is $75,604. Many customers may be dealing with far less than 1PB, but the cost can be high even at relatively lower traffic volumes. This drawback of NAT gateway is widely lamented among AWS users.

Plug in the numbers to the AWS Pricing Calculator and you may well be flabbergasted. Rather than 1PB, which may be less relatable for some users, let’s choose a nice, relatively low round number as an example. Say, 10TB. The cost of sending 10TB over the Internet (5TB ingress, 5TB egress) through NAT Gateway works out to $954 per month, or $11,448 per year.

Unlike NAT Gateways, NAT instances do not suffer from data processing charges. With NAT instances, you pay for:

The cost of the EC2 instances
Data transfer out of AWS (the same as NAT Gateway)
The operational expense of maintaining EC2 instances

Of these, at scale, data transfer is the most significant. NAT instances are subject to the same data transfer sliding scale as NAT Gateways. Inbound data transfer is free, and most importantly, there is no $0.045 per GB data processing charge.

Consider the cost of transferring that same 5TB inbound and 5TB outbound through a NAT instance. Using the EC2 Data Transfer sliding scale for egress traffic and a c6gn.large NAT instance (optimized for networking), the cost comes to about $526. This is a $428 per month savings (~45%) compared to the NAT Gateway. The more data processed - especially on the ingress side - the higher the savings.

NAT instances aren't for everyone. You might benefit from alterNAT if NAT Gateway data processing costs are a significant item on your AWS bill. If the hourly cost of the NAT instances and/or the NAT Gateways are a material line item on your bill, alterNAT is probably not for you. As a rule of thumb, assuming a roughly equal volume of ingress/egress traffic, and considering the slight overhead of operating NAT instances, you might save money using this solution if you are processing more than 10TB per month with NAT Gateway.

Features:

Self-provisioned NAT instances in Auto Scaling Groups
Standby NAT Gateways with health checks and automated failover, facilitated by a Lambda function
Vanilla Amazon Linux 2 AMI (no AMI management requirement)
Optional use of SSM for connecting to the NAT instances
Max instance lifetimes (no long-lived instances!) with automated failover
A Terraform module to set everything up
Compatibility with the default naming convention used by the open source terraform-aws-vpc Terraform module

Read on to learn more about alterNAT.

Architecture Overview

The two main elements of the NAT instance solution are:

The NAT instance Auto Scaling Groups, one per zone, with a corresponding standby NAT Gateway
The replace-route Lambda function

Both are deployed by the Terraform module.

NAT Instance Auto Scaling Group and Standby NAT Gateway

The solution deploys an Auto Scaling Group (ASG) for each provided public subnet. Each ASG contains a single instance. When the instance boots, the user data initializes the instance to do the NAT stuff.

By default, the ASGs are configured with a maximum instance lifetime. This is to facilitate periodic replacement of the instance to automate patching. When the maximum instance lifetime is reached (14 days by default), the following occurs:

The instance is terminated by the Auto Scaling service.
A Terminating:Wait lifecycle hook fires to an SNS topic.
The replace-route function updates the route table of the corresponding private subnet to instead route through a standby NAT Gateway.
When the new instance boots, its user data automatically reclaims the Elastic IP address and updates the route table to route through itself.

The standby NAT Gateway is a safety measure. It is only used if the NAT instance is actively being replaced, either due to the maximum instance lifetime or due to some other failure scenario.

`replace-route` Lambda Function

The purpose of the replace-route Lambda Function is to update the route table of the private subnets to route through the standby NAT gateway. It does this in response to two events:

By the lifecycle hook (via SNS topic) when the ASG terminates a NAT instance (such as when the max instance lifetime is reached), and
by a CloudWatch Event rule, once per minute for every private subnet.

When a NAT instance in any of the zonal ASGs is terminated, the lifecycle hook publishes an event to an SNS topic to which the Lambda function is subscribed. The Lambda then performs the necessary steps to identify which zone is affected and updates the respective private route table to point at its standby NAT gateway.

The replace-route function also acts as a health check. Every minute, in the private subnet of each availability zone, the function checks that connectivity to the Internet works by requesting https://www.example.com and, if that fails, https://www.google.com. If the request succeeds, the function exits. If both requests fail, the NAT instance is presumably borked, and the function updates the route to point at the standby NAT gateway.

In the event that a NAT instance is unavailable, the function would have no route to the AWS EC2 API to perform the necessary steps to update the route table. This is mitigated by the use of an interface VPC endpoint to EC2.

Drawbacks

No solution is without its downsides. To understand the primary drawback of this design, a brief discussion about how NAT works is warranted.

NAT stands for Network Address Translation. NAT devices act as proxies, allowing hosts in private networks to communicate over the Internet without public, Internet-routable addresses. They have a network presence in both the private network and on the Internet. NAT devices accept connections from hosts on the private network, mark the connection in a translation table, then open a corresponding connection to the destination using their public-facing Internet connection.

The table, typically stored in memory on the NAT device, tracks the state of open connections. If the state is lost or changes abruptly, the connections will be unexpectedly closed. Processes on clients in the private network with open connections to the Internet will need to open new connections.

In the design described above, NAT instances are intentionally terminated for automated patching. The route is updated to use the NAT Gateway, then back to the newly launched, freshly patched NAT instance. During these changes the NAT table is lost. Established TCP connections present at the time of the change will still appear to be open on both ends of the connection (client and server) because no TCP FIN or RST has been sent, but will in fact be closed because the table is lost and the public IP address of the NAT has changed.

Importantly, connectivity to the Internet is never lost. A route to the Internet is available at all times.

For our use case, and for many others, this limitation is acceptable. Many clients will open new connections. Other clients may use primarily short-lived connections that retry after a failure.

For some use cases - for example, file transfers, or other operations that are unable to recover from failures - this drawback may be unacceptable. In this case, the max instance lifetime can be disabled, and route changes would only occur in the unlikely event that a NAT instance failed for another reason, in which case the connectivity checker automatically redirects through the NAT Gateway.

The Internet is unreliable, so failure modes such as connection loss should be a consideration in any resilient system.

Edge Cases

As described above, alterNAT uses the ReplaceRoute API (among others) to switch the route in the event of a NAT instance failure or Auto Scaling termination event. One possible failure scenario could occur where the EC2 control plane is for some reason not functional (e.g. an outage within AWS) and a NAT instance fails at the same time. The replace-route function may be unable to automatically switch the route to the NAT Gateway because the control plane is down. One mitigation would be to attempt to manually replace the route for the impacted subnet(s) using the CLI or console. However, if the control plane is in fact down and no APIs are working, waiting until the issue is resolved may be the only option.

Usage and Considerations

There are two ways to deploy alterNAT:

By building a Docker image and using AWS Lambda support for containers
By using AWS Lambda runtime for Python directly

Use this project directly, as provided, or draw inspiration from it and use only the parts you need. We cut releases following the Semantic Versioning method. We recommend pinning to our tagged releases or using the short commit SHA if you decide to use this repo directly.

Building and Pushing the Container Image

Build and push the container image using the Dockerfile.

We do not provide a public image, so you'll need to build an image and push it to the registry and repo of your choice. Amazon ECR is the obvious choice.

docker build . -t <your_registry_url>/<your_repo:<release tag or short git commit sha>
docker push <your_registry_url>/<your_repo:<release tag or short git commit sha>

Use the Terraform Module

Start by reviewing the available input variables.

Example usage using the terraform module:

locals {
  vpc_az_maps = [
    for index, rt in module.vpc.private_route_table_ids : {
      az                 = data.aws_subnet.subnet[index].availability_zone
      route_table_ids    = [rt]
      public_subnet_id   = module.vpc.public_subnets[index]
      private_subnet_ids = [module.vpc.private_subnets[index]]
    }
  ]
}

data "aws_subnet" "subnet" {
  count = length(module.vpc.private_subnets)
  id    = module.vpc.private_subnets[count.index]
}

module "alternat_instances" {
  source  = "chime/alternat/aws"
  # It's recommended to pin every module to a specific version
  # version = "x.x.x"

  alternat_image_uri = "0123456789012.dkr.ecr.us-east-1.amazonaws.com/alternat-functions-lambda"
  alternat_image_tag = "v0.3.3"

  ingress_security_group_ids = var.ingress_security_group_ids

  lambda_package_type = "Image"

  # Optional EBS volume settings. If omitted, the AMI defaults will be used.
  nat_instance_block_devices = {
    xvda = {
      device_name = "/dev/xvda"
      ebs = {
        encrypted   = true
        volume_type = "gp3"
        volume_size = 20
      }
    }
  }

  tags = var.tags

  vpc_id      = module.vpc.vpc_id
  vpc_az_maps = local.vpc_az_maps
}

To use AWS Lambda runtime for Python, remove alternat_image_* inputs and set lambda_package_type to Zip, e.g.:

module "alternat_instances" {
  ...
  lambda_package_type = "Zip"
  ...
}

The nat_instance_user_data_post_install variable allows you to run an additional script to be executed after the main configuration has been installed.

module "alternat_instances" {
  ...
    nat_instance_user_data_post_install = templatefile("${path.root}/post_install.tpl", {
      VERSION_ENV = var.third_party_version
    })
  ...
}

Feel free to submit a pull request or create an issue if you need an input or output that isn't available.

Can I use my own NAT Gateways?

Yes, but with caveats. You can set create_nat_gateways = false and alterNAT will not create NAT Gateways or EIPs for the NAT Gateways. However, alterNAT needs to manage the route to the Internet (0.0.0.0/0) for the private route tables. You have to ensure that you do not have an aws_route resource that points to the NAT Gateway from the route tables that you want to route through the alterNAT instances.

If you are using the open source terraform-aws-vpc module, you can set nat_gateway_destination_cidr_block to a value that is unlikely to affect your network. For instance, you could set nat_gateway_destination_cidr_block=192.0.2.0/24, an example CIDR range as discussed in RFC5735. This way the terraform-aws-vpc module will create and manage the NAT Gateways and their EIPs, but will not set the route to the Internet.

AlterNATively, you can remove the NAT Gateways and their EIPs from your existing configuration and then terraform import them to allow alterNAT to manage them.

Other Considerations

Read the Amazon EC2 instance network bandwidth page carefully. In particular:

To other Regions, an internet gateway, Direct Connect, or local gateways (LGW) – Traffic can utilize up to 50% of the network bandwidth available to a current generation instance with a minimum of 32 vCPUs. Bandwidth for a current generation instance with less than 32 vCPUs is limited to 5 Gbps.
Hence if you need more than 5Gbps, make sure to use an instance type with at least 32 vCPUs, and divide the bandwidth in half. So the c6gn.8xlarge which offers 50Gbps guaranteed bandwidth will have 25Gbps available for egress to other regions, an internet gateway, etc.
It's wise to start by overprovisioning, observing patterns, and resizing if necessary. Don't be surprised by the network I/O credit mechanism explained in the AWS EC2 docs thusly:

Typically, instances with 16 vCPUs or fewer (size 4xlarge and smaller) are documented as having "up to" a specified bandwidth; for example, "up to 10 Gbps". These instances have a baseline bandwidth. To meet additional demand, they can use a network I/O credit mechanism to burst beyond their baseline bandwidth. Instances can use burst bandwidth for a limited time, typically from 5 to 60 minutes, depending on the instance size.
SSM Session Manager is enabled by default. To view NAT connections on an instance, use sessions manager to connect, then run sudo cat /proc/net/nf_conntrack. Disable SSM by setting enable_ssm=false.
We intentionally use most_recent=true for the Amazon Linux 2 AMI. This helps to ensure that the latest AMI is used in the ASG launch template. If a new AMI is available when you run terraform apply, the launch template will be updated with the latest AMI. The new AMI will be launched automatically when the maximum instance lifetime is reached.
Most of the time, except when the instance is actively being replaces, NAT traffic should be routed through the NAT instance and NOT through the NAT Gateway. You should monitor your logs for the text "Failed connectivity tests! Replacing route" and alert when this occurs as you may need to manually intervene to resolve a problem with the NAT instances.
There are four Elastic IP addresses for the NAT instances and four for the NAT Gateways. Be sure to add all eight addresses to any external allow lists if necessary.
If you plan on running this in a dual stack network (IPv4 and IPv6), you may notice that it takes ~10 minutes for an alternat node to start. In that case, you can use the nat_instance_user_data_pre_install variable to prefer IPv4 over IPv6 before running any user data.
```
  nat_instance_user_data_pre_install = <<-EOF
    # Prefer IPv4 over IPv6
    echo 'precedence ::ffff:0:0/96 100' >> /etc/gai.conf
  EOF
```
If you see errors like: error connecting to https://www.google.com/: <urlopen error [Errno 97] Address family not supported by protocol> in the connectivity tester logs, you can set lambda_has_ipv6 = false. This will cause the lambda to request IPv4 addresses only in DNS lookups.
If you want to use just a single NAT Gateway for fallback, you can create it externally and provide its ID through the nat_gateway_id variable. Note that you will incur cross AZ traffic charges of $0.01/GB.
```
  create_nat_gateways = false
  nat_gateway_id      = "nat-..."
```

Contributing

Issues and pull requests are most welcome!

alterNAT is intended to be a safe, welcoming space for collaboration. Contributors are expected to adhere to the Contributor Covenant code of conduct.

Local Testing

Terraform module testing

The test/ directory uses the Terratest library to run integration tests on the Terraform module. The test uses the example located in examples/ to set up Alternat, runs validations, then destroys the resources. Unfortunately, because of how the Lambda Hyperplane ENI deletion process works, this takes a very long time (about 35 minutes) to run.

Lambda function testing

To test locally, install the AWS SAM CLI client:

brew tap aws/tap
brew install aws-sam-cli

Build sam and invoke the functions:

sam build
sam local invoke <FUNCTION NAME> -e <event_filename>.json

Example:

cd functions/replace-route
sam local invoke AutoScalingTerminationFunction -e sns-event.json
sam local invoke ConnectivityTestFunction -e cloudwatch-event.json

Testing with SAM

In the first terminal

cd functions/replace-route
sam build && sam local start-lambda # This will start up a docker container running locally

In a second terminal, invoke the function back in terminal one:

cd functions/replace-route
aws lambda invoke --function-name "AutoScalingTerminationFunction" --endpoint-url "http://127.0.0.1:3001" --region us-east-1 --cli-binary-format raw-in-base64-out --payload file://./sns-event.json --no-verify-ssl out.txt
aws lambda invoke --function-name "ConnectivityTestFunction" --endpoint-url "http://127.0.0.1:3001" --region us-east-1 --cli-binary-format raw-in-base64-out --payload file://./cloudwatch-event.json --no-verify-ssl out.txt

terraform-aws-alternat's People

Contributors

Stargazers

Watchers

terraform-aws-alternat's Issues

New terraform outputs for network interfaces/nat instances

Hi, first off I'd like to say that this repository is awesome and I love the savings it offers! Just started deploying this project today and we can already see the benefits of this solution.

One thing I was wondering about is wether we could add outputs to the Terraform module such as:

EC2 NAT instance IDs mapped per-az
Network interface IDs mapped per-az

Here's an example of what it would look like as an output:

nat_instances = [
  {
     az = us-east-1a,
     eni_id = eni-1234,
     instance_id = i-1232132,
  },
  {
     az = us-east-1b,
     eni_id = eni-4567,
     instance_id = i-2432094,
  },
  ....
]

This would particularly be helpful since we have our route tables created by Terraform per-az, so we'd like to have our 0.0.0.0/0 routes set to the corresponding EC2 instances+Network interface IDs directly in our codebase, rather than have these routes set to the managed NAT gateways and risk having someone overriding the 0.0.0.0/0 routes by mistake and our costs going through the roof because of managed NATs usage.

We are totally open to contributing to this repository, just looking for thoughts first and then we could open a PR.
Thank you :)

install third-party software on EC2 alternat instances

Hello, first of all I would like to thank you for your work on Alternat. Well done!

Now to the problem I'm solving. I need to install third party software on an EC2 instance of Alternat. Specifically Datadog monitoring through which I want to monitor enhanced networking metrics and other system metrics.

By default, things like this are inserted on EC2 via a user data script that you use yourself. But as a Terraform module user I have no way to add my own script. In our fork we temporarily solved this by adding a variable, something like:

variable "nat_instance_user_data_post_install" {
  type        = string
  description = "Additional nat instance user data scripts"
  default     = ""
}

variable then appears in the data config (1debit/alternat/modules/terraform-aws-alternat/main.tf#L172)

data "cloudinit_config" "config" {
  for_each = { for obj in var.vpc_az_maps : obj.az => obj.route_table_ids }

  gzip          = true
  base64_encode = true
  part {
    content_type = "text/x-shellscript"
    content = templatefile("${path.module}/alternat.conf.tftpl", {
      eip_allocation_ids_csv = join(",", local.nat_instance_eip_ids),
      route_table_ids_csv    = join(",", each.value)
    })
  }
  part {
    content_type = "text/x-shellscript"
    content      = file("${path.module}/../../scripts/alternat.sh")
  }
  part {
    content_type = "text/x-shellscript"
    content = var.nat_instance_user_data_post_install
  }
}

when defining a module, we then use a similar definition:

nat_instance_user_data_post_install = templatefile("${path.module}/templates/user_data_nat_instance.tpl", {
    DD_SITE       = var.datadog_site
    DD_API_KEY    = var.datadog_api_key
  })

user_data_nat_instance.tpl contains a separate installation of the Datadog agent, it is a easy bash script with a few commands. This is how any third party software for monitoring or other user needs should be installed.

Do you plan to add similar functionality in the future? Do you find our solution reasonable, or do you have another idea how to install third-party software?

Thanks in advance for your time!

Q: How to deal with Terraform drift for existing routes?

In our environment we create routes to 0.0.0.0/0 to be managed by NAT gateways by default, like this:

resource "aws_route" "private_route_per_az" {
    destination_cidr_block = "0.0.0.0/0"
    id                     = "r-rtb-<id>"
    nat_gateway_id         = "nat-<id>"
    origin                 = "CreateRoute"
    route_table_id         = "rtb-<id>"
    state                  = "active"
}

Once AlterNAT is deployed and updates the route, — we see a drift the next time Terraform plan is ran:

  # module.infra.aws_route.private_route_per_az["nat-<id>"] will be updated in-place
  ~ resource "aws_route" "private_route_per_az" {
        id                     = "r-rtb-<id>"
      + nat_gateway_id         = "nat-<id>"
        # (7 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

So far I've managed to silence the noise by applying ignore_changes on the drifted resource, e.g.:

  lifecycle {
    ignore_changes = all
  }

My question: is there a better way around this? Thanks!

Request: Automated releases, Repo Hygiene

This is not a feature request in the technical sense, but having to do with the GitHub repo itself.

This project is clearly picking up since its inception, but the GitHub repo itself is missing a few things relating to GitHub repo best practices and release automation. The biggest thing is that the Terraform module is in v0.1.0, and the code works, which is great. But what happens when someone makes a pull request for a hotfix or a new feature? Cutting releases manually is generally prone to human error and leads to arbitrary semantic version bumps. So I propose a few things in this repo that can make it easier to manage in the long term:

Manage semantic versions and GitHub releases automatically based on pull requests and labels using release-drafter.
Templates for feature requests and bug reports.

Here are examples of popular Terraform module repos using what I've described above:

https://github.com/cloudposse/terraform-aws-eks-cluster
https://github.com/terraform-aws-modules/terraform-aws-eks (not using release-drafter but a GHA equivalent in nature).

I can easily add some of these in (as so can another volunteer), but wondering what your thoughts are.

Allow all ingress traffic to NAT instance by default

Currently, we need to input security group IDs one by one to allow ingress traffic to the NAT instance. This can be time-consuming and may not be efficient for some use cases. I propose that we explore the possibility of allowing all ingress traffic to the NAT instance by default, while still providing the option to specify security group IDs for more granular control when needed.

Please consider this feature request and let me know if any additional information is required.

Keep the same EIP on the instances

It seems like it's possible, with some lambda hackery to keep the same set of static EIPs to instances in the autoscaling group.

For example, if the asg needs to rotate an instance due to max refresh

asg scales from 3 to 4 instances
wait for new ec2 to be healthy
lambda associates instance-to-be-deleted's ip to the new ec2
asg then destroys the older instance

Could this work?

Would this also avoid needing to update the route table every time?

Would this still drop existing connections?

References

ARM Lambda Option

Hi team,

First off, thanks for this excellent project!

I have been working on setting up alterNAT to replace our NAT gateways for our more active AWS accounts. I noticed there isn't an option for running the Lambdas in ARM architecture. It would be a neat little cost optimization given how often the connectivity Lambda runs alone.

Admittedly, I am not familiar with how Python works with ARM CPU architecture - so I can imagine this making it tricky.

Thanks in advance!

How can I point specific aws provider for module?

I want to try alternat in one of our aws account and we use different aws providers for different aws accounts. Where can I set the provider?

Question: how to find out NAT gateway data transfer usage?

This is more a question related to evaluating migration to alternat. I'm not an AWS expert, so please go easy on me. :)

Is there a way how to figure out how much data has been transferred through a NAT gateway in a month without enabling VPC flow logs? If VPC flow logs and querying them with Athena is the only option -- are there any tools that provide some sort of automation, a Terraform module perhaps? I've been struggling to find any of it that's not 4+ years old.

I think putting this information or links to some resources into the readme would really help with the adoption.

Thanks for a great tool!

Chime pay

Discussion: constructing `vpc_az_maps` deterministically

Long time no see!

Since recently I've started incorporating AWS-maintained Terraform module for provisioning VPCs mainly due to its better support of AWS IPAM. Trying to construct the vpc_az_maps from VPC module's outputs and pass it down to AlterNAT module, I've discovered that Terraform can't plan the changes. Note: using targeted applies and applying VPC module first & AlterNAT module afterwards works w/o issues.

I'm looking forward to any ideas how this could be solved in painless manner. Would love to contribute the solution into examples if we manage to find one.

💡 CloudPosse had written a nice article about values that can't be determined until apply, leaving it here for those who might come across.

The following Terraform (slightly simplified for demonstration purposes):

module "vpc_from_cidr" {
  count = 1

  source  = "aws-ia/vpc/aws"
  version = "~> 4.4.2"

  name     = "example-vpc"
  az_count = 3

  cidr_block = "172.16.44.0/22"

  subnets = {
    private = { netmask = 26 }
    public  = { netmask = 26 }
  }
}

locals {
  private_subnet_attributes_by_az      = module.vpc_from_cidr[0].private_subnet_attributes_by_az
  private_route_table_attributes_by_az = module.vpc_from_cidr[0].rt_attributes_by_type_by_az.private
  public_subnet_attributes_by_az       = module.vpc_from_cidr[0].public_subnet_attributes_by_az

  vpc_config_for_alternat = [
    for name, attributes in local.private_subnet_attributes_by_az : {
      az                 = attributes.availability_zone
      private_subnet_ids = [attributes.id]
      public_subnet_id   = local.public_subnet_attributes_by_az[attributes.availability_zone].id
      route_table_ids = [
        local.private_route_table_attributes_by_az[name].id
      ]
    }
  ]
}

module "alternat_instances" {
  source  = "chime/alternat/aws"
  version = "0.6.0"

  nat_instance_type   = "t4g.medium"
  lambda_package_type = "Zip"

  vpc_id      = module.vpc_from_cidr[0].vpc_attributes.id
  vpc_az_maps = local.vpc_config_for_alternat
}

.. results in:

│ Error: Invalid for_each argument
│
│   on .terraform/modules/cell.alternat_instances/lambda.tf line 129, in resource "aws_lambda_function" "alternat_connectivity_tester":
│  129:   for_each = { for obj in var.vpc_az_maps : obj.az => obj }
│     ├────────────────
│     │ var.vpc_az_maps is a list of object, known only after apply
│
│ The "for_each" map includes keys derived from resource attributes that cannot be determined until apply, and so Terraform cannot determine the full set of keys that will identify the instances of this resource.
│
│ When working with unknown values in for_each, it's better to define the map keys statically in your configuration and place apply-time results only in the map values.
│
│ Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.
╵
╷
│ Error: Invalid count argument
│
│   on .terraform/modules/cell.alternat_instances/main.tf line 55, in resource "aws_eip" "nat_instance_eips":
│   55:   count = local.reuse_nat_instance_eips ? 0 : length(var.vpc_az_maps)
│
│ The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the resources that the count depends on.
╵
╷
│ Error: Invalid for_each argument
│
│   on .terraform/modules/cell.alternat_instances/main.tf line 425, in resource "aws_eip" "nat_gateway_eips":
│  425:   for_each = {
│  426:     for obj in var.vpc_az_maps
│  427:     : obj.az => obj.public_subnet_id
│  428:     if var.create_nat_gateways
│  429:   }
│     ├────────────────
│     │ var.create_nat_gateways is true
│     │ var.vpc_az_maps is a list of object, known only after apply
│
│ The "for_each" map includes keys derived from resource attributes that cannot be determined until apply, and so Terraform cannot determine the full set of keys that will identify the instances of this resource.
│
│ When working with unknown values in for_each, it's better to define the map keys statically in your configuration and place apply-time results only in the map values.
│
│ Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.
╵

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot?

Maybe these are all obvious and each have unique blockers but has the project maintainers considered a few optimizations: test with icmp instead of tcp, test to an Amazon public address, and use spot instances.

Suggestion: Add Structured logging

Hi team!

We are looking to enhance the log outputs of alterNAT Lambdas in the form of structured logging. This would tremendously improve the searchability and analysis of logs when errors surface.

For those not privy to structured logging, check out this article.

Would the maintainers of this repo be open to a PR that featured structured logging? We'd be happy to implement it!

Implementation questions:

Would you prefer this as something that users opt in/opt out of? Or would you prefer to implement structured logging as the only logging solution?
Do you have any preference on what library/libraries that would be used for this? I was looking to implement structlog (https://www.structlog.org/en/stable/ & https://github.com/hynek/structlog) as a potential library. It has a fair amount of stars, and is in active development.

I have read through the Contributing, and Code of Conduct documentation for the repo, though I am also open to suggestions, or guidelines you may have that hasn't yet been covered in the documentation.

Thanks!

AWS Gateway Load Balancer

AWS have an offering called the Gateway Load Balancer. Pricing is based on the GWLB itself and GWLB endpoints (GWLBe). Both have per-hour and per-GB costs.

A single GWLB can be deployed to multiple AZs (just like an LB) and GWLBes are AZ-specific. So for a single AZ it would cost $0.0225/hr. Per-GB it would be an extra $0.0075 - this is assuming costs are dominated by bandwidth and not the other GLCU units (active connections and new connections).

As per tweet the main benefits would be high availability within each AZ (via multiple instances behind a GWLBe) and the ability for connection draining.

There is even a recent blog post demonstrating exactly how this could be achieved. It is the "2-arm" mode that would most closely mirror how alterNAT works today.

"Reset" to NAT instances after failover

I'm putting this here to see if there's any interest in adding in the ability to "fall back" to the NAT instances after a failover due to curl failure. Or am I missing something that will set it back automatically?

I'm working on the code anyway, so I'm happy to make a PR if you think it's useful.

Right now, my first thought is to update the connection check lambdas so that the 1st time through, it checks the route table and if it's set to a NAT Gateway, change it to a NAT instance just before the first check, so if it's still down, it'll immediately be changed back. Effective, but will cause a connectivity blip every minute while failed over to NAT Gateway.

Option 2 is to have a separate lambda on a separate schedule (maybe every 15 minutes by default, or only on demand?) that if the route tables are using NAT Gateways, we run an "Instance Refresh" on the ASG, forcing it to re-create the instances. In theory, we could terminate the instances, and the ASG would do it's thing as well.

Thoughts?

AlterNAT at scale Questions

Our Dev/Staging environments quite frankly suck, low traffic at best and in testing we encountered no issues at all (this is good)

When we tried to simulate a production environment, AlterNAT we noticed the instance would start to drop traffic when we reached higher Lambda Execution volumes. We hit the instance's PPS limit and packets were dropped.
Increasing the NAT instance class seemed to help.

Our Production environment is a different beast.
The vast majority of our NAT traffic is from Lambda executions. (Occasionally bursting past 300,000 executions per minute)

I'm concerned about hitting a PPS Limit and having drops in production.

Since you stated you send several PB of traffic, I'm going to guess your traffic is a lot more than ours (it would make sense)

Our short lived lambdas (for SSO, Dynamodb lookups, Small API Requests) are all quick in and out, but our long running lambdas can run upwards of 6 minutes (Data continues to flow to/from the browser during this so the pps do not stop)

Without going into specifics:

Do you notice any dropped packets / throttling with a c6gn.8xlarge?
I'm going to guess that your NAT Instances are in every private subnet and thus you can spread out the load across multiple subnets.

All of our lambda's use a single subnet with a NAT Gateway in that subnet. I unfortunately cannot do that as re-engineering the architecture is not feasible until winter 2025.

(This has been transferred from an email with @bwhaley for public visibility and comments)

Chime

Clarify the need for lambda VPC endpoint in the documentation

Forgive me if this has already been asked but I haven't found the answer in the documentation or in other issues.

I understand the need for the EC2 VPC endpoint to replace the routes in the route table when there are connectivity issues. But I don't see that the lambda VPC endpoint is needed for this. If I'm correct, the lambda endpoint is used for testing purposes only, Could you please confirm if this is the case?

I would be happy open a PR to update the documentation if thats the case. Maybe its not a big saving, but having an extra VPCe in 4 AZs could represent around 30$/month.

Thank you

Add module to opentofu registry

context

This is to have a 1:1 with the terraform registry by adding this to the opentofu registry

https://github.com/opentofu/registry/?tab=readme-ov-file#adding-providers-modules-or-gpg-keys-to-the-opentofu-registry

https://opentofu.org/registry/

references

similar to #84
submitted module here opentofu/registry#510
closed when this merges opentofu/registry#511

Error: local-exec provisioner error

Hi,
I've faced an error

Error: local-exec provisioner error
with module.alternat_instances.null_resource.prepare_artifact[0],
  on .terraform/modules/alternat_instances/modules/terraform-aws-alternat/lambda.tf line 8, in resource "null_resource" "prepare_artifact":
    8:   provisioner "local-exec" {

 Error running command 'pip install -r
.terraform/modules/alternat_instances/modules/terraform-aws-alternat/../../functions/replace-route/requirements.txt
 -t
.terraform/modules/alternat_instances/modules/terraform-aws-alternat/../../functions/replace-route':
 exit status 127. Output: /bin/sh: pip: not found

my terraform plan:

module "alternat_instances" {
  source = "git::https://github.com/1debit/alternat.git//modules/terraform-aws-alternat?ref=v0.3.3"

  alternat_image_uri = ""
  alternat_image_tag = ""
  providers      = {
    aws = aws.my_aws
  }

  lambda_package_type = "Zip"

  tags = {
    ... 
    }

  vpc_id      = "vpc-90d219f6"
  vpc_az_maps = local.vpc_az_maps
}

I keep alternat_image_uri and alternat_image_tag because it required, but they are empty, because lambda_package_type = "Zip"

Enable performance monitoring using Elastic Network Adapter (ENA)

tl;dr - Can alterNAT instances publish additional network metrics to CloudWatch?

By default, AWS CloudWatch provides just four network metrics for EC2 instances: Network{In,Out} and NetworkPackets{In,Out}. In particular, there's no metric indicating the available network credits or whether an EC2 instance is being throttled.† (Applicable to instances with burstable network performance only.)

The good news is Elastic Network Adapter (ENA) provides addition network metrics which can be published to CloudWatch using the CloudWatch agent. Some useful metrics include:

Metric	Description	Supported on
bw_in_allowance_exceeded	The number of packets queued or dropped because the inbound aggregate bandwidth exceeded the maximum for the instance.	All instance types
bw_out_allowance_exceeded	The number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance.	All instance types
pps_allowance_exceeded	The number of packets queued or dropped because the bidirectional PPS exceeded the maximum for the instance.	All instance types

These metrics indicate when an alterNAT instance's network performance is throttled, which is very useful to know. Alarms can be created and configured to send notifications when this occurs.

To publish these metrics, some relatively simple changes are required to an already deployed alterNAT setup (written in CDKTF).

Firstly, an inline policy is required to allow the instance to publish metrics to CloudWatch:

const alterNat = new alternat.Alternat(this, "alternat", {
  additionalInstancePolicies: [
    {
      policy_name: "publish-cloudwatch-custom-metrics",
      policy_json: Fn.jsonencode({
        Version: "2012-10-17",
        Statement: [
          {
            Action: "cloudwatch:PutMetricData",
            Effect: "Allow",
            Resource: "*",
          },
        ],
      }),
    },
  ],
  // ...
})

And an user data script is required to install and configure the CloudWatch agent:

const userdata = new TerraformAsset(this, "userdata", {
  path: "src/userdata/install-cw-agent.sh",
  type: AssetType.FILE,
})

const alterNat = new alternat.Alternat(this, "alternat", {
  // ...
  natInstanceUserDataPostInstall: Fn.file(userdata.path),
  // ...
})

#!/bin/bash

CW_AGENT_CONFIG_PATH='/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json'

sudo yum -y install amazon-cloudwatch-agent

sudo touch "$CW_AGENT_CONFIG_PATH"
sudo tee "$CW_AGENT_CONFIG_PATH" > /dev/null << EOF
{
  "metrics": {
    "append_dimensions": {
      "InstanceId": "\${aws:InstanceId}"
    },
    "metrics_collected": {
      "ethtool": {
        "metrics_include": [
          "bw_in_allowance_exceeded",
          "bw_out_allowance_exceeded",
          "pps_allowance_exceeded"
        ]
      }
    }
  }
}
EOF

# load config file and restart agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c "file:$CW_AGENT_CONFIG_PATH"

That's it! Now, enhanced network metrics are published to CloudWatch by each alterNAT instance. As mentioned, these metrics are useful for indicating when an instance is throttled because the (hidden) network credits have been exhausted. They can also be used to optimise the instance size, as you can test different sizes and you'll know when an instance has insufficient network performance.

I'm sharing this because I think it's useful and I'd like others to benefit. Could this feature be included with alterNAT?

† I've contacted AWS Premium Support and they've confirmed they're actively working on a feature which exposes this information.

Use taggable security group rule resources

These resources are taggable

The older aws_security_group_rule is not taggable

Also the inline security group rules are also not taggable

The import terraform/tofu blocks may be able to help with the transition. If not, then release docs would be helpful.

Thank you for considering

Deprecated `hashicorp/template` provider

The last release has introduced data "template_cloudinit_config" , which relies on hashicorp/template marked by HashiCorp as deprecated. The provider is no longer built and published for various architecture, like arm for instance. People trying to excecute terraform on M1 will see:

│ Error: Incompatible provider version
│
│ Provider registry.terraform.io/hashicorp/template v2.2.0 does not have a package available for your current platform, darwin_arm64.
│
│ Provider releases are separate from Terraform CLI releases, so not all providers are available for all platforms. Other versions of
│ this provider may have different platforms supported.

Connectivity checks fail when URL resolves to IPv6 address in IPv4-only VPC

Occasionally, urllib will resolve the provided check URL to an IPv6 address. If the VPC in which the Lambda function is running isn't configured to support IPv6, the Lambda function will throw the following error:

error connecting to https://www.google.com/: <urlopen error [Errno 97] Address family not supported by protocol>

Some simple Googling reveals that this is often attributed to the host not supporting IPv6 (example). Unfortunately, I haven't been able to find a trivial way to force the built-in urllib package to use IPv4-only resolution. One alternative would be to add requests or urllib3 as a dependency (and Lambda layer), and use the following to accomplish this:

# requests
requests.packages.urllib3.util.connection.HAS_IPV6 = False
# urllib3
urllib3.util.connection.HAS_IPV6 = False

Runtime Python dependencies review

It looks like only requests are required for the Lambda to actually run. Would it make sense to remove requests in favour of built-in urllib3 for the sake of simplicity? It would also make packaging of alternat easier when using Lambda's python runtime instead of containers (addressed in #44).

amazon linux 2 deprecated in favor of amazon linux 2023

It seems that AL2023 will be superseding AL2

Is it on the roadmap to switch 2023 ?

https://aws.amazon.com/amazon-linux-2/faqs/

Q. When will support for Amazon Linux 2 end?

Amazon Linux 2 end of support date (End of Life, or EOL) has been extended by two years from 2023-06-30 to 2025-06-30 to provide customers with ample time to migrate to the next version.

https://docs.aws.amazon.com/linux/al2023/ug/compare-with-al2.html

Reuse Exiting Nat Gateways

The module takes ownership of the Nat Gateways created in the provided. However, It's possible that each VPC already has provisioned it's own Nat Gateways. Could we supply a list of existing Nat Gateways instead of creating new ones?

Upstream terraform module to upstream registry

Please consider adding the terraform module to the upstream registry

https://registry.terraform.io/search/modules?q=Alternat

https://github.com/opentofu/registry/?tab=readme-ov-file

In order to submit the module, it may be needed to move the terraform code to a separate repo such as 1debit/terraform-aws-alternat with a readme generated from terraform-docs

Once submitted and accepted, the module will be findable using the registry search allowing more users to access the product.

Thank you.

Any plan to extend this to other major clouds?

I'm really impressed after reading the README and watching the demo. Wonder if the maintainers have any plan to extend the same idea to other major clouds, such as Azure and GCP. Obviously each cloud needs a different implementation of Terraform code due to different cloud APIs. However, it seems to me that all the basic components used in this repo are already ready in GCP and Azure. I could be wrong though. Would like to know if there's any known blocker in clouds other than AWS. Thanks!

Adding lambda layers support

Adding support for accepting lambda layers for functions.

Tried to do it myself but couldn't find where I can sign to contribute code.. Have the PR ready:

index 43f348b..95723f3 100644
--- a/modules/terraform-aws-alternat/lambda.tf
+++ b/modules/terraform-aws-alternat/lambda.tf
@@ -15,6 +15,8 @@ resource "aws_lambda_function" "alternat_autoscaling_hook" {
   timeout       = var.lambda_timeout
   role          = aws_iam_role.nat_lambda_role.arn
 
+  layers        = var.lambda_layer_arns
+
   image_uri = var.lambda_package_type == "Image" ? "${var.alternat_image_uri}:${var.alternat_image_tag}" : null
 
   runtime          = var.lambda_package_type == "Zip" ? "python3.8" : null
@@ -127,6 +129,8 @@ resource "aws_lambda_function" "alternat_connectivity_tester" {
   timeout       = var.lambda_timeout
   role          = aws_iam_role.nat_lambda_role.arn
 
+  layers        = var.lambda_layer_arns
+
   image_uri = var.lambda_package_type == "Image" ? "${var.alternat_image_uri}:${var.alternat_image_tag}" : null
 
   runtime          = var.lambda_package_type == "Zip" ? "python3.8" : null
diff --git a/modules/terraform-aws-alternat/variables.tf b/modules/terraform-aws-alternat/variables.tf
index cf5144f..6ab16f2 100644
--- a/modules/terraform-aws-alternat/variables.tf
+++ b/modules/terraform-aws-alternat/variables.tf
@@ -235,3 +235,9 @@ variable "lambda_function_architectures" {
   type        = list(string)
   default     = ["x86_64"]
 }
+
+variable "lambda_layer_arns" {
+  type = list(string)
+  description = "List of Lambda layers ARN that will be added to functions"
+  default = null
+}

A warm pool for the asg would reduce spin up times

In the readme, it speaks of the high spin up time for ipv6. A warm pool should help with that

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-warm-pools.html

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group#warm_pool

Install CloudWatch Agent and enable ENA metrics

We should add a feature to optionally install the CloudWatch Agent and configure it to collect network performance metrics. We can use these to monitor for bw/pps exceeded issues.

Sync connection table using conntrackd or similar

Hey,
Just a quick idea, not sure if it's even doable, but what about using conntrackd for syncing conntack table to another node which can be a passive instance?

According to this https://docs.aws.amazon.com/whitepapers/latest/real-time-communication-on-aws/floating-ip-pattern-for-ha-between-activestandby-stateful-servers.html#applicability-in-rtc-solutions Keepalived should work.

Or it could be even a newly created instance where can be a conntrack table imported when a reboot would be required. Optional locking can be done via AWSSM.

Thanks

Debugging stuck connections

Hello,
we are trying to use Alternat as a replacement for our Managed NAT Gateway. In our use case, data from various sources is uploaded/downloaded from the internet through NAT.

We are facing random stuck connections when downloading data from MySQL via Alternat. We do not see these errors happening with the same configurations when using the Managed NAT Gateway. It's important to note that there were no Alternat failovers (route changes) when the connection got stuck. We have confirmed that the issue is definitely related to routing via Alternat. Unfortunately, we are not able to simulate this issue in our testing environment in isolation.

We have implemented monitoring provided by ENA and we do not see that we are hitting any limits during the times of errors, and we should not be close to the limits. Our instance is currently m6g.4xlarge, according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html.

When the connection gets stuck, we performed the following checks:

cpu=0           found=0 invalid=166 ignore=59199 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=115
cpu=1           found=0 invalid=160 ignore=58152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=125
cpu=2           found=0 invalid=161 ignore=60999 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=3           found=1 invalid=187 ignore=89662 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=113
cpu=4           found=0 invalid=170 ignore=73766 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=91
cpu=5           found=0 invalid=186 ignore=75964 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=6           found=0 invalid=172 ignore=77936 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=116
cpu=7           found=0 invalid=165 ignore=96004 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=104
cpu=8           found=0 invalid=0 ignore=64714 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=7
cpu=9           found=0 invalid=174 ignore=48375 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=84
cpu=10          found=0 invalid=161 ignore=83150 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=126
cpu=11          found=0 invalid=391 ignore=58921 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=239
cpu=12          found=0 invalid=176 ignore=77967 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=108
cpu=13          found=0 invalid=150 ignore=122890 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=128
cpu=14          found=0 invalid=154 ignore=62822 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=89
cpu=15          found=0 invalid=168 ignore=65315 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=113

We have seen established connections in conntrack -L on both source nodes and alternat nodes calling.

And limited activity for these connection in tcpdump:

tcpdump host x.x.x.x
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
05:54:00.294341 IP ip-10-10-29-139.ec2.internal.56482 > x.x.x.x.6011: Flags [F.], seq 3003739363, ack 1787942545, win 1463, options [nop,nop,TS val 2024721780 ecr 933756998], length 0
05:54:00.294345 IP ip-10-10-29-139.ec2.internal.56482 > x.x.x.x.6011: Flags [F.], seq 0, ack 1, win 1463, options [nop,nop,TS val 2024721780 ecr 933756998], length 0
05:54:02.002371 IP ip-10-10-29-139.ec2.internal.60756 > x.x.x.x.6011: Flags [F.], seq 1433822988, ack 3418363755, win 1365, options [nop,nop,TS val 3357797591 ecr 933756998], length 0
05:54:02.002377 IP ip-10-10-29-139.ec2.internal.60756 > x.x.x.x.6011: Flags [F.], seq 0, ack 1, win 1365, options [nop,nop,TS val 3357797591 ecr 933756998], length 0
05:54:07.178172 IP ip-10-10-29-139.ec2.internal.56084 > x.x.x.x.6012: Flags [F.], seq 3978364809, ack 1664524663, win 1210, options [nop,nop,TS val 16024769 ecr 933756998], length 0
05:54:07.178178 IP ip-10-10-29-139.ec2.internal.56084 > x.x.x.x.6012: Flags [F.], seq 0, ack 1, win 1210, options [nop,nop,TS val 16024769 ecr 933756998], length 0

I understand that it might be difficult to determine the root cause based on the information provided. We would appreciate any ideas on where to look and what tools to use for debugging. Have we missed any other limits or useful metrics? Particularly, where can the behavior be different from Managed NATs?

One of the differences that we noticed was the 350-second idle timeout for Managed NATs. We have added sysctl net.netfilter.nf_conntrack_tcp_timeout_established=350 for Alternat nodes.

Request for Enhancements: Improved Monitoring for AlterNAT Routing Transitions

Our team had been using alterNAT for some time and, due to its seamless transition to NAT Gateway, we did not immediately recognize that it had switched the route from NAT Instance to NAT Gateway (potentially an external issue). As a result, we incurred unexpected expenses.

To address this concern and enhance our monitoring capabilities, we would like to propose a new feature: the ability to receive alerts whenever any routing entry is replaced, preferably through an SNS Topic.

By implementing this feature, we can proactively monitor routing transitions and promptly respond to any changes, ensuring that we optimize our usage of resources and minimize unnecessary costs. We understand the value of this feature for our team and believe it could benefit other users as well.

We are enthusiastic about contributing to this enhancement and would be more than willing to create a pull request to implement the suggested feature if it aligns with your development roadmap.