Git Product home page Git Product logo

terraform-aws-metaflow's Introduction

Metaflow Terraform module

Terraform module that provisions AWS resources to run Metaflow in production.

This module consists of submodules that can be used separately as well:

modules diagram

You can either use this high-level module, or submodules individually. See each submodule's corresponding README.md for more details.

Here's a minimal end-to-end example of using this module with VPC:

# Random suffix for this deployment
resource "random_string" "suffix" {
  length  = 8
  special = false
  upper = false
}

locals {
  resource_prefix = "metaflow"
  resource_suffix = random_string.suffix.result
}

data "aws_availability_zones" "available" {
}

# VPC infra using https://github.com/terraform-aws-modules/terraform-aws-vpc
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "3.13.0"

  name = "${local.resource_prefix}-${local.resource_suffix}"
  cidr = "10.10.0.0/16"

  azs             = data.aws_availability_zones.available.names
  private_subnets = ["10.10.8.0/21", "10.10.16.0/21", "10.10.24.0/21"]
  public_subnets  = ["10.10.128.0/21", "10.10.136.0/21", "10.10.144.0/21"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
}


module "metaflow" {
  source = "outerbounds/metaflow/aws"
  version = "0.3.0"

  resource_prefix = local.resource_prefix
  resource_suffix = local.resource_suffix

  enable_step_functions = false
  subnet1_id            = module.vpc.public_subnets[0]
  subnet2_id            = module.vpc.public_subnets[1]
  vpc_cidr_blocks       = module.vpc.vpc_cidr_blocks
  vpc_id                = module.vpc.vpc_id
  with_public_ip        = true

  tags = {
      "managedBy" = "terraform"
  }
}

# export all outputs from metaflow modules
output "metaflow" {
  value = module.metaflow
}

# The module will generate a Metaflow config in JSON format, write it to a file
resource "local_file" "metaflow_config" {
  content  = module.metaflow.metaflow_profile_json
  filename = "./metaflow_profile.json"
}

Note: You can find a more complete example that uses this module but also includes setting up sagemaker notebooks and other non-Metaflow-specific parts of infra in this repo.

Modules

Name Source Version
metaflow-common ./modules/common n/a
metaflow-computation ./modules/computation n/a
metaflow-datastore ./modules/datastore n/a
metaflow-metadata-service ./modules/metadata-service n/a
metaflow-step-functions ./modules/step-functions n/a
metaflow-ui ./modules/ui n/a

Inputs

Name Description Type Default Required
access_list_cidr_blocks List of CIDRs we want to grant access to our Metaflow Metadata Service. Usually this is our VPN's CIDR blocks. list(string) [] no
batch_type AWS Batch Compute Type ('ec2', 'fargate') string "ec2" no
compute_environment_desired_vcpus Desired Starting VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) number 8 no
compute_environment_egress_cidr_blocks CIDR blocks to which egress is allowed from the Batch Compute environment's security group list(string)
[
"0.0.0.0/0"
]
no
compute_environment_instance_types The instance types for the compute environment list(string)
[
"c4.large",
"c4.xlarge",
"c4.2xlarge",
"c4.4xlarge",
"c4.8xlarge"
]
no
compute_environment_max_vcpus Maximum VCPUs for Batch Compute Environment [16-96] number 64 no
compute_environment_min_vcpus Minimum VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) number 8 no
db_engine_version n/a string "11" no
db_instance_type RDS instance type to launch for PostgresQL database. string "db.t2.small" no
db_migrate_lambda_zip_file Output path for the zip file containing the DB migrate lambda string null no
enable_custom_batch_container_registry Provisions infrastructure for custom Amazon ECR container registry if enabled bool false no
enable_key_rotation Enable key rotation for KMS keys bool false no
enable_step_functions Provisions infrastructure for step functions if enabled bool n/a yes
extra_ui_backend_env_vars Additional environment variables for UI backend container map(string) {} no
extra_ui_static_env_vars Additional environment variables for UI static app map(string) {} no
force_destroy_s3_bucket Empty S3 bucket before destroying via terraform destroy bool false no
iam_partition IAM Partition (Select aws-us-gov for AWS GovCloud, otherwise leave as is) string "aws" no
launch_template_http_endpoint Whether the metadata service is available. Can be 'enabled' or 'disabled' string "enabled" no
launch_template_http_put_response_hop_limit The desired HTTP PUT response hop limit for instance metadata requests. Can be an integer from 1 to 64 number 2 no
launch_template_http_tokens Whether or not the metadata service requires session tokens, also referred to as Instance Metadata Service Version 2 (IMDSv2). Can be 'optional' or 'required' string "optional" no
metadata_service_container_image Container image for metadata service string "" no
metadata_service_enable_api_basic_auth Enable basic auth for API Gateway? (requires key export) bool true no
metadata_service_enable_api_gateway Enable API Gateway for public metadata service endpoint bool true no
resource_prefix string prefix for all resources string "metaflow" no
resource_suffix string suffix for all resources string "" no
subnet1_id First subnet used for availability zone redundancy string n/a yes
subnet2_id Second subnet used for availability zone redundancy string n/a yes
tags aws tags map(string) n/a yes
ui_alb_internal Defines whether the ALB for the UI is internal bool false no
ui_allow_list List of CIDRs we want to grant access to our Metaflow UI Service. Usually this is our VPN's CIDR blocks. list(string) [] no
ui_certificate_arn SSL certificate for UI. If set to empty string, UI is disabled. string "" no
ui_static_container_image Container image for the UI frontend app string "" no
vpc_cidr_blocks The VPC CIDR blocks that we'll access list on our Metadata Service API to allow all internal communications list(string) n/a yes
vpc_id The id of the single VPC we stood up for all Metaflow resources to exist in. string n/a yes
with_public_ip Enable public IP assignment for the Metadata Service. If the subnets specified for subnet1_id and subnet2_id are public subnets, you will NEED to set this to true to allow pulling container images from public registries. Otherwise this should be set to false. bool n/a yes

Outputs

Name Description
METAFLOW_BATCH_JOB_QUEUE AWS Batch Job Queue ARN for Metaflow
METAFLOW_DATASTORE_SYSROOT_S3 Amazon S3 URL for Metaflow DataStore
METAFLOW_DATATOOLS_S3ROOT Amazon S3 URL for Metaflow DataTools
METAFLOW_ECS_S3_ACCESS_IAM_ROLE Role for AWS Batch to Access Amazon S3
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE IAM role for Amazon EventBridge to access AWS Step Functions.
METAFLOW_SERVICE_INTERNAL_URL URL for Metadata Service (Accessible in VPC)
METAFLOW_SERVICE_URL URL for Metadata Service (Accessible in VPC)
METAFLOW_SFN_DYNAMO_DB_TABLE AWS DynamoDB table name for tracking AWS Step Functions execution metadata.
METAFLOW_SFN_IAM_ROLE IAM role for AWS Step Functions to access AWS resources (AWS Batch, AWS DynamoDB).
api_gateway_rest_api_id_key_id API Gateway Key ID for Metadata Service. Fetch Key from AWS Console [METAFLOW_SERVICE_AUTH_KEY]
batch_compute_environment_security_group_id The ID of the security group attached to the Batch Compute environment.
datastore_s3_bucket_kms_key_arn The ARN of the KMS key used to encrypt the Metaflow datastore S3 bucket
metadata_svc_ecs_task_role_arn n/a
metaflow_api_gateway_rest_api_id The ID of the API Gateway REST API we'll use to accept MetaData service requests to forward to the Fargate API instance
metaflow_batch_container_image The ECR repo containing the metaflow batch image
metaflow_profile_json Metaflow profile JSON object that can be used to communicate with this Metaflow Stack. Store this in ~/.metaflow/config_[stack-name] and select with $ export METAFLOW_PROFILE=[stack-name].
metaflow_s3_bucket_arn The ARN of the bucket we'll be using as blob storage
metaflow_s3_bucket_name The name of the bucket we'll be using as blob storage
migration_function_arn ARN of DB Migration Function
ui_alb_arn UI ALB ARN
ui_alb_dns_name UI ALB DNS name

terraform-aws-metaflow's People

Contributors

benchoncy avatar ctrombley avatar erin-boehmer avatar greghilstonhop avatar isaac4real avatar jackie-ob avatar josephsirak avatar kldavis4 avatar limess avatar oavdeev avatar ob-uk avatar olivermeyer avatar pgasior avatar ruial avatar ryewilson avatar savingoyal avatar tkbky avatar valaydave avatar wrbooth avatar yanp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

terraform-aws-metaflow's Issues

Overly permissive policy in batch_execution_role

The batch_execution_role is allowed to do autoscaling:DeleteLaunchConfiguration on all resources (see here). This is a concern, since it could seriously mess with production systems running in EKS.

Is there a strong reason for allowing this, or is it possible to restrict the resources which are covered by the policy?

METAFLOW_SERVICE_URL is empty in the metaflow_profile.json

I use this minimal terraform to spin up infra with an existing vpc. I set metadata_service_enable_api_gateway = false. Unfortunately, the resulting metaflow_profile.json file had METAFLOW_SERVICE_URL set to "".

Is this expected? The METAFLOW_SERVICE_INTERNAL_URL was filled with the internal nlb url so thats okay.

Not surprisingly, i could not communicate with the metadata service. To fix it I had to manually fill it in.

locals {
  subnet_ids = ["foo","bar"]
}


data "aws_vpc" "main" {
  tags = {
    Name = "my-super-precious-vpc"
  }
}


module "metaflow" {
  source = "outerbounds/metaflow/aws"
  version = "0.9.2"

  resource_prefix = "metaflow"
  resource_suffix = "random"

  enable_step_functions = true
  subnet1_id            = local.subnet_ids[0]
  subnet2_id            = local.subnet_ids[1]
  vpc_cidr_blocks       = [data.aws_vpc.main.cidr_block]
  vpc_id                = data.aws_vpc.main.id
  with_public_ip        = true

  metadata_service_enable_api_gateway = false
  batch_type = "fargate"


  tags = {}
}

Issues with terraform apply

Hello!
I try to launch terraform apply after terraform init (I'm following this guide: https://outerbounds.com/engineering/deployment/aws-k8s/deployment/#initialize-your-terraform-workspace). I'm using latest version of this repo.

But terraform apply initially returned me the following error:

╷
│ Warning: Argument is deprecated
│ 
│   with module.metaflow-datastore.aws_s3_bucket.this,
│   on .terraform/modules/metaflow-datastore/modules/datastore/s3.tf line 3, in resource "aws_s3_bucket" "this":
│    3:   acl           = "private"
│ 
│ Use the aws_s3_bucket_acl resource instead
╵
╷
│ Error: Unsupported argument
│ 
│   on .terraform/modules/metaflow-datastore/modules/datastore/rds.tf line 109, in resource "aws_db_instance" "this":
│  109:   name                      = var.db_name                                                  # unique id for CLI commands (name of DB table which is why we're not adding the prefix as no conflicts will occur and the API expects this table name)
│ 
│ An argument named "name" is not expected here.

I manually commented this argument, but afterwards I still have a lot of errors:

╷
│ Warning: Argument is deprecated
│ 
│   with module.metaflow-datastore.aws_s3_bucket.this,
│   on .terraform/modules/metaflow-datastore/modules/datastore/s3.tf line 3, in resource "aws_s3_bucket" "this":
│    3:   acl           = "private"
│ 
│ Use the aws_s3_bucket_acl resource instead
╵
╷
│ Error: Invalid provider configuration
│ 
│ Provider "registry.terraform.io/hashicorp/aws" requires explicit configuration. Add a provider block to the root module and configure the provider's required arguments as described in the provider
│ documentation.
│ 
╵
╷
│ Error: validating provider credentials: retrieving caller identity from STS: operation error STS: GetCallerIdentity, https response error StatusCode: 403, RequestID: af356c1f-c5e6-4565-abf2-a2d0bbfef591, api error InvalidClientTokenId: The security token included in the request is invalid.
│ 
│   with provider["registry.terraform.io/hashicorp/aws"],
│   on <empty> line 0:
│   (source code not available)
│ 

Could you help me to resolve it? Or maybe explain how to get rid of these issues?
Thank you in advance!

Task is stuck on RUNNABLE

Hi all,

As the title suggests, I am getting the infamous "Task stuck on RUNNABLE" line when I try to run this simple flow:

from metaflow import FlowSpec, step
import os
global_value = 5


class ProcessDemoFlow(FlowSpec):
    @step
    def start(self):
        global global_value
        global_value = 9
        print('process ID is', os.getpid())
        print('global_value is', global_value)
        self.next(self.end)

    @step
    def end(self):
        print('process ID is', os.getpid())
        print('global_value is', global_value)


if __name__ == '__main__':
    ProcessDemoFlow()

For provisioning the infrastructure, I used the minimal Terraform AWS template on the README. However, I had to make a few adjustments to remove some errors (I could not have been the only one...):

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "5.5.3"
...

In the vpc module I had to modify the version number to the latest version: 5.5.3.

module "metaflow" {
  source = "outerbounds/metaflow/aws"
  version = "0.12.0"

  resource_prefix = local.resource_prefix
  resource_suffix = local.resource_suffix

  enable_step_functions = false
  subnet1_id            = module.vpc.public_subnets[0]
  subnet2_id            = module.vpc.public_subnets[1]
  vpc_cidr_blocks       = [module.vpc.vpc_cidr_block]
  vpc_id                = module.vpc.vpc_id
  with_public_ip        = true
  db_engine_version	    = 16
  db_instance_type      = "db.t3.small"

  tags = {
      "managedBy" = "terraform"
  }
}

In the metaflow module I had to change module.vpc.vpc_cidr_blocks to [module.vpc.vpc_cidr_block], because I was getting an error saying that module.vpc.vpc_cidr_blocks didn't exist. I confirmed that no such variable exists in the vpc module (no idea why this is in the template...). I also had to update the version number to 0.12.0. Finally, I got an error stating that the combination of Postgres, a db_engine_version of "11", and a db_instance_type of "db.t2.small" (default values) is not allowed by AWS. So I updated the engine_version to 16 and db_instance_type to "db.t3.small".

I have done basic checks regarding looking at Batch, ECS, and EC2, and everything appears to be connected, valid, etc. What could be going wrong? I keep seeing stuff about my compute environment might be too limited for the task, but given that this is a very basic task I don't think that is the issue. Are one of the modifications I made to the minimal template wrong?

Variable for ephemeral_storage in aws_ecs_task_definition

Hello,

I would like to increase the ephemeral_storage of a Metaflow ECS task so that it can load larger Docker images (Pytorch GPU Docker images are 10Gb+ compressed). Currently, the instance runs out of disk space and shows the following error:

CannotPullContainerError: no space left on device

The ephemeral_storage is currently not defined in the metadata-service submodule so it is set to its default value of 21Gb - would it be possible to have a variable so that we can define this parameter?

I'd be happy to work on this and share a PR.

Many thanks!

API Gateway should be optional

The API Gateway is only useful when external traffic is expected. In cases where all traffic will be internal to the VPC in which Metaflow is hosted, the API Gateway only adds value if it's used for additional access control. It's also a liability because the only way to deny all incoming traffic is to misuse the access_list_cidr_blocks variable to make the API Gateway's resource policy allow traffic only from an impossible IP range.

I see two solutions here:

  • If the API Gateway is useful even for all-private traffic (e.g. to allow other forms of access control), then it should be possible to make it private
  • If the API Gateway is not useful for all-private traffic, then it should be possible to disable it in the module

I think the first solution is preferable in the long run, but the second is simpler to implement. I'm happy to open a PR but I'm not sure which way to go.

Allow optional security_groups to be passed to batch instance

Personally I would benefit from having user supplied optional security_groups in addition to the one generated by the module.

My org has existing SGs that we typically attach to new infra to allow access to common services (e.g RDS, presto clusters etc)

Upgrade problems after deploying T2 small to T4G.medium with old Postgres version

I'm trying to upgrade from Postgres v.11.x to v.15.x and change the DB instance to T4G.Medium I just get blocked in this way using Terraform.

Doing it manually is allowed to do but takes a while to do it manual, and the concept of IaC is not here :)

Hope this bug can be fixed so new installs start on Postgres 15.x and a t4g instance.

RDS database T2 instance is being retired

[Just a service message]

The current default RDS instance for the metadata service is:

However, this instance will soon not be available.

AWS wrote this to me (shortened):

You are receiving this notification because you have one or more Amazon RDS for PostgreSQL database instances running on db.m4, db.r4, or db.t2 instance types. We plan to retire all M4, R4, and T2 instances by April 2024. [...] Reserved instances for db.m4, db.r4, and db.t2 will no longer be available for purchase beginning on June 30, 2023.

Depending on the instance type you choose to upgrade to, you may also need to upgrade your PostgreSQL database version. We recommend that you upgrade your database engine to PostgreSQL 13.4 or higher. For more details on supported RDS DB instance classes for each engine version please refer to the Amazon RDS user guide.

UI submodule expects default security group to be present

The UI submodule assumes the AWS default security group is present.

I receive the following error when using the UI submodule.

Error: no matching SecurityGroup found

  with module.metaflow_ui.data.aws_security_group.vpc_default,
  on .terraform/modules/metaflow_ui/modules/ui/data.tf line 5, in data "aws_security_group" "vpc_default":
   5: data "aws_security_group" "vpc_default" {

ui/data.tf line 5:

data "aws_security_group" "vpc_default" {
  name   = "default"
  vpc_id = var.metaflow_vpc_id
}

I believe the module should create its own security group rather than expect the AWS default to exist.

Example of Developer Role

I'm working on deploying a Metaflow instance to AWS and would find it useful to see an example of what a developer role would look like.

To be honest, I couldn't find much about what IAM policies need to be in-place for a Metaflow developer to carry out various tasks.

For example, to reproduce production issues locally, I'm assuming the user would need S3 read access to the production artifact store.

When configuring AWS via Metaflow, the only instruction is to configure the AWS CLI:

Metaflow relies on AWS access credentials present on your computer to access resources on AWS.
Before proceeding further, please confirm that you have already configured these access credentials on this computer. [Y/n]: n
There are many ways to setup your AWS access credentials. You can get started by following this guide: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

--

Edit: After spending some more time working with this repository, it seems that this file is a starting point. My approach is to also create aws_iam_policy resources that can be leveraged by both the AWS Batch job role and a developer role.

I would still find more documentation on the AWS IAM permissions model to be incredibly useful -- there's a lot of roles at play in this implementation.

Public subnet mandatory to tasks

In UI module we we have to inform public subnets, contrary to what the documentation asks, otherwise access through ALB doesn't work.

Also, we have just 2 variables to pass subnets, since we have to pass public ones to ALB work correctly, the tasks will run in public subnets with public IPs. It would be interesting to be able to inform private networks and not configure public ip in this case.

VPC module in README is very out of date

Version 5.1.1 is now available, updating to this solves an unexpected attribute error but the outputs have changed.

module "vpc" {
source  = "terraform-aws-modules/vpc/aws"
version = "5.1.1"

The output:
module.vpc.vpc_cidr_blocks no longer exists and should be replaced with [module.vpc.vpc_cidr_block]

Terraform Apply Surfaces Tags Not In Terraform Plan

When using the minimal terraform example in the readme that leverages the metaflow terraform module, terraform apply surfaced tags that were not in terraform plan.

Here's an example output:

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-metadata-service.aws_ecs_cluster.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "Metaflow" has appeared.

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-metadata-service.aws_ecs_cluster.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "Name" has appeared.

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-metadata-service.aws_ecs_cluster.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "managedBy" has appeared.

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-datastore.aws_db_subnet_group.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "Metaflow" has appeared.

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-datastore.aws_db_subnet_group.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "Name" has appeared.

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for module.metaflow.module.metaflow-datastore.aws_db_subnet_group.this to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/aws" produced an invalid new value for .tags_all: new element "managedBy" has appeared.

Make public IP optional for UI Backend and UI Static

Everything I deploy has to be within a VPC without any public endpoints, which means I can't use the Metaflow UI module out of the box right now. Can we make the "assign public IP addresses" in the network configuration of the "aws_ecs_service" resources optional?

Minimum Terraform version ?

Hi,

I read in a doc (this one aws-tf), the minimum version is 0.14, but trying with version 1.0, I have this error:
An argument named "nullable" is not expected here.

Indeed, seems this feature "nullable variable" has been added in 1.1.

I think the doc is misleading.
So, what is the minimum terraform possible version ?

Thank you

Does't work with provided examples (examples/minimal)

Steps to reproduce:

  1. Clone repository and cd terraform-aws-metaflow/examples/minimal
  2. Set locals.resource_prefix = "test-metaflow" in minimal_example.tf
  3. Run terraform apply and wait until it finishes
  4. Run aws apigateway get-api-key --api-key <api-key> --include-value | grep value and paste the result to the metaflow_profile.json file
  5. Import Metaflow configuration: metaflow configure import metaflow_profile.json
  6. Run python mftest.py run
mftest.py
from metaflow import FlowSpec, step, batch, resources


class MfTest(FlowSpec):
    @step
    def start(self):
        print("Started")
        self.next(self.run_batch)

    @batch
    @resources(cpu=1, memory=1_000)
    @step
    def run_batch(self):
        print("Hello from @batch")
        self.next(self.end)

    @step
    def end(self):
        print("Finished")

if __name__ == '__main__':
    MfTest()

The running task will never finish, created AWS Batch Job in the AWS Batch Job queue is always in status RUNNABLE

Also tried with outerbounds/metaflow/aws version=0.10.1 and terraform-aws-modules/vpc/aws version = 5.1.2

Generated metaflow_profile.json
{
  "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:<region>:<account>:job-queue/test-metaflow-<random>",
  "METAFLOW_DATASTORE_SYSROOT_S3": "s3://test-metaflow-s3-<random>/metaflow",
  "METAFLOW_DATATOOLS_S3ROOT": "s3://test-metaflow-s3-<random>/data",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DEFAULT_METADATA": "service",
  "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::<account>:role/test-metaflow-batch_s3_task_role-<random>",
  "METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE": "",
  "METAFLOW_SERVICE_AUTH_KEY": <get-api-key-result>,
  "METAFLOW_SERVICE_INTERNAL_URL": "http://test-metaflow-nlb-<random>-<random>.elb.<region>.amazonaws.com/",
  "METAFLOW_SERVICE_URL": "https://<random>.execute-api.<region>.amazonaws.com/api/",
  "METAFLOW_SFN_DYNAMO_DB_TABLE": "",
  "METAFLOW_SFN_IAM_ROLE": "",
  "METAFLOW_SFN_STATE_MACHINE_PREFIX": "test-metaflow-<random>"
}

ACL is a deprecated argument for aws_s3_bucket

Warning: Argument is deprecated

  with module.metaflow_ml.module.metaflow-datastore.aws_s3_bucket.this,
  on .terraform/modules/metaflow_ml.metaflow-datastore/modules/datastore/s3.tf line 3, in resource "aws_s3_bucket" "this":
   3:   acl           = "private"

 Use the aws_s3_bucket_acl resource instead

Allowing multiple batch queues

Hi team would it possible to in your sub-module "computation" on batch.tf if the compute environment arn can be returned on the "metaflow" module as another output then that can be used to provision another batch queue and link it to the existing compute environment, or alternative to do a list input into the terraform module to allow for the generation of multiple queues. This way when running state machine one could do python flow.py --with batch:queue=whereyouwantittorun. This will allow for concurrently running queues against the same compute environment. The policy attached to the role that the step functions assume will have to allow the submission of a job onto the batch queues. Currently this policy will only allow step functions to submit jobs to the one default queue. This policy sits under the "step-functions" module in the file iam-step-functions.tf. It will be here where the policy will need to allow a list of batch queue arns

Cheers
Byron

Database backup not auto-enabled and can't enable it

When I look at the database it's not automatically backup the data and take snapshots, its something I need manual to enable, it will be a nice feature to enable this automatically or make it possible to enable it.

Need additional parameter for RDS security group

For teams leveraging the module, there's currently no way to pass in additional CIDRs to the created RDS security group for things like VPN access for maintenance, handling migrations of metaflow services to kubernetes clusters without redeploying / importing the RDS, etc.

The current way to do this is to define an aws security group rule and attach it to the RDS security group, however a TF bug creates a scenario where this rule requires a double apply (first apply attaches, second apply detaches, third apply re-attaches). This creates intermittent issues for anything that is not the metaflow metadata service trying to connect to the RDS backend.

minimal terraform example leads to internal server error

When I use the https://github.com/outerbounds/terraform-aws-metaflow/blob/master/examples/minimal/minimal_example.tf (from v.0.8.0 which is the current master) to deploy to aws (eu-central-1) and afterwards update METAFLOW_SERVICE_AUTH_KEY in the generated config.json a simple flow leads to this error

Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
    Metaflow service error:
    Metadata request (/flows/TestGPUFlow) failed (code 500): {"message": "Internal server error"}

Since with_public_ip in the minimal example is not defined, I tried both true and false but i get in both cases the same error message.

Currently, I am using version 0.5.1 and there everythings seems to works (also with GPUs)

PS: the minimal example in the README still uses version 0.3.0

Pb variables with instanciation of metaflow with eks aws

Hello

Thank you for these great ressources, I am trying to use and deploy meta flow on aws with kubernetes with these given instructions:
https://github.com/outerbounds/terraform-aws-metaflow/tree/master/examples/eks_argo

When I generate my config, with terraform init, there is still the deprecated name for the db that is being generated
.terraform/modules/metaflow-datastore/modules/datastore/rds.tf
as fixed by: #65

I also have unsupported arguments:

27 enable_classiclink = var.enable_classiclink
28 enable_classiclink_dns_support = var.enable_classiclink_dns_support
1237 enable_classiclink = var.default_vpc_enable_classiclink

in the generated file .terraform/modules/vpc/main.tf

Did anyone encounter this issue also?

I modify directly these generated files by commenting on these lines and modifying the name to db_name (very bad practice I know). and it deployed correctly my metaflow stack.

Thank you

compute environment for gpu

I used the module (v0.5.0 and master) to deploy the infrastructure (thanks for this great module!). But my impression is that currently I cannot use @Batch(gpu=1) for a step. The corresponding task gets stuck in runnable. Metaflow instantiates a p2-instance if i request it but the tasks with @Batch(gpu=1) never leaves the state runnable. (Note that when i terminate that task manually, then other "pure cpu-tasks", i.e. not decorated with @Batch(gpu=1), get executed on that p2-instance)
My best guess so far is that the terraform module simply does not create a gpu compute environment. There is a very old commit 91a8f26 in this repo but the gpu related parts disappeared two commits later and now only one cpu-ami and cpu-compute-environment is created.

Are there any plans to integrate gpu compute environments back into the module? Or is there an obvious way to enable gpu compute environment i missed?

PostgreSQL 11 is EOL and out of standard support for RDS causing additional cost

The PostgreSQL version used in the datastore module is currently 11 as seen in :

However this version is now at EOL and costs extra to be supported by AWS (0.10$/h/VCPU) see:

https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLReleaseNotes/postgresql-release-calendar.html
https://aws.amazon.com/rds/postgresql/pricing/

Could this be upgraded ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.