vrivellino / spoptimize Goto Github PK

Spoptimize: Replace AWS AutoScaling instances with spot instances

License: Mozilla Public License 2.0

Python 94.68% Shell 5.32%

aws autoscaling autoscaling-groups spot-instances ec2-spot ec2-spot-instances aws-step-functions python cloudformation serverless-application-model

spoptimize's People

Contributors

Stargazers

Watchers

Forkers

harmy

spoptimize's Issues

Support launch templates

Launch Templates with auto-scaling is a thing.

ECS Support

Hello!

Can spoptimize be used for EC2 instances with ECS?

Thanks!

Update deployment documentation

Split out getting-started/quick-deploy from advanced deployment topics.

Advanced deployment docs would include details on how to override defaults via environment variables.

Terminate spot instances via termination notices

Wire up a lambda to subscribe to termination notices from CloudWatch Events and terminate Spoptimize-launched spot instances via the autoscaling API.

Cancel spot requests for terminated instances

Perhaps wire-up a lambda to ASG termination notices to cancel spot requests. EC2 will eventually close them on its own, but open spot requests associated with terminate instances go against the account-holder's limit.

Simplify spot instance attachment; Add locking

When testing the initial implementation, I found that with an autoscaling group with desired-capacity of 1, spoptimize and autoscaling get into a loop:

Instance is launch by ASG
Spoptimize attaches a spot instance in same AZ as launched on-demand instance
ASG launches a new instance in another AZ, attempting to rebalance
The original instance and the spot instance get nuked
Process repeats in the other AZ

Rather than worry about seamless attachments and terminations, spoptmize should instead:

terminate on-demand and attach spot in same step
provide a lock-out mechanism to prevent parallel executions from attaching & terminating in the same ASG

With locking implemented, there won't be any service downtime as long as the autoscaling group has more than one instance running. And an autoscaling group of 1 implies that some service downtime is acceptable.

Tag spot instance requests

Support minimum number of on-demand instances

Allow a configuration override to prevent spoptimize from replacing all on-demand instances.

Refactor handler.py & stepfns.py

Handler.py has zero test coverage, and it contains some logic that probably belongs in stepfns.py.

It'd great to get test coverage for handler.py and keep as much logic in stepfns.py.

All lambda return values should be defined somewhere (perhaps in a standalone module) so that a test can compare those strings in sam.yml.

Parameterize and/or restrict iam:PassRole

IAM policy currently allows the lambdas to pass any IAM role. This should be restricted.

Perhaps a parameter that allows the user to list ARNs?

Update readme to note MaxSpotInstanceCountExceeded

During testing, I came across this error from the request-spot lambda:

An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded: ClientError
Traceback (most recent call last):
File "/var/task/handler.py", line 64, in handler
event['launch_subnet_id'], client_token)
File "/var/task/spoptimize/stepfns.py", line 106, in request_spot_instance
return spot_helper.request_spot_instance(launch_config, az, subnet_id, client_token)
File "/var/task/spoptimize/spot_helper.py", line 65, in request_spot_instance
Type='one-time', ClientToken=client_token)
File "/var/runtime/botocore/client.py", line 317, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 615, in _make_api_call
raise error_class(parsed_response, operation_name)
ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded

The solution was to request service limit increase via AWS Support.

Cancel spot request after failure

Make sure the spot request is cancelled after a failure (Terminate Spot and Unrecoverable Spot Instance Failure).

Update readme to note protected & standby instances

Make a note in the documentation that protected and standby instances are not replaced by spoptimize and execution will stop if the launched instance is detected by spoptimize to be protected from scale-in or is marked as standby.

Revisit locking retry/back-off

Exclusive locking is implemented via step functions' retry semantics:

spoptimize/sam.yml

Lines 322 to 326 in 4aa555c

 "Retry": [{ 

 "ErrorEquals": [ "GroupLocked" ], 

 "IntervalSeconds": 5, 

 "MaxAttempts": 20, 

 "BackoffRate": 1.5

I'm not sure this is the right long-term solution. For larger autoscaling groups, it may take hours for all instances to be replaced after a deploy or mass update.

Perhaps allow for more than one instance to be replaced my Spoptimize (configurable via tag)? Or just have a static interval between retries?

Wait for cloudformation

If an auto-scaling group is managed by cloudformation and the associated cloudformation stack status is IN_PROGRESS wait for stack status to settle before proceeding with execution.

This will prevent spoptimize from doing anything during stack updates.

	"Retry": [{
	"ErrorEquals": [ "GroupLocked" ],
	"IntervalSeconds": 5,
	"MaxAttempts": 20,
	"BackoffRate": 1.5