eformat / sno-for-100 Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 10.0 93 KB

Single Node OpenShift on AWS Spot

Shell 100.00%

sno-for-100's People

Contributors

Stargazers

Watchers

Forkers

walidshaari syanpriyajot akash-sethi-redhat dendod96 rafaeltuelho noushi whoiscnu jpaulrajredhat chennu shettyjm

sno-for-100's Issues

Call reboot-instances instead of stop/start in fix-instance-id.sh

In fix-instance-id.sh, we could call reboot-instances EC2 command instead of stop/start. Stopping while the host is under a persistent spot request can cause a race condition with AWS trying to restart it for you.

Also, if the host is set to "terminate on stop" outside of our script's happy path, then that stop command will inadvertently terminate it.

Add: --delete-ami to default path

Would be nice to clean up (and they have an option for --delete-ami on the tool). However, I got this error trying it with that flag enabled:

Got this error when applying it:

[INFO] 2022-11-24 08:44:03,949 ec2-spot-converter - [STEP 26/26] Deregister image... 
Traceback (most recent call last):
  File "/home/scuppett/bin/ec2-spot-converter", line 1621, in <module>
    sys.exit(main(sys.argv))
             ^^^^^^^^^^^^^^
  File "/home/scuppett/bin/ec2-spot-converter", line 1587, in main
    return_code, reason, keys = step["Function"]()
                                ^^^^^^^^^^^^^^^^^^
  File "/home/scuppett/bin/ec2-spot-converter", line 1212, in deregister_image
    snap_ids = [blk["Ebs"]["SnapshotId"] for blk in img["BlockDeviceMappings"]]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scuppett/bin/ec2-spot-converter", line 1212, in <listcomp>
    snap_ids = [blk["Ebs"]["SnapshotId"] for blk in img["BlockDeviceMappings"]]
                ~~~^^^^^^^
KeyError: 'Ebs'

Add the ability to use Lets encrypt on SNO

Something like is described here: https://ksingh7.medium.com/lets-automate-let-s-encrypt-tls-certs-for-openshift-4-211d6c081875

Fails to remove LoadBalancer target groups

The conversion script removes the API load balancers correctly. However, it fails to remove the target groups that went to those original load balancers. Just saw those when cleaning up manually some stuff.

Minor issue given:

Uninstall removes them just fine
They don't really cost anything

Feature: Add support for an all-in-one script - failure modes InsufficientInstanceCapacity

as a user, i just want to run one script .. puhlease !!

Wait for API/router availability?

I'm getting faster at these automated steps.... :)

At the end of the scripts, they exit right after restart. I wonder if it couldn't wait for the API to be available or the instance to register active in the ELB? (If it doesn't, moving right to the next step usually fails because it's not available.)

EIP still fails to release properly on first attempt, despite explicit wait

}
 -> delete_nat_gateways [ nat-0cfc03e3ca74ee554 ] OK
 -> wait_for_nat_gateway_delete [ nat-082277122928075ac	nat-07e3a5c341f04cabc	nat-0cfc03e3ca74ee554 ] OK

An error occurred (AuthFailure) when calling the ReleaseAddress operation: You do not have permission to access the specified resource.
🕱Failed - could not release eip 3.13.164.181 eipalloc-073a4438a763e61a5 ?

even after waiting for delete ...

AWS CLI v2 Issue with Router ELB

This step didn't work (maybe with CLI v2 only?):

🌴 RouterLoadBalancer set to aa2a9ddd000e94b718487e804e0e3d24	a9b27c28add644da48d5e14113aedf24
Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help

Unknown options: a9b27c28add644da48d5e14113aedf24
🕱Failed - could not associate router lb  aa2a9ddd000e94b718487e804e0e3d24	a9b27c28add644da48d5e14113aedf24 with instance i-0244d5a182f015f11 ?

LoadBalancers array needs emptied on master machine

Noticed this Event go by on the log:

sno-1-6jxkf-master-0
(combined from similar events): sno-1-6jxkf-master-0: reconciler failed to Update machine: failed to updated update load balancers: LoadBalancerNotFound: Load balancers '[sno-1-6jxkf-int, sno-1-6jxkf-ext]' not found status code: 400, request id: 32db87a0-867d-454c-ae42-5e9680bb7836

In the Machine for the master, there's an array similar to this:

  loadBalancers:
    - name: sno-1-6jxkf-int
      type: network
    - name: sno-1-6jxkf-ext
      type: network

Those get whacked during conversion. It'll need updated/emptied like this:

    loadBalancers: []

Bring "Adjust AWS Objects" before "Convert to Spot"

If you take #3, then the script in https://github.com/eformat/sno-for-100/blob/main/adjust-single-node.sh can lose the "find instance id" part of the script and get simpler (just use the environment variable).

Also, by doing the AWS adjustment first, you allow everybody to save "some" money of the infrastructure pieces (LoadBalancers and NAT Gateway) and stop right there without necessarily having to convert to spot or do the instance surgery on the updated instance ID...

I noticed this could be beneficial on my account's bill. Internally have some savings plans which cancel out the entire instance cost (burn committed spend), but converting to spot actually increases cost to the business and could leave committed spend underutilized.... smh.

Elastic IPs attempted to be removed too soon

We probably need to wait for NAT GW to delete completely (openshift-install destroy cluster has similar mechanism.)

Got this error running the script:

An error occurred (AuthFailure) when calling the ReleaseAddress operation: You do not have permission to access the specified resource.
🕱Failed - could not release eip 3.15.101.139 eipalloc-0018453cac7b2f9f6 ?

I think the permission message is a red herring. My user does. I believe the script didn't wait long enough for the NAT GW to be completely gone/deleted. I deleted them manually after a bit and re-ran the script to completion.

May need to remove taint praoctively

Only saw this once while we were early prototyping, but capturing this here in case we need it.

Saw taint appear when converting host before providerID was getting set right. Needed this command to get all the pods scheduling on the SNO correctly again:

kubectl taint nodes <<your node name>> node-role.kubernetes.io/master:NoSchedule-

We may want to add that to the fix-instance-id.sh script preemptively since it is unlikely to cause harm and only adds protection against this scenario.

EIP tagging still isn't quite right

I can see where the EIP is attempted to be tagged here:

https://github.com/eformat/sno-for-100/blob/main/adjust-single-node.sh#L140-L153

However, I'm still getting this warning (and no tags):

💀Warning - tag_eip - could not find any tags for new eip ?

I tried to debug it a little bit here:

[scuppett@x1-carbon sno-for-100]$ read -d '' -r -a lines < <(aws ec2 describe-tags --filters "Name=resource-id,Values=i-080f66d46f78f5a4c" --output text)
[scuppett@x1-carbon sno-for-100]$ echo $lines
TAGS
[scuppett@x1-carbon sno-for-100]$ aws ec2 describe-tags --filters "Name=resource-id,Values=i-080f66d46f78f5a4c" --output text
TAGS    Name    i-080f66d46f78f5a4c     instance        sno-2-98rbv-master-0
TAGS    kubernetes.io/cluster/sno-2-98rbv       i-080f66d46f78f5a4c     instance        owned

So I think it's getting chopped prematurely.

openshift-install destroy cluster misses the EIP

After running this, you can easily run openshift-install destroy cluster and the instance + hosted zones + vpc get cleaned up. However, the new EIP from the conversion gets left behind. This is likely because there's only the sno-100 tag.

If we copy the cluster tags from the instance and put it on the new EIP, I bet the openshift installer will help scrub that up when it tears it down.