eformat / sno-for-100 Goto Github PK
View Code? Open in Web Editor NEWSingle Node OpenShift on AWS Spot
Single Node OpenShift on AWS Spot
In fix-instance-id.sh, we could call reboot-instances EC2 command instead of stop/start. Stopping while the host is under a persistent spot request can cause a race condition with AWS trying to restart it for you.
Also, if the host is set to "terminate on stop" outside of our script's happy path, then that stop command will inadvertently terminate it.
Would be nice to clean up (and they have an option for --delete-ami on the tool). However, I got this error trying it with that flag enabled:
Got this error when applying it:
[INFO] 2022-11-24 08:44:03,949 ec2-spot-converter - [STEP 26/26] Deregister image...
Traceback (most recent call last):
File "/home/scuppett/bin/ec2-spot-converter", line 1621, in <module>
sys.exit(main(sys.argv))
^^^^^^^^^^^^^^
File "/home/scuppett/bin/ec2-spot-converter", line 1587, in main
return_code, reason, keys = step["Function"]()
^^^^^^^^^^^^^^^^^^
File "/home/scuppett/bin/ec2-spot-converter", line 1212, in deregister_image
snap_ids = [blk["Ebs"]["SnapshotId"] for blk in img["BlockDeviceMappings"]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scuppett/bin/ec2-spot-converter", line 1212, in <listcomp>
snap_ids = [blk["Ebs"]["SnapshotId"] for blk in img["BlockDeviceMappings"]]
~~~^^^^^^^
KeyError: 'Ebs'
Something like is described here: https://ksingh7.medium.com/lets-automate-let-s-encrypt-tls-certs-for-openshift-4-211d6c081875
The conversion script removes the API load balancers correctly. However, it fails to remove the target groups that went to those original load balancers. Just saw those when cleaning up manually some stuff.
Minor issue given:
as a user, i just want to run one script .. puhlease !!
I'm getting faster at these automated steps.... :)
At the end of the scripts, they exit right after restart. I wonder if it couldn't wait for the API to be available or the instance to register active in the ELB? (If it doesn't, moving right to the next step usually fails because it's not available.)
}
-> delete_nat_gateways [ nat-0cfc03e3ca74ee554 ] OK
-> wait_for_nat_gateway_delete [ nat-082277122928075ac nat-07e3a5c341f04cabc nat-0cfc03e3ca74ee554 ] OK
An error occurred (AuthFailure) when calling the ReleaseAddress operation: You do not have permission to access the specified resource.
🕱Failed - could not release eip 3.13.164.181 eipalloc-073a4438a763e61a5 ?
even after waiting for delete ...
This step didn't work (maybe with CLI v2 only?):
🌴 RouterLoadBalancer set to aa2a9ddd000e94b718487e804e0e3d24 a9b27c28add644da48d5e14113aedf24
Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:
aws help
aws <command> help
aws <command> <subcommand> help
Unknown options: a9b27c28add644da48d5e14113aedf24
🕱Failed - could not associate router lb aa2a9ddd000e94b718487e804e0e3d24 a9b27c28add644da48d5e14113aedf24 with instance i-0244d5a182f015f11 ?
Noticed this Event go by on the log:
sno-1-6jxkf-master-0
(combined from similar events): sno-1-6jxkf-master-0: reconciler failed to Update machine: failed to updated update load balancers: LoadBalancerNotFound: Load balancers '[sno-1-6jxkf-int, sno-1-6jxkf-ext]' not found status code: 400, request id: 32db87a0-867d-454c-ae42-5e9680bb7836
In the Machine for the master, there's an array similar to this:
loadBalancers:
- name: sno-1-6jxkf-int
type: network
- name: sno-1-6jxkf-ext
type: network
Those get whacked during conversion. It'll need updated/emptied like this:
loadBalancers: []
If you take #3, then the script in https://github.com/eformat/sno-for-100/blob/main/adjust-single-node.sh can lose the "find instance id" part of the script and get simpler (just use the environment variable).
Also, by doing the AWS adjustment first, you allow everybody to save "some" money of the infrastructure pieces (LoadBalancers and NAT Gateway) and stop right there without necessarily having to convert to spot or do the instance surgery on the updated instance ID...
I noticed this could be beneficial on my account's bill. Internally have some savings plans which cancel out the entire instance cost (burn committed spend), but converting to spot actually increases cost to the business and could leave committed spend underutilized.... smh.
We probably need to wait for NAT GW to delete completely (openshift-install destroy cluster has similar mechanism.)
Got this error running the script:
An error occurred (AuthFailure) when calling the ReleaseAddress operation: You do not have permission to access the specified resource.
🕱Failed - could not release eip 3.15.101.139 eipalloc-0018453cac7b2f9f6 ?
I think the permission message is a red herring. My user does. I believe the script didn't wait long enough for the NAT GW to be completely gone/deleted. I deleted them manually after a bit and re-ran the script to completion.
Only saw this once while we were early prototyping, but capturing this here in case we need it.
Saw taint appear when converting host before providerID was getting set right. Needed this command to get all the pods scheduling on the SNO correctly again:
kubectl taint nodes <<your node name>> node-role.kubernetes.io/master:NoSchedule-
We may want to add that to the fix-instance-id.sh script preemptively since it is unlikely to cause harm and only adds protection against this scenario.
I can see where the EIP is attempted to be tagged here:
https://github.com/eformat/sno-for-100/blob/main/adjust-single-node.sh#L140-L153
However, I'm still getting this warning (and no tags):
💀Warning - tag_eip - could not find any tags for new eip ?
I tried to debug it a little bit here:
[scuppett@x1-carbon sno-for-100]$ read -d '' -r -a lines < <(aws ec2 describe-tags --filters "Name=resource-id,Values=i-080f66d46f78f5a4c" --output text)
[scuppett@x1-carbon sno-for-100]$ echo $lines
TAGS
[scuppett@x1-carbon sno-for-100]$ aws ec2 describe-tags --filters "Name=resource-id,Values=i-080f66d46f78f5a4c" --output text
TAGS Name i-080f66d46f78f5a4c instance sno-2-98rbv-master-0
TAGS kubernetes.io/cluster/sno-2-98rbv i-080f66d46f78f5a4c instance owned
So I think it's getting chopped prematurely.
After running this, you can easily run openshift-install destroy cluster and the instance + hosted zones + vpc get cleaned up. However, the new EIP from the conversion gets left behind. This is likely because there's only the sno-100 tag.
If we copy the cluster tags from the instance and put it on the new EIP, I bet the openshift installer will help scrub that up when it tears it down.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.