TL;DR:
Failures that occur outside of the handler of a Custom Resource result in long periods of inactivity when invoking CDK commands, they're also never raised as actual failures. This is a PITA.
So recently whilst working on HLS I was making some ๐ธ Custom Resources ๐ค
I'd wrapped my logic in beautiful try/except
blocks and I'd handled the CFN Callbacks so that my Custom Resource called back to the mothership ๐ฝ ๐ธ to tell CFN what was happening. This is used by Custom Resources to tell CloudFormation (and CDK) whether a resources creation/update/delete
has been successful or not.
BUT
When I ran cdk deploy
, my deployment was seemingly stuck on creating the Custom Resource forever. Upon further inspection, I could see in the logs of the handler that it was erroring out as soon as it was invoked - strange, this should be caught and Cloud Formation should be informed of the failure and begin the rollback logic.
So, I've got a Stack stuck deploying, my first thought? Delete the thing. So I deleted the stack... and it got stuck deleting the Custom Resource forever ๐ .
u wot ๐คจ
So, this was confusing at first but then I took a look at the error messages in CloudWatch. Let's say I had a index.py
like:
import cfnresponse
import my_cool_module
def handler(event, context):
try:
my_cool_module.do_something()
cfnresponse.send_success() # This isn't real but you get the idea
except my_cool_module.a_not_so_cool_exception as ex:
print(ex)
cfnresponse.send_failure(ex)
My importing of my_cool_module
was erroring, not anything in my handler
function. Because of this, I was never reaching any of my callback code, which meant that as far as CDK/CloudFormation were concerned, my Custom Resource was doing its thing and it'd hear from it eventually.
Because these callbacks are required for any CDK action, they'd result in infinitely (1+ hours) running deploys/updates/destroys
which really wastes time.
You might ask, did you not test your code locally @ciaranevans?! - Well, I did. It worked beautifully because of how it was interpreting the import statement... not so correct when actually on its own in a Lambda ๐ญ
So what should we do?
I suppose the easiest and grossest way could be:
import cfnresponse
try:
import my_cool_module
except:
cfnresponse.send_failure()
def handler(event, context):
try:
my_cool_module.do_something()
cfnresponse.send_success() # This isn't real but you get the idea
except my_cool_module.a_not_so_cool_exception as ex:
print(ex)
cfnresponse.send_failure(ex)
I don't like try/catch or conditional imports. So if someone has a better idea or knows of how we could gracefully handle this kind of issue, I'm all ears!
I imagine most languages will suffer this kind of situation, or at least have it a situation that's possible - is there a way for CDK to treat any failure that's not explicitly handled as a failure for CloudFormation? ๐คท
cc. @developmentseed/earthdata-infrastructure