shinesolutions / aem-aws-stack-builder Goto Github PK

View Code? Open in Web Editor NEW

44.0 18.0 38.0 3.72 MB

Adobe Experience Manager (AEM) infrastructure builder on AWS using CloudFormation stacks

License: Apache License 2.0

Makefile 6.37% Shell 45.54% Python 32.73% Jinja 15.35%

aem aem-opencloud aws cloudformation

aem-aws-stack-builder's People

Contributors

Stargazers

Watchers

aem-aws-stack-builder's Issues

Add scheduled auto-scaling support

Some users would like to auto-scale publish and publis-dispatcher pairs based on their peak schedule during the day/week.

Need to introduce scheduled auto-scaling (http://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html) to the publish-dispatcher ASG. New publish-dispatcher will then trigger the creation of new publish instance.

Move generated system credentials storage to Secrets Manager or Parameter Store

Generated stack-specific system users passwords are currently stored in S3 and only encrypted at-rest.
This has to be improved by storing either encrypted password or moving it all the way to a more secure service.
Consider tools like https://forge.puppet.com/scalefactory/kms and https://github.com/mozilla/sops for encryption/decryption.

Dispatcher Roles need Permission to move instances into standby mode.

They also need permission to describe the instances. to check the lifecycleState.

autoscaling:DescribeAutoScalingInstances.

See scripts:
https://github.com/shinesolutions/aem-aws-stack-provisioner/blob/master/files/aem-tools/enter-standby.sh
and
https://github.com/shinesolutions/aem-aws-stack-provisioner/blob/master/files/aem-tools/exit-standby.sh

Replace serverspec with Inspec

Since Inspec superseded serverspec's featureset, we need to replace serverspec with Inspec (or any other better tool really).

One benefit of moving to Inspec is the ability to publish AEM Inspec Profile to Chef marketplace, this allows multiple projects to use the same AEM spec checks, and that allows the automation code to deep AEM inspection rather than just relying on the limited checks that serverspec provides.
E.g. check whether replication agent exists or not.

Stack init shouldn't need to sleep

Stack initialisation has a sleep for 30 seconds before running the tests (used to be ServerSpec, now InSpec). This sleep has to be removed, and proper checks and waits need to be ensured at puppet-aem-curator level.

Sleeping for X duration only works when one instance has a single service, but 30 seconds fail already for an instance that has 2 AEM instances and 1 Apache httpd.

Support variable number of Availability Zones

Currently support 2 AZ (in all Autoscaling Groups):
https://github.com/shinesolutions/aem-aws-stack-builder/blob/master/ansible/playbooks/apps/publish.yaml#L21

should have the ability to specify more AZ or less.

New architecture for permission type a (v2.0.0 original app architecture)

During the effort to unify v1.x and v2.0.0 feature sets, we lost the original v2.0.0 app architecture.
Now that master has the structure to support multiple architectures, v2.0.0 app architecture should be introduced as a different type of prerequisites stack.

The prerequisites for v2.0.0 involves:

r53 zone
wildcard cert
messaging
instance profiles

Configurable AZs via inventory file

templates currently support 2 AZ and 2 Subnet. would be nice to have templates to create a 3 AZ 3 Subnet network

missing bastion host and inbound security group creation.

no template for creating the bastion host security groups.

Prerequisites for consolidated architecture

In order to simplify stack dependencies for a consolidated architecture (instance profiles, security groups, etc), we need to introduce a prerequisites stack for consolidated.

It will be simpler for users to understand the need to build the prerequisites and the compute stack, then pair them (whether it's one to many, or one to one).

Clean up Simian Army's SimpleDB domain as part of stack deletion

Simian Army installation creates a SimpleDB domain which is not associated to CloudFormation stack.
Stack deletion should also delete this SimpleDB domain in ChaosMonkey playbook https://github.com/shinesolutions/aem-aws-stack-builder/blob/master/ansible/playbooks/apps/chaos-monkey.yaml .

Stack initialisation script should download versioned stack provisioner

Stack provisioner version should be injected from inventory down to user data, should be used to retrieve versioned stack provisioner artifact.

Enable metrics collection on all ASGs

In order to provide better visibility on the stack's activity on AuthorDispatcher, Publish, and PublishDispatcher layers, we need to enable metric collections on those layers' AutoScalingGroups.

References:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-metricscollection.html#cfn-as-metricscollection-granularity
http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-monitoring.html

This would help users when they need to find out when instances were launched and terminated, was there a period of time where an ASG keeps scaling, etc.

Replace ELBs with ALBs

Consider replacing the existing ELBs with ALBs where make sense.

One benefit that many users often want is the recently new support for multiple TLS certs https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/ .

Add volume tagging support

Current CloudFormation templates are currently using BlockDeviceMappings, which doesn't provide a way to tag the volume (e.g. with StackPrefix and Component tags).

Need to figure out a way to tag the volumes.
Past discussions suggested to look at perhaps replacing BlockDeviceMappings with Volumes and/or VolumeAttachments .

Remove stack prefix doubling up in EC2 instance names

Currently stack prefix shows up twice in EC2 instance names with this format: <stack_prefix>-name (stack_prefix) .
It should simply be: <stack_prefix>-name .

Flag for enabling default password

Stack builder is secure by default and generate random password for each environment creation.
However, during development, users might want to use simple passwords for each of the system users. When this flag is enabled, set password to be the same as the username, i.e. admin/admin, orchestrator/orchestrator.

NumberFormatException on Chaos Monkey component initialisation

The following error shows up in simianarmy log, which causes Chaos Monkey to not run.

2018-01-23 11:49:06.331 - ERROR MonkeyRunner - [MonkeyRunner.java:234] monkeyFactory error, cannot make monkey from com.netflix.simianarmy.basic.chaos.BasicChaosMonkey with com.netflix.simianarmy.basic.BasicChaosMonkeyContext
java.lang.NumberFormatException: For input string: ""
       at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
       at java.lang.Integer.parseInt(Integer.java:592)
       at java.lang.Integer.parseInt(Integer.java:615)
       at com.netflix.simianarmy.basic.BasicSimianArmyContext.<init>(BasicSimianArmyContext.java:147)
       at com.netflix.simianarmy.basic.BasicChaosMonkeyContext.<init>(BasicChaosMonkeyContext.java:54)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
       at java.lang.Class.newInstance(Class.java:442)
       at com.netflix.simianarmy.MonkeyRunner.factory(MonkeyRunner.java:229)
       at com.netflix.simianarmy.MonkeyRunner.replaceMonkey(MonkeyRunner.java:145)
       at com.netflix.simianarmy.basic.BasicMonkeyServer.addMonkeysToRun(BasicMonkeyServer.java:53)
       at com.netflix.simianarmy.basic.BasicMonkeyServer.init(BasicMonkeyServer.java:78)
       at javax.servlet.GenericServlet.init(GenericServlet.java:158)
       at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1269)
       at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1182)
       at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1072)
       at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5368)
       at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5660)
       at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
       at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:899)
       at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:875)
       at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
       at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1260)
       at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:2002)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

Remove 2 subnets restriction in application CloudFormation templates

There are still some references to *SubnetA and *SubnetB in the stack builder CloudFormation.
This introduces a hard coupling between application and network, and it also limits the use of stack builder in a network which uses only 2 AZs.

All remaining references to SubnetA and SubnetB should be replaced with SubnetList.

put author standby in autoscaling group to handle instance failure.

Currently author-standby component sits on its own as a standalone, it needs an auto recovery process.

Users have to resort to blue-green creating a new stack in order to recover from author-standby scenario. We need to remove this from user space, and let the infrastructure auto recover.

Add bastion secgroup configuration in author-publish-dispatcher

author-publish-dispatcher needs to support InboundFromBastionHostSecurityGroupParameter just like the full aem-set .

The idea is to allow users to easily configure a secgroup that act as the origin of inbound connection into author-publish-dispatcher's EC2 instance.

Clean up overall logging management

All log files across all components need to be configured with CloudWatch.

All log files should also have a log rotation mechanism, this will differ between applications. Some applications might already have a built-in log rotation logic, where others need logrotate to be set up via Puppet.

Randomise the Author-Primary and Author-Standby AZ/Subnet

Randomise the Author-Primary and Author-Standby Availability Zone and Subnet to spread ip usage across subnets.

Introduce Stack Manager CloudFormation template

The current CloudFormation template (along with config and playbooks) need to be added to aem-aws-stack-builder. This is mostly a migration effort from Stack Manager Cloud repo which currently contains the CF template(s).

This is an effort to simplify user's stack building configuration and provisioning.
At the same time, it's also an effort to make things consistent that the application repos (stack manager, orchestrator, etc) produces the application code, while AWS CloudFormation templates live in aem-aws-stack-builder.

Custom stack provisioner check shouldn't log error message when file doesn't exist

stack-init.sh checks for the existence of custom stack provisioner using aws s3api head-object ..., but this displays error message An error occurred (404) when calling the HeadObject operation: Not Found when the custom stack provisioner file doesn't exist.

This is a problem with users where log messages are scanned for the existence of the word 'error', where it becomes a false negative. Other than that, users who see the error message got confused and was looking for an error to be fixed.
Ideally we should find a way where existence check is performed without displaying error message.

Introduce AuthorDispatcher ELB subnet config

In order to provide the flexibility to place author-dispatcher ELB on a different subnet to the authors (in full-set architecture), we need to introduce a configurable author-dispatcher ELB subnet.

Author Publish Dispatcher inbound CIDR should be configurable

Author Publish Dispatcher inbound CIDR should be configurable, consistent with the other components' inbound CIDRs.

This provides the user with the ability to restrict incoming requests to the given component in a Consolidated architecture.

Add support for AEM 6.3

Planning upgrades for a number of stacks, and will need support for latest versions.

Simplify configuration/facts that will be passed to provisioner

At the moment stack builder is creating a stack-facts.txt file (which will be available as Facter facts) containing configuration values that need to be passed from stack builder config to stack provisioners (aem-aws-stack-provisioner and also any custom stack provisioner).

The goal is to allow users to configure those values from aem-aws-stack-builder, so they don't have to configure anything else.

Need to revisit whether passing these values via Facter facts can be improved or not.
Previous discussions suggested that if aem-aws-stack-provisioner will be refactored into a proper module, then these facts can be replaced by a configuration for this aem-aws-stack-provisioner.

Scaling policies when instances have high cpu usage.

As the first line of defence, dispatcher sitting in front of publisher must be able to scale horizontally when under heavy load. Even though the architecture might include a CDN in front of the dispatcher, we need to consider the scenario when the CDN is not inside AWS and hence there would be a disaster scenario when that external CDN is unavailable.

Add global tags support to author-publish-dispatcher component

Add global tags script currently doesn't have any knowledge about author-publish-dispatcher component, which means the EC2 instance that's running on that stack wouldn't have user-specific tags.

The script configuration needs to have author-publish-dispatcher configuration.

author elb allows login on https but not http.

not sure if this is intended or not, given the stack does allow access via http.

if it is intended - it would be useful to have a redirect to https in the apache configuration.

SimpleDB deletion

As part of the clean up when a stack is terminated, the SimpleDB that was created for Chaos-Monkey needs to be deleted as well.

Remove external package installation during component initialisation

Stack init in v2 performs a gem install of rspec, which breaks user environments which strictly mandate zero access to the Internet during the stack creation process, while it also doesn't have an internal ruby package manager mirror.

This requirement comes from the fact that those users don't want stack creation and recovery process to rely on the availability of external package managers. A common scenario they'd like to avoid is an external package manager experiencing downtime during production stack creation or automated recovery process.

Stack level health check

Need a way to identify whether a stack is healthy or not, this should include various statuses:

cloud init finishes without any error
all instances are wired by orchestrator
content is accessible from the publish-dispatcher ELB
authors can login to AEM admin via author-dispatcher ELB
author standby is not lagging behind author primary
orchestrator queues don't have too many messages
aem-healthcheck deep healthcheck is responding with 200
aem-healthcheck security is responding with 200
publish-dispatcher ELB and author-dispatcher ELB should have at least 2 inservice instances, author ELB should have exactly 1 inservice instance

Support for predefined certs

v2 introduced support for using TLS certs generated with AWS Certificate Manager. However, it lost the feature to allow users to specify a custom cert which doesn't get dynamically generated.

This is necessary for users which are mandated to use specific cert which can be uploaded to AWS Certificate Manager, but not generated each time.

Inject dependency versions as facts

stack-data playbook should generate a custom fact file that contains version numbers of all dependencies stored inside data bucket. The values can be retrieved from inventory config.

These facts should then be used by stack provisioner to download the dependency artifacts from data bucket.

Benefits:

single location for configuring dependency versions in aem-aws-stack-builder (currently duplicated in stack provisioner)
data bucket can store versioned artifacts

author standby role needs putmetricdata action permission

author standby role needs putmetricdata action permission. consider adding to all roles.

403 Client Error: Forbidden for url: https://monitoring.ap-southeast-2.amazonaws.com/?Action=PutMetricData&MetricData.member.

used for putting the sync delay metric

Dispatcher state needs to be persisted

During recovery, the publish instance's state is restored on the new pair, however, the cache is empty and this might trigger rebuilding of the cache, which could be a huge resource consumption for a large piece of content.

If dispatcher's docroot is stored in a different volume, it can be used as the source to copy the dispatcher's state from.

Add a CloudWatch Alarm when AuthorLoadBalancer number if instances in-service is > 1.

If both author instances come into service at the same time, we'll be in a world of hurt. We should add an alarm that triggers if there's more than one author in-service.

Configure Simian Army source to be an S3 URI

Chaos Monkey component currently uses the default location of Simian Army artifact (defined in puppet-simianarmy) which will download the artifact from bintray.

In order to remove this direct dependency to an external service (i.e. not in AWS), we need to add simian army to S3 library location, consistent with oak run jar file.

The risk of having an external dependency is that your system recovery can fail when the external service (external to AWS) is unavailable and this is unacceptable in a number of organisations.

System user credentials get logged when verbose flag is set

When stack-data is initialised with Ansible verbose flag -v set, the generated credentials are displayed, e.g.

TASK [Generate random credentials for system users] ****************************

ok: [127.0.0.1] => {"changed": false, "meta": {"admin": "UMKIzR11kilAC8wyvADz8mivtvoUUfjFruWEBAqfwef13RZvOipzoTff4SAsm3hAtMkO
YahiEKwCdG2JU47Z8qz3ioTMVtOBDqcB", "deployer": "swd6g6NbhDsJHT3e9PUSw1Era5hPxQMwXla3ZZ2GGZLg8lRYmerKKvdB34eVgzFCvD8h265STn2TP
WQxOjqGkAoJhMRidwWNYYSf", "exporter": "rOJAIL5HgsKbbmb0HscJufVdEidUjmLmtJ9BohJwUqVt2MMzWj60Wd1Rf2iVEvFziQW8bv3bic5z6SD5F7HqKl
SwHRbnnmcZI2Yx", "importer": "pGg19IfclNKTQfDfAP482VktdevMPSM0QnhEsyg7U52ujYv5CeyuB159SwTPJ9ZxO36w2NG9gVW2oD3X6XdsaGgsqBpdtLK
3ohdh", "orchestrator": "tJXEQXeKoVTq8yj9qSJ9syBnZBx62hUxHsLGOgUYvlY8C0a3Sb7yC76gzBBUtd4KWLR6USw42spS0jlR9vOwOVVCBAEl2cBfc7wr
", "replicator": "LhXqiZnZpxAxI2mdZu22gC6SYP1MgzMFFFcDDNaQamAwkzRTP6M9ZmSXuyFyhVcsxNSHduxJMbRmtzedGdOARKQE0UtRSoxgU1Fq"}}

These credentials shouldn't be logged even if verbose flag is set.

Create CloudWatch alarm for messaging queue size

Messaging stack contains an SQS queue for handling ASG events, which could potentially flood when the stack keeps recovering/scaling (ASG keeps terminating and launching instances) in a cycle, e.g. due to provisioning failure caused by failing outbound proxy.

Need to create a CloudWatch alarm on the messaging stack to accompany the SQS queue, set the condition to 'when there are 10 or more messages in the queue'.

Stack-level JVM opts customisation support

Currently JVM opts can be configured in packer-aem as part of AMI baking.
aem-aws-stack-builder supports the ability to customise instance types, however, the JVM opts can't be customised at stack level. This causes a dependency to specific AMIs that use the expected JVM opts for a given instance type.

We should decouple the JVM opts between what's required for AMI baking in packer-aem and what's required while running the instances in aem-aws-stack-builder.

The custom JVM opts should be configurable as Ansible variable and passed to Puppet config for puppet-aem-curator to consume in config_author and config_publish.

Add DNS to author-publish-dispatcher CF

author-publish-dispatcher component CF is currently missing a Route53 record.
Keep the naming convention consistent with the other components.

Auto recovery from a dysfunctional orchestrator

In the scenario that the orchestrator application is no longer processing messages from the queue. maybe the ec2 instance losses access to the queue or the application crashes etc. There needs to be a mechanism to terminate the instance so a new health orchestrator starts.

Flag for enabling crxde

Stack builder is secure by default, however, during development users might want to enable crxde in order to inspect the repository.
We need to introduce a flag for enabling crxde.

Persist backup snapshot ID in launch config

After nightly and hourly snapshot process of publish instance, snapshot/volume ID needs to be persisted in launch config, to be consumed by new publish instances launched during autoscaling event.

This is to ensure that in the event of losing all publish instances (i.e. there's no healthy publish instance to take snapshot from), the new instances will restore the latest nightly or hourly snapshot.

Note that there's a risk of corrupted repository in the hourly snapshot, hence we need to make this option configurable (whether to use hourly or nightly snapshot). Different users will take different level of risk of encountering corrupted repository.

Bug JVM Memory option is not configurable for Author Standby instance

By implementing the customization of the JVM Memory option of the AEM Author and Publish instances, I forgot to configure the hiera yaml to configure this variable also for the author standby instance.

#49

Parameterise EBS volume size

EBS volume size in all cloudformation templates need to be configurable due variety of content sizes on various AEM projects.

shinesolutions / aem-aws-stack-builder Goto Github PK

aem-aws-stack-builder's People

Contributors

Stargazers

Watchers

Forkers

aem-aws-stack-builder's Issues

Recommend Projects

Recommend Topics

Recommend Org