shinesolutions / aem-aws-stack-builder Goto Github PK
View Code? Open in Web Editor NEWAdobe Experience Manager (AEM) infrastructure builder on AWS using CloudFormation stacks
License: Apache License 2.0
Adobe Experience Manager (AEM) infrastructure builder on AWS using CloudFormation stacks
License: Apache License 2.0
Some users would like to auto-scale publish and publis-dispatcher pairs based on their peak schedule during the day/week.
Need to introduce scheduled auto-scaling (http://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html) to the publish-dispatcher ASG. New publish-dispatcher will then trigger the creation of new publish instance.
Generated stack-specific system users passwords are currently stored in S3 and only encrypted at-rest.
This has to be improved by storing either encrypted password or moving it all the way to a more secure service.
Consider tools like https://forge.puppet.com/scalefactory/kms and https://github.com/mozilla/sops for encryption/decryption.
They also need permission to describe the instances. to check the lifecycleState.
autoscaling:DescribeAutoScalingInstances.
See scripts:
https://github.com/shinesolutions/aem-aws-stack-provisioner/blob/master/files/aem-tools/enter-standby.sh
and
https://github.com/shinesolutions/aem-aws-stack-provisioner/blob/master/files/aem-tools/exit-standby.sh
Since Inspec superseded serverspec's featureset, we need to replace serverspec with Inspec (or any other better tool really).
One benefit of moving to Inspec is the ability to publish AEM Inspec Profile to Chef marketplace, this allows multiple projects to use the same AEM spec checks, and that allows the automation code to deep AEM inspection rather than just relying on the limited checks that serverspec provides.
E.g. check whether replication agent exists or not.
Stack initialisation has a sleep for 30 seconds before running the tests (used to be ServerSpec, now InSpec). This sleep has to be removed, and proper checks and waits need to be ensured at puppet-aem-curator level.
Sleeping for X duration only works when one instance has a single service, but 30 seconds fail already for an instance that has 2 AEM instances and 1 Apache httpd.
Currently support 2 AZ (in all Autoscaling Groups):
https://github.com/shinesolutions/aem-aws-stack-builder/blob/master/ansible/playbooks/apps/publish.yaml#L21
should have the ability to specify more AZ or less.
During the effort to unify v1.x and v2.0.0 feature sets, we lost the original v2.0.0 app architecture.
Now that master has the structure to support multiple architectures, v2.0.0 app architecture should be introduced as a different type of prerequisites stack.
The prerequisites for v2.0.0 involves:
templates currently support 2 AZ and 2 Subnet. would be nice to have templates to create a 3 AZ 3 Subnet network
no template for creating the bastion host security groups.
In order to simplify stack dependencies for a consolidated architecture (instance profiles, security groups, etc), we need to introduce a prerequisites stack for consolidated.
It will be simpler for users to understand the need to build the prerequisites and the compute stack, then pair them (whether it's one to many, or one to one).
Simian Army installation creates a SimpleDB domain which is not associated to CloudFormation stack.
Stack deletion should also delete this SimpleDB domain in ChaosMonkey playbook https://github.com/shinesolutions/aem-aws-stack-builder/blob/master/ansible/playbooks/apps/chaos-monkey.yaml .
Stack provisioner version should be injected from inventory down to user data, should be used to retrieve versioned stack provisioner artifact.
In order to provide better visibility on the stack's activity on AuthorDispatcher, Publish, and PublishDispatcher layers, we need to enable metric collections on those layers' AutoScalingGroups.
References:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-metricscollection.html#cfn-as-metricscollection-granularity
http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-monitoring.html
This would help users when they need to find out when instances were launched and terminated, was there a period of time where an ASG keeps scaling, etc.
Consider replacing the existing ELBs with ALBs where make sense.
One benefit that many users often want is the recently new support for multiple TLS certs https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/ .
Current CloudFormation templates are currently using BlockDeviceMappings
, which doesn't provide a way to tag the volume (e.g. with StackPrefix and Component tags).
Need to figure out a way to tag the volumes.
Past discussions suggested to look at perhaps replacing BlockDeviceMappings
with Volumes
and/or VolumeAttachments
.
Currently stack prefix shows up twice in EC2 instance names with this format: <stack_prefix>-name (stack_prefix) .
It should simply be: <stack_prefix>-name .
Stack builder is secure by default and generate random password for each environment creation.
However, during development, users might want to use simple passwords for each of the system users. When this flag is enabled, set password to be the same as the username, i.e. admin/admin, orchestrator/orchestrator.
The following error shows up in simianarmy log, which causes Chaos Monkey to not run.
2018-01-23 11:49:06.331 - ERROR MonkeyRunner - [MonkeyRunner.java:234] monkeyFactory error, cannot make monkey from com.netflix.simianarmy.basic.chaos.BasicChaosMonkey with com.netflix.simianarmy.basic.BasicChaosMonkeyContext
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at com.netflix.simianarmy.basic.BasicSimianArmyContext.<init>(BasicSimianArmyContext.java:147)
at com.netflix.simianarmy.basic.BasicChaosMonkeyContext.<init>(BasicChaosMonkeyContext.java:54)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at com.netflix.simianarmy.MonkeyRunner.factory(MonkeyRunner.java:229)
at com.netflix.simianarmy.MonkeyRunner.replaceMonkey(MonkeyRunner.java:145)
at com.netflix.simianarmy.basic.BasicMonkeyServer.addMonkeysToRun(BasicMonkeyServer.java:53)
at com.netflix.simianarmy.basic.BasicMonkeyServer.init(BasicMonkeyServer.java:78)
at javax.servlet.GenericServlet.init(GenericServlet.java:158)
at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1269)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1182)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1072)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5368)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5660)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:899)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:875)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1260)
at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:2002)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
There are still some references to *SubnetA
and *SubnetB
in the stack builder CloudFormation.
This introduces a hard coupling between application and network, and it also limits the use of stack builder in a network which uses only 2 AZs.
All remaining references to SubnetA and SubnetB should be replaced with SubnetList.
Currently author-standby component sits on its own as a standalone, it needs an auto recovery process.
Users have to resort to blue-green creating a new stack in order to recover from author-standby scenario. We need to remove this from user space, and let the infrastructure auto recover.
author-publish-dispatcher needs to support InboundFromBastionHostSecurityGroupParameter
just like the full aem-set .
The idea is to allow users to easily configure a secgroup that act as the origin of inbound connection into author-publish-dispatcher's EC2 instance.
All log files across all components need to be configured with CloudWatch.
All log files should also have a log rotation mechanism, this will differ between applications. Some applications might already have a built-in log rotation logic, where others need logrotate to be set up via Puppet.
Randomise the Author-Primary and Author-Standby Availability Zone and Subnet to spread ip usage across subnets.
The current CloudFormation template (along with config and playbooks) need to be added to aem-aws-stack-builder. This is mostly a migration effort from Stack Manager Cloud repo which currently contains the CF template(s).
This is an effort to simplify user's stack building configuration and provisioning.
At the same time, it's also an effort to make things consistent that the application repos (stack manager, orchestrator, etc) produces the application code, while AWS CloudFormation templates live in aem-aws-stack-builder.
stack-init.sh checks for the existence of custom stack provisioner using aws s3api head-object ...
, but this displays error message An error occurred (404) when calling the HeadObject operation: Not Found
when the custom stack provisioner file doesn't exist.
This is a problem with users where log messages are scanned for the existence of the word 'error', where it becomes a false negative. Other than that, users who see the error message got confused and was looking for an error to be fixed.
Ideally we should find a way where existence check is performed without displaying error message.
In order to provide the flexibility to place author-dispatcher ELB on a different subnet to the authors (in full-set architecture), we need to introduce a configurable author-dispatcher ELB subnet.
Author Publish Dispatcher inbound CIDR should be configurable, consistent with the other components' inbound CIDRs.
This provides the user with the ability to restrict incoming requests to the given component in a Consolidated architecture.
Planning upgrades for a number of stacks, and will need support for latest versions.
At the moment stack builder is creating a stack-facts.txt file (which will be available as Facter facts) containing configuration values that need to be passed from stack builder config to stack provisioners (aem-aws-stack-provisioner and also any custom stack provisioner).
The goal is to allow users to configure those values from aem-aws-stack-builder, so they don't have to configure anything else.
Need to revisit whether passing these values via Facter facts can be improved or not.
Previous discussions suggested that if aem-aws-stack-provisioner will be refactored into a proper module, then these facts can be replaced by a configuration for this aem-aws-stack-provisioner.
As the first line of defence, dispatcher sitting in front of publisher must be able to scale horizontally when under heavy load. Even though the architecture might include a CDN in front of the dispatcher, we need to consider the scenario when the CDN is not inside AWS and hence there would be a disaster scenario when that external CDN is unavailable.
Add global tags script currently doesn't have any knowledge about author-publish-dispatcher component, which means the EC2 instance that's running on that stack wouldn't have user-specific tags.
The script configuration needs to have author-publish-dispatcher configuration.
author elb allows login on https but not http.
not sure if this is intended or not, given the stack does allow access via http.
if it is intended - it would be useful to have a redirect to https in the apache configuration.
As part of the clean up when a stack is terminated, the SimpleDB that was created for Chaos-Monkey needs to be deleted as well.
Stack init in v2 performs a gem install of rspec, which breaks user environments which strictly mandate zero access to the Internet during the stack creation process, while it also doesn't have an internal ruby package manager mirror.
This requirement comes from the fact that those users don't want stack creation and recovery process to rely on the availability of external package managers. A common scenario they'd like to avoid is an external package manager experiencing downtime during production stack creation or automated recovery process.
Need a way to identify whether a stack is healthy or not, this should include various statuses:
v2 introduced support for using TLS certs generated with AWS Certificate Manager. However, it lost the feature to allow users to specify a custom cert which doesn't get dynamically generated.
This is necessary for users which are mandated to use specific cert which can be uploaded to AWS Certificate Manager, but not generated each time.
stack-data playbook should generate a custom fact file that contains version numbers of all dependencies stored inside data bucket. The values can be retrieved from inventory config.
These facts should then be used by stack provisioner to download the dependency artifacts from data bucket.
Benefits:
author standby role needs putmetricdata action permission. consider adding to all roles.
403 Client Error: Forbidden for url: https://monitoring.ap-southeast-2.amazonaws.com/?Action=PutMetricData&MetricData.member.
used for putting the sync delay metric
During recovery, the publish instance's state is restored on the new pair, however, the cache is empty and this might trigger rebuilding of the cache, which could be a huge resource consumption for a large piece of content.
If dispatcher's docroot is stored in a different volume, it can be used as the source to copy the dispatcher's state from.
If both author instances come into service at the same time, we'll be in a world of hurt. We should add an alarm that triggers if there's more than one author in-service.
Chaos Monkey component currently uses the default location of Simian Army artifact (defined in puppet-simianarmy) which will download the artifact from bintray.
In order to remove this direct dependency to an external service (i.e. not in AWS), we need to add simian army to S3 library location, consistent with oak run jar file.
The risk of having an external dependency is that your system recovery can fail when the external service (external to AWS) is unavailable and this is unacceptable in a number of organisations.
When stack-data is initialised with Ansible verbose flag -v
set, the generated credentials are displayed, e.g.
TASK [Generate random credentials for system users] ****************************
ok: [127.0.0.1] => {"changed": false, "meta": {"admin": "UMKIzR11kilAC8wyvADz8mivtvoUUfjFruWEBAqfwef13RZvOipzoTff4SAsm3hAtMkO
YahiEKwCdG2JU47Z8qz3ioTMVtOBDqcB", "deployer": "swd6g6NbhDsJHT3e9PUSw1Era5hPxQMwXla3ZZ2GGZLg8lRYmerKKvdB34eVgzFCvD8h265STn2TP
WQxOjqGkAoJhMRidwWNYYSf", "exporter": "rOJAIL5HgsKbbmb0HscJufVdEidUjmLmtJ9BohJwUqVt2MMzWj60Wd1Rf2iVEvFziQW8bv3bic5z6SD5F7HqKl
SwHRbnnmcZI2Yx", "importer": "pGg19IfclNKTQfDfAP482VktdevMPSM0QnhEsyg7U52ujYv5CeyuB159SwTPJ9ZxO36w2NG9gVW2oD3X6XdsaGgsqBpdtLK
3ohdh", "orchestrator": "tJXEQXeKoVTq8yj9qSJ9syBnZBx62hUxHsLGOgUYvlY8C0a3Sb7yC76gzBBUtd4KWLR6USw42spS0jlR9vOwOVVCBAEl2cBfc7wr
", "replicator": "LhXqiZnZpxAxI2mdZu22gC6SYP1MgzMFFFcDDNaQamAwkzRTP6M9ZmSXuyFyhVcsxNSHduxJMbRmtzedGdOARKQE0UtRSoxgU1Fq"}}
These credentials shouldn't be logged even if verbose flag is set.
Messaging stack contains an SQS queue for handling ASG events, which could potentially flood when the stack keeps recovering/scaling (ASG keeps terminating and launching instances) in a cycle, e.g. due to provisioning failure caused by failing outbound proxy.
Need to create a CloudWatch alarm on the messaging stack to accompany the SQS queue, set the condition to 'when there are 10 or more messages in the queue'.
Currently JVM opts can be configured in packer-aem
as part of AMI baking.
aem-aws-stack-builder
supports the ability to customise instance types, however, the JVM opts can't be customised at stack level. This causes a dependency to specific AMIs that use the expected JVM opts for a given instance type.
We should decouple the JVM opts between what's required for AMI baking in packer-aem
and what's required while running the instances in aem-aws-stack-builder
.
The custom JVM opts should be configurable as Ansible variable and passed to Puppet config for puppet-aem-curator
to consume in config_author
and config_publish
.
author-publish-dispatcher component CF is currently missing a Route53 record.
Keep the naming convention consistent with the other components.
In the scenario that the orchestrator application is no longer processing messages from the queue. maybe the ec2 instance losses access to the queue or the application crashes etc. There needs to be a mechanism to terminate the instance so a new health orchestrator starts.
Stack builder is secure by default, however, during development users might want to enable crxde in order to inspect the repository.
We need to introduce a flag for enabling crxde.
After nightly and hourly snapshot process of publish instance, snapshot/volume ID needs to be persisted in launch config, to be consumed by new publish instances launched during autoscaling event.
This is to ensure that in the event of losing all publish instances (i.e. there's no healthy publish instance to take snapshot from), the new instances will restore the latest nightly or hourly snapshot.
Note that there's a risk of corrupted repository in the hourly snapshot, hence we need to make this option configurable (whether to use hourly or nightly snapshot). Different users will take different level of risk of encountering corrupted repository.
By implementing the customization of the JVM Memory option of the AEM Author and Publish instances, I forgot to configure the hiera yaml to configure this variable also for the author standby instance.
EBS volume size in all cloudformation templates need to be configurable due variety of content sizes on various AEM projects.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.