Git Product home page Git Product logo

aem-aws-stack-builder's People

Contributors

cliffano avatar dependabot[bot] avatar engshine avatar epoxboy avatar hoomaan-kh avatar kaveensingh31 avatar lenuhc avatar mattd-mb avatar matthew-d avatar mbloch1986 avatar melbit-nishantsharma avatar michaeldiender-shinesolutions avatar nerdy-dav avatar nletts avatar ovlords avatar phillipi-shinesolutions avatar pradkhandelwal avatar pranavmalaviya avatar priya-cr avatar pzurzolo avatar rjunx avatar shineworks avatar siebes avatar sregort avatar veldotshine avatar viveknair93 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aem-aws-stack-builder's Issues

Replace serverspec with Inspec

Since Inspec superseded serverspec's featureset, we need to replace serverspec with Inspec (or any other better tool really).

One benefit of moving to Inspec is the ability to publish AEM Inspec Profile to Chef marketplace, this allows multiple projects to use the same AEM spec checks, and that allows the automation code to deep AEM inspection rather than just relying on the limited checks that serverspec provides.
E.g. check whether replication agent exists or not.

Stack init shouldn't need to sleep

Stack initialisation has a sleep for 30 seconds before running the tests (used to be ServerSpec, now InSpec). This sleep has to be removed, and proper checks and waits need to be ensured at puppet-aem-curator level.

Sleeping for X duration only works when one instance has a single service, but 30 seconds fail already for an instance that has 2 AEM instances and 1 Apache httpd.

New architecture for permission type a (v2.0.0 original app architecture)

During the effort to unify v1.x and v2.0.0 feature sets, we lost the original v2.0.0 app architecture.
Now that master has the structure to support multiple architectures, v2.0.0 app architecture should be introduced as a different type of prerequisites stack.

The prerequisites for v2.0.0 involves:

  • r53 zone
  • wildcard cert
  • messaging
  • instance profiles

Prerequisites for consolidated architecture

In order to simplify stack dependencies for a consolidated architecture (instance profiles, security groups, etc), we need to introduce a prerequisites stack for consolidated.

It will be simpler for users to understand the need to build the prerequisites and the compute stack, then pair them (whether it's one to many, or one to one).

Enable metrics collection on all ASGs

In order to provide better visibility on the stack's activity on AuthorDispatcher, Publish, and PublishDispatcher layers, we need to enable metric collections on those layers' AutoScalingGroups.

References:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-metricscollection.html#cfn-as-metricscollection-granularity
http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-monitoring.html

This would help users when they need to find out when instances were launched and terminated, was there a period of time where an ASG keeps scaling, etc.

Add volume tagging support

Current CloudFormation templates are currently using BlockDeviceMappings, which doesn't provide a way to tag the volume (e.g. with StackPrefix and Component tags).

Need to figure out a way to tag the volumes.
Past discussions suggested to look at perhaps replacing BlockDeviceMappings with Volumes and/or VolumeAttachments .

Flag for enabling default password

Stack builder is secure by default and generate random password for each environment creation.
However, during development, users might want to use simple passwords for each of the system users. When this flag is enabled, set password to be the same as the username, i.e. admin/admin, orchestrator/orchestrator.

NumberFormatException on Chaos Monkey component initialisation

The following error shows up in simianarmy log, which causes Chaos Monkey to not run.

2018-01-23 11:49:06.331 - ERROR MonkeyRunner - [MonkeyRunner.java:234] monkeyFactory error, cannot make monkey from com.netflix.simianarmy.basic.chaos.BasicChaosMonkey with com.netflix.simianarmy.basic.BasicChaosMonkeyContext
java.lang.NumberFormatException: For input string: ""
       at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
       at java.lang.Integer.parseInt(Integer.java:592)
       at java.lang.Integer.parseInt(Integer.java:615)
       at com.netflix.simianarmy.basic.BasicSimianArmyContext.<init>(BasicSimianArmyContext.java:147)
       at com.netflix.simianarmy.basic.BasicChaosMonkeyContext.<init>(BasicChaosMonkeyContext.java:54)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
       at java.lang.Class.newInstance(Class.java:442)
       at com.netflix.simianarmy.MonkeyRunner.factory(MonkeyRunner.java:229)
       at com.netflix.simianarmy.MonkeyRunner.replaceMonkey(MonkeyRunner.java:145)
       at com.netflix.simianarmy.basic.BasicMonkeyServer.addMonkeysToRun(BasicMonkeyServer.java:53)
       at com.netflix.simianarmy.basic.BasicMonkeyServer.init(BasicMonkeyServer.java:78)
       at javax.servlet.GenericServlet.init(GenericServlet.java:158)
       at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1269)
       at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1182)
       at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1072)
       at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5368)
       at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5660)
       at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
       at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:899)
       at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:875)
       at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
       at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1260)
       at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:2002)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

Remove 2 subnets restriction in application CloudFormation templates

There are still some references to *SubnetA and *SubnetB in the stack builder CloudFormation.
This introduces a hard coupling between application and network, and it also limits the use of stack builder in a network which uses only 2 AZs.

All remaining references to SubnetA and SubnetB should be replaced with SubnetList.

put author standby in autoscaling group to handle instance failure.

Currently author-standby component sits on its own as a standalone, it needs an auto recovery process.

Users have to resort to blue-green creating a new stack in order to recover from author-standby scenario. We need to remove this from user space, and let the infrastructure auto recover.

Add bastion secgroup configuration in author-publish-dispatcher

author-publish-dispatcher needs to support InboundFromBastionHostSecurityGroupParameter just like the full aem-set .

The idea is to allow users to easily configure a secgroup that act as the origin of inbound connection into author-publish-dispatcher's EC2 instance.

Clean up overall logging management

All log files across all components need to be configured with CloudWatch.

All log files should also have a log rotation mechanism, this will differ between applications. Some applications might already have a built-in log rotation logic, where others need logrotate to be set up via Puppet.

Introduce Stack Manager CloudFormation template

The current CloudFormation template (along with config and playbooks) need to be added to aem-aws-stack-builder. This is mostly a migration effort from Stack Manager Cloud repo which currently contains the CF template(s).

This is an effort to simplify user's stack building configuration and provisioning.
At the same time, it's also an effort to make things consistent that the application repos (stack manager, orchestrator, etc) produces the application code, while AWS CloudFormation templates live in aem-aws-stack-builder.

Custom stack provisioner check shouldn't log error message when file doesn't exist

stack-init.sh checks for the existence of custom stack provisioner using aws s3api head-object ..., but this displays error message An error occurred (404) when calling the HeadObject operation: Not Found when the custom stack provisioner file doesn't exist.

This is a problem with users where log messages are scanned for the existence of the word 'error', where it becomes a false negative. Other than that, users who see the error message got confused and was looking for an error to be fixed.
Ideally we should find a way where existence check is performed without displaying error message.

Introduce AuthorDispatcher ELB subnet config

In order to provide the flexibility to place author-dispatcher ELB on a different subnet to the authors (in full-set architecture), we need to introduce a configurable author-dispatcher ELB subnet.

Simplify configuration/facts that will be passed to provisioner

At the moment stack builder is creating a stack-facts.txt file (which will be available as Facter facts) containing configuration values that need to be passed from stack builder config to stack provisioners (aem-aws-stack-provisioner and also any custom stack provisioner).

The goal is to allow users to configure those values from aem-aws-stack-builder, so they don't have to configure anything else.

Need to revisit whether passing these values via Facter facts can be improved or not.
Previous discussions suggested that if aem-aws-stack-provisioner will be refactored into a proper module, then these facts can be replaced by a configuration for this aem-aws-stack-provisioner.

Scaling policies when instances have high cpu usage.

As the first line of defence, dispatcher sitting in front of publisher must be able to scale horizontally when under heavy load. Even though the architecture might include a CDN in front of the dispatcher, we need to consider the scenario when the CDN is not inside AWS and hence there would be a disaster scenario when that external CDN is unavailable.

Add global tags support to author-publish-dispatcher component

Add global tags script currently doesn't have any knowledge about author-publish-dispatcher component, which means the EC2 instance that's running on that stack wouldn't have user-specific tags.

The script configuration needs to have author-publish-dispatcher configuration.

author elb allows login on https but not http.

author elb allows login on https but not http.

not sure if this is intended or not, given the stack does allow access via http.

if it is intended - it would be useful to have a redirect to https in the apache configuration.

SimpleDB deletion

As part of the clean up when a stack is terminated, the SimpleDB that was created for Chaos-Monkey needs to be deleted as well.

Remove external package installation during component initialisation

Stack init in v2 performs a gem install of rspec, which breaks user environments which strictly mandate zero access to the Internet during the stack creation process, while it also doesn't have an internal ruby package manager mirror.

This requirement comes from the fact that those users don't want stack creation and recovery process to rely on the availability of external package managers. A common scenario they'd like to avoid is an external package manager experiencing downtime during production stack creation or automated recovery process.

Stack level health check

Need a way to identify whether a stack is healthy or not, this should include various statuses:

  • cloud init finishes without any error
  • all instances are wired by orchestrator
  • content is accessible from the publish-dispatcher ELB
  • authors can login to AEM admin via author-dispatcher ELB
  • author standby is not lagging behind author primary
  • orchestrator queues don't have too many messages
  • aem-healthcheck deep healthcheck is responding with 200
  • aem-healthcheck security is responding with 200
  • publish-dispatcher ELB and author-dispatcher ELB should have at least 2 inservice instances, author ELB should have exactly 1 inservice instance

Support for predefined certs

v2 introduced support for using TLS certs generated with AWS Certificate Manager. However, it lost the feature to allow users to specify a custom cert which doesn't get dynamically generated.

This is necessary for users which are mandated to use specific cert which can be uploaded to AWS Certificate Manager, but not generated each time.

Inject dependency versions as facts

stack-data playbook should generate a custom fact file that contains version numbers of all dependencies stored inside data bucket. The values can be retrieved from inventory config.

These facts should then be used by stack provisioner to download the dependency artifacts from data bucket.

Benefits:

  • single location for configuring dependency versions in aem-aws-stack-builder (currently duplicated in stack provisioner)
  • data bucket can store versioned artifacts

Dispatcher state needs to be persisted

During recovery, the publish instance's state is restored on the new pair, however, the cache is empty and this might trigger rebuilding of the cache, which could be a huge resource consumption for a large piece of content.

If dispatcher's docroot is stored in a different volume, it can be used as the source to copy the dispatcher's state from.

Configure Simian Army source to be an S3 URI

Chaos Monkey component currently uses the default location of Simian Army artifact (defined in puppet-simianarmy) which will download the artifact from bintray.

In order to remove this direct dependency to an external service (i.e. not in AWS), we need to add simian army to S3 library location, consistent with oak run jar file.

The risk of having an external dependency is that your system recovery can fail when the external service (external to AWS) is unavailable and this is unacceptable in a number of organisations.

System user credentials get logged when verbose flag is set

When stack-data is initialised with Ansible verbose flag -v set, the generated credentials are displayed, e.g.

TASK [Generate random credentials for system users] ****************************

ok: [127.0.0.1] => {"changed": false, "meta": {"admin": "UMKIzR11kilAC8wyvADz8mivtvoUUfjFruWEBAqfwef13RZvOipzoTff4SAsm3hAtMkO
YahiEKwCdG2JU47Z8qz3ioTMVtOBDqcB", "deployer": "swd6g6NbhDsJHT3e9PUSw1Era5hPxQMwXla3ZZ2GGZLg8lRYmerKKvdB34eVgzFCvD8h265STn2TP
WQxOjqGkAoJhMRidwWNYYSf", "exporter": "rOJAIL5HgsKbbmb0HscJufVdEidUjmLmtJ9BohJwUqVt2MMzWj60Wd1Rf2iVEvFziQW8bv3bic5z6SD5F7HqKl
SwHRbnnmcZI2Yx", "importer": "pGg19IfclNKTQfDfAP482VktdevMPSM0QnhEsyg7U52ujYv5CeyuB159SwTPJ9ZxO36w2NG9gVW2oD3X6XdsaGgsqBpdtLK
3ohdh", "orchestrator": "tJXEQXeKoVTq8yj9qSJ9syBnZBx62hUxHsLGOgUYvlY8C0a3Sb7yC76gzBBUtd4KWLR6USw42spS0jlR9vOwOVVCBAEl2cBfc7wr
", "replicator": "LhXqiZnZpxAxI2mdZu22gC6SYP1MgzMFFFcDDNaQamAwkzRTP6M9ZmSXuyFyhVcsxNSHduxJMbRmtzedGdOARKQE0UtRSoxgU1Fq"}}

These credentials shouldn't be logged even if verbose flag is set.

Create CloudWatch alarm for messaging queue size

Messaging stack contains an SQS queue for handling ASG events, which could potentially flood when the stack keeps recovering/scaling (ASG keeps terminating and launching instances) in a cycle, e.g. due to provisioning failure caused by failing outbound proxy.

Need to create a CloudWatch alarm on the messaging stack to accompany the SQS queue, set the condition to 'when there are 10 or more messages in the queue'.

Stack-level JVM opts customisation support

Currently JVM opts can be configured in packer-aem as part of AMI baking.
aem-aws-stack-builder supports the ability to customise instance types, however, the JVM opts can't be customised at stack level. This causes a dependency to specific AMIs that use the expected JVM opts for a given instance type.

We should decouple the JVM opts between what's required for AMI baking in packer-aem and what's required while running the instances in aem-aws-stack-builder.

The custom JVM opts should be configurable as Ansible variable and passed to Puppet config for puppet-aem-curator to consume in config_author and config_publish.

Auto recovery from a dysfunctional orchestrator

In the scenario that the orchestrator application is no longer processing messages from the queue. maybe the ec2 instance losses access to the queue or the application crashes etc. There needs to be a mechanism to terminate the instance so a new health orchestrator starts.

Flag for enabling crxde

Stack builder is secure by default, however, during development users might want to enable crxde in order to inspect the repository.
We need to introduce a flag for enabling crxde.

Persist backup snapshot ID in launch config

After nightly and hourly snapshot process of publish instance, snapshot/volume ID needs to be persisted in launch config, to be consumed by new publish instances launched during autoscaling event.

This is to ensure that in the event of losing all publish instances (i.e. there's no healthy publish instance to take snapshot from), the new instances will restore the latest nightly or hourly snapshot.

Note that there's a risk of corrupted repository in the hourly snapshot, hence we need to make this option configurable (whether to use hourly or nightly snapshot). Different users will take different level of risk of encountering corrupted repository.

Parameterise EBS volume size

EBS volume size in all cloudformation templates need to be configurable due variety of content sizes on various AEM projects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.