A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.
This project requires Ansible, Java8, Apache Maven and an AWS account. In addition, even though not required, Docker, redis-cli, Apache Kafka will be nice to have installed locally, to further explore various parts of this project.
Tools/Services Used
Java
Ansible
Docker
Apache Zookeeper
Apache Kafka
Apache Storm
AWS
EC2
RDS
Elasticache(Redis)
S3
Short Description
A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.
Process Description
The events (in following format - CustID,Balance) for this pipeline are being generated by KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The KafkaStormProducer.jar publeshes to Kafka topic from local machine. The topic acts as a source(Spout in Storm's world) for StormLookup topology, which for each event(tuple in Storm's world) does
LookupBolt
Extracts the CustID from the tuple
Looks up the Redis cluster and gets the SSN for that CustID
Passes the SSN along with the Balance to both RDSInserterBolt and S3WriterBolt
RDSInserterBolt
For each tuple does an upsert in RDS PSQL Balances table
S3WriterBolt
Accumulates the tuples received based on either specified count or specified time delta(whatever happens first)
Generates a file based on above and writes into S3
Execution
In order to execute issue ansible-playbook infrastructure.yml, while using your own AWS user. Once ansible run is complete, run one or more instances of KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The pipeline will start populating the RDS and writing files into S3 at this point.
Execution Process Description
ansible-playbook infrastructure.yml
Creates a dedicated VPC for this project
Creates two subnets inside that VPC, sets up an Internet Gateway and defines all the necessary routes
Creates a SecurityGroup to be assigned to different resources throughout this project
Spins up EC2 instances and runs Zookeeper, Kafka and Storm Docker containers on them
Creates an Elastichache(Redis) cluster and populates it with lookup data
Creates an RDS Postgres instance to be populated by the pipeline
Deploys the storm topology
KafkaStormProducer
Produces events to kafka topic to be consumed by the pipeline
To Do
Make KafkaStormProducer param based
Split CustID_Balance to 2 separate files
Create a config.yml.tplt and start pushing this instead of original ansible config one
Implement time/count based batching for S3WriterBolt
Add storm logviewer to supervisor dockers
Implement a prod level logging and exception handling
Observations
If possible, always use Terraform for infrastructure creation. Ansible is good as far as working with already provisioned resources goes(i.e a kafka container is being spun up on an EC2 instance, however, terraform is much more intuitive as far as EC2 instance creation, itself, is concerned)
Warnings
Current configuration of this project will be using AWS services that are beyond the Free Tier!