Git Product home page Git Product logo

tigstep / kafka_storm_pipeline Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 209 KB

A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.

Java 100.00%
java-8 ansible docker apache-zookeeper apache-kafka apache-storm aws-ec2 aws-rds aws-elasticache aws-s3

kafka_storm_pipeline's Introduction

kafka_storm_pipeline

Diagram

alt text

Requirements

This project requires Ansible, Java8, Apache Maven and an AWS account. In addition, even though not required, Docker, redis-cli, Apache Kafka will be nice to have installed locally, to further explore various parts of this project.

Tools/Services Used

  • Java
  • Ansible
  • Docker
  • Apache Zookeeper
  • Apache Kafka
  • Apache Storm
  • AWS
    • EC2
    • RDS
    • Elasticache(Redis)
    • S3

Short Description

A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.

Process Description

The events (in following format - CustID,Balance) for this pipeline are being generated by KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The KafkaStormProducer.jar publeshes to Kafka topic from local machine. The topic acts as a source(Spout in Storm's world) for StormLookup topology, which for each event(tuple in Storm's world) does
  • LookupBolt
    • Extracts the CustID from the tuple
    • Looks up the Redis cluster and gets the SSN for that CustID
    • Passes the SSN along with the Balance to both RDSInserterBolt and S3WriterBolt
  • RDSInserterBolt
    • For each tuple does an upsert in RDS PSQL Balances table
  • S3WriterBolt
    • Accumulates the tuples received based on either specified count or specified time delta(whatever happens first)
    • Generates a file based on above and writes into S3

Execution

In order to execute issue ansible-playbook infrastructure.yml, while using your own AWS user. Once ansible run is complete, run one or more instances of KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The pipeline will start populating the RDS and writing files into S3 at this point.

Execution Process Description

  • ansible-playbook infrastructure.yml
    • Creates a dedicated VPC for this project
    • Creates two subnets inside that VPC, sets up an Internet Gateway and defines all the necessary routes
    • Creates a SecurityGroup to be assigned to different resources throughout this project
    • Spins up EC2 instances and runs Zookeeper, Kafka and Storm Docker containers on them
    • Creates an Elastichache(Redis) cluster and populates it with lookup data
    • Creates an RDS Postgres instance to be populated by the pipeline
    • Deploys the storm topology
  • KafkaStormProducer
    • Produces events to kafka topic to be consumed by the pipeline

To Do

  • Make KafkaStormProducer param based
  • Split CustID_Balance to 2 separate files
  • Create a config.yml.tplt and start pushing this instead of original ansible config one
  • Implement time/count based batching for S3WriterBolt
  • Add storm logviewer to supervisor dockers
  • Implement a prod level logging and exception handling

Observations

  • If possible, always use Terraform for infrastructure creation. Ansible is good as far as working with already provisioned resources goes(i.e a kafka container is being spun up on an EC2 instance, however, terraform is much more intuitive as far as EC2 instance creation, itself, is concerned)

Warnings

  • Current configuration of this project will be using AWS services that are beyond the Free Tier!

kafka_storm_pipeline's People

Contributors

tigstep avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

huangzhaorongit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.