Git Product home page Git Product logo

spark-flume-stream's Introduction

spark-flume-stream

A spark streaming program to process events from a flume agent and store output data to text files.

  • The Flume Event Producer reads a csv file and sends US consumer complaints events to a locally setup flume agent.
  • The spark program sources the events from flume agent (as avro sink), transforms the events into a process-able format, maintains a running/rolling list of event counts per product and state, and appends the running counts to a text file specific to a product and state (for every streaming batch window of 2 sec).

For each project, please download latest version of maven to run mvn commands from command-line, or import it as a maven project in your IDE (provided maven plug-in is present). Please run mvn clean install and mvn eclipse:eclipse if you're running from a command line, and then import the project in your IDE.

Following setup will need to be done on local machine to run these projects:

  • Install Apache Flume - Visit Flume

  • Install Spark - Visit Spark (For mac users, it's also available via brew)

  • Install Python - Visit Python. Python is required to run Python specific examples provided in Spark distribution.

  • Flume agent config - The program runs as a flume avro sink for a configured flume agent. The avro source for the flume agent is the flume event producer. The following configuration for flume agent can be stored at FLUME_HOME/conf/flume-spark-conf.properties:

    • agent.sources = javaavrorpc
    • agent.channels = memoryChannel
    • agent.sinks = sparkstreaming
    • agent.sources.javaavrorpc.type = avro
    • agent.sources.javaavrorpc.bind = localhost
    • agent.sources.javaavrorpc.port = 42222
    • agent.sources.javaavrorpc.channels = memoryChannel
    • agent.sinks.sparkstreaming.type = avro
    • agent.sinks.sparkstreaming.hostname = localhost
    • agent.sinks.sparkstreaming.port = 43333
    • agent.sinks.sparkstreaming.channel = memoryChannel
    • agent.channels.memoryChannel.type = memory
    • agent.channels.memoryChannel.capacity = 10000
    • agent.channels.memoryChannel.transactionCapacity = 1000
  • Flume agent startup - From FLUME_HOME/bin, try starting the flume agent instance by running flume-ng agent -n agent -c ../conf -f ../conf/flume-spark-conf.properties. Check if it's running as a Java process.

  • Spark standalone cluster startup - From SPARK_HOME/sbin, start the cluster master as a background process by running start-master.sh. Check if it's running as a Java process. Go to http://localhost:8080 and check the URL for master. Then from same command line, start one of the worker's daemon as spark-class org.apache.spark.deploy.worker.Worker <spark master URL> -c 1 -m 2G -d <a directory to store logs>. Then from a different command line, start second worker's daemon with the same command. There should be a total of 4 Java processes running now - flume agent, cluster master, and two worker daemons.

  • Spark program submission - Once you've done mvn clean install and generated the project uber-JAR (including all dependencies), submit the program from command line by running spark-submit --class <main class fully qualified name> --master <spark master URL> <complete path to uber JAR> <complete path to a local checkpoint directory> <complete path to a local output directory>.

    • eg: --class com.sapient.stream.process.ConsCompEventStream
  • Event submission to Flume agent - From your IDE in project Flume Event Producer, run the main class FlumeEventSender. It'll submit all records in the CSV file as events to flume agent. There're 3 files available in resources, and the file name can be changed in class ConsCompFlumeClient to send less/more events. The spark program will start processing the flume event stream, and creating output files at the local output directory.

  • Program monitoring - In your browser, go to http//localhost:8080. It should show the program as running application. Now you can play around with it.

spark-flume-stream's People

Contributors

abhinavg6 avatar mathyourlife avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.