Git Product home page Git Product logo

spark-playground's Introduction

Spark Playground

Build Status

This is a playground for experiments with Spark! Who's got two thumbs and is excited...!?

Getting Started

On a Mac, run brew update and brew install apache-spark sbt to install the latest version of Spark and the Scala sbt build system.

To get started playing, run spark-shell and follow the Spark Quick Start guide for some examples of simple interactive processing.

The Scala experiments code is located in the standard directory layout at src/main/scala.

To build the various experiments into a jar for submitting to spark, run sbt assembly. For example:

$ sbt assembly
$ spark-submit --class playground.HardFeelings --master local[4] target/scala-2.10/spark-playground-assembly-*.jar

Live Tweets

To process live tweets, copy src/main/resources/twitter.conf.example to src/main/resources/twitter.conf and modify the config to specify your Twitter API credentials. If you don't have Twitter API credentials, see http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html for directions how to obtain them.

To see live tweets printed out:

$ spark-submit --class playground.connector.Twitter --master local[4] target/scala-2.10/spark-playground-assembly-*.jar

Elasticsearch

To enable Elasticsearch in the playground, make sure Elasticsearch is running locally (Elasticsearch REST API should be available at http://localhost:9200), then you can enable Elasticsearch support by adding --conf spark.playground.es.enabled=true to the spark-submit command arguments. See examples below.

Kafka

To enable Kafka in the playground, make sure Kafka is running locally. Kafka broker expected to be on localhost:9092 and zookeeper is expected to be on `localhost:2181

See http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/ for a good guide on Spark/Kafka integration.

To listen in externally to the MericaTweets feed from MericaStreaming, optionally from the beginning of the queue:

$ kafka-console-consumer.sh --zookeeper localhost:2181 --topic MericaTweets [--from-beginning]

HDFS

To output to HDFS (or simulated HDFS via local files), add --conf spark.playground.es.enabled=true to the spark-submit command arguments.

Note that spark-submit will use the HADOOP_CONF_DIR environment variable to find HDFS. When HDFS is not properly setup, make sure this environment variable is not set. Then Spark will write to local files rather than HDFS.

Streaming Tweet Processing

To run streaming tweet processing, you can enable or disable various external datastores, for example:

$ spark-submit --class playground.MericaStreaming --master local[4] target/scala-2.10/spark-playground-assembly-*.jar
$ spark-submit --conf spark.playground.kafka.enabled=true --class playground.MericaStreaming --master local[4] target/scala-2.10/spark-playground-assembly-*.jar
$ spark-submit --conf spark.playground.kafka.enabled=true --conf spark.playground.es.enabled=true --class playground.MericaStreaming --master local[4] target/scala-2.10/spark-playground-assembly-*.jar

To save the tweets for offline batch processing (see below), run with --conf spark.playground.hdfs.enabled=true, e.g:

$ spark-submit --conf spark.playground.hdfs.enabled=true --class playground.MericaStreaming --master local[4] target/scala-2.10/spark-playground-assembly-*.jar

Note that any combination of the above --conf external datastores is supported.

The Merica and MericaStreaming sentiment analysis jobs can perform sentiment calculations the "easy way" (compute sentiment in memory) or the "hard way" (distributed sentiment calculation via Spark).

To change the calculation mode --conf spark.playground.easy=false (defaults to true). This works for streaming or batch.

Batch Tweet Processing

To do batch sentiment analysis on the tweets saved from the MericaStreaming processing described above, run any of the following:

$ spark-submit --class playground.Merica --master local[4] target/scala-2.10/spark-playground-assembly-*.jar
$ spark-submit --conf spark.playground.es.enabled=true --class playground.Merica --master local[4] target/scala-2.10/spark-playground-assembly-*.jar

Development

Just like any other SBT project, to develop in Eclipse (like the Scala IDE), initialize or update the Scala project by running:

$ sbt eclipse

Testing

Just like any other SBT project, to run unit tests:

$ sbt test

spark-playground's People

Contributors

jdutton avatar ben54 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.