Git Product home page Git Product logo

spark-mllib-scala-play's Introduction

Twitter Sentiment Analysis

Typesafe Activator tutorial for Apache Spark, MLlib, Scala, Akka and Play Framework

TravisCI

With this tutorial template we show how to automatically classify the sentiment of Twitter messages leveraging the Typesafe Stack and Apache Spark. These messages are classified as either positive or negative with respect to a query term. Users who want to research the sentiment of products before purchase, or companies that want to monitor the public sentiment of their brands can make use of this kind of application.

The Activator template consists of backend components using Scala, Spark, Akka and the Play Framework in their most recent versions and Polymer for the UI. Main focus of this template is the orchestration of these technologies by an example of using machine learning for classifying the sentiment of Twitter messages using MLlib. If you want to see this template in action please refer to http://sentiment.openforce.com (you will need a Twitter user to login).

The fundamental idea of sentiment classification used in this template is based on the paper by Alec Go et al. and its related implementation Sentiment140.

Setup Instructions

Make sure that you have Java 8, either Sbt or Typesafe Activator and Node.js already installed on your machine. You should have at least two cores available on this machine since Spark streaming (used by the OnlineTrainer) will occupy one core. Hence, to be able to process the data the application needs at least one more resource.

  1. Clone this repository: git clone [email protected]:openforce/spark-mllib-scala-play.git
  2. Change into the newly created directory: cd spark-mllib-scala-play
  3. Insert your Twitter access token and consumer key/secret pairs in application.conf. For generating a token, please refer to dev.twitter.com. By default the application runs in single-user-mode which means the access tokens configured in your application.conf respectively local.conf will be also used for querying Twitter by keywords. This is fine when you run the application locally and just want to checkout the tutorial. Note: If you want to run the application in production mode you would have to turn single-user-mode off so that OAuth per user is used instead. To do so change the line in your conf/application.conf to twitter.single-user-mode = no. Also make sure to provide an application secret.
  4. Launch SBT: sbt run or ACTIVATOR: ./activator ui (If you want to start the application as Typesafe Activator Template)
  5. Navigate your browser to: http://localhost:9000
  6. If necessary change the twitter.redirect.url in application.conf to the url the application actually uses
  7. If necessary (if twitter changes the url to its fetch tweets service) change the twitter.fetch.url in application.conf to the new one. Ensure that the last url parameter is the query string, the application will append the keyword at the end of the url.

If starting the application takes a very long time or even times out it may be due to a known Activator issue. In that case do the following before starting with sbt run.

  1. Delete the project/sbt-fork-run.sbt file
  2. Remove the line fork in run := true (added automatically when you start activator) from the bottom of build.sbt

Without the fork option, which is needed by Activator the application should start within a few seconds.

The Classification Workflow

The following diagram shows how the actor communication workflow for classification looks like:

The Classification Workflow

The Application controller serves HTTP requests from the client/browser and obtains ActorRefs for EventServer, StatisticsServer and Director.

The Director is the root of the Actor hierarchy, which creates all other durable (long lived) actors except StatisticsServer and EventServer. Besides supervision of the child actors it builds the bridge between Playframework and Akka by handing over the Classifier ActorRefs to the controller. Moreover, when trainings of the estimators within BatchTrainer and OnlineTrainer are finished, this actor passes the latest Machine Learning models to the StatisticsServer (see Figure below). For the OnlineTrainer statistics generation is scheduled every 5 seconds.

The Classifier creates a FetchResponseHandler actor and tells the TwitterHandler with a Fetch message (and the ActorRef of the FetchResponseHandler) to get the latest Tweets by a given keyword or query.

Once the TwitterHandler has fetched some Tweets, the FetchResponse is sent to the FetchResponseHandler.

The FetchResponseHandler creates a TrainingModelResponseHandler actor and tells the BatchTrainer and OnlineTrainer to pass the latest model to TrainingResponseHandler. It registers itself as a monitor for TrainingResponseHandler and when this actor terminates it stops itself as well.

The TrainingModelResponseHandler collects the models and vectorized Tweets makes predictions and sends the results to the original sender (the Application controller). The original sender is passed through the ephemeral (short lived) actors, indicated by the yellow dotted line in the figure above.

Model Training and Statistics

The following diagram shows the actors involved in training the machine learning estimators and serving statistics about their predictive performance:

Model Training and Statistics

The BatchTrainer receives a Train message as soon as a corpus (a collection of labeled Tweets) has been initialized. This corpus is initialized by the CorpusInitializer and can either be created on-the-fly via Sparks TwitterUtils.createStream (with automatic labeling by using emoticons ":)" and ":(") or a static corpus provided by Sentiment140 which is read from a CSV file. Which one to choose can be configured via ml.corpus.initialization.streamed in application.conf. For batch training we use the high-level org.apache.spark.ml API. We use Grid Search Cross Validation to get the best hyperparameters for our LogisticRegression model.

The OnlineTrainer receives a Train message with a corpus (an RDD[Tweet]) upon successful initialization just like the BatchTrainer. For the online learning approach we use the experimental StreamingLogisticRegressionWithSGD estimator which, as the name implies, uses Stochastic Gradient Descent to update the model continually on each Mini-Batch (RDD) of the DStream created via TwitterUtils.createStream.

The StatisticsServer receives {Online,Batch}TrainerModel messages and creates performance metrics like Accuracy, Area under the ROC Curve and so forth which in turn are forwarded to the subscribed EventListeners and finally sent to the client (browser) via Web Socket.

The EventListener s are created for each client via the Playframeworks built-in WebSocket.acceptWithActor. EventListeners subscribe for EventServer and StatisticsServer. When the connections terminate (e.g. browser window is closed) the respective EventListener shuts down and unsubscribes from EventServer and/or StatisticsServer via postStop().

The EventServer is created by the Application controller and forwards event messages (progress of corpus initialization) to the client (also via Web Socket).

spark-mllib-scala-play's People

Contributors

cmacher avatar nano4711 avatar reneblaim avatar shokuninsan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-mllib-scala-play's Issues

application secret not set

When starting the application I get the excpetion

Caused by: play.api.PlayException: Configuration error[Application secret not set]
        at play.api.libs.CryptoConfigParser.get$lzycompute(Crypto.scala:236) ~[com.typesafe.play.play_2.11-2.4.4.jar:2.4.4]
        at play.api.libs.CryptoConfigParser.get(Crypto.scala:203) ~[com.typesafe.play.play_2.11-2.4.4.jar:2.4.4]

We need to add to the documentation that the user needs to set an application secret.

Validate activator performance

Starting the activator ui extends the sbt build process (mainly adding jvm fork configs). This increases the startup time of the application in some cases or even halts the process.

Make UI more descriptive

Ideas:

  • Rename Batch Result and Online Result
  • Add info tooltips or descriptive texts
    • What do the results mean?
    • What do we compare?
    • Meaning of metrics

Extend statistics

The StatisticServer should remember the pushed statistics. This is needed so the batch statistics aren't lost after the batch trainer is done and also to fill the live chart of the online trainer with statistics that occurred before the user opened the frontend.

SparkException: A master URL must be set in your configuration

After a fresh git clone und switching to develop branch I get the following error when I try to access http://localhost:9000/

2015-11-11 10:04:08,019 - [warn] o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-11-11 10:04:08,075 - [error] o.a.s.SparkContext - Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) [spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112) [spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext$$FastClassByGuice$$119109b9.newInstance(<generated>) [guice-4.0.jar:1.5.1]
  at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) [guice-4.0.jar:na]
  at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:61) [guice-4.0.jar:na]
2015-11-11 10:04:08,098 - [error] application -

! @6o4ml9c9d - Internal server error, for (GET) [/] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112)
  while locating org.apache.spark.SparkContext
    for parameter 1 at controllers.Application.<init>(Application.scala:23)
  at controllers.Application.class(Application.scala:23)
  while locating controllers.Application

1 error]]
  at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:265) ~[play_2.11-2.4.2.jar:2.4.2]
  at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:191) ~[play_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$class.logExceptionAndGetResult$1(Server.scala:50) [play-server_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$$anonfun$getHandlerFor$4.apply(Server.scala:59) [play-server_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$$anonfun$getHandlerFor$4.apply(Server.scala:57) [play-server_2.11-2.4.2.jar:2.4.2]
Caused by: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112)
  while locating org.apache.spark.SparkContext
    for parameter 1 at controllers.Application.<init>(Application.scala:23)
  at controllers.Application.class(Application.scala:23)
  while locating controllers.Application

1 error
  at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025) ~[guice-4.0.jar:na]
  at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051) ~[guice-4.0.jar:na]
  at play.api.inject.guice.GuiceInjector.instanceOf(GuiceInjectorBuilder.scala:321) ~[play_2.11-2.4.2.jar:2.4.2]
  at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$1$$anonfun$apply$11.apply(Routes.scala:179) ~[na:na]
  at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$1$$anonfun$apply$11.apply(Routes.scala:179) ~[na:na]
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) ~[spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112) ~[spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext$$FastClassByGuice$$119109b9.newInstance(<generated>) ~[guice-4.0.jar:1.5.1]
  at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) ~[guice-4.0.jar:na]
  at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:61) ~[guice-4.0.jar:na]

Not able to perform OAuth login

With the current develop branch I cannot login via OAuth. I get the following error:

2015-12-08 19:26:02,541 - [warn] o.a.h.i.c.DefaultHttpClient - Authentication error: Unable to respond to any of these challenges: {oauth=www-    authenticate: OAuth realm="https://api.twitter.com"}
2015-12-08 19:26:02,543 - [error] application -

! @6oce0039p - Internal server error, for (GET) [/authenticate] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[OAuthNotAuthorizedException: Authorization failed (server replied with a     401). This can happen if the consumer key was not correct or the signatures did not match.]]
  at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:265) ~[play_2.11-2.4.4.jar:2.4.4]
  at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:191) ~[play_2.11-2.4.4.jar:2.4.4]
  at play.api.GlobalSettings$class.onError(GlobalSettings.scala:179) [play_2.11-2.4.4.jar:2.4.4]
  at play.api.DefaultGlobal$.onError(GlobalSettings.scala:212) [play_2.11-2.4.4.jar:2.4.4]
  at play.api.http.GlobalSettingsHttpErrorHandler.onServerError(HttpErrorHandler.scala:94) [play_2.11-2.4.4.jar:2.4.4]
Caused by: oauth.signpost.exception.OAuthNotAuthorizedException: Authorization failed (server replied with a 401). This can happen if the     consumer key was not correct or the signatures did not match.
  at oauth.signpost.AbstractOAuthProvider.handleUnexpectedResponse(AbstractOAuthProvider.java:243) ~[signpost-core-1.2.1.2.jar:na]
  at oauth.signpost.AbstractOAuthProvider.retrieveToken(AbstractOAuthProvider.java:193) ~[signpost-core-1.2.1.2.jar:na]
  at oauth.signpost.AbstractOAuthProvider.retrieveRequestToken(AbstractOAuthProvider.java:74) ~[signpost-core-1.2.1.2.jar:na]
  at play.api.libs.oauth.OAuth.retrieveRequestToken(OAuth.scala:36) ~[play-ws_2.11-2.4.4.jar:2.4.4]
  at controllers.Twitter$$anonfun$authenticate$1$$anonfun$apply$2.apply(Twitter.scala:38) ~[classes/:na]

Browser testing

Run app through BrowserStack. Application must work in standard browsers.

Compiling the app with Traceur fails on Windows 10

Launching sbt run on Windows (10) fails with:

[info] Compiling with Traceur
Warning: node.js detection failed, sbt will use the Rhino based Trireme JavaScript engine instead to run JavaScript assets compilation, which in some cases may be orders of magnitude slower than using node.js.
[ERROR] [12/11/2015 13:04:57.805] [sbt-web-akka.actor.default-dispatcher-5] [akka://sbt-web/user/$a/process] null
akka.actor.ActorInitializationException: exception during creation
        at akka.actor.ActorInitializationException$.apply(Actor.scala:166)
        at akka.actor.ActorCell.create(ActorCell.scala:596)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at akka.util.Reflect$.instantiate(Reflect.scala:66)
        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
        at akka.actor.Props.newActor(Props.scala:252)
        at akka.actor.ActorCell.newActor(ActorCell.scala:552)
        at akka.actor.ActorCell.create(ActorCell.scala:578)
        ... 7 more
Caused by: java.io.IOException: Cannot run program "node": CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at akka.contrib.process.BlockingProcess.<init>(BlockingProcess.scala:55)
        ... 16 more
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
        at java.lang.ProcessImpl.start(ProcessImpl.java:137)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 17 more

OAuth user tokens must not be shared via Director actor

The OAuth mechanism has to be changed so that the user specific OAuth tokens are not send to the Director actor as it is single instance actor used by all users. Each user query has to fetch the cookie stored OAuth tokens and use them for Twitter authentication.

Add a single user mode to the config to avoid OAuth

I would suggest to add a default config for a single user mode which uses the already required tokens in the application.conf for the Twitter querying instead of OAuth. That would simplify the scenario where someone just wants to checkout the application locally. It would also mean no callback URL has to be configured with Twitter, which would solve the issue #21 .
For our deployment to a public server we could still set this single user mode flag to false and require OAuth for each user of the application.

sbt run - provisioning exception

HI;
I just tried to get started with this template. A sbt run results in the following error:

Execution exception
[ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, java.lang.ExceptionInInitializerError
  at controllers.Twitter.<init>(Twitter.scala:13)
  while locating controllers.Twitter
    for parameter 2 at controllers.Application.<init>(Application.scala:24)
  at controllers.Application.class(Application.scala:24)
  while locating controllers.Application

1 error]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.