openforce / spark-mllib-scala-play Goto Github PK

View Code? Open in Web Editor NEW

53.0 17.0 26.0 7.51 MB

Twitter sentiment analysis based on Apache Spark, MLlib, Scala and Akka.

Home Page: http://sentiment.openforce.com

License: Apache License 2.0

Scala 25.22% HTML 16.22% JavaScript 6.73% CSS 3.85% Jupyter Notebook 47.98%

spark-mllib-scala-play's Introduction

Twitter Sentiment Analysis

Typesafe Activator tutorial for Apache Spark, MLlib, Scala, Akka and Play Framework

With this tutorial template we show how to automatically classify the sentiment of Twitter messages leveraging the Typesafe Stack and Apache Spark. These messages are classified as either positive or negative with respect to a query term. Users who want to research the sentiment of products before purchase, or companies that want to monitor the public sentiment of their brands can make use of this kind of application.

The Activator template consists of backend components using Scala, Spark, Akka and the Play Framework in their most recent versions and Polymer for the UI. Main focus of this template is the orchestration of these technologies by an example of using machine learning for classifying the sentiment of Twitter messages using MLlib. If you want to see this template in action please refer to http://sentiment.openforce.com (you will need a Twitter user to login).

The fundamental idea of sentiment classification used in this template is based on the paper by Alec Go et al. and its related implementation Sentiment140.

Setup Instructions

Make sure that you have Java 8, either Sbt or Typesafe Activator and Node.js already installed on your machine. You should have at least two cores available on this machine since Spark streaming (used by the OnlineTrainer) will occupy one core. Hence, to be able to process the data the application needs at least one more resource.

Clone this repository: git clone [email protected]:openforce/spark-mllib-scala-play.git
Change into the newly created directory: cd spark-mllib-scala-play
Insert your Twitter access token and consumer key/secret pairs in application.conf. For generating a token, please refer to dev.twitter.com. By default the application runs in single-user-mode which means the access tokens configured in your application.conf respectively local.conf will be also used for querying Twitter by keywords. This is fine when you run the application locally and just want to checkout the tutorial. Note: If you want to run the application in production mode you would have to turn single-user-mode off so that OAuth per user is used instead. To do so change the line in your conf/application.conf to twitter.single-user-mode = no. Also make sure to provide an application secret.
Launch SBT: sbt run or ACTIVATOR: ./activator ui (If you want to start the application as Typesafe Activator Template)
Navigate your browser to: http://localhost:9000
If necessary change the twitter.redirect.url in application.conf to the url the application actually uses
If necessary (if twitter changes the url to its fetch tweets service) change the twitter.fetch.url in application.conf to the new one. Ensure that the last url parameter is the query string, the application will append the keyword at the end of the url.

If starting the application takes a very long time or even times out it may be due to a known Activator issue. In that case do the following before starting with sbt run.

Delete the project/sbt-fork-run.sbt file
Remove the line fork in run := true (added automatically when you start activator) from the bottom of build.sbt

Without the fork option, which is needed by Activator the application should start within a few seconds.

The Classification Workflow

The following diagram shows how the actor communication workflow for classification looks like:

The Application controller serves HTTP requests from the client/browser and obtains ActorRefs for EventServer, StatisticsServer and Director.

The Director is the root of the Actor hierarchy, which creates all other durable (long lived) actors except StatisticsServer and EventServer. Besides supervision of the child actors it builds the bridge between Playframework and Akka by handing over the Classifier ActorRefs to the controller. Moreover, when trainings of the estimators within BatchTrainer and OnlineTrainer are finished, this actor passes the latest Machine Learning models to the StatisticsServer (see Figure below). For the OnlineTrainer statistics generation is scheduled every 5 seconds.

The Classifier creates a FetchResponseHandler actor and tells the TwitterHandler with a Fetch message (and the ActorRef of the FetchResponseHandler) to get the latest Tweets by a given keyword or query.

Once the TwitterHandler has fetched some Tweets, the FetchResponse is sent to the FetchResponseHandler.

The FetchResponseHandler creates a TrainingModelResponseHandler actor and tells the BatchTrainer and OnlineTrainer to pass the latest model to TrainingResponseHandler. It registers itself as a monitor for TrainingResponseHandler and when this actor terminates it stops itself as well.

The TrainingModelResponseHandler collects the models and vectorized Tweets makes predictions and sends the results to the original sender (the Application controller). The original sender is passed through the ephemeral (short lived) actors, indicated by the yellow dotted line in the figure above.

Model Training and Statistics

The following diagram shows the actors involved in training the machine learning estimators and serving statistics about their predictive performance:

The BatchTrainer receives a Train message as soon as a corpus (a collection of labeled Tweets) has been initialized. This corpus is initialized by the CorpusInitializer and can either be created on-the-fly via Sparks TwitterUtils.createStream (with automatic labeling by using emoticons ":)" and ":(") or a static corpus provided by Sentiment140 which is read from a CSV file. Which one to choose can be configured via ml.corpus.initialization.streamed in application.conf. For batch training we use the high-level org.apache.spark.ml API. We use Grid Search Cross Validation to get the best hyperparameters for our LogisticRegression model.

The OnlineTrainer receives a Train message with a corpus (an RDD[Tweet]) upon successful initialization just like the BatchTrainer. For the online learning approach we use the experimental StreamingLogisticRegressionWithSGD estimator which, as the name implies, uses Stochastic Gradient Descent to update the model continually on each Mini-Batch (RDD) of the DStream created via TwitterUtils.createStream.

The StatisticsServer receives {Online,Batch}TrainerModel messages and creates performance metrics like Accuracy, Area under the ROC Curve and so forth which in turn are forwarded to the subscribed EventListeners and finally sent to the client (browser) via Web Socket.

The EventListener s are created for each client via the Playframeworks built-in WebSocket.acceptWithActor. EventListeners subscribe for EventServer and StatisticsServer. When the connections terminate (e.g. browser window is closed) the respective EventListener shuts down and unsubscribes from EventServer and/or StatisticsServer via postStop().

The EventServer is created by the Application controller and forwards event messages (progress of corpus initialization) to the client (also via Web Socket).

spark-mllib-scala-play's People

Contributors

Stargazers

Watchers

spark-mllib-scala-play's Issues

Add more tests

netlib wrapper causes application to halt on non-os x machines

On Windows (10) and some Linux derivates the application gets stuck when dependencies to libblas and liblapack are not met. Since OSX ships those with the veclib framework, there are no issues.

OAuth callback URL cannot be localhost which conflicts with the default startup host

Activator and Play Framework start the application by default with localhost:9000. The OAuth configuration for a Twitter App requires to set a callback URL. This cannot be localhost but has to be 127.0.0.1 instead. Therefore we have a problem when we want to access cookies which are stored with a different host.

Add emoji support in Transformable trait

Add emoji unicodes to emoRepl mapping.

Change components diagram for online learning

The provided diagram is incomplete and somewhat misleading.

Prune obsolete feature branches

Review Akka supervision hierarchy and strategies

Further investigation is required since we just use the defaults until now.

application secret not set

When starting the application I get the excpetion

Caused by: play.api.PlayException: Configuration error[Application secret not set]
        at play.api.libs.CryptoConfigParser.get$lzycompute(Crypto.scala:236) ~[com.typesafe.play.play_2.11-2.4.4.jar:2.4.4]
        at play.api.libs.CryptoConfigParser.get(Crypto.scala:203) ~[com.typesafe.play.play_2.11-2.4.4.jar:2.4.4]

We need to add to the documentation that the user needs to set an application secret.

Added Github description and link to website

The Github project description is missing, which makes it harder to find for people. Admin rights are needed to add it.

Write tutorials

sbt dist does not include test corpus needed for batch learning

So after an sbt dist and installation on a fresh server all you see is a swarm of exceptions. Need to include this in the distribution.

Improve predictive performance of online learning

Validate activator performance

Starting the activator ui extends the sbt build process (mainly adding jvm fork configs). This increases the startup time of the application in some cases or even halts the process.

Make UI more descriptive

Ideas:

Rename Batch Result and Online Result
Add info tooltips or descriptive texts
- What do the results mean?
- What do we compare?
- Meaning of metrics

Change favicon

Use openForce favicon instead of default.

Cleanup data folder

Remove obsolete files and directories.

Extend statistics

The StatisticServer should remember the pushed statistics. This is needed so the batch statistics aren't lost after the batch trainer is done and also to fill the live chart of the online trainer with statistics that occurred before the user opened the frontend.

SparkException: A master URL must be set in your configuration

After a fresh git clone und switching to develop branch I get the following error when I try to access http://localhost:9000/

2015-11-11 10:04:08,019 - [warn] o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-11-11 10:04:08,075 - [error] o.a.s.SparkContext - Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) [spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112) [spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext$$FastClassByGuice$$119109b9.newInstance(<generated>) [guice-4.0.jar:1.5.1]
  at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) [guice-4.0.jar:na]
  at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:61) [guice-4.0.jar:na]
2015-11-11 10:04:08,098 - [error] application -

! @6o4ml9c9d - Internal server error, for (GET) [/] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112)
  while locating org.apache.spark.SparkContext
    for parameter 1 at controllers.Application.<init>(Application.scala:23)
  at controllers.Application.class(Application.scala:23)
  while locating controllers.Application

1 error]]
  at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:265) ~[play_2.11-2.4.2.jar:2.4.2]
  at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:191) ~[play_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$class.logExceptionAndGetResult$1(Server.scala:50) [play-server_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$$anonfun$getHandlerFor$4.apply(Server.scala:59) [play-server_2.11-2.4.2.jar:2.4.2]
  at play.core.server.Server$$anonfun$getHandlerFor$4.apply(Server.scala:57) [play-server_2.11-2.4.2.jar:2.4.2]
Caused by: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112)
  while locating org.apache.spark.SparkContext
    for parameter 1 at controllers.Application.<init>(Application.scala:23)
  at controllers.Application.class(Application.scala:23)
  while locating controllers.Application

1 error
  at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025) ~[guice-4.0.jar:na]
  at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051) ~[guice-4.0.jar:na]
  at play.api.inject.guice.GuiceInjector.instanceOf(GuiceInjectorBuilder.scala:321) ~[play_2.11-2.4.2.jar:2.4.2]
  at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$1$$anonfun$apply$11.apply(Routes.scala:179) ~[na:na]
  at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$1$$anonfun$apply$11.apply(Routes.scala:179) ~[na:na]
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) ~[spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:112) ~[spark-core_2.11-1.5.1.jar:1.5.1]
  at org.apache.spark.SparkContext$$FastClassByGuice$$119109b9.newInstance(<generated>) ~[guice-4.0.jar:1.5.1]
  at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) ~[guice-4.0.jar:na]
  at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:61) ~[guice-4.0.jar:na]

Extend setup instructions in README

The setup instruction in README does not point out required OAuth settings on apps.twitter.com.

Not able to perform OAuth login

With the current develop branch I cannot login via OAuth. I get the following error:

2015-12-08 19:26:02,541 - [warn] o.a.h.i.c.DefaultHttpClient - Authentication error: Unable to respond to any of these challenges: {oauth=www-    authenticate: OAuth realm="https://api.twitter.com"}
2015-12-08 19:26:02,543 - [error] application -

! @6oce0039p - Internal server error, for (GET) [/authenticate] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[OAuthNotAuthorizedException: Authorization failed (server replied with a     401). This can happen if the consumer key was not correct or the signatures did not match.]]
  at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:265) ~[play_2.11-2.4.4.jar:2.4.4]
  at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:191) ~[play_2.11-2.4.4.jar:2.4.4]
  at play.api.GlobalSettings$class.onError(GlobalSettings.scala:179) [play_2.11-2.4.4.jar:2.4.4]
  at play.api.DefaultGlobal$.onError(GlobalSettings.scala:212) [play_2.11-2.4.4.jar:2.4.4]
  at play.api.http.GlobalSettingsHttpErrorHandler.onServerError(HttpErrorHandler.scala:94) [play_2.11-2.4.4.jar:2.4.4]
Caused by: oauth.signpost.exception.OAuthNotAuthorizedException: Authorization failed (server replied with a 401). This can happen if the     consumer key was not correct or the signatures did not match.
  at oauth.signpost.AbstractOAuthProvider.handleUnexpectedResponse(AbstractOAuthProvider.java:243) ~[signpost-core-1.2.1.2.jar:na]
  at oauth.signpost.AbstractOAuthProvider.retrieveToken(AbstractOAuthProvider.java:193) ~[signpost-core-1.2.1.2.jar:na]
  at oauth.signpost.AbstractOAuthProvider.retrieveRequestToken(AbstractOAuthProvider.java:74) ~[signpost-core-1.2.1.2.jar:na]
  at play.api.libs.oauth.OAuth.retrieveRequestToken(OAuth.scala:36) ~[play-ws_2.11-2.4.4.jar:2.4.4]
  at controllers.Twitter$$anonfun$authenticate$1$$anonfun$apply$2.apply(Twitter.scala:38) ~[classes/:na]

Refactor Transformable trait to be more composable

Make functions like tokenizeSentence and unigramsAndBigrams more composable.

Browser testing

Run app through BrowserStack. Application must work in standard browsers.

Add OAuth for Twitter

Generating keys/tokens for this application is tedious.

Prune public/javascripts

Public assets folder contains unused components.

Incompatible path for local maven repository in build.sbt

Path for local maven repository in not compatible with Windows machines.

Multiple query terms lead to timeouts

Querying for instance "star wars" causes no results for both online and batch model.

Deploy app at sentiment.openforce.com

@nano4711 discovered some issues on deployment and starup of this app on debian jessie. We need to investigate this further.

Compiling the app with Traceur fails on Windows 10

Launching sbt run on Windows (10) fails with:

[info] Compiling with Traceur
Warning: node.js detection failed, sbt will use the Rhino based Trireme JavaScript engine instead to run JavaScript assets compilation, which in some cases may be orders of magnitude slower than using node.js.
[ERROR] [12/11/2015 13:04:57.805] [sbt-web-akka.actor.default-dispatcher-5] [akka://sbt-web/user/$a/process] null
akka.actor.ActorInitializationException: exception during creation
        at akka.actor.ActorInitializationException$.apply(Actor.scala:166)
        at akka.actor.ActorCell.create(ActorCell.scala:596)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at akka.util.Reflect$.instantiate(Reflect.scala:66)
        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
        at akka.actor.Props.newActor(Props.scala:252)
        at akka.actor.ActorCell.newActor(ActorCell.scala:552)
        at akka.actor.ActorCell.create(ActorCell.scala:578)
        ... 7 more
Caused by: java.io.IOException: Cannot run program "node": CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at akka.contrib.process.BlockingProcess.<init>(BlockingProcess.scala:55)
        ... 16 more
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
        at java.lang.ProcessImpl.start(ProcessImpl.java:137)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 17 more

OAuth user tokens must not be shared via Director actor

The OAuth mechanism has to be changed so that the user specific OAuth tokens are not send to the Director actor as it is single instance actor used by all users. Each user query has to fetch the cookie stored OAuth tokens and use them for Twitter authentication.

Add a single user mode to the config to avoid OAuth

I would suggest to add a default config for a single user mode which uses the already required tokens in the application.conf for the Twitter querying instead of OAuth. That would simplify the scenario where someone just wants to checkout the application locally. It would also mean no callback URL has to be configured with Twitter, which would solve the issue #21 .
For our deployment to a public server we could still set this single user mode flag to false and require OAuth for each user of the application.

sbt run - provisioning exception

HI;
I just tried to get started with this template. A sbt run results in the following error:

Execution exception
[ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, java.lang.ExceptionInInitializerError
  at controllers.Twitter.<init>(Twitter.scala:13)
  while locating controllers.Twitter
    for parameter 2 at controllers.Application.<init>(Application.scala:24)
  at controllers.Application.class(Application.scala:24)
  while locating controllers.Application

1 error]

Review of the activator tutorial

It would assist us greatly to get feedback on the tutorials from people who were not involved with them.