License: Apache License 2.0

CSS 5.79% HTML 23.91% JavaScript 13.66% Python 56.64%

spark-recommendation-engine's Introduction

Recommendation on Google Cloud Platform

This tutorial shows how to run the code explained in the solution paper Recommendation Engine on Google Cloud Platform. In order to run this example fully you will need to use various components.

Disclaimer: This is not an official Google product.

Setting up

Before you begin

This tutorial assumes that you have a Cloud Platform project. To set up a project:

In the Cloud Platform Console, go to the Projects page.
Select a project, or click Create Project to create a new Cloud Platform Console project.
In the dialog, name your project. Make a note of your generated project ID.
Click Create to create a new project.

The main steps involved in this example are:

Setup a Spark cluster.
Setup a simple Google App Engine website.
Create a Cloud SQL database with an accommodation table, a rating table and a recommendation table.
Run a Python script on the Spark cluster to find the best model.
Run a Python script making a prediction using the best model.
Saving the predictions into Cloud SQL so the user can see them when displaying the welcome page.

The recommended approach to run Spark on Google Cloud Platform is to use Google Cloud Dataproc. Dataproc is a managed service that facilitates common tasks such as cluster deployment, jobs submissions, cluster scale and nodes monitoring. Interaction with Dataproc can be done over a UI, CLI or API.

Setup using Cloud Dataproc

Set up a cluster with the default parameters as explained in the Cloud Dataproc documentation on how to create a cluster. Cloud Dataproc does not require you to setup the JDBC connector.

Create and Configure Google Cloud SQL (First Generation) Access

Follow these instructions to create a Cloud SQL instance. We will use a Cloud SQL first generation in this example. To be make sure your Spark cluster can access your Cloud SQL database, you must:

Whitelist the IPs of the nodes as explained in the Cloud SQL documentation. You can find the instances' IPs by going to Compute Engine -> VM Instances in the Cloud Console. There you should see a number of instances (based on your cluster size) with names like cluster-m, cluster-w-i where cluster is the name of your cluster and i is a slave number.
Create an IPv4 address so the Cloud SQL instance can be accessed through the network.
Create a non-root user account. Make sure that this user account can connect from the IPs corresponding to the Dataproc cluster (not just localhost)

Example data

Cloud SQL Data

After you create and connect to an instance, you need to create some tables and load data into some of them by following these steps:

Connect to your project Cloud SQL instance through the Cloud Console.
Create the database and tables as explained here, using the provided sql script. In the Cloud Storage file input, enter solutions-public-assets/recommendation-spark/sql/table_creation.sql. If for some reason, the UI says that you have no access, you can also copy the file to your own bucket. In a CloudShell window or in a terminal type gsutil cp gs://solutions-public-assets/recommendation-spark/sql/table_creation.sql gs://<your-bucket>/recommendation-spark/sql/table_creation.sql. Then, in the Cloud SQL import window, provide <your-bucket>/recommendation-spark/sql/table_creation.sql (i.e the path to your copy of the file on Google Storage, without the gs:// prefix).
In the same way, populate the Accommodation and Rating tables using the provided accommodations.csv and ratings.csv.

Renting Website

The appengine folder contains a simple HTML website built with Python on App Engine using Angular Material. While it is not required to deploy this website, it can give you an idea of what a recommendation display could look like in a production environment.

From the appengine folder you can run

gcloud app deploy --project [YOUR-PROJECT-ID]

Make sure to update your database values in the main.py file to match your setup. If you kept the values of the .sql script, _DB_NAME = 'recommendation_spark'. The rest will be specific to your setup.

You can find some accommodation images here. Upload the individual files to your own bucket and change their acl to be public in order to serve them out. Remember to replace <YOUR_IMAGE_BUCKET> in appengine/app/templates/welcome.html page with your bucket.

Recommendation scripts

The main part of this solution paper is explained on the Cloud Platform solution page. In the pyspark folder, you will find the scripts mentionned in the solution paper:

Both scripts should be run in a Spark cluster. This can be done by using Cloud Dataproc.

Find the best model

Dataproc already has the MySQL connector enabled so there is no need to set it up.

The easiest way is to use the Cloud Console and run the script directly from a remote location (Cloud Storage for example). See the documentation.

It is also possible to run this command line from a local computer:

$ gcloud dataproc jobs submit pyspark \
  --cluster <YOUR_DATAPROC_CLUSTER_NAME> \
  find_model_collaborative.py \
  -- <YOUR_CLOUDSQL_INSTANCE_IP> \
  <YOUR_CLOUDSQL_DB_NAME> \
  <YOUR_CLOUDSQL_USER> \
  <YOUR_CLOUDSQL_PASSWORD>

Use the best found parameters

The script above returns a combination of the best parameters for the ALS training, as explained in the Training the model part of the solution article. It will be displayed in the console in the following format, where Dist represents how far we are from being the known value. The result might not feel satisfying but remember that the training dataset was quite small.

After you have those values, you can reuse them when calling the recommendation script.

# Build our model with the best found values
model = ALS.train(rddTraining, BEST_RANK, BEST_ITERATION, BEST_REGULATION)

Where, in our current case, BEST_RANK=15, BEST_ITERATION=20, BEST_REGULATION=0.1.

Make the prediction

Run the app_collaborative.py file with the updated values as you did before. The code makes a prediction and saves the top 5 expected rates in Cloud SQL. You can look at the results later.

You can use the Cloud Console, as explained before, which would be equivalent of running the following script from your local computer.

$ gcloud dataproc jobs submit pyspark \
  --cluster <YOUR_DATAPROC_CLUSTER_NAME> \
  app_collaborative.py \
  -- <YOUR_CLOUDSQL_INSTANCE_IP> \
  <YOUR_CLOUDSQL_DB_NAME> \
  <YOUR_CLOUDSQL_USER> \
  <YOUR_CLOUDSQL_PASSWORD> \
  <YOUR_BEST_RANK> \
  <YOUR_BEST_ITERATION> \
  <YOUR_BEST_REGULATION>

Results

See in the console

The code posted in GitHub prints the top 5 predictions. You should see something similar to a list of tuples, including userId, accoId, and prediction:

[('0', '75', 4.6428704512729375), ('0', '76', 4.54325166163637), ('0', '86', 4.529177571208829), ('0', '66', 4.52387350189572), ('0', '99', 4.44705391172443)]

Display the top recommendations saved in the Database

To easily access a MySql CLI, you can use Cloud Shell and type the following command line then enter your database password.

gcloud sql connect <YOUR_CLOUDSQL_INSTANCE_NAME>  --user=<YOUR_CLOUDSQL_USER>

Running the following SQL query on the database will return the predictions saved in the Recommendation table by app_collaborative.py:

SELECT
  id, title, type, r.prediction
FROM
  Accommodation a
INNER JOIN
  Recommendation r
ON
 r.accoId = a.id
WHERE
  r.userId = <USER_ID>
ORDER BY
  r.prediction desc

spark-recommendation-engine's People

Contributors

Stargazers

Watchers

Forkers

jimtravis dongjoon-hyun vadimberezniker drdebian mlifemaker perfettiful lakshmanok acourtney2015 vivekyadav01 bhavanabhasker komasoftware agentelinux fashtimedotcom irenegao ivyhan2013 chiraggiri rawteech ridgekimani bakley davisosiemo tmatsuo aungmyo jhamski ssmirr digideskio datafordev chenguodan bigdata-vis mhtkmr54 gweedo767 harshanimmagadda44 playgroundph mindis jphalip jicksonp nsrinivasapps izayafirst dangtong76 dtsukiyama allensmile shaguftamethwani maggie0830 mdiby oooqqqooo sunivazquez aipachakutiqwan devacto mwin007 jai2033shankar mizie guruscott ankushbhatia2 kshirato athensf gsogol jerofad ichbineinberliner mbuguafrancis naheedmk smitkadvani sibasubramaniam sudohello reloadbrain indrajitharidas kushalvenkatesh tiravata bsbonus zxshinxz akinmail kenhehuang firefoxxy8 sajeedbakht pravinboppuri-zz ptouati fatmauludag zyq11223 jaycloud8 colinsyyip jamesho22 stjordanis shubhampachori12110095 jacobkreider clabra ai-mini todun pavankodavati annieee6446 statyhc omeraslandogdu tzenkner90 jessiejingxugao chelseawmk maanasperi23 damonchien muskanmahajan37 isabella232 emkpmg-zz rodrich amitchoudhary13 hu-milize

spark-recommendation-engine's Issues

get_recommendations error

I get a 500 error on get_recommendations. Apparently I am missing something but don't have a friggin clue what it is. Any thoughts that could help point me in the right direction?

find_model_collaborative.py job failed

Initially the job ran on dataproc failed on line 60 of find_model_collaborative.py because the 'load' call was deprecated. Instead I changed that line to:
dfRates = sqlContext.read.load(source='jdbc', driver=jdbcDriver, url=jdbcUrl, dbtable='Rating')

ran the job again, came back with another error "java.util.NoSuchElementException: key not found: path" in callstack:
`16/04/05 19:30:01 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/04/05 19:30:01 INFO Remoting: Starting remoting
16/04/05 19:30:01 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:33193]
16/04/05 19:30:01 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/04/05 19:30:01 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:4040
16/04/05 19:30:02 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at producthunter-cluster-1-m/xx.xxx.x.x:8032
16/04/05 19:30:05 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1459820580219_0002
Traceback (most recent call last):
File "/tmp/0df71cfb-bf81-4fff-9838-2d5c8b87db47/find_model_collaborative.py", line 60, in
dfRates = sqlContext.read.load(source='jdbc', driver=jdbcDriver, url=jdbcUrl, dbtable='Rating')
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 139, in load
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o47.load.
: java.util.NoSuchElementException: key not found: path
at scala.collection.MapLike$class.default(MapLike.scala:228)
at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:150)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:150)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:168)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$10.apply(ResolvedDataSource.scala:168)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:168)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

16/04/05 19:30:10 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/04/05 19:30:10 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.`

Unable to connect to Cloud SQL [com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure]

Following these steps: https://github.com/GoogleCloudPlatform/spark-recommendation-engine/blob/master/README.md

but stuck at connecting Dataproc cluster to Cloud SQL

Cloud SQL (2nd generation)
Dataproc
(both within same Google Cloud Project, therefore I believe no need to whitelisting Dataproc IP instances within Cloud SQL authorizated network (?!), but still did that...

any clue will be highly appreciated!

Cheers!

Here's the trace:

18/05/21 01:53:12 INFO org.spark_project.jetty.util.log: Logging initialized @5714ms
18/05/21 01:53:12 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/05/21 01:53:12 INFO org.spark_project.jetty.server.Server: Started @5792ms
18/05/21 01:53:12 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@5343094c{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/05/21 01:53:12 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.5-hadoop2
18/05/21 01:53:13 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at vj-test-cluster-m/10.164.0.3:8032
18/05/21 01:53:17 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1526866161871_0001
Traceback (most recent call last):
File "/tmp/job-19c3c49f/find_model_collaborative.py", line 59, in
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table='Rating')
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 475, in jdbc
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o44.jdbc.
: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

Links to Cloud SQL docs need to be updated

for
Whitelist the IPs of the nodes as explained in the Cloud SQL documentation.
and
Create an IPv4 address so the Cloud SQL instance

replace
/sql/docs/access-control#appaccess

with

/sql/docs/external#appaccessIP

Unable to connect with cloud sql instance

can any one suggest. any soluation?

dfRates = sqlContext.read.jdbc(url=jdbcUrl, table='Rating')

com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

app_collabrative.py doest recomment anything

Hi
I've followed the guideline in readme file and follow the below steps

Executed find_model_collabrative.py and it returned the following results:

Rank 10
Regul 1.000000
Iter 20
Dist 7.302938

Then I've executed app_collabrative.py file and give the following inputs but it doesnt generate a recommendation. Instead it prints out and empty dataframe. What should I do to fix that?

Input Arguments:

35.187.44.98
Livy
root
xxxxxxx
20
20
1.0

Output:

18/04/24 14:00:20 INFO org.spark_project.jetty.util.log: Logging initialized @3946ms
18/04/24 14:00:20 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/04/24 14:00:20 INFO org.spark_project.jetty.server.Server: Started @4133ms
18/04/24 14:00:21 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@c151b06{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/04/24 14:00:21 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.4-hadoop2
18/04/24 14:00:23 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at livy-bigdata-cluster-m/10.132.0.4:8032
18/04/24 14:00:26 WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:973)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:624)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:801)
18/04/24 14:00:26 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1524149156133_0043
18/04/24 14:00:34 WARN org.apache.spark.SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory 'checkpoint/' appears to be on the local filesystem.
Tue Apr 24 14:00:36 UTC 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Tue Apr 24 14:00:37 UTC 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
[]
Start predicting

[Stage 21:>                                                         (0 + 1) / 2]
[Stage 21:=============================>                            (1 + 1) / 2]
                                                                                

[Stage 27:>                                                         (0 + 0) / 2]
                                                                                
18/04/24 14:00:54 WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:973)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:624)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:801)
Model Created
Predictions made
Scheme is Ready
+------+---------+----------+
|userId|companyId|prediction|
+------+---------+----------+
+------+---------+----------+

Tue Apr 24 14:00:58 UTC 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Tue Apr 24 14:00:59 UTC 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Written to DB
18/04/24 14:00:59 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@c151b06{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Job output is complete

Importing into Cloud SQL

Users can not directly import from your solutions bucket -- they need write permissions (see b/ 27454771). I've attached a patch to README.md

fix_gcsperm_patch.zip

com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

hi
i'm following each step to configure spark recommendation engine in google cloud
below step are completed
created cloud platform project.
created cluster in dataproc.
created instance in google cloud sql.(first generation)
i have created database and table in mysql cloud sql.
i try to find best model through the dataproc i got jdbc connection error

File "/tmp/433d1223-b139-4b58-8749-30bbf193eed6/find_model_collaborative.py", line 59, in
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table='Rating')
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 475, in jdbc
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
**py4j.protocol.Py4JJavaError: An error occurred while calling o44.jdbc.

: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure**

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:632)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1016)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
at com.mysql.jdbc.ConnectionImpl.(ConnectionImpl.java:779)
at com.mysql.jdbc.JDBC4Connection.(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:567)
... 33 more

please help me how to fix this problem

Facing Issues

While trying to execute the program app_collaborative.py, we have been facing issue. We are executing the program through the command line using gcloud. The following are the command line parameter:

901C502984LM:pyspark raosa$ gcloud beta dataproc jobs submit pyspark --cluster cluster-1 app_collaborative.py 173.194.229.149 recommendation_spark recommender password 15 20 0.1.

These were the parameters which were given at the console also.

The code errors out with the following message:

Traceback (most recent call last):
File "/tmp/07e42a82-714f-4200-bddf-976bbdc467f3/app_collaborative.py", line 83, in
topPredictions = predictions.takeOrdered(5, key=lambda x: -x[2])

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.StackOverflowError

The configuration of the cluster is:

Name
cluster-1
Zone
asia-east1-b
Master node

Machine type
n1-highcpu-4 (4 vCPU, 3.60 GB memory)
Primary disk size
500 GB
Worker nodes
2
Machine type
n1-standard-2 (2 vCPU, 7.50 GB memory)
Primary disk size
500 GB

Could you be able to get this resolved?

Thanks & Regards

API throws 500 error

API error thrown when retrieving recommendations on app engine website

'SQLContext' object has no attribute 'load'

I followed the DataProc setup and got all the way to the job submission. Looks like the code didn't seem to work. Would appreciate some advice on where to go from here.

16/12/05 21:17:27 INFO org.spark_project.jetty.util.log: Logging initialized @4624ms
16/12/05 21:17:27 INFO org.spark_project.jetty.server.Server: jetty-9.2.z-SNAPSHOT
16/12/05 21:17:27 INFO org.spark_project.jetty.server.ServerConnector: Started ServerConnector@4a0faf37{HTTP/1.1}{0.0.0.0:4040}
16/12/05 21:17:27 INFO org.spark_project.jetty.server.Server: Started @4829ms
16/12/05 21:17:28 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.5.5-hadoop2
16/12/05 21:17:29 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.128.0.8:8032
16/12/05 21:17:32 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1480971590242_0001
Traceback (most recent call last):
  File "/tmp/f5f5259b-be77-4ede-8b73-93ad71957870/find_model_collaborative.py", line 60, in <module>
    dfRates = sqlContext.load(source='jdbc', driver=jdbcDriver, url=jdbcUrl, dbtable='Rating')
AttributeError: 'SQLContext' object has no attribute 'load'
16/12/05 21:17:43 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@4a0faf37{HTTP/1.1}{0.0.0.0:4040}

Configuration:
Cluster
cluster-1
Job type
PySpark
Python file
gs://sps-test-bucket/find_model_collaborative.py
Additional python files
Jar files
Arguments
173.194.82.106
mysqlinstance
test
test

py4j.protocol.Py4JJavaError:

Hi,
I'm following the tutorial which is mentioned in https://github.com/GoogleCloudPlatform/spark-recommendation-engine
during the step of "Find the best model" using bdutil I'm getting following error. I'm able to connect using cloud sql, created all necessary tables which are mentioned in the above link.

gcloud beta sql connect unilogreco --user=root
I was able to connect using the above command

/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --driver-class-path /home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar --jars /home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar find_model_collaborative.py 172.18.0.1 unilogreco root unilog123

When i check for error py4j.protocol.Py4JJavaError: An error occurred while calling o29.jdbc.
: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

Not finding relevant solution in google. hence kindly help me to resolve the issue

Snippet of error log
17/10/24 20:06:19 INFO SharedState: Warehouse path is 'file:/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/spark-warehouse/'.
17/10/24 20:06:21 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/find_model_collaborative.py", line 59, in
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table='Rating')
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 475, in jdbc
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o29.jdbc.
: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

Complete Log

jeevitesh_ms@testmlrecommendation:~/spark-recommendation-engine-master/pyspark$ /home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --driver-class-path /home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar --jars /home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar find_model_collaborative.py 35.202.220.185 unilogreco root unilog123
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/24 20:57:48 INFO SparkContext: Running Spark version 2.2.0
17/10/24 20:57:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/24 20:57:48 INFO SparkContext: Submitted application: app_collaborative
17/10/24 20:57:48 INFO SecurityManager: Changing view acls to: jeevitesh_ms
17/10/24 20:57:48 INFO SecurityManager: Changing modify acls to: jeevitesh_ms
17/10/24 20:57:48 INFO SecurityManager: Changing view acls groups to:
17/10/24 20:57:48 INFO SecurityManager: Changing modify acls groups to:
17/10/24 20:57:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jeevitesh_ms); groups with view permissions: Set(); users with modify permissions: Set(jeevitesh_ms); groups with modify permissions: Set()
17/10/24 20:57:49 INFO Utils: Successfully started service 'sparkDriver' on port 56068.
17/10/24 20:57:49 INFO SparkEnv: Registering MapOutputTracker
17/10/24 20:57:49 INFO SparkEnv: Registering BlockManagerMaster
17/10/24 20:57:49 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/10/24 20:57:49 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/10/24 20:57:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aff00f6d-085c-4b22-b853-8a122f59dab1
17/10/24 20:57:49 INFO MemoryStore: MemoryStore started with capacity 413.9 MB
17/10/24 20:57:49 INFO SparkEnv: Registering OutputCommitCoordinator
17/10/24 20:57:49 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/10/24 20:57:49 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.17.0.2:4040
17/10/24 20:57:49 INFO SparkContext: Added JAR file:/home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar at spark://172.17.0.2:56068/jars/mysql-connector-java-5.1.44-bin.jar with timestamp 1508858869966
17/10/24 20:57:50 INFO SparkContext: Added file file:/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/find_model_collaborative.py at spark://172.17.0.2:56068/files/find_model_collaborative.py with timestamp 1508858870271
17/10/24 20:57:50 INFO Utils: Copying /home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/find_model_collaborative.py to /tmp/spark-b4e50762-1ad8-40bd-afcc-7506d7001573/userFiles-33652ea8-5749-42da-b9b5-ff92f2951c3b/find_model_collaborative.py
17/10/24 20:57:50 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.17.0.2:7077...
17/10/24 20:57:50 INFO TransportClientFactory: Successfully created connection to /172.17.0.2:7077 after 28 ms (0 ms spent in bootstraps)
17/10/24 20:57:50 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171024205750-0007
17/10/24 20:57:50 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52228.
17/10/24 20:57:50 INFO NettyBlockTransferService: Server created on 172.17.0.2:52228
17/10/24 20:57:50 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/10/24 20:57:50 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.17.0.2, 52228, None)
17/10/24 20:57:50 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.2:52228 with 413.9 MB RAM, BlockManagerId(driver, 172.17.0.2, 52228, None)
17/10/24 20:57:50 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.17.0.2, 52228, None)
17/10/24 20:57:50 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.17.0.2, 52228, None)
17/10/24 20:57:51 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/app-20171024205750-0007
17/10/24 20:57:51 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/10/24 20:57:51 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/spark-warehouse/').
17/10/24 20:57:51 INFO SharedState: Warehouse path is 'file:/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/spark-warehouse/'.
17/10/24 20:57:52 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "/home/jeevitesh_ms/spark-recommendation-engine-master/pyspark/find_model_collaborative.py", line 59, in
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table='Rating')
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 475, in jdbc
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o29.jdbc.
: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  jeevitesh_ms@testmlrecommendation:~/spark-recommendation-engine-master/pyspark$ /home/jeevitesh_ms/spark-2.2.0-bin-hadoop2.7/bin/spark-submit   --driver-class-path /home/jeevitesh_ms/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.j
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
    at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
    at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:341)
    at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2189)
    at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2222)
    at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2017)
    at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
    at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
    at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
    at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)

googlecloudplatform / spark-recommendation-engine Goto Github PK