goldshtn / spark-workshop Goto Github PK
View Code? Open in Web Editor NEWLabs and data files for a full-day Spark workshop
License: MIT License
Labs and data files for a full-day Spark workshop
License: MIT License
There's no reason to suffer through all the aggregations when DataFrames are becoming more and more prevalent. At least point out that the DataFrame version is available, and have attendees experiment with it -- even if it's before we had a chance to teach SparkSQL.
topFive = sorted(enumerate(similarities.collect()), key=lambda (k, v): -v)[0:5]
Near this particula line it shows pyspark 5063 error
Rework the course structure so that it puts an emphasis on the DataFrame / DataSet API first, and covers RDD as an implementation detail.
Add an intro that highlights key concepts in ML and a few comments on some algorithms that Spark has to offer. Then we can do labs on:
Ideally a simple bash script that install dependencies, downloads and extracts Spark, Zeppelin, the course data files, and sets up everything so that workshop attendees can start being productive right away. If necessary, can assume Ubuntu 14.04 as the base image.
NOTE: Some Python labs depend on external modules that need to be brought in through easy_install
.
The SparkSQL examples can be adjusted to also cover the DataFrame API without writing SQL statements. The same applies to UDF. For example:
compensation = udf(lambda delay: 0 if delay < 15 else delay * 10)
df.select(..., compensation(df['ArrDelay'])).groupBy("carrier")...
Simply run pyspark --packages com.databricks:spark-csv_2.11:1.4.0
to have it detect and download the dependency automatically. Also update the slides to mention Spark Packages and this specific spark-csv package.
The docs for spark-csv are quite good as well. Should explain that the SQLContext.read
method has support for pluggable format, and maybe also mention that there's roundtripping support with SQLContext.write
.
For each lab, add side-by-side Scala instructions. For the first couple of labs, attendees should be using spark-shell
directly, and there should be at least one example of taking a stand-alone .scala file and submitting it. The rest of the labs can use Zeppelin.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.