Git Product home page Git Product logo

spark-workshop's People

Contributors

eyalbenivri avatar goldshtn avatar noikaslev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spark-workshop's Issues

Use the spark-csv package with --packages instead of the 3rd party one

Simply run pyspark --packages com.databricks:spark-csv_2.11:1.4.0 to have it detect and download the dependency automatically. Also update the slides to mention Spark Packages and this specific spark-csv package.

The docs for spark-csv are quite good as well. Should explain that the SQLContext.read method has support for pluggable format, and maybe also mention that there's roundtripping support with SQLContext.write.

Add setup script

Ideally a simple bash script that install dependencies, downloads and extracts Spark, Zeppelin, the course data files, and sets up everything so that workshop attendees can start being productive right away. If necessary, can assume Ubuntu 14.04 as the base image.

NOTE: Some Python labs depend on external modules that need to be brought in through easy_install.

topFive part shows error

topFive = sorted(enumerate(similarities.collect()), key=lambda (k, v): -v)[0:5]
Near this particula line it shows pyspark 5063 error

DataFrame versions of the RDD labs

There's no reason to suffer through all the aggregations when DataFrames are becoming more and more prevalent. At least point out that the DataFrame version is available, and have attendees experiment with it -- even if it's before we had a chance to teach SparkSQL.

Scala versions of the labs

For each lab, add side-by-side Scala instructions. For the first couple of labs, attendees should be using spark-shell directly, and there should be at least one example of taking a stand-alone .scala file and submitting it. The rest of the labs can use Zeppelin.

  • Lab 0
  • Lab 1
  • Lab 2
  • Lab 3
  • Lab 4
  • Lab 5
  • Lab 6
  • Lab 7

Spark 2.0 content

Rework the course structure so that it puts an emphasis on the DataFrame / DataSet API first, and covers RDD as an implementation detail.

DataFrame API example without SQL

The SparkSQL examples can be adjusted to also cover the DataFrame API without writing SQL statements. The same applies to UDF. For example:

compensation = udf(lambda delay: 0 if delay < 15 else delay * 10)
df.select(..., compensation(df['ArrDelay'])).groupBy("carrier")...

MLLib labs and content

Add an intro that highlights key concepts in ML and a few comments on some algorithms that Spark has to offer. Then we can do labs on:

  • TF-IDF example -- plagiarism detection
  • Binary classification -- census data, predict >$50K income using Naive Bayes or Decision Tree (Forest)
  • Clustering -- TBD
  • Collaborative filtering -- MovieLens dataset, adapted from Databricks training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.