Git Product home page Git Product logo

ds30_5's Introduction

Data Science in 30 Minutes: Spark Streaming and Basic Analysis

Installation Instructions

You can easily install all of the Python requirements with Continuum Analytics' conda - if you haven't heard of it yet, we'd highly recommend taking a look!

The easiest way to install all these packages is the following, once you've gotten conda installed:

conda env create --name ds30 --file environment.yml

More importantly you'll need a working PySpark install (have pyspark in your path). You can download Spark here.

The presentation uses Jupyter notebooks, with a Scala/Spark kernel for ingesting data and a Python kernel for analysis.

The following will help you duplicate our (admittedly aged) kernel setup. We'll assume that you have already installed a Python environment, iPython, and Jupyter through either Anaconda or some other method.

  1. You will need to have a working Java installation with $JAVA_HOME set. On Ubuntu, you can e.g. sudo apt-get install default-jdk.
  2. wget https://oss.sonatype.org/content/repositories/snapshots/sh/jove/jove-spark-cli_1.3_2.10/0.1.1-1-SNAPSHOT/jove-spark-cli_1.3_2.10-0.1.1-1-SNAPSHOT.tar.gz
  3. Unpack with tar xvf jove-spark-cli....tar.gz
  4. mv jove-spark-cli...SNAPSHOT jove-spark for convenience
  5. Run ./jove-spark/bin/jove-spark-1.3 --kernel-spec
  6. Check your installed kernels with jupyter kernelspec list. You should see the Spark kernel installed.

If you choose to use a different (newer) kernel, the setup may vary. The three dependencies you'll need in the Scala/Spark kernel are:

  • org.apache.spark %% spark-streaming % 1.3.1
  • org.apache.spark %% spark-streaming-twitter % 1.3.1
  • com.google.code.gson % gson % 2.4

Lastly, for Twitter data, you'll need to register an application and enter your credentials in the twitter4j.properties file.

Credits

This talk was created by Ariel M'ndange-Pfupfu, a Data Scientist at The Data Incubator. He has worked on a variety of data science, software engineering, and curriculum development projects and is also a current Bleeker Fellow. He earned his Master’s degree at Stanford and his Ph.D. in Materials Science & Engineering from Northwestern.

DataBricks, one of the largest contributors to the Apache Spark project, has been instrumental in developing and supporting Spark education. The reference applications book was very useful for building this talk.

About The Data Incubator

[The Data Incubator] (https://www.thedataincubator.com/) is a data science education company based in NYC, DC, and SF with both corporate training and recruiting services. For [data science corporate training] (https://www.thedataincubator.com/training.html), we offer customized, in-house training solutions in data and analytics. For [data science hiring] (https://www.thedataincubator.com/hiring.html), we run a [free 8 week fellowship] (https://www.thedataincubator.com/fellowship.html) training PhDs to become data scientists. The fellowship selects 2% of its 2000+ quarterly applicants and is free for Fellows. Hiring companies (including EBay, Capital One, Pfizer) pay a recruiting fee only if they successfully hire. You can read about us on [Harvard Business Review] (https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/), [VentureBeat] (http://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/), or [The Next Web] (http://thenextweb.com/insider/2015/07/02/data-incubator-opens-a-west-coast-campus-to-groom-the-next-generation-of-data-scientists/), or read about our alumni at [LinkedIn] (http://blog.thedataincubator.com/2016/05/alumni-spotlight-xia-hong/), [Palantir] (http://blog.thedataincubator.com/2015/02/moving-to-palantir-from-mathematics-alumni-spotlight-on-justin-bush/), or the [NYTimes] (http://blog.thedataincubator.com/2015/02/alumni-spotlight-dorian-goldman-using-a-pure-math-background-to-solve-problems-for-the-new-york-times/).

For information on upcoming events, visit our [Eventbrite] (http://www.eventbrite.com/o/the-data-incubator-8342209540).

ds30_5's People

Contributors

gotoariel avatar megannissel avatar

Stargazers

Roman avatar

Watchers

James Cloos avatar Silambarasan Madanakumar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.