Git Product home page Git Product logo

edx_berkeley_big_data_spark's Introduction

EDX: Big Data Analysis with Spark (Berkeley)

https://www.edx.org/course/big-data-analysis-spark-uc-berkeleyx-cs110x

This project is course that I took in the EDX platform (The course was archived)

  • Learning objectives
    • How to use Apache Spark to perform data analysis
    • First steps with machine learn library MLlib
    • How to use parallel programming and distributed file systems to explore and analyze massive data sets
    • Use Spark to analyze log mining, textual entity recognition and collaborative filtering techniques to real-world data questions
    • Start learn about recommendation system (movie recommendation)
    • Use emacs, tramp and sunrise to mage and edit remote file
    • Get familiar with vagrant (Manager virtual development environments)

See mooc-setup-master/README.md

Spark notes

Spark is a general purpose cluster computing framework that provides efficient in-memory computations for large data sets by distributing computation across multiple computers.

  • Workflow
    • Create or load a RDD (Resilient Distributed Data)
    • Invoke operation on RDD by passing function closures to each element
      • When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure.
      • Shared variables: Broadcast variables (read only) is distributed to all cluster
    • Use the result RDD with actions: count, collect (the result is not big) and save
  • Tips
    • Try to think as much as possible in term of Key value pairs
    • Do not copy all elements of large RDD to the driver (Avoid collect() in big data) because of the memory limit
      • Use take() or takeSample()
      • Or use filter
    • Avoid GroupByKey() when is possible. GrouByKey shuffle more data than ReduceByKey. Think in the traffic and the cost to move data in massive data sets
      • Instead of GrouByKey, prefer to use: combineByKey() or foldByKey()

To see the labs exercise, see Iptyhon Notebooks labs

Managing Virtual Development Enviroment with Vagrant and emacs

Vagrant is software to help you to manage virtual environments. It becomes easy to configure, reproducible and portable work environments.

Emacs is powerful editor that provides modes to manager vagrants virtual machines, access files and folder remotely transparently. So it is a piece of cake to develop and work with vagrant. For this project we worked with 2 modes: vagrant-tramp and sunrise (see also: pragmatic_emacs)

The snapshot bellow shows a print screen of the sunrise file manager while I am managing remote files (vagrant virtual machine) on the left and local files on the right

figures/Screenshot%20from%202016-02-19%2022:43:44.png

cd mooc-setup-master
        vagrant up --provider=virtualbox

edx_berkeley_big_data_spark's People

Contributors

leandroohf avatar

Stargazers

Tuobang Li avatar sai krishna kanneti avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.