Git Product home page Git Product logo

scipy2013's Introduction

SciPy 2013 Data Processing Tutorial

##Recap

Thank you to everyone for coming to our tutorial. We hope you all learned something new and useful, and encourage everyone to continue the lively discussions from the sessions throughout this week and beyond. Towards that aim of facilitating further discussion of these topics, here is a quick rundown of the topics we went over and some additional resources for those interested in learning more.

  • The tutorial GitHub repo contains the slides and exercises, and should stay up for a while.
  • Your demo accounts on Wakari.io are not permanent, but it's super easy to sign up for a free account. Wakari is in active development, so if there's a feature you want or an annoyance you don't, feel free to give us a shout!

Pandas

Data Exploration

(Unsupervised machine learning)

  • Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) find the axes with highest variance.
    • These high variance axes represent the "important" variables.
    • Implementations in Numpu/Scipy and SciKits-Learn
  • K-means clustering tries to group "similar" data points together.
    • The number of clusters K is an input parameter. This is good or bad depending on the problem.
    • Methods like Bayesian Information Content can help determine K from the data if it is unknown.
  • Exercises: PCA and K-means.
  • Recommended reading/examples:
  • Paper on PCA
  • Jake Vanderplas's GitRepo

IPCluster

  • IPCluster clients talk to a central controller, which in turn wrangles remote nodes, each running one or more engines.

  • An engine is like a thread. You can run an engine on the same node as a controller, and nodes can run more than one engine.

  • Configuration is flexible, but somewhat poorly documented. For development, run ipcluster start -n 3 to start three engines, and connect to them from IPython with

      from IPython.parallel import Client
      client = Client()
    
  • Execute commands with view methods e.g. direct.execute('foo()') not client.execute('foo()').

  • IPCluster is ideal for embarassingly parallel workloads that are CPU/GPU/RAM-heavy and light on data transfer.

  • Exercise: IPCluster Basics

  • Exercise: Bayesian Estimation w/ MCMC and IPCluster (view/clone this notebook with Wakari).

  • Recommended notebook: Introduction to Parallel Python with IPCluster and Wakari, Ian Stokes-Rees.

  • Recommended text: Doing Bayesian Data Analysis, John K Kruschke.

MapReduce

scipy2013's People

Contributors

quasiben avatar clayadavis avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.