Git Product home page Git Product logo

mooc-setup's Introduction

mooc-setup

Information for setting up for the Spark MOOC, and lab assignments for the course.

mooc-setup's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mooc-setup's Issues

Minor typos/grammar errors in ML_lab3_linear_reg_student.ipynb

Three changes for your consideration where underscores are insertions and dashes are strike throughs:

  • line 361: 'task involves split_ting_ it into training, validation and test sets'
  • line 490: 'Calculates the -the- squared error'
  • line 815: 'gradient` s_h_ould be a '

Module 4 lab CTR Data URL No longer exists!

Hi Felix!

I love your pyspark course thus far!

I am going through your "Scalable Machine Learning" and noticed the link to the dataset in the Module 4 Lab 'Click through Rate Prediction" is not working anymore. Do you have any advice on how to import the dataset relevant to the Module 4 lab so that I may finish the Module?

Thank you so much for your help,

Austen

cs120_lab2_linear_regression_df.py - randomSplit changed result in Databrick may cause inconsistent test cases

Some change in Databrick caused randomSplit to result differently since yesterday (02/08/2016).

The same test case was correct yesterday but when I ran again today I found these test cases became incorrect due to result changed of randomSplit in line 352

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L384
should be
Test.assertEquals(round(float(n_train) / float(n_train + n_val + n_test), 1), .8, 'unexpected value for nTrain')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L385
should be
Test.assertEquals(round(float(n_val) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nVal')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L386
should be
Test.assertEquals(round(float(n_test) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nTest')

Seed problem

Hello

Im trying to go though the 3rd week lab, however it seems to be a problem with the proportions by which the data is partitioned regarding train, validation and test. I'm using the supplied seed, along with the defined weights and i get a different number of examples within each set. Obviously, the following tests are sentenced to fail.

snippet:

weights = [.8, .1, .1]
seed = 42
raw_train_df, raw_validation_df, raw_test_df = raw_df.randomSplit(weights, seed)

n_train = raw_train_df.cache().count()
n_val = raw_validation_df.cache().count()
n_test = raw_test_df.cache().count()
print n_train, n_val, n_test, n_train + n_val + n_test
raw_df.show(1)

output:

80115 9955 9930 100000
+--------------------+
|                text|
+--------------------+
|0,1,1,5,0,1382,4,...|
+--------------------+
only showing top 1 row

the same thing happens in lab 2 linear regression

ssh timeout error

I am trying to use your VM which is similar to the one created here, but I am getting an ssh timeout error. Are you familiar to why this is the case?

Labs incompatibilities in certain circumstances

I do realize that course VM is close environment not friendly to change, but searching Piazza some students had same obstacles, if I'm incorrect, please close issue.

: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/vagrant/Scalable-Machine-Learning/labs-progress/data/cs190/neuro.txt
  1. Relative file import path in labs produce error in case when IPython working directory changed to another from user home. For convenience using shared folder I made a change in notebook profile c.NotebookApp.notebook_dir = '/vagrant' So maybe notebook have to use explicit path of current user home directory? Something like:
from os.path import expanduser
home = expanduser("~")
  1. Incompatible with numpy 1.9.2 Is that worth to make it forward compatible?

Need to update Spark 1.6.0

Hi

The Spark version is 1.3.1 in this VM:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/

I need to do an update to 1.6.0. How is Spark being installed inside the VM and is there instruction to update? Or do you plan to push an update soon?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.