spark-mooc / mooc-setup Goto Github PK

View Code? Open in Web Editor NEW

349.0 48.0 313.0 52.44 MB

Information for setting up for the BerkeleyX Spark Intro MOOC, and lab assignments for the course

Jupyter Notebook 14.88% Python 74.44% Java 10.67%

mooc-setup's Introduction

mooc-setup

Information for setting up for the Spark MOOC, and lab assignments for the course.

mooc-setup's People

Stargazers

Watchers

Forkers

thuvh caderache2014 cnaik95129 kfowler bistaumanga nkhuyu emaasit adalee2future huskyeder andrewzhang1 balijepalli amruthamuralidharan rahuls5 heinm arturmkrtchyan gabrielspmoreira bhardwaj123rohit seoh pressleydavid gloriasweet smartkiwi gef756 rajatdeshpande kotalikg badbye amritasawant umavatsa harley84 hxgong mr1azl tris-sondon quantcruncher vheaffinitech jayant-ahl golemboy aspiringguru sernle cabadsanchez risabh-baheti bestco blue3sky nikolayvoronchikhin airaultf nishkavijay sqrt1 eunicep obinsc gitderek cesaralba joseleiva miketam1021 delgad9 kroull vishalbhalla ar1231sen susan25 jeperez bnamatherdhala cryptonom csallred razamut datasciencemom colgur pmutyala rtruong397 amleshwar lildata neiodavince sanjivsharma beijinger rostykm laventura mashz naswin waanng vladson snowdj harkiratudh jseam jonathanchiang venki2k9 djvita hyukcho vijayendra-g lcchen1984 arnabray07 devimanohar123 nujuy boukos abulho lebaker poulikov jmfelixr zhujiem yairmazor arashka rasmurth sputnik13 hdubey dts3

mooc-setup's Issues

A small typo issue

"which is recommended when they key doesn't change"

Minor typos/grammar errors in ML_lab3_linear_reg_student.ipynb

Three changes for your consideration where underscores are insertions and dashes are strike throughs:

line 361: 'task involves split_ting_ it into training, validation and test sets'
line 490: 'Calculates the -the- squared error'
line 815: 'gradient` s_h_ould be a '

Incorrect assertion in cs120_lab3_ctr_df.dbc (5f)

Concerns 8252845

Following the discussion at https://piazza.com/class/iqfbu516yuj5t3?cid=653
it seems that "expected_test_baseline = 0.530363901139" used in the assertion comes from hash_test_df instead of hash_train_df

Module 4 lab CTR Data URL No longer exists!

Hi Felix!

I love your pyspark course thus far!

I am going through your "Scalable Machine Learning" and noticed the link to the dataset in the Module 4 Lab 'Click through Rate Prediction" is not working anymore. Do you have any advice on how to import the dataset relevant to the Module 4 lab so that I may finish the Module?

Thank you so much for your help,

Austen

Login user name and password

Hi, just downloaded sparkvm via vagrant, but can't login, what is the user name and password?

cs120_lab2_linear_regression_df.py - randomSplit changed result in Databrick may cause inconsistent test cases

Some change in Databrick caused randomSplit to result differently since yesterday (02/08/2016).

The same test case was correct yesterday but when I ran again today I found these test cases became incorrect due to result changed of randomSplit in line 352

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L384
should be
Test.assertEquals(round(float(n_train) / float(n_train + n_val + n_test), 1), .8, 'unexpected value for nTrain')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L385
should be
Test.assertEquals(round(float(n_val) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nVal')

https://github.com/spark-mooc/mooc-setup/blob/master/cs120_lab2_linear_regression_df.py#L386
should be
Test.assertEquals(round(float(n_test) / float(n_train + n_val + n_test), 1), .1, 'unexpected value for nTest')

Seed problem

Hello

Im trying to go though the 3rd week lab, however it seems to be a problem with the proportions by which the data is partitioned regarding train, validation and test. I'm using the supplied seed, along with the defined weights and i get a different number of examples within each set. Obviously, the following tests are sentenced to fail.

snippet:

weights = [.8, .1, .1]
seed = 42
raw_train_df, raw_validation_df, raw_test_df = raw_df.randomSplit(weights, seed)

n_train = raw_train_df.cache().count()
n_val = raw_validation_df.cache().count()
n_test = raw_test_df.cache().count()
print n_train, n_val, n_test, n_train + n_val + n_test
raw_df.show(1)

output:

80115 9955 9930 100000
+--------------------+
|                text|
+--------------------+
|0,1,1,5,0,1382,4,...|
+--------------------+
only showing top 1 row

the same thing happens in lab 2 linear regression

ssh timeout error

I am trying to use your VM which is similar to the one created here, but I am getting an ssh timeout error. Are you familiar to why this is the case?

Labs incompatibilities in certain circumstances

I do realize that course VM is close environment not friendly to change, but searching Piazza some students had same obstacles, if I'm incorrect, please close issue.

: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/vagrant/Scalable-Machine-Learning/labs-progress/data/cs190/neuro.txt

Relative file import path in labs produce error in case when IPython working directory changed to another from user home. For convenience using shared folder I made a change in notebook profile c.NotebookApp.notebook_dir = '/vagrant' So maybe notebook have to use explicit path of current user home directory? Something like:

from os.path import expanduser
home = expanduser("~")

Incompatible with numpy 1.9.2 Is that worth to make it forward compatible?

Need to update Spark 1.6.0

The Spark version is 1.3.1 in this VM:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/

I need to do an update to 1.6.0. How is Spark being installed inside the VM and is there instruction to update? Or do you plan to push an update soon?

Small issue with cs120_lab2_linear_regression_df.py

I think in this line, parsed_points_df should be parsed_data_df instead, because we should plot with the data with shifted labels. Using parsed_points_df we can't see the labels on the x-axis as it's out of range.

Thank you very much.