License: Other

Jupyter Notebook 89.33% Python 10.67%

single-variable-regression-lab's Introduction

Comedy Show Lab

Imagine that you are the producer for a comedy show at your school. We need you to use knowledge of linear regression to make predictions as to the success of the show.

Working through a linear regression

The comedy show is trying to figure out how much money to spend on advertising in the student newspaper. The newspaper tells the show that

For every two dollars spent on advertising, three students attend the show.
If no money is spent on advertising, no one will attend the show.

Write a linear regression function called attendance that shows the relationship between advertising and attendance expressed by the newspaper.

def attendance(advertising):
    pass

attendance(100) # 150

attendance(50) # 75

As the old adage goes, "Don't ask the barber if you need a haircut!" Likewise, despite what the student newspaper says, the comedy show knows from experience that they'll still have a crowd even without an advertising budget. Some of the comedians in the show have friends (believe it or not), and twenty of those friends will show up. Write a function called attendance_with_friends that models the following:

When the advertising budget is zero, 20 friends still attend
Three additional people attend the show for every two dollars spent on advertising

def attendance_with_friends(advertising):
    pass

attendance_with_friends(100) # 170

attendance_with_friends(50) # 95

Plot it

Let's help plot this line so you can get a sense of what your $m$ and $b$ values look like in graph form.

Our x values can be a list of initial_sample_budgets, equal to a list of our budgets. And we can use the outputs of our attendance_with_friends function to determine the list of attendance_values, the attendance at each of those x values.

initial_sample_budgets = [0, 50, 100]
attendance_values = [20, 95, 170]

First we import the necessary plotly library, and graph_obs function, and setup plotly to be used without uploading our plots to its website.

Finally, we plot out our regression line using our attendance_with_friends function. Our x values will be the budgets. For our y values, we need to use our attendance_with_friends function to create a list of y-value attendances for every input of x.

import plotly
from plotly import graph_objs
plotly.offline.init_notebook_mode(connected=True)

trace_of_attendance_with_friends = graph_objs.Scatter(
    x=initial_sample_budgets,
    y=attendance_values,
)

plotly.offline.iplot([trace_of_attendance_with_friends])

trace_of_attendance_with_friends

Now let's write a couple functions that we can use going forward. We'll write a function called m_b_data that given a slope of a line, $m$, a y-intercept, $b$, will return a dictionary that has a key of x pointing to a list of x_values, and a key of y that points to a list of y_values. Each $y$ value should be the output of a regression line for the provided $m$ and $b$ values, for each of the provided x_values.

def m_b_data(m, b, x_values):
    pass

m_b_data(1.5, 20, [0, 50, 100]) # {'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

Now let's write a function called m_b_trace that uses our m_b_data function to return a dictionary that includes keys of name and mode in addition to x and y. The values of mode and name are provided as arguments. When the mode argument is not provided, it has a default value of lines and when name is not provided, it has a default value of line function.

def m_b_trace(m, b, x_values, mode = 'lines', name = 'line function'):
    pass

m_b_trace(1.5, 20, [0, 50, 100]) 
# {'mode': 'line', 'name': 'line function', 'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

Calculating lines

The comedy show decides to advertise for two different shows. The attendance looks like the following.

Budgets (dollars)	Attendance
200	400
400	700

In code, we represent this as the following:

first_show = {'budget': 200, 'attendance': 400}
second_show = {'budget': 400, 'attendance': 700}

Write a function called marginal_return_on_budget that returns the expected amount of increase per every dollar spent on budget.

The function should use the formula for calculating the slope of a line provided two points.

def marginal_return_on_budget(first_show, second_show):
    pass

marginal_return_on_budget(first_show, second_show) # 1.5

first_show

Just to check, let's use some different data to make sure our marginal_return_on_budget function calculates the slope properly.

imaginary_third_show = {'budget': 300, 'attendance': 500}
imaginary_fourth_show = {'budget': 600, 'attendance': 900}
marginal_return_on_budget(imaginary_third_show, imaginary_fourth_show) # 1.3333333333333333

Great! Now we'll begin to write functions that we can use going forward. The functions will calculate attributes of lines in general and can be used to predict the attendance of the comedy show.

Take the following data. The comedy show spends zero dollars on advertising for the next show. The attendance chart now looks like the following:

Budgets (dollars)	Attendance
0	100
200	400
400	700

budgets = [0, 200, 400]
attendance_numbers = [100, 400, 700]

To get you started, we'll provide a function called sorted_points that accepts a list of x values and a list of y values and returns a list of point coordinates sorted by their x values. The return value is a list of sorted tuples.

def sorted_points(x_values, y_values):
    values = list(zip(x_values, y_values))
    sorted_values = sorted(values, key=lambda value: value[0])
    return sorted_values

sorted_points([4, 1, 6], [4, 6, 7])

build_starting_line

In this section, we'll write a function called build_starting_line. The function that we end up building simply draws a line between our points with the highest and lowest x values. We are selecting these points as an arbitrary "starting" point for our regression line.

As John von Neumann said, "truth … is much too complicated to allow anything but approximations." All models are inherently wrong, but some are useful. In future lessons, we will learn how to build a regression line that accurately matches our dataset. For now, we will focus on building a useful "starting" line using the first and last points along the x-axis.

First, write a function called slope that, given a list of x values and a list of y values, will use the points with the lowest and highest x values to calculate the slope of a line.

def slope(x_values, y_values):
    pass

slope([200, 400], [400, 700]) # 1.5

Now write a function called y_intercept. Use the slope function to calculate the slope if it isn't provided as an argument. Then we will use the slope and the values of the point with the highest x value to return the y-intercept.

def y_intercept(x_values, y_values, m = None):
    pass

y_intercept([200, 400], [400, 700]) # 100

y_intercept([0, 200, 400], [10, 400, 700]) # 10

Now write a function called build_starting_line that given a list of x_values and a list of y_values returns a dictionary with a key of m and a key of b to return the m and b values of the calculated regression line. Use the slope and y_intercept functions to calculate the line.

def build_starting_line(x_values, y_values):
    pass

build_starting_line([0, 200, 400], [10, 400, 700]) # {'b': 10.0, 'm': 1.725}

Finally, let's write a function called expected_value_for_line that returns the expected attendance given the $m$, $b$, and $x$ $value$.

first_show = {'budget': 300, 'attendance': 700}
second_show = {'budget': 400, 'attendance': 900}

shows = [first_show, second_show]

def expected_value_for_line(m, b, x_value):
    pass

expected_value_for_line(1.5, 100, 100) # 250

Using our functions

Now that we have built these functions, we can use them on our dataset. Uncomment and run the lines below to see how we can use our functions going forward.

first_show = {'budget': 200, 'attendance': 400}
second_show = {'budget': 400, 'attendance': 700}
third_show = {'budget': 300, 'attendance': 500}
fourth_show = {'budget': 600, 'attendance': 900}

comedy_shows = [first_show, second_show, third_show, fourth_show]

show_x_values = list(map(lambda show: show['budget'], comedy_shows))
show_y_values = list(map(lambda show: show['attendance'], comedy_shows))

def trace_values(x_values, y_values, mode = 'markers', name="data"):
    return {'x': x_values, 'y': y_values, 'mode': mode, 'name': name}

def plot(traces):
    plotly.offline.iplot(traces)

comedy_show_trace = trace_values(show_x_values, show_y_values, name = 'comedy show data')
comedy_show_trace

show_starting_line = build_starting_line(show_x_values, show_y_values)
show_starting_line

trace_show_line = m_b_trace(show_starting_line['m'], show_starting_line['b'], show_x_values, name = 'starting line')

trace_show_line

plot([comedy_show_trace, trace_show_line])

As we can see above, we built a "starting" regression line out of the points with the lowest and highest x values. We will learn in future lessons how to improve our line so that it becomes the "best fit" given all of our dataset, not just the first and last points. For now, this approach sufficed since our goal was to practice working with and plotting line functions.

single-variable-regression-lab's People

Contributors

Stargazers

Watchers

Forkers

lcorr8 xiupan volynsal s-b1 ksis1st

single-variable-regression-lab's Issues

Instructions don't match function definition provided in Single Variable Regression Lab

Not sure if this is an oversight or not, but the description of what the lab wants us to do (function name, number and types of inputs) does not match the function definition provided.

See screenshot for details:

y - intercept explanation

I think the y-intercept functions need a bit more explanation.

I'm not sure why we need the 'y-intercept provided' function. We have a function called sorted_points that lists our coordinates in order by the lowest x-value. So, if our y-intercept is included in our data set and our x-value are never below zero, the y-intercept will always be the second element of the the index 0 of our sorted_points list.

When writing the slope function, an explanation of why we are using the lowest and the highest x-coordinates would be helpful. I think in a previous lesson, you explain that slope can be calculated given any two points. In this lab though, we do need the highest and lowest x coordinates and not just any two points to make sure that we are including all of our budget range.

Maybe there is a mathematical reason that we need that function, but I'm not sure why exactly.

Also, it would be helpful to plot another graph at the end of the lesson that did have a y-intercept greater than zero since we were writing functions to account for that in the lesson.

No place to write "regression_line_two_points" or expected answer

regression_line_two_points doesn't have a pre-created spot to write it out/test it

there is no solution available for this lab

clicking the solutions button does not bring up any solution

minor typo in comment

code snippet

m_b_trace(1.5, 20, [0, 50, 100]) 
# {'mode': 'line', 'name': 'line function', 'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

should be

m_b_trace(1.5, 20, [0, 50, 100]) 
# {'mode': 'lines', 'name': 'line function', 'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

expected value for slope from build_starting_line variable appears incorrect.

build_starting_line([0, 200, 400], [10, 400, 700]) # {'b': 10.0, 'm': 1.725}
the actual slope of this line is 1.95.

400 (y2)-10(y1) = 390 / 200 = 200(x2) - 0 (x1)
390 / 200 = 1.95

Data Science Bootcamp Prep: Labs Refresh Issue

Hello,

Whenever I'm working on the labs they automatically refresh periodically. Sometimes it's seconds in between other times it could be minutes. Is there a way for me to fix this?

I'm connected to github per my account settings and I've disconnected and reconnected, but I still have the issue.

y_intercept function explanation

"returns a dictionary with the keys of m and b to return the values of m and b."

Looks like it really just wants a return of the b value, that return is for the next function

Error running pre-written plotly code at end of lesson

When attempting to run the pre-written code plot([comedy_show_trace, trace_show_line]) on Chrome 70.0.3538.110 on Windows 10 I get the following error, and am unable to produce the graph with plotly:

ValueErrorTraceback (most recent call last)
in ()
----> 1 plot([comedy_show_trace, trace_show_line])

in plot(traces)
1 def plot(traces):
----> 2 plotly.offline.iplot(traces)

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/offline/offline.py in iplot(figure_or_data, show_link, link_text, validate, image, filename, image_width, image_height, config)
334 config.setdefault('linkText', link_text)
335
--> 336 figure = tools.return_figure_from_figure_or_data(figure_or_data, validate)
337
338 # Though it can add quite a bit to the display-bundle size, we include

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/tools.py in return_figure_from_figure_or_data(figure_or_data, validate_figure)
1467
1468 try:
-> 1469 figure = Figure(**figure).to_dict()
1470 except exceptions.PlotlyError as err:
1471 raise exceptions.PlotlyError("Invalid 'figure_or_data' argument. "

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/graph_objs/_figure.py in init(self, data, layout, frames)
312 respective traces in the data attribute
313 """
--> 314 super(Figure, self).init(data, layout, frames)
315
316 def add_area(

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/basedatatypes.py in init(self, data, layout_plotly, frames)
114
115 # ### Import traces ###
--> 116 data = self._data_validator.validate_coerce(data)
117
118 # ### Save tuple of trace objects ###

/opt/conda/envs/learn-env/lib/python3.6/site-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v)
1953 invalid_els.append(v_el)
1954 else:
-> 1955 trace = self.class_maptrace_type
1956 res.append(trace)
1957 else:

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/graph_objs/_scatter.py in init(self, arg, cliponaxis, connectgaps, customdata, customdatasrc, dx, dy, error_x, error_y, fill, fillcolor, hoverinfo, hoverinfosrc, hoverlabel, hoveron, hovertext, hovertextsrc, ids, idssrc, legendgroup, line, marker, mode, name, opacity, r, rsrc, selected, selectedpoints, showlegend, stream, t, text, textfont, textposition, textpositionsrc, textsrc, tsrc, uid, unselected, visible, x, x0, xaxis, xcalendar, xsrc, y, y0, yaxis, ycalendar, ysrc, **kwargs)
2131 self.xsrc = xsrc if xsrc is not None else _v
2132 _v = arg.pop('y', None)
-> 2133 self.y = y if y is not None else _v
2134 _v = arg.pop('y0', None)
2135 self.y0 = y0 if y0 is not None else _v

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/basedatatypes.py in setattr(self, prop, value)
2691 prop in self._validators):
2692 # Let known properties and private properties through
-> 2693 super(BasePlotlyType, self).setattr(prop, value)
2694 else:
2695 # Raise error on unknown public properties

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/graph_objs/_scatter.py in y(self, val)
1409 @y.setter
1410 def y(self, val):
-> 1411 self['y'] = val
1412
1413 # y0

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/basedatatypes.py in setitem(self, prop, value)
2663 # ### Handle simple property ###
2664 else:
-> 2665 self._set_prop(prop, value)
2666
2667 # Handle non-scalar case

/opt/conda/envs/learn-env/lib/python3.6/site-packages/plotly/basedatatypes.py in _set_prop(self, prop, val)
2893 # ------------
2894 validator = self._validators.get(prop)
-> 2895 val = validator.validate_coerce(val)
2896
2897 # val is None

/opt/conda/envs/learn-env/lib/python3.6/site-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v)
311 v = to_scalar_or_list(v)
312 else:
--> 313 self.raise_invalid_val(v)
314 return v
315

/opt/conda/envs/learn-env/lib/python3.6/site-packages/_plotly_utils/basevalidators.py in raise_invalid_val(self, v)
214 typ=type_str(v),
215 v=repr(v),
--> 216 valid_clr_desc=self.description()))
217
218 def raise_invalid_elements(self, invalid_els):

ValueError:
Invalid value of type 'builtins.dict' received for the 'y' property of scatter
Received value: {'x': [200, 400, 300, 600], 'y': [400.0, 650.0, 525.0, 900.0]}

The 'y' property is an array that may be specified as a tuple,
list, numpy array, or pandas Series

y_intercept() function bug

I found a bug in the code for the function y_intercept() .. The code in the Solutions guide is the following:
def y_intercept(x_values, y_values, m = None):

sorted_values = sorted_points(x_values, y_values)  

highest = sorted_values[-1]  

if m == None:  

    m = slope(x_values, y_values)  

offset = highest[1] - m*highest[0]  

return offset

The problem with this code is that when you use it for larger data sets, the return is wildly incorrect. For example, if you use this with a simple two-point data set such as
y_intercept([200, 400], [400, 700]) you will return the correct answer 100 . Once you add a third point, there is a problem.

y_intercept([0, 200, 400], [10, 400, 700]) returns -11333 (I think, I already deleted the incorrect code, however, it was a large negative number like that one). The answer is supposed to be 10.

The correct code for y_intercept is as follows: I did change some of the variables for coherence.

def y_intercept(x_values, y_values, m = None):

sorted_values = sorted_points(x_values, y_values)   

highest = sorted_values[-1]   

if m == None:   

    m = slope(sorted_values[0], sorted_values[-1])   

y_int = highest[1] - m*highest[0]   

return y_int

Hope this helps someone!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

learn-co-curriculum / single-variable-regression-lab Goto Github PK