Git Product home page Git Product logo

ds-train-test-split-nyc-ds-062518's Introduction

Train Test Split

Introduction

Now that you've seen some basic linear regression models it's time to discuss further how to better tune these models. As you saw, we usually begin with an error or loss function for which we'll apply an optimization algorithm such as gradient descent. We then apply this optimization algorithm to the error function we're trying to minimize and voila, we have an optimized solution! Unfortunately, things aren't quite that simple.

Overfitting and Underfitting

Most importantly is the issue of generalization. This is often examined by discussing underfitting and overfitting.

Recall our main goal when performing regression: we're attempting to find relationships that can generalize to new cases. Generally, the more data that we have the better off we'll be as we can observe more patterns and relationships within that data. However, some of these patterns and relationships may not generalize well to other cases.

Let's intentionally overfit some data to see this in demonstration.

1. Import the data and define X and Y.

#Import the Data here.
path = '/data' #The subdirectory where the file is stored
filename = 'movie_data_detailed.xlsx' #The filename
full_path = path + filename #Alternative shortcut

df = #Your code here

#Subset the Data into appropriate X and Y features. (X should be multiple features!)
X = #Your code here
Y = #Your code here

2. For each feature in X, create several new columns that are powers of that feature. For example, you could take the $budget$ column and produce another column $budget2$, a third column $budget3$, a fourth column $budget**4$ and so on. Do this until you have more columns then rows.

#Your code here.
#Create additional features using powers until you have more columns then rows.

3. Use all of your new features for X. Then train a regression model using RSS as your error function and gradient descent to tune your weights.

#Your code here

4. Plot the model and the actual data on the Budget/Gross Domestic Product plane. (Remember this is just a slice of your n-dimensional space!)

#Your code here

5. What do you notice?

#Your response here

Note: This box (like all the questions and headers) is formatted in Markdown. See a brief cheat sheet of markdown syntax here!

Train Test Split

Here lies the theoretical underpinnings for train test split. Essentially, we are trying to gauge the generalization error of our currently tuned model to future cases. (After all, that's the value of predictive models; to predict fturue states or occurences! By initially dividing our data into one set that we will optimize and train our models on, and a second hold out set that we later verify our models on but never tune them against, we can better judge how well our models will generalize to future cases outside of the scope of current observations.

6. Split your data (including all of those feature engineered columns) into two sets; train and test. In other words, instead of simply X and respective Y datasets, you will now have 4 subsets: X_train, y_train, X_test, and y_test.

X_train, X_test, y_train, y_test = #Your code here

7. Train your model on the train set. [As before use RSS and gradient descent, but only use the training data.]

#Your code here

8. Evaluate your model on the test set.

#Your code here

Bonus:

Iterate over training size sets from 5%-95% of the total sample size and calculate both the training error (minimized rss) and the test error (rss) for each of these splits. Plot these two curves (train error vs. training size and test error vs. training size) on a graph.

View Residuals and Train Test Split on Learn.co and start learning to code for free.

ds-train-test-split-nyc-ds-062518's People

Contributors

mathymitchell avatar

Watchers

James Cloos avatar Joseph Szpigiel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.