Git Product home page Git Product logo

dsc-generating-data-lab-nyc-ds-071519's Introduction

Generating Data - Lab

Introduction

In this lab, we shall practice some of the data generation techniques that we saw in the previous lesson in order to generate datasets for regression and classification purposes. We will run a couple of simple simulations to help us generate different datasets by controlling noise and variance parameters in the data generation process. We will also look at the statistical indicators and visual output to see how these parameters affect the accuracy of an algorithm.

Objectives

In this lab you will:

  • Generate datasets for classification problems
  • Generate datasets for regression problems

Generate data for classfication

Use make_blobs() to create a binary classification dataset with 100 samples, 2 features, and 2 centers (where each center corresponds to a different class label). Set random_state = 42 for reproducibility.

Hint: Here's a link to the documentation for make_blobs().

# Your code here 

Place the data in a pandas DataFrame called df, and inspect the first five rows of the data.

Hint: Your dataframe should have three columns in total, two for the features and one for the class label.

# Your code here 

Create a scatter plot of the data, while color-coding the different classes.

Hint: You may find this dictionary mapping class labels to colors useful: colors = {0: 'red', 1: 'blue'}

# Your code here 

Repeat this exercise two times by setting cluster_std = 0.5 and cluster_std = 2.

Keep all other parameters passed to make_blobs() equal.

That is:

  • Create a classification dataset with 100 samples, 2 features, and 2 centers using make_blobs()
    • Set random_state = 42 for reproducibility, and pass the appropriate value for cluster_std
  • Place the data in a pandas DataFrame called df
  • Plot the values on a scatter plot, while color-coding the different classes

What is the effect of changing cluster_std based on your plots?

# Your code here: 
# cluster_std = 0.5
# Your code here: 
# clusted_std = 2
# Your comments here

Generate data for regression

Create a function reg_simulation() to run a regression simulation creating a number of datasets with the make_regression() data generation function. Perform the following tasks:

  • Create reg_simulation() with n (noise) and random_state as input parameters

    • Make a regression dataset (X, y) with 100 samples using a given noise value and random state
    • Plot the data as a scatter plot
    • Calculate and plot a regression line on the plot and calculate $R^2$ (you can do this in statsmodels or sklearn)
    • Label the plot with the noise value and the calculated $R^2$
  • Pass a fixed random state and values from [10, 25, 40, 50, 100, 200] as noise values iteratively to the function above

  • Inspect and comment on the output

# Import necessary libraries


def reg_simulation(n, random_state):
    
    # Generate X and y

    # Use X,y to draw a scatter plot
    # Fit a linear regression model to X , y and calculate r2
    # label and plot the regression line 
    pass


random_state = random_state = np.random.RandomState(42)

for n in [10, 25, 40, 50, 100, 200]:
    reg_simulation(n, random_state)
# Your comments here

Summary

In this lesson, we learned how to generate random datasets for classification and regression problems. We ran simulations for this and fitted simple models to view the effect of random data parameters including noise level and standard deviation on the performance of parameters, visually as well as objectively. These skills will come in handy while testing model performance and robustness in the future.

dsc-generating-data-lab-nyc-ds-071519's People

Contributors

shakeelraja avatar loredirick avatar sumedh10 avatar mathymitchell avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar  avatar Matt avatar Antoin avatar  avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar  avatar

Forkers

oqusous

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.