Data Cleaning in Pandas

Course: Data Science
Mod: Module 1
Topic: Pandas 3 - Data Cleaning
Amount of time: 90 minutes
Author: Miles Erickson
Ported from the ds-lesson-starters-repo here.

Lesson Summary:

Topic:

Pandas 3 - Data Cleaning in Pandas

Learn.co material:

(link to github)

Prerequisite knowledge/Prework:

Pandas lessons 1 and 2

Learning goals for this lesson:

Students will be able to:

1. Describe a clean dataset
- Explain Hadley Wickham's concept of "tidy" data
- Describe the concept of an Analytics Base Table (ABT)
1. Handle missing and invalid values in a dataset
- Describe when it is appropriate to discard vs. impute
- Add an indicator feature before imputing missing values
1. Create a repeatable Python function to load and prepare raw data

Misconceptions:

As a data scientist, I can expect to begin with a clean dataset.
When we have missing or invalid data, we can just drop it.
It is acceptable to modify the original raw data in the data cleaning process.
Data cleaning is a one-time step at the beginning of a project. It does not need to be repeatable.

Materials

Instructor jupyter notebook files (TODO link here)
King County Real Estate dataset
- Real Property Sales
- Residential Buildings
Student jupyter notebook files (TODO link here)

Lesson Outline:

Step: Introduction and outline of lesson
Time: 5 min

Goal/Scenario:
As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

Learning Goals in sequence:

1. Describe a clean dataset

Explain Hadley Wickham's concept of "tidy" data
Describe the concept of an Analytics Base Table (ABT)

2. Handle missing and invalid values in a dataset

Describe when it is appropriate to discard vs. impute
Add an indicator feature before imputing missing values

3. Create a repeatable Python function to load and prepare raw data

Describe how to convert categorical fields into binary values
Explain why pd.get_dummies is not repeatable on new data

Step: Activation
Time: 10 min

Discussion prompts:

What happens if you try to open this dataset in Google Sheets or Excel? (note: as of Office ~2017 Excel is unable to load the Real Property Sales file)
Can Pandas load the data? (Yes.)
Where can we find the square footage of a house in this dataset? (A: in the Residential building file)
Where can we find the sale prices of houses sold in 2019? (A: in the Real Property Sales file)

Step: Learning Goal 1: Describe a clean dataset
Time: 20 min

Demonstrate: 10 min

Slides: Clean and Tidy Data, definition and context (6 min) (TODO)
Show what the dataset will look like after preparation -- 3 numeric columns: year sold, square footage, sale price (4 min)

Application: 5 min

Think, pair, share: what are the key differences between the raw dataset and the clean dataset?

Informal assessment: 5 min

Slides: 4-5 slides showing examples of a few rows of data (TODO).
- Is it tidy? (entire class)
- If not, what could we do to to make it tidy? (give all students time to think, cold call)

Step: Learning Goal 2: Handle missing and invalid values in a dataset
Time: 20 min

Demonstrate: 10 min

Slides: Handling Missing Values (TODO)

Instructor demo: Students are asked to close their laptops, follow along, interrupt, ask questions.

Load the dataset (two dataframes)
Join the two dataframes (review from Pandas 2)
Subset rows to sales from 2019
Subset columns to square footage, sale price
Identify suspicious values for sale price ($1 etc)
Identify missing values for square footage

Application: 5 min

Think-pair-share: in this context, what's the best thing to do with the suspicious sale price values?

Informal assessment: 5 min

You have a million rows of data, but 200 rows are missing the target value that you want to predict with your model. What's the best thing to do?
You have a dataset with many columns, and nearly every row is missing data in one or two columns. What's a reasonable approach in this case?

Step: Learning Goal 3: Create a repeatable Python function to load and prepare raw data
Time: 15 min

Demonstrate: 10 min

Point: A key difference between a data analyst and a data scientist: scientists do reproducible work. Whereas an analyst might or might not document data preparation steps, data scientists document their work in code.

Remind students that they are not expected to follow along and "keep up" at this point.

Demonstrate wrapping the code from above into a Python function that takes the raw dataframe as an argument, and returns the clean data.

Application & Informal Assessment: 5 min

Why is it important for data scientists to document their data preparation steps in code?
What role do (potentially familiar) graphical tools like Microsoft Excel play in this type of a workflow?

Step: Assessment
Time: 5 min

Introduce lab exercise w/ scaffolding (load/join complete): students pair on data preparation.

#TODO discuss with group: is this an appropriate use of this space?

Step: Reflection:
Time: 5 min

Review objectives

learn-co-curriculum / data-cleaning-with-pandas Goto Github PK

data-cleaning-with-pandas's Introduction

Data Cleaning in Pandas

Lesson Summary:

Topic:

Learn.co material:

Prerequisite knowledge/Prework:

Learning goals for this lesson:

Misconceptions:

Materials

Lesson Outline:

1. Describe a clean dataset

2. Handle missing and invalid values in a dataset

3. Create a repeatable Python function to load and prepare raw data

data-cleaning-with-pandas's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent