Git Product home page Git Product logo

dsc-implementing-statistics-with-functions-lab-online-ds-ft-100719's Introduction

Implementing Statistics with Functions - Lab

Introduction

In this lab you'll dive deep into calculating the measures of central tendency and dispersion introduced in previous lessons. You will code the formulas for these functions in Python which will require you to use the programming skills that you have gained in the other sections of this module. Let's get started!

Objectives

You will be able to:

  • Create functions to model measures of central tendency and dispersion
  • Perform basic statistical analysis of given data using measures of central tendency and dispersion

Dataset

For this lab, we'll use the NHIS dataset, which contains weights, heights, and some other attributes for a number of surveyed individuals. The context of this survey is outside the scope this lab, so we'll just go ahead and load the heights column as a list for us to run some simple statistical experiments. We'll use the pandas library to import the data into our Python environment. This process will be covered in detail in the next section. For now, we'll do this part for you to give you a head start.

Run the cell below to import the data.

import pandas as pd
df = pd.read_csv('nhis.csv')
height = list(df['height'])
df.head()

We are only interested in the height column, so we saved it as a list in the variable height in the cell above.

In the cell below:

  • Print out the number of items in height
  • Slice and print out the first 10 items from height
 # Expected Output: 4785
 # Expected Output: [74, 70, 61, 68, 66, 98, 99, 70, 65, 64]

So, around 4700 records of height. That's great. Next, we'll try plotting some basic Histograms for these records.

Plotting Histograms

We'll begin by importing the pyplot module from the library matplotlib and setting an alias of plt for it (so that we only have to type plt. instead of matplotlib.pyplot. each time we want to use it). Note that plt is considered the Standard Alias for matplotlib.

Run the cell below to import matplotlib and use it to create a histogram of our height data with 8 different bins.

# Import matplotlib and plot histogram for height data
import matplotlib.pyplot as plt
%matplotlib inline  
# ^^This is a 'magic command' built into jupyter notebooks. We use it so that the visualization displays 
# in the notebook directly, instead of in a separate window.  

Next, we'll use matplotlib to create a histogram by passing in our data, as well as the parameter bins=8.

# Expected output below
plt.hist(height, bins=8)

Do you spot anything unusual above? Some outliers, maybe?

Measures of Central Tendency: Calculating the Mean

We're just beginning to dig into the data stored in height. We'll begin by writing a function to calculate the mean of the data. Recall the formula for calculating mean:

$$ \Large \bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_i $$

Using the Python skills you have learned so far, create a function get_mean() to perform the following tasks:

  • Input a list of numbers (like the height list we have above)
  • Calculate the sum of numbers and length of the list
  • Calculate mean from above, round off to 2 decimals and return it.
def get_mean(data):

    mean = None
    
    return round(mean,2)

test1 = [5, 4, 1, 3, 2]
test2 = [4, 2, 3, 1]

print(get_mean(test1)) # 3
print(get_mean(test2)) # 2.5

Now, we'll test the function by passing in the height list.

# After creating the function, pass the height list to the function 
mean = get_mean(height)

print("Sample Mean:", mean)

# Sample Mean: 69.58

So, we have our mean length, 69.58, and this confirms our observations from the histogram. But we also have some outliers in our data above and we know outliers affect the mean calculation by pulling the mean value in their direction. So, let's remove these outliers and create a new list to see if our mean shifts or stays. We'll use a threshold of 80 inches, i.e. filter out any values greater than 80.

Perform following tasks:

  • Create a function filter_list() that inputs a list
  • Perform a for loop to iteratively check and append values to a new list if the value is less than 80
  • Return the new list
def filter_list(listA):
    
    listB = []
    
    # Perform filtering here  in a for loop
    
    return listB

test = [60, 70, 80, 90]
filter_list(test) # [60, 70]

Great, now we can use filter_list() to filter our height list and plot a new histogram to see if things change considerably.

# Filter the height list using the above function
filtered_height = None

Now that we have filtered the outliers out of our data, let's recreate our histogram with 8 bins using our filtered data.

NOTE: You do not need to reimport matplotlib -- once it's been imported, it's stored in memory and can be accessed whenever we like in other cells.

Since we've filtered our data to remove outliers, we should also recalculate the mean. Do this now in the cell below.

# Get the mean of the new filtered_height list using our get_mean() function


# 66.85

Does the mean height of our filtered data match up with what we see in our histogram of our filtered data?

Note that in some analytical situations we may not be able to exclude the outliers in such a naive manner. So, let's go ahead and calculate other measures of central tendency as well. We'll start by calculating the median value for our original (unfiltered) height data.

Calculating the Median

The median is the value directly in the middle of the dataset. In statistical terms, this is the Median Quartile. If the dataset was sorted from lowest value to highest value, the median is the value that would be larger than the first 50% of the data, and smaller than the second 50%.

If the dataset has an odd number of values, then the median is the middle number. If the dataset has an even number of values, then we take the mean of the middle two numbers.

In the cell below, write a function that takes in a list of numbers and returns the median value for that dataset. Make sure you first check for even / odd number of data points and perform the computation accordingly. The best approach to calculate the median is as follows:

  1. Sort the data
  2. Check if the data has even or odd number of data points
  3. Calculate the median of the sorted data now that you know if the data is even or odd.

(Hint: Recall that you can use modulo operator % in python to check if a value is even or odd -- odd numbers % 2 will equal 1, while even numbers % 2 will equal 0!)

def get_median(data):

    data_sorted = None
    
    # Check for even/odd and perform calculations accordingly - use if-else 
   
    pass

test1 = [5, 4, 1, 3, 2]
test2 = [4, 2, 3, 1]

print(get_median(test1)) # 3
print(get_median(test2)) # 2.5

Great, now we can pass in our original height list to this function to check the median.

# Calculate the median of our original height list


# 67

So, we have 67, which is much closer to the filtered list mean (66.85) than the mean we calculated with actual list (69.58). So, median in this case seems to be a much better indicator of the central tendency found in the dataset. This makes sense because we've already learned that medians are less sensitive to outliers than mean values are!

Next, we'll calculate the mode. This could give us better insight into the typical values in the dataset based on how frequent a value is.

Calculating the Mode

The mode is the value that shows up the most in a dataset. A dataset can have 0 or more modes. If no value shows up more than once, the dataset is considered to have no mode value. If two numbers show up the same number of times, that dataset is considered bimodal. Datasets where multiple values all show up the same number of times are considered multimodal.

In the cell below, write a function that takes in a list of numbers and returns another list containing the mode value(s). In the case of only one mode, the list would have a single element.

Hint: Building a Frequency Distribution table using dictionaries is probably the easiest way to approach this problem. Use each unique element from the height list as a key, and the frequency of this element as the value and build a dictionary. You can then simply identify the keys (heights) with maximum values.

def get_mode(data):

    # Create and populate frequency distribution
    frequency_dict = {}
    
    # For all elements in the list:
    # If an element is not in the dictionary, add it with value 1
    # If an element is already in the dictionary, +1 the value
    
    
    # Create a list for mode values
    modes = []
    
    #from the dictionary, add element(s) to the modes list with max frequency

    # Return the mode list 
    return modes

test1 = [1, 2, 3, 5, 5, 4]
test2 = [1, 1, 1, 2, 3, 4, 5, 5, 5]

print(get_mode(test1)) # [5]
print(get_mode(test2)) # [1, 5]

That's done. Now you can use the above function to calculate the mode of the original height list to compare it with our mean and median values.

# Calculate the mode of our original height list
get_mode(height)

So, the mode value is much lower than our mean and median calculated earlier. What do you make of this? The answer to that could be subjective and depends on the problem. i.e. if your problem is to identify sizes for garments that would sell the most, you cannot disregard mode. However, if you want to get an idea about the general or typical height of individuals, you can probably still do that with the median and the average.

To get an even clearer picture, we know we need to see how much the values deviate from the central values we have identified. We have seen variance and standard deviation before as measures of such dispersion. Let's have a go at these to strengthen our understanding of this data.

Calculating the Variance

The formula for variance is:

$$ \Large s^2 = \frac{1}{n - 1} \sum_{i=1}^{n}(x_i - \bar{x})^2 $$

Note that this formula is for the sample variance. The formula is slightly different than the formula for calculating population variance. Read more about the difference here. In the cell below, write a function that takes a list of numbers as input and returns the variance (rounded to two decimal places) of the sample as output.

def get_variance(sample):

    # First, calculate the sample mean using get_mean()
    sample_mean = None
    
    # Now, subtract the sample mean from each point and square the result 
    
    # Divide the total by the number of items in the sample to calculate variance 
    variance = None
    
    return variance

test1 = [1, 2, 3, 5, 5, 4]
test2 = [1, 1, 1, 2, 3, 4, 5, 5, 5]
print(get_variance(test1)) # 2.67
print(get_mean(test1)) # 3.33
print(get_variance(test2)) # 3.25

Now we can test the variance of our list height with our new get_variance() function.

# Calculate the variance of our original height list

# 87.74

So this value, as we learned earlier, tells us a bit about the deviation but not in the units of underlying data. This is because it squares the values of deviations. Standard deviation, however, can deal with this issue as it takes the square roots of differences. So that would probably be a bit more revealing.

Calculating the Standard Deviation

In the cell below, write a function that takes a list of numbers as input and returns the standard deviation of that sample as output.

Recall that the formula for Standard Deviation is:

$$ \Large s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2} $$

You would need the sqrt() function from math library to calculate the square root.

Note: Alternatively, another approach to finding the square root of a number would be to raise that number to the power of 0.5. For example, if we wanted to find the square root of 100, we could raise 100 to the power of 0.5, which would give us the number 10.0. However, either approach will work.

100**0.5
# 10.0

from math import sqrt
sqrt(100)
# 10.0
from math import sqrt

def get_stddev(list):

    mean = None
    
    stddev = None
    
    return round(stddev, 2)

test = [120,112,131,211,312,90]

print (get_stddev(test))
# 84.03

So now we can finally calculate the standard deviation for our height list and inspect the results.

# Calculate the standard deviation of our original height list

# 9.37

So 9.37 inches is the amount of deviation present in our dataset. As we are still including outlier values, this might be slightly affected but these results are now much more reliable.

Finally, we will build a boxplot for height data and see if it agrees with our understanding for this data that we have developed up to this point. Use the matplotlib's boxplot() function with height data and comment on the output.

# boxplot here

Summary

In this lab, we performed a basic, yet detailed, statistical analysis around measuring the tendencies of center and spread for a given dataset. We looked at building a number of functions to calculate different measures and also used some statistical visualizations to strengthen our intuitions around the dataset. We shall see how we can simplify this process as we study numpy and pandas libraries to ease out the programming load while calculating basic statistics.

dsc-implementing-statistics-with-functions-lab-online-ds-ft-100719's People

Contributors

shakeelraja avatar loredirick avatar fpolchow avatar peterbell avatar tkoar avatar mathymitchell avatar voxoff avatar mas16 avatar mike-kane avatar erdosn avatar sumedh10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.