Git Product home page Git Product logo

dsc-pandas-groupby's Introduction

Pandas Groupby

Introduction

In this lab, you'll learn how to use the .groupby() method in Pandas to summarize datasets.

Objectives

You will be able to:

  • Use groupby methods to aggregate different groups in a dataframe

Using .groupby()

Let's bring in the Titantic data set.

import pandas as pd

df = pd.read_csv('titanic.csv')
df = df.drop(columns=['Name','Ticket','Embarked', 'Cabin'])
df = df.dropna()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Pclass Sex Age SibSp Parch Fare
0 1.0 0.0 3 male 22.0 1.0 0.0 7.2500
1 2.0 1.0 1 female 38.0 1.0 0.0 71.2833
2 3.0 1.0 3 female 26.0 0.0 0.0 7.9250
3 4.0 1.0 1 female 35.0 1.0 0.0 53.1000
4 5.0 0.0 3 male 35.0 0.0 0.0 8.0500

During the Exploratory Data Analysis phase, one of the most common tasks you'll want to do is split the dataset into subgroups and compare them to see if you can notice any trends. For instance, you may want to group the passengers together by gender or age. You can do this by using the .groupby() method built-in to pandas DataFrames.

To group passengers by gender, you would type:

df.groupby('Sex')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbea9576100>
# This line of code is equivalent to the one above
df.groupby(df['Sex'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbea95525e0>

Note that this alone will not display a result -- although you have split the dataset into groups, you don't have a meaningful way to display information until you chain an Aggregation Function onto the groupby. This allows you to compute summary statistics.

You can quickly use an aggregation function by chaining the call to the end of the .groupby() method.

df.groupby('Sex').sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Age SibSp Parch Fare
Sex
female 267590.0 284.0 12812.85 838.0 765.0 19208.2047
male 384203.0 191.0 23133.01 997.0 775.0 21465.1410

You can use aggregation functions to quickly help us compare subsets of our data. For example, the aggregate statistics displayed above allow you to quickly notice that there were more female survivors overall than male survivors.

Aggregation functions

There are many built-in aggregate methods provided for you in the pandas package, and you can even write and apply your own. Some of the most common aggregate methods you may want to use are:

  • .min(): returns the minimum value for each column by group
  • .max(): returns the maximum value for each column by group
  • .mean(): returns the average value for each column by group
  • .median(): returns the median value for each column by group
  • .count(): returns the count of each column by group

You can also see a list of all of the built-in aggregation methods by creating a grouped object and then using tab completion to inspect the available methods:

grouped_df = df.groupby('Sex')
# For the following line of code, remove the `#` and then hit the tab after the period.
#grouped_df.

This is a comprehensive list of all built-in methods available to grouped objects. Note that some are aggregation methods, while others, such as grouped.fillna(), allow us to fill missing values to individual groups independently.

Multiple Aggregations

The .groupby() method in pandas can also run multiple different aggregations by utilizing .agg() instead of a single aggregation. A python dictionary can be passed into .agg() where the keys are the column names you want to aggregate and the values are the string representation of the exact aggregation method you want.

df.groupby('Sex').agg({'PassengerId':'count',
                       'Survived':'sum',
                       'Age':'mean'})
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Age
Sex
female 443 284.0 28.922912
male 766 191.0 30.199752

In the cell above we returned three different aggregations on three seperate columns. We counted up the number of individuals using 'PassengerId':'count'. We looked at the number of people who survived via 'Survived':'sum' and finally we also returned the mean age via 'Age':'mean', all grouped by Sex.

Multiple groups

You can also split data into multiple different levels of groups by passing in an array containing the name of every column you want to group by -- for instance, by every combination of both Sex and Pclass.

df.groupby(['Sex', 'Pclass']).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Age SibSp Parch Fare
Sex Pclass
female 1 594.965812 0.811966 34.098291 1.521368 1.538462 84.552209
2 602.647059 0.722689 26.338992 1.605042 1.596639 26.989777
3 550.912162 0.466216 25.677973 1.858108 1.810811 21.144596
? 758.118644 0.576271 32.011356 3.288136 2.152542 50.413771
male 1 601.886792 0.415094 38.287799 1.440252 1.490566 56.046671
2 587.170068 0.258503 31.630340 1.414966 1.122449 29.693905
3 377.919060 0.151436 25.757624 0.973890 0.506527 15.446343
? 746.051948 0.376623 32.862597 2.428571 2.324675 29.516452

Selecting information from grouped objects

Since the resulting object returned is a DataFrame, you can also slice a selection of columns you're interested in from the DataFrame returned.

The example below demonstrates the syntax for returning the mean of the Survived class for every combination of Sex and Pclass:

df.groupby(['Sex', 'Pclass'])['Survived'].mean()
Sex     Pclass
female  1         0.811966
        2         0.722689
        3         0.466216
        ?         0.576271
male    1         0.415094
        2         0.258503
        3         0.151436
        ?         0.376623
Name: Survived, dtype: float64

The above example slices by column, but you can also slice by index. Take a look:

grouped = df.groupby(['Sex', 'Pclass'])['Survived'].mean()
grouped['female']
Pclass
1    0.811966
2    0.722689
3    0.466216
?    0.576271
Name: Survived, dtype: float64
# Using string index label
grouped['female']['1']
0.811965811965812
# Same result as python 0 index
grouped['female'][0]
0.811965811965812

Note that you need to provide only the value female as the index, and are returned all the groups where the passenger is female, regardless of the Pclass value. The second example shows the results for female passengers with a 1st-class ticket.

Summary

In this lab, you learned about how to split a DataFrame into subgroups using the .groupby() method. You also learned to generate aggregate views of these groups by applying built-in methods to a groupby object.

dsc-pandas-groupby's People

Contributors

loredirick avatar cheffrey2000 avatar mike-kane avatar bpurdy-ds avatar danielburdeno avatar peterbell avatar sumedh10 avatar mathymitchell avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.