Git Product home page Git Product logo

dsc-enterprise-hsbc-data-visualization-with-pandas's Introduction

Data Visualization with Pandas

Introduction

In this lesson, we will be looking at data visualization using Pandas and Matplotlib - modules that we have already seen and used. Pandas uses matplotlib under the hood for data visualization, and provides some handy yet efficient functions for visualizing data from DataFrames.

Objectives

You will be able to:

  • Understand the relation between pandas and matplotlib plots and their attributes
  • Plot data from single variables using scatter plots, histograms, line plots, boxplots and KDE plots in pandas
  • Plot multidimensional data using scatter matrix and parallel coordinate plots.

Styling a Plot

Before we dive into data visualization in Pandas, it would be a good idea to get a quick introduction to Matplotlib's style package. Matplotlib comes with a number of predefined styles to customize the plots. These styles generally change the look of plots by changing color maps, line styles, backgrounds etc. Because Pandas is built on Matplotlib for visualizations, this will change the style of our Pandas graphs as well as we'll see below:

We can use plt.style.available to see a list of predefined styles available in Matplotlib. The %matplotlib notebook magic below optimizes the plots for displaying them in jupyter notebooks

import matplotlib.pyplot as plt
%matplotlib notebook
plt.style.available
['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']

So this provides us with a list of styles available. In order to use a style, we simply give the command plt.style.use(<style name>). Let's use ggplot for now and see how it changes the default style. Feel free to try other styles and see how they impact the look and feel of the plots!

plt.style.use('ggplot')

Create a dataset for visualization

Pandas offers excellent built-in visualization features. It's particularly useful for exploratory data analysis of data that's stored as Pandas Series or DataFrame.

Let's build a synthetic temporal DataFrame with following steps:

  • Data frame with three columns A, B and C
  • For data in each column, we will use a random number generator to generate 365 numbers (to reflect days in a year) using np.random.randn().
  • Using numpy's cumsum (cumulative sum) method, we will cumulatively sum the generated random numbers in each column.
  • Offset column A by +25 and column C by -25 with respect to Column B, which will remain unchanged
  • Using pd.date_range, set the index to be every day in 2018 (starting from 1st of January).

We'll also set a seed for controlling the randomization, allowing us to reproduce the data.

It is always a good idea to set a random seed when dealing with probabilistic outputs.

Let's give this a go:

import pandas as pd
import numpy as np

np.random.seed(777)

data = pd.DataFrame({'A':np.random.randn(365).cumsum(),
                    'B':np.random.randn(365).cumsum() + 25,
                    'C':np.random.randn(365).cumsum() - 25}, 
                     index = pd.date_range('1/1/2018', periods = 365))
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
A B C
2018-01-01 -0.468209 25.435990 -22.997943
2018-01-02 -1.291034 26.479220 -22.673404
2018-01-03 -1.356414 25.832356 -21.669027
2018-01-04 -2.069776 26.456703 -21.408310
2018-01-05 -1.163425 25.864281 -22.685208

This is great. Now we have a dataset with three columns we can call time-series. Let's inspect our data visually. To plot this data we can simply use the .plot() method on the DataFrame.

data.plot()
<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a208d2cf8>

This is sweet. So we didn't have to define our canvas, axes or labels etc. This is where pandas really shines. The DataFrame.plot() method is just a simple wrapper around plt.plot() that draws line plots. So when we call data.plot(), we get a line graph of all the columns in the data frame with labels.

Also, notice how this plot looks different in terms of look and feel. This is because of the style we used earlier. Additionally, the %matplotlib notebook makes the plots interactive. Try clicking, dragging , zooming on above plot to see how this works.

Try changing the to a different style and see which one would you prefer.

Scatter Plots

The DataFrame.plot() allows us to plot a number of different kinds of plots. We can select which plot we want to use by pressing it into the kind parameter. Here is a complete list from the documentation

kind : str

‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot

Let's try and create a scatter plot that takes the A and B columns of data. We pass in "scatter" to the kind parameter to change the plot type. Also note, putting a semicolon at the end of plotting function would mute any extra text out.

data.plot('A', 'B', kind='scatter' );
<IPython.core.display.Javascript object>

We can also choose the plot kind by using the methods dataframe.plot.kind instead of passing the kind argument as we'll see below. Lets now create another scatter plot with points varying in color and size. We'll perform the following steps:

  • Use df.plot.scatter and pass in columns A and C.
  • Set the color c and size s of the data points to change based on the value of column B.
  • Choose the color palette by passing a string into the parameter colormap.

A complete list of colormaps is available at Official Documentation

Let's see this in action:

data.plot.scatter('A', 'C', 
                  c = 'B',
                  s = data['B'],
                 colormap = 'viridis');
<IPython.core.display.Javascript object>

<img src="data:image/png;base64,iVBORw0KGgoAAA

dsc-enterprise-hsbc-data-visualization-with-pandas's People

Contributors

loredirick avatar mike-kane avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.