- Introduction
- Last Week
- Data Science
- Python
- Data Types
- Methods
- Packages
- Visualization
- Common Graphs
- Packages
- Last Week
- Getting Started
- Choosing Colors
- Adding Labels
- Summary
Welcome to week 2! Last week we covered a lot. We started with an overview of data science, both from a career and work flow perspective and tehn continued on to cover the basic datatypes of python. Today, we are going to continue where we left off and take a further look into creating some stunning visuals!
We also saw how to load packages into python in order to use additional functions and methods stored within them.
Recall:
- plotly
- matplotlib
- pandas (Primarily a spreadsheet [excel-like] package, but has some built in visualizations using matplotlib)
- folium
- seaborn (Makes everything prettier!)
import pandas
travel_df = pandas.read_excel('./cities.xlsx')
cities = travel_df.to_dict('records')
cities[0]
{'City': 'Solta', 'Country': 'Croatia', 'Population': 1700, 'Area': 59}
Let's pull in some data!
Technically we already did this (we imported the pandas package up above) but let's do it once again just as a reminder.
This time we'll also alias the pandas package as pd while it may not seem like a lot saving those 4 letters typing will add up if we're using it frequently.
import pandas as pd
Here's the official blurbs about Methods and Importing from python:
"If you quit from the Python interpreter and enter it again, the definitions you have made (functions and variables) are lost. Therefore, if you want to write a somewhat longer program, you are better off using a text editor to prepare the input for the interpreter and running it with that file as input instead. This is known as creating a script. As your program gets longer, you may want to split it into several files for easier maintenance. You may also want to use a handy function that you’ve written in several programs without copying its definition into each program.
To support this, Python has a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. Such a file is called a module; definitions from a module can be imported into other modules or into the main module (the collection of variables that you have access to in a script executed at the top level and in calculator mode)."
Packages (collections of Modules) https://docs.python.org/3/reference/import.html
"It’s important to keep in mind that all packages are modules, but not all modules are packages. Or put another way, packages are just a special kind of module. Specifically, any module that contains a path attribute is considered a package."
"Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way. Functions such as importlib.import_module() and built-in import() can also be used to invoke the import machinery."
df = pd.read_excel('cities.xlsx')
df.head() #Preview the first 5 rows of the dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
City | Country | Population | Area | |
---|---|---|---|---|
0 | Solta | Croatia | 1700 | 59 |
1 | Greenville | USA | 84554 | 68 |
2 | Buenos Aires | Argentina | 13591863 | 4758 |
3 | Los Cabos | Mexico | 287651 | 3750 |
4 | Walla Walla Valley | USA | 32237 | 33 |
- df.head() #Preview the first 5 rows of the dataframe
- df.head(10) #Preview the first 10 rows
- df.tail() #Preview the last 5 rows
- df.columns #Returns a list of the column names Notice that this is an attribute not a method/function; there are no parentheses
- df.info() #Return column names, length of dataframe and storage size info
- df[col] #Return a particular column of the dataframe where col is the name of the column
- df[col].value_counts() #Returns a frequency count of entries within the column in descending order
- df[col].unique() #Returns a list of unique entries within the column
- df[col].nunique() #Returns the number of unique entries within the column as an integer
- df[[cols]] #Returns the dataframe with only those columns indicated
%matplotlib inline
Let's make a bar chart of cities and their population.
df['Population'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x11b925b00>
Hmmm, it would sure be nice to have the actual names of the cities on our graph! To do this, we have to tell Pandas what feature we want to use as the index for the dataframe. The index is shown on the left edge and can be thought of as the row names.
df.set_index('City')['Population'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x112f9ab00>
Better. Let's also change this to a horizontal bar chart so that the cities are easier to read.
df.set_index('City')['Population'].plot(kind='barh') #Notice barh instead of bar
<matplotlib.axes._subplots.AxesSubplot at 0x11bbc0f60>
Great! I want to make my chart all orange though!
Here's a few helpful resources for getting started:
http://colorbrewer2.org/#type=sequential&scheme=YlOrRd&n=3
https://matplotlib.org/api/colors_api.html
df.set_index('City')['Population'].plot(kind='barh', color='Orange')
<matplotlib.axes._subplots.AxesSubplot at 0x11c35f6a0>
We can do way better then that though!
Checkout what you can do with the seaborn package and their color palettes!
import seaborn as sns
sns.set_style("darkgrid") #Load in some snazzier visual settings to spice things upb
See https://seaborn.pydata.org/tutorial/aesthetics.html for all too many options
df.set_index('City')['Population'].plot(kind='barh') #Same code, prettier graph thanks to Seaborn!
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e21fda0>
sns.palplot(sns.light_palette((210, 90, 60), input="husl")) #Previewing a color scheme
sns.palplot(sns.dark_palette("muted purple", input="xkcd")) #Another color scheme!
sns.palplot(sns.color_palette("Paired")) #And another
Those purples were amazing! Let's incorporate them into our graph.
dark_purples = sns.dark_palette("muted purple", input="xkcd")
df.set_index('City')['Population'].plot(kind='barh', color = dark_purples)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e4e1438>
We need another module for this one.
import matplotlib.pyplot as plt
#Same Initial Code
dark_purples = sns.dark_palette("muted purple", input="xkcd")
df.set_index('City')['Population'].plot(kind='barh', color = dark_purples)
#Now Add a title
plt.title('Cities by Population')
#Label the X-Axis
plt.xlabel('Population in Tens of Millions')
Text(0.5,0,'Population in Tens of Millions')
This time lets make everything bigger!
#Same Initial Code
dark_purples = sns.dark_palette("muted purple", input="xkcd")
df.set_index('City')['Population'].plot(kind='barh', color = dark_purples, figsize=(15,12)) #Bigger graph
#Now Add a title
plt.title('Cities by Population', fontsize=22)
#Label the X-Axis
plt.xlabel('Population in Tens of Millions', fontsize=16)
#Enlarge the Y-Axis Label
plt.ylabel('City', fontsize=16)
#Enlarge the City Names themselves
plt.yticks(fontsize=14)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
<a list of 12 Text yticklabel objects>)
So there you have it! Quick and easy visuals and how to import data from .xlsx or .csv files. Bon voyage!
Remember:
#Import to Module
import pandas as pd
#Load a spreadsheet as a DataFrame using Pandas
df = pd.read_excel(filename)
#or
df = pd.read_csv(filename)
# Make sure that graphs show up in Jupyter Notebook:
%matplotlib inline
df[[x_col, y_col]].plot(kind='barh') #Create a bar chart!