This repository collects datasets for Python practice. The following will provide some basic information on each dataset.
I sincerely thank Professor Kim J. Ruhl for his Python lecture in 2021 Fall at University of Wisconsin-Madison.
The file 'osk.csv' contains daily closing prices for OshKosh Corp. and the S&P 500.
The file 'vix.csv' contains daily end-of-trading values of the VIX, a measure of expected market volatility as implied by S&P 500 options. Business news like to refer to it as the "fear index". The idea is that expected volatility rises when people are worried about the future.
These two data files are from the pro football reference on the Green Bay Packers' roster. They are "matched" on the player names, but there are some errors in those names.
The data is from the Current Population Survey, which surveys about 60,000 households each month.
The file 'two_digit_by_port.csv' contains U.S. the dollar value of imports by two-digit commodity code and port of entry for December 2013. The data were retrieved from the Census trade API. For example, imports into port number 3703 (Green Bay, WI) of commodity 72 (Iron and Steel) where $9,208,917 in December 2013. You can learn more about port codes here.
The iris data set is one of the most famous data set and classifier example. It even has its own Wikipedia page! An iris is a flower, made up of sepals and petals. There are three types of irises in the data set: Iris-setosa, Iris-versicolor, and Iris-virginica. We can see four characteristics: sepal length, sepal width, petal length, and petal width.
The data file records characteristics and quality rankings of Portuguese wine from the Wine Quality Data Set from the UCI Machine Learning Repository.
In this data file, the variables include
- GEOID: geographical ID.
- Description: name of city and state (only WI and MN).
- income: in thousands of USD, for 2018.
- population: number of persons, for 2018.
- ALAND: area of the county in square meters.
This dataset constains information on number of banks and banking offices in the US in 1934-2017.
This file collects data on US annual GDP and its subcomponents from 1929 to 2019.
This dataset collects daily VIX index from 1990 Jan 2 to 2019 Oct 10.
This dataset contains data on the number of commercial banking institutions, branches, and offices in the United States at the end of each year between 1934 and 2017. The data are from Table CB01, which is maintained by the Federal Deposit Insurance Corporation (FDIC). FDIC data can be downloaded from https://www.fdic.gov/open/datatools.html.
This dataset contains annual real GDP of the United States from 1929 to 2021, with some missing data.
This dataset was cleaned from the MovieLens "ml-latest-small" dataset, which was released by the GroupLens. It is meant to help build recommendation algorithms, like the ones you see in Netflix or Spotify. The GroupLens organization has other ratings datasets, too, on music, jokes, and books.
The data are taken from the Airline Origin & Destination Survey (DB1B) but has been substantially cleaned by Dennis McWeeny, a Senior Economist at Bates White Economic Consulting. The dataset contains information on a sample of airline itineraries for flights departing from one of seven airports in the San Francisco Bay region and arriving at one of the other large cities in the United States in the second quarter of 2017. Each observation contains information on the origin airport, destination airport, airline, nonstop or connecting itinerary type, average one-way fare in dollars, and distance between the origin and destination (in miles).
This data set corresponds to Problem 3 in Chapter 3 of Wooldridge's Introductory Econometrics (7th edition) and it's already cleaned. Professor Kim J. Ruhl contemplated adding some junk to the files to make it more interesting.
This data set corresponds to Problem C2 in Chapter 6 of Wooldridge's Introductory Econometrics (7th edition).
This file contains data about Vegas betting. The complete variable list is here. For example, favwin is equal to 1 if the favored team won and zero otherwise, and spread holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.
The data dictionary for this file can be found here. The variable ecolbs is purchases of eco-friendly apples.
In the dataset 'dogs.csv', there are different dimensions: variables (walks, snacks); dogs (Buster, Su); and time. The column 'value' are the data associated with the dog-variable-time triplet.
The file 'WEOOct2019all.csv' is from the IMF's World Economic Outlook, which contains historical data and the IMF's forecasts for many countries and variables.
This file contains data on automobile characteristics in the European market. The unit of observation is a automobile model at a point in time. We can see prices and quantities sold and characteristics about the model.
This is a dataset (in four different forms) with some information on Wisconsin cities. The data include the location of the cities and their populations. There are 20 Wisconsin cities that rank among the 1000 largest cities. We can use '1000-largest-us-cities-by-population-with-geographic-coordinates.shp' with geopandas in Python to plot maps for Wisconsin cities.
To plot voting patterns in the 2016 presidential election, I downloaded result data from Wisconsin Elections Commission. It is a mess. I saved a cleaned-up version of the file to 'results.csv'.