Git Product home page Git Product logo

upenn_project2's Introduction

UPenn_Project2

This project was done by Hao Dong, Ana Stefanski, Michael McAneny-Binetsky, Vysh Nalluri, and Meredith Hopkins as part of the University of Pennsylvania Data Analytics Bootcamp. Data on bird sightings and data on ufo sightings were collected from Cornell's eBird API 2.0 and Kaggle, respectively. The data was cleaned and procesed with Pandas in Jupyter Lab before being loaded into both PostgreSQL and MongoDB for storage.

UFO data:

The first step in this project was to extract data. The group decided to find data on UFO sightings and data on bird sightings. The UFO sightings data was sourced from Kaggle and downloaded as a CSV. Obtaining the bird sightings data was a more complex process. Due to the volume of data available through the eBird API, at this stage in the project the group decided to restrict the data to a single US state - California - and the years 2003 to 2013. This API only allows date specific data to be drawn in two ways; 1) By selecting up to the last 30 days of data or; 2) Selecting data from a single date. Because we wanted data from a specific timeframe from 2003 to 2013 we created a nested loop that runs through all the dates in a given year, checks if there is any data, and stores the data if there is any in a list, and create a dataframe out of that. Then we transpose the dataframe, change the column names, and save it to a csv. This is done for each year of our data. From there, we merge all our data together, split our Date column into a date and time column, and convert our date into a datetime data type before saving that new dataframe into a csv.

The next step was transforming the data. For the UFO data, we dropped some columns that were not needed for our purposes, split the datetime column to get a year column and a date column, filtered to only keep data from the state of California from 2003 through 2013. Imported SQLAlchemy and Pandas as dependency. Read in our csv file - ../cleaned_data/cali_ufo.csv". To load into Postgres we used the dependency getpass to store our password. We made the connection to Postgres with ‘create engine’ and passed in the connection string. Loaded the dataframe with the table name of “Bird Data” and used the arguments if_exists = append, index = True. Index = True - This will be our arbitrary primary key. Bird Data: Reviewed data and decided to drop the following columns: 'location ID', 'observation valid', 'observation reviewed', 'location private', ‘Sub ID’. Only called data from California to match data in UFO. The dataframe before cleaning had rows and columns switched, therefore transposed data changing rows to columns and downloaded as csv file. Due to the large amount of data called from Bird data, data was saved under multiple csvs, one for each year. Merged all csvs for years 2003 through 2013. Split datetime column to get a year column, then filtered to keep only data from 2003 to 2013 and created a new dataframe with the filtered data. Imported SQLAlchemy as dependency. Due to the large csv file of Bird Data, zipping the file was necessary to manage data size. We discovered that PD.read_csv will read a zipped csv file. To load into Postgres we used the dependency getpass to store our password. We made the connection to Postgres with ‘create engine’ and passed in the connection string. We then loaded the dataframe with the table name of “Bird Data” and used the arguments if_exists = fail, index = False. if_exists was used so that rows are not duplicated when data is loaded at a later time. Index = false was used to remove the index from the data frame because it is irrelevant.

Screenshots:

Bird data before cleaning: image info

Bird data after cleaning: image info

UFO data before cleaning: image info

UFO data after cleaning: image info

Database diagram: image info

Bird to pgAdmin: image info

pgAdmin search by date: image info

UFO to pgAdmin: image info

Source mapping: image info

upenn_project2's People

Contributors

haodong191 avatar mikebinetsky avatar vyshnalluri avatar twolightsabovethesea avatar anastefanski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.