Crowdfunding ETL

A collaborative project to build an ETL pipeline using Python, Pandas, and PostgreSQL for extracting, transforming, and loading crowdfunding data from Excel files into a relational database.

ETL (Extract, Transform and Load) data processing is an automated procedure that extracts relevant information from raw data, converts it into a format that fulfills business requirements and loads it into a target system.

Instructions

The instructions for this mini project are divided into the following subsections:

Create the Category and Subcategory DataFrames
Create the Campaign DataFrame
Create the Contacts DataFrame
Create the Crowdfunding Database

Features

Utilize Python and Pandas in order to:
- Extract and transform crowdfunding and contact data from Excel files
- Create and export Category, Subcategory, Campaign, and Contact DataFrames as CSV files
Utilize PostgreSQL in order to:
- Design an ERD and table schema for the database
- Create and populate PostgreSQL database tables

Data

For our analysis, we have extracted and transformed data from the following datasets available in the Resources folder.
- contacts.xlsx
- crowdfunding.xlsx
Then, we loaded our transformed data into crowdfunding_db PostgreSQL database.
- category.csv
- subcategory.csv
- contacts.csv
- campaign.csv

How to Run

Clone the repository.
Install PostgresSQL
Install required Python packages: pandas, numpy and openpyxl.
In order to run our analysis, first select Python Environment (Python 3.9.13) in Jupyter Notebook.
Select "Run All" in ETL_Mini_Project.ipynb file, which will Extract, Transform data, and Create CSV files.
- All CSV files will be located in Resources folder.
Set up a PostgreSQL server to create a new database called crowdfunding_db.
In Crowdfunding DB folder, use provided crowdfunding_db_schema.sql file to create tables in PostgreSQL crowdfunding_db database.
- Import each CSV file into its corresponding SQL table in the following order: category, subcategory, contacts, and campaign.
- Query the database to verify the data has been loaded correctly, by running a SELECT statement for each.

Outcome

In this project, we have demonstrated the ETL pipeline utilizing Python, Pandas, and a combination of both Python dictionary and regular expression techniques for data extraction and transformation. We were able to successfully generate four CSV files, and use the data in these files to design an ERD as well as table schema, then loaded the data into PostgresSQL database from the CSV files via dataframes.

Resources and Libraries

Programming Languages	Database	Libraries	Software	Modules
Python	PostgreSQL	Pandas	QuickDBD	datetime as dt
		numpy	Jupyter Notebook	pprint
				json

Team Members:

Helena Fedorenko
Jason Barbagallo
Xing Ying Chen
Nancy Santiago
Anthony Parry

helenaschatz / crowdfunding_etl Goto Github PK