Git Product home page Git Product logo

udacity_data_engineering_capstone_project's Introduction

Udacity Data Engineering Capstone Project

US IMMIGRATION DATA MODEL

Data Engineering Capstone Project

Project Summary

This is the Uacity Data Engineering Capstone Project, which presents a usecase where both big data engineering concepts and technologies are utilized to bring useful insights out of large and diverse data sources. In this project, a data model is built to enable analysts and business users to answer different types of questions related to the immigration trends to the US.

Questions like is there a peak season for immigration? Do immigrants from warm countries prefer certain States versus those who are coming from cold countries? What is the percentage of immigrants with business visas versus those with tourist visas? All these questions and many more can be easily answered once our data model is built and filled with data.

The project follows the following steps:

  • Step 1: Scope the Project and Gather Data
  • Step 2: Explore and Assess the Data
  • Step 3: Define the Data Model
  • Step 4: Run ETL to Model the Data
  • Step 5: Complete Project Write Up

Our data comes from different sources and in different formats described as follows:

  • I94 Immigration Data: This data comes from the US National Tourism and Trade Office. The data comes in .sas format, and it has information about entries made to the US in 2016. This dataset is relatively big as it contains several million records. The data also comes with labels descriptions file which provides additional information about the main dataset. More about this dataset can be found here.
  • World Temperature Data: This dataset came from Kaggle, and it keeps track of the global weather information. The data is provided as a .csv file. More about this dataset can be found here.
  • U.S. City Demographic Data: This data is offered by OpenSoft, and it provides basic information about different city demographics. The data is also provided in .csv format. You can read more about it here.
  • Airport Code Table: This is a simple table of airport codes and corresponding cities. The data is also provided in .csv format. It comes from here.

Star schema and pyspark have been used for the implementation of our data model. The implementation details can be found in the Capstone Project Template.ipynb file.

The code in the Jupiter note is organized in a logical order where each code block represents a task or a number of related tasks. The ETL code is divided into a set of functions in the etl.py file. Necessary imports and configurations are done at the begining of the code; no further imports or installs are needed.

To execute this ETL, it can be run in Terminal by executing "python3 etl.py" command, where etl.py is the name of our ETL code file.

Some data quality checks are performed after processing and saving data. Processed data are saved in .parquet format in a separate folder in this project, namely output. Some partitioning is applied to the saved data to achieve better performance during the data read operations.

The ETL has been tested and completed successfully before submission.

udacity_data_engineering_capstone_project's People

Contributors

qusay-elewy avatar

Watchers

 avatar

Forkers

suonbo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.