Git Product home page Git Product logo

gandalf1819 / nycopendata-profiling-analysis Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 4.0 18.31 MB

Open Data Profiling, Quality and Analysis on NYC OpenData dataset with semantic profiling using fuzzy ratio, Levenshtein distance and regex

License: MIT License

Jupyter Notebook 93.80% Python 6.20%
data-profiling nyc-opendata nyc-311-dataset big-data pyspark hdfs modin dask dask-distributed pandas

nycopendata-profiling-analysis's Introduction

NYCOpenData-Profiling-Analysis

License: MIT

Buy Me A Coffee


Viz-3.1 Viz-3.1

Data Profiling, Quality and Analysis on public dataset on NYCOpenData.

Dataset

Dataset: NYC Open Data

Task 1 : Generic Profiling

Open data often comes with little or no metadata. You will profile a large collection of open data sets and derive metadata that can be used for data discovery, querying, and identification of data quality problems.

For each column in the dataset collection, you will extract the following metadata

  1. Number of non-empty cells
  2. Number of empty cells (i.e., cell with no data)
  3. Number of distinct values
  4. Top-5 most frequent value(s)
  5. Data types (a column may contain values belonging to multiple types)

Identify the data types for each distinct column value as one of the following:

  • INTEGER (LONG)
  • REAL
  • DATE/TIME
  • TEXT

For each column count the total number of values as well as the distinct values for each of the above data types.
For columns that contain at least one value of type INTEGER / REAL report:

  • Maximum value
  • Minimum value
  • Mean
  • Standard Deviation

For columns that contain at least one value of type DATE report:

  • Maximum value
  • Minimum value

For columns that contain at least one value of type TEXT report:

  • Top-5 Shortest value(s) (the values with shortest length)
  • Top-5 Longest values(s) (the values with longest length)
  • Average value length

Viz-1

Task 2 : Semantic Profiling

For each column, identify and summarize semantic types present in the column. These can be generic types (e.g., city, state) or collection-specific types (NYU school names, NYC agency).
For each semantic type T identified, enumerate all the values encountered for T in all columns present in the collection.
You will look for the following types and add one or more semantic type labels to the column metadata together with their frequency in the column:

  • Person name (Last name, First name, Middle name, Full name)
  • Business name
  • Phone Number
  • Address
  • Street name
  • City
  • Neighborhood
  • LAT/LON coordinates
  • Zip code
  • Borough
  • School name (Abbreviations and full names)
  • Color
  • Car make
  • City agency (Abbreviations and full names)
  • Areas of study (e.g., Architecture, Animal Science, Communications)
  • Subjects in school (e.g., MATH A, MATH B, US HISTORY)
  • School Levels (K-2, ELEMENTARY, ELEMENTARY SCHOOL, MIDDLE)
  • College/University names
  • Websites (e.g., ASESCHOLARS.ORG)
  • Building Classification (e.g., R0-CONDOMINIUM, R2-WALK-UP)
  • Vehicle Type (e.g., AMBULANCE, VAN, TAXI, BUS)
  • Type of location (e.g., ABANDONED BUILDING, AIRPORT TERMINAL, BANK, CHURCH, CLOTHING/BOUTIQUE)
  • Parks/Playgrounds (e.g., CLOVE LAKES PARK, GREENE PLAYGROUND)

Viz-2

Viz-2.1

Task 3 : Data Analysis

  • Identify the three most frequent 311 complaint types by borough.
  • Are the same complaint types frequent in all five boroughs of the City?
  • How might you explain the differences?
  • How does the distribution of complaints change over time for certain neighborhoods and how could this be explained?

Data Visualizations

Types of complaints across the different boroughs

Viz-3.1

Distribution of "closed-dates" across the different boroughs

Viz-3.2

Heat Map Representing Status of Complaints Across The Different Boroughs

Viz-3.3

Heat Map Representing Count Of Complaints Across The Different Boroughs

Viz-3.4

Distribution of Complaint Types and their resolution dates

Viz-3.5

Types of complaints across various different locations

Viz-3.6

Heat Map representing the Types of complaints that are open in the Brooklyn region

Viz-3.7

Team

Support

If you found this useful, please consider starring(โ˜…) the repo so that it can reach a broader audience

License

This project is licensed under the MIT License. Feel free to create a Pull Request for adding implementations or suggesting new ideas to make the analysis more insightful

nycopendata-profiling-analysis's People

Contributors

gandalf1819 avatar hemanthtejay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.