Git Product home page Git Product logo

coms6111-proj3's Introduction

COMS6111-proj3

  1. Group name: Project 3 Group 23 Qi Wang (qw2197), Yongjia Huo (yh2796)

  2. Submitted files:

    1. main.py -- the main program, implement the a-priori algorithm to extract association rules
    2. clean.py -- the program used to clean the data set
    3. INTEGRATED-DATASET.csv -- the cleaned data set we use to generate rules
    4. READEME.md
    5. example-run.txt -- the output of the interesting sample run
  3. The detailed description of data set

    1. Which NYC Open Data data set(s) you used to generate the INTEGRATED-DATASET file: We choose the "Bus Breakdown and Delays" data set to generate the INTEGRATED-DATASET file

    2. What (high-level) procedure you used to map the original NYC Open Data data set(s) into your INTEGRATED-DATASET file: We use the following filters when downloading data:

      • School_Year: 2016-2017
      • Number_of_Students_On_the_Bus > 0
      • Has_Contractor_Notified_Schools: Yes
      • Has_Contractor_Notified_Parents: Yes
      • Have_You_Alerted_OPT: Yes
      • School_Age_or_PreK: School-Age

    Then we extract the hour part from the "Occurred_On" column and reformat the "How_Long_Delayed" column. And we remove some columns which are unlikely to generate rules, keeping only the following columns in our final data set: * Route_Number * Reason * Occurred_On * Boro * Bus_Company_Name * How_Long_Delayed * Number_Of_Students_On_The_Bus * Breakdown_or_Running_Late

    Note: There are some empty columns within the original file, we fill the empty slot in "Boro" column with "Others" and fill the empty slot in "How_Long_Delayed" columns with "-1" to indicate the data is unavailable.

    1. What makes your choice of INTEGRATED-DATASET file interesting: We choose this data set because we want to know when the bus breakdowns and delays happens, which is a good indication of the transportation condition in NYC. We can generate rules to know where and when the accidences occur most often, which routine is more likely to delay and so on. Besides, we can give students' parents suggestions like when should they send children to school to avoid heavy traffic.

    2. Use the following command to run our program:

      python main.py <min_supp> <min_conf>

      For example,

      python main.py INTEGRATED-DATASET.csv 0.25 0.8

    3. Description of Internal Design: We follow exactly the implementation of a-priori algorithm described in Section 2.1 of the Agrawal and Srikant paper in VLDB 1994. The main steps are:

      1. Counts item occurrences to determine the large 1-itemsets.
      2. Iteratively generate the large k-itemsets(denoted as L_k) from the large (k-1)-itemsets until (k-1)-itemsets is empty.
        • join L_k-1 with L_k-1 to return the set of all large k-itemsets, denoted as Ck
        • prune Ck to delete those itemsets not in L_k-1
        • scan the database and count the support of candidates in Ck
        • L_k is the itemsets in Ck whose support is larger than the required minimum support
    4. An interesting sample run

    5. The command line specification

      python main.py INTEGRATED-DATASET.csv 0.25 0.8

    6. Explanations Some interesting rules:

      1. [07 O'Clock, Heavy Traffic] => [Running Late] -- (Conf: 100.0 % Supp: 49.3277927157 %) [07 O'Clock] => [Running Late] -- (Conf: 97.5080385852 % Supp: 59.3009044243 %) [07 O'Clock, Running Late] => [Heavy Traffic] -- (Conf: 83.1821929101 % Supp: 49.3277927157 %) [07 O'Clock] => [Heavy Traffic] -- (Conf: 81.1093247588 % Supp: 49.3277927157 %)

        7 O'Clock is the busiest time in a day and bus delays are more likely to happen at this time

      2. [20 MIN] => [Running Late] -- (Conf: 100.0 % Supp: 26.5949645563 %)

        most delays are 20-minute late

      3. [Bronx] => [Running Late] -- (Conf: 97.9981801638 % Supp: 26.3260816426 %) [Manhattan] => [Running Late] -- (Conf: 96.4511041009 % Supp: 29.8948912246 %)

        traffic in Bronx and Manhattan is worse than other boroughs.

      4. [RELIANT TRANS, INC. , Running Late] => [Heavy Traffic] -- (Conf: 95.0184501845 % Supp: 25.177218284 %) [RELIANT TRANS, INC. ] => [Running Late] -- (Conf: 91.1690496215 % Supp: 26.4971889514 %) [RELIANT TRANS, INC. ] => [Heavy Traffic] -- (Conf: 86.6274179983 % Supp: 25.177218284 %)

        buses from RELIANT TRANS, INC. are more likely to delay because of heavy traffic. From Google reviews we also find that this company only gets a rating of 2.6, and people complain that they don't come on time.

coms6111-proj3's People

Contributors

lilyhuo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.