Git Product home page Git Product logo

yelpassociationmining's Introduction

YelpAssociationMining

Association mining project on the Yelp Dataset.

The goal of this project was to perform an association mining case study on Yelp business categories and to compare the performance of different mining algorithms.

For the case study on Yelp businesses, there are two goals: identify frequent business categories and most popular business categories. This is done in general, and then for the Montreal and Philadelphia markets specifically. Popularity is based on user reviews (>=3 stars and above).

For the comparative study on algorithmic performance, three association mining algorithms are evaluated: Apriori [Agrawal and Srikant, 1994], FP-Growth [Han et al., 2000] and Eclat [Zaki, 2000]. For Apriori and FP-Growth, we use the mlxtend implementation. For Eclat, we use the PyECLAT implementation. Performance comparison metric is the execution time, measured for different minsup.

Scripts:

Note: the files in data/ and results/ folders are just sample files to illustrate the structure of the project. In order to use the scripts below, you have to download and extract the Yelp dataset into the data/ folder.

  • extract_data.py: save json files to csv format
  • preprocess.py: form dataframes to make way for association mining (remove empty lines, join DFs, filter for reviews with >=3 stars, etc.)
  • catacteristics.py: summarize some frequent items characteristics for the Yelp dataset (e.g. max/mean transaction size, frequent itemset curves for different values of minsup, etc.)
  • frequent_categories.py: identify frequent categories/associations in all businesses
  • frequent_restaurant_categories.py: identify frequent categories/associations for restaurant businesses
  • frequent_user_categories.py: identify frequent categories/associations for user-preferred businesses (based on reviews)
  • measure_exec_time.py measure execution time on the Yelp dataset for all 3 association mining algorithms.

References:

  • Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, p. 487–499., San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Han, J., Pei, J. and Yin, Y. (2000). Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2), 1–12. http://dx.doi.org/10.1145/335191.335372.
  • Zaki, M. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390. http://dx.doi.org/10.1109/69.846291

yelpassociationmining's People

Contributors

cadotte avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.