Git Product home page Git Product logo

restaurant-recommender's Introduction

restaurant-recommender

Overview & Motivation

This repository is aiming at building a recommendation system able to recommend restaurants to a given user. The model used is (DLRM or Contextual Sequence Learning with Transformer)

Acknowledgments

This repository is inspired by

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

The project is built using Python and PyTorch. We use Poetry for dependency management.

First, you will have to clone the repository locally.

git clone https://github.com/ChrisTho23/restaurant-recommender
cd DrakeGPT

Then, install dependencies using Poetry:

poetry install

All following scripts will have to be run from the ./src folder to make sure the relative paths defined in ./src/config.py work correctly. Access the ./src file like so:

cd src/

In this repository, we use a subset of Yelp's businesses, reviews, and user data, included in this Yelp Dataset that has been uploaded to Kaggle, are used. Thus, we need to access Kaggle’s public API to download the dataset. For this, one needs to authenticate in Kaggle using an API token. If you have not done so, follow these steps to authenticate:

  1. If not done already, create an account on kaggle.com
  2. Go to the 'Account' tab of your user profile on the Kaggle website. Click on 'Create New API Token'. This triggers the download of kaggle.json, a file containing your API credentials.
  3. Make the credentials in the kaggle.jsonfile accessible to your application. This can look like this:
mkdir ~/.kaggle
echo '{"username":"your_username","key":"your_api_key"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
  1. For more details and troubleshooting, visit the official Kaggle API documentation.

Finally, you will have to run the ./src/setup.py script to load the data in the ./data folder and create a train and a test data set. We use a tiny dataset from Kaggle containing lyrics of Drake song text for model training. Find the data here after.

poetry run python data.py

Results

Data preprocessing

Before we go over to the (more interesting) model training, a significant amount of data processing is needed. The fact that the data used for this exercise is quite large (initial dataset >20GB), does not facilitate this task. As mentioned before, we use Kaggle's Yelp Dataset which contains a subset of Yelp's businesses, reviews, and user data across 8 metropolitan areas in the USA and Canada. Overall, Kaggle provides six different files:

  • Dataset_User_Agreement.pdf: PDF document that governs the terms under which you may access and use the data the Yelp data. Specifies that data can be used solely for academic or non-commercial purposes.
  • yelp_academic_dataset_business.json: Contains business data including location data, attributes, and categories.
  • yelp_academic_dataset_review.json: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
  • yelp_academic_dataset_user.json: User data including the user's friend mapping and all the metadata associated with the user.
  • yelp_academic_dataset_checkin.json: Checkins on a business.
  • yelp_academic_dataset_tip.json: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions. For simplicity, in this exercise, we are only going to use the business, review, and user datasets. Find below how the data is preprocessed to obtain a clean dataset which we can use for training our recommender system. Note, here we use Dataiku from on a Google Cloud virtual machine (VM) for data preprocessing as quite a bit of RAM storage is needed for the preprocessing. If you do not want to run the data preprocessing yourself you can just download the preprocessed file from my Google Cloud bucket as described here.
  1. Load user, business, and review dataset into kaggle (manually select JSON dataformat for user dataset)
  2. Filter 'category' column in business dataset for gastronomy businesses via keywords: "Bakeries", "Bar", "Bars", "Bistros", "Cafes", "Patisserie", "Restaurants", "Tea" (150,346 lines to 61,562 lines)
  3. Merge the the business dataset containing the gastronomy business with the review dataset with an inner join on the key "business_id". We retain the columns business_id, name, address, city, state, postal_code, latitude, longitude, stars, review_count, and categories from the business dataset and the columns review_id, review_user_id, review_stars, review_useful, review_funny, review_cool, review_text, and review_date from the review dataset. (out of 6,990,280 reviews, 5,062,772 lines of data remain)
  4. Last but not least, enrich the data for each review of each business by the user data using another inner join, this time on the key "user_id". We retain all the columns out of the business_review dataset and add the user information: user_name, user_review_count, user_yelping_since, user_useful, user_funny, user_cool, user_elite, user_friends, user_fans, user_average_stars to each row. Out of the 5,062,772 lines of reviews, we can find user data for X lines. The final dataset contains the columns: | Column | Description | Type | |------------------------|-------------|---------| | gastro_business_id | | string | | gastro_name | | string | | gastro_address | | string | | gastro_city | | string | | gastro_state | | string | | gastro_postal_code | | string | | gastro_latitude | | double | | gastro_longitude | | double | | gastro_stars | | double | | gastro_review_count | | bigint | | gastro_categories | | string | | review_id | | string | | review_user_id | | string | | review_stars | | string | | review_useful | | string | | review_funny | | string | | review_cool | | string | | review_text | | string | | review_date | | string | | user_name | | string | | user_review_count | | string | | user_yelping_since | | string | | user_useful | | string | | user_funny | | string | | user_cool | | string | | user_elite | | string | | user_friends | | string | | user_fans | | string | | user_average_stars | | string |

Data engineering

Training

Usage

Training

To train a model, run the src/train.py script.

poetry run python train.py

Note: After every run of train.py the model will be saved in the ./model folder. By default, all models were trained and can be found in this folder. Running a pre-defined model will overwrite this file.

Inference

Dependencies

Dependencies are managed with Poetry. To add or update dependencies, use Poetry's dependency management commands.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

restaurant-recommender's People

Contributors

christho23 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.