Git Product home page Git Product logo

hotel-review-analysis's Introduction

Sentiment Analysis and Aspect classification for Hotel Reviews

This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models.

Code organization

The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts.

The TripAdvisor (hotel_sentiment/spider/tripadvisor_spider.py) spider is used to gather data to train a sentiment analysis classifier in MonkeyLearn. Reviews texts are used as the sample content and reviews stars are used as the category (1 and 2 stars = Negative, 4 and 5 stars = Positive).

To crawl ~15000 items from tripadvisor use:

scrapy crawl tripadvisor -o itemsTripadvisor.csv -s CLOSESPIDER_ITEMCOUNT=15000

You can check out the generated machine learning sentiment analysis model here.

The Booking spider (hotel_sentiment/spider/booking_spider.py) is used to gather data to train an aspect classifier in MonkeyLearn. The data obtained with this spider can be manually tagged with each aspect (eg: cleanliness, comfort & facilities, food, internet, location, staff, value for money) using MonkeyLearn's Sample tab or an external crowd sourcing service like Mechanical Turk.

To crawl from booking use:

scrapy crawl booking -o itemsBooking.csv

You first have to add the url of a starting city. To crawl from a single hotel in booking use:

scrapy crawl booking_singlehotel -o <hotel name>.csv
  • opinionTokenizer.py is a simple script to obtain the "opinion units" from each review.
  • classify_and_plot_reviews.ipynb is a simple script that uses the generated model to classify new reviews and then plot the results in a graph using Plotly.

You can check out the generated machine learning aspect classifier here.

To crawl from Tripadvisor use:

scrapy crawl tripadvisor_more -a start_url="http://some_url" -o <hotel_name>.csv -s CLOSESPIDER_ITEMCOUNT=20000

With the url of a starting city to crawl from, such as https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html.

The scripts and notebooks necessary to replicate the post are in the classify_elastic folder:

  • classify_elastic/generate_files_for_indexing.py will take the csv file produced by scrapy and generate two files that other scripts will use.
  • classify_elastic/classify_pipe.py will open the opinion_units file and classify it with MonkeyLearn according to topic and sentiment, and save the results to a new csv file.
  • classify_elastic/index_definition.json contains the mapping definitions used in ElasticSearch.
  • classify_elastic/index_reviews.py will index into your ElasticSearch instance the reviews generated by generate_files_for_indexing.py.
  • classify_elastic/index_opinion_units.py will index into your ElasticSearch instance the classified opinion units.
  • classify_elastic/Extract keywords.ipynb shows how to extract keywords from the indexed data.

Finally, the queries folder contains some queries that were used to power the Kibana visualization.

hotel-review-analysis's People

Contributors

brusteca avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.