Git Product home page Git Product logo

shaw-scraper's Introduction

Shaw Scraper

A web scraper built to scrape movie and seat buying data from Shaw Theatres' website to understand movie-goers' behaviouristic patterns. The data was used to build interesting visualisations which can be found on the PopcornData website. More details on how we obtained the data and cleaned it can be found in this Medium article.

Data Collected

Raw Data

The complete raw data collected can be found here.

Cleaned Data

The processed data can be found here.

Built With

Getting Started

The scraper was built to run on Heroku. The following instructions are to deploy it on Heroku.

Prerequisites

  • Heroku

    • Account - Create a free account on Heroku
    • Heroku CLI - Follow these instructions to download and install the Heroku CLI
  • MongoDB Atlas account

    • Create a free MongoDB Atlas account
    • Create a database in MongoDB named "shaw_data" and a collection inside it called "movie_data". You can have different names for the database and collection but you must update the Shaw_scraper.py file accordingly.
    • Add 0.0.0.0/0 (i.e. all addresses) to your MongoDB Atlas whitelist
    • Get the database connection string which is in this format:
    mongodb://[username:password@]host1[:port1][,...hostN[:portN]][/[defaultauthdb][?options]]
    

Installation

  1. Clone the repo and navigate to the correct folder
git clone https://github.com/PopcornData/shaw-scraper.git
  1. Open your Heroku CLI and login to Heroku
heroku login
  1. Create a new project on Heroku
heroku create <project-name>
  1. Add the remote
heroku git:remote -a <project-name>
  1. Add the Buildpacks necessary for Selenium ChromeDriver
heroku buildpacks:add --index 1 https://github.com/heroku/heroku-buildpack-python.git

heroku buildpacks:add --index 2 https://github.com/heroku/heroku-buildpack-chromedriver

heroku buildpacks:add --index 3 https://github.com/heroku/heroku-buildpack-google-chrome
  1. Add the PATH variable to the Heroku configuration
heroku config:set GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google_chrome

heroku config:set CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver

heroku config:set MONGODB_URL=<your-MongoDB-connection-string>
  1. Deploy to Heroku (Make sure that you navigate to the cloned folder before deploying)
git push heroku master
  1. Run the following command to start the scraper
heroku ps:scale clock=1

Usage

The scraper has 2 functions which run separately:

  1. get_movie_data() - This function scrapes the movie details from all the theatres for the given day and stores the JSON data in the DB. The data has the folowing format:
{
 "theatre":"Nex",
 "hall":"nex Hall 5",
 "movie":"Jumanji: The Next Level",
 "date":"18 Jan 2020",
 "time":"1:00 PM+",
 "session_code":"P00000000000000000200104"
}
  1. get_seat_data() - This function scrapes the seat details including which seats where bought and the time at which seats where bought for movie sessions. It scrapes data from the previous day so that all the seat data (ticket sales) are updated. It should be run only after running the get_movie_data() function as it updates the JSON in the DB by adding the seat data to it. The updated data has the following format:
 {
     "theatre":"Nex",
     "hall":"nex Hall 5",
     "movie":"Jumanji: The Next Level",
     "date":"18 Jan 2020",
     "time":"1:00 PM+",
     "session_code":"P00000000000000000200104"
     "seats":[
         {   
           "seat_status":"AV",
           "last_update_time":"2020-01-20 14:34:53.704117",
           "seat_buy_time":"1900-01-01T00:00:00",
           "seat_number":"I15",
           "seat_sold_by":""
         },
         ...,
         {  
           "seat_status":"SO",
           "last_update_time":"2020-01-20 14:34:53.705116",
           "seat_buy_time":"2020-01-18T13:12:34.193",
           "seat_number":"F6",
           "seat_sold_by":""
         }
      ]
 }

A full sample updated document in the database can be viewed here.

The functions are scheduled to run daily at the times specified in clock.py. The timings and frequencies of the scraper can be changed by editing the clock.py file.

License

Distributed under the MIT License. See LICENSE for more information.

Team

Disclaimer

This scraper was made as a project to analyse cinema seat patterns. We are in no way affiliated with Shaw Theatres and are not responsible for the accuracy of the data scraped using this scraper. The scraper was developed to scrape data in Jan 2020 from the website and was functional as of June 2020. It may not work as expected as the structure of the website may have changed since.

shaw-scraper's People

Contributors

dependabot[bot] avatar noelmathewisaac avatar vanshiqa avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

eugeniogrant

shaw-scraper's Issues

Consider setting up scraper for flat-data repo

Flat Data was created by GitHub's Office of the CTO (OCTO) to essentially promote the use of GitHub Actions to regularly scrape data into a GitHub repository. Examples of these used for data found in Singapore can be found at @datascapesg .

shaw-scraper could be reworked to regularly scrape data without cost into a GitHub repository, allowing analysis of trends using data beyond what was collected for the popcorn data blog post.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.