Git Product home page Git Product logo

gridapp-web-scraper's Introduction

Mwangazi Solar Lights web scraper and Database


To run this app on your local machine

  • Install Ruby.2.3.3
  • In a terminal, run git clone https://github.com/luigilake/GridApp-web-scraper.git
  • Navigate to the project's root directory with cd pensieve
  • Run bundle install && rake db:setup
  • In terminal, run rails s
  • Visit http://localhost:3000/ in your browser.

Web Scraper

The web scraper currently resides in the application_controller.rb, as the scrape_mangoo function, to test for its effectiveness in scraping the two websites. The websites that are being scraped are:

The scraper goes about this by first accessing a selected product from Mangoo.org, for example, the Solar Lantern S20. The scraper will then obtain relevant information with the Mangoo website, which includes a link to the same product's Lighting Global webpage. The scraper will then obtain the relevant information within the Lighting Global webpage, which should include the product spec PDFs.

While the scraper currently works, there are a couple of things that need to be done. Here is the general flow of information:

  • THE WEB SCRAPER OBTAINS ALL INFORMATION FROM MANGOO AND LIGHTING GLOBAL
    • Double check, by scraping random Mangoo.org products, to see if the scraper dynamically and successfully obtains the proper information from both Lighting Global and Mangoo websites.
    • There was a recent change in the format of the PDFs, fix the currently commented out PDF scraper so that the information in the PDFs will also be scraped properly.
  • SAVE ALL INFO OBTAINED BY THE WEB SCRAPER INTO THE CSV FILES (inside the public folder)
  • THE MWANGAZI TEAM REVIEWS ALL INFORMATION IN THE CSV FILES
    • The CSVs will have a column entitled 'VERIFIED', with the values 'YES' or 'NO'. The Mwangazi team will change this to 'YES' if they've verified the information in a row.
  • IF INFORMATION IS VERIFIED, PULL ALL INFO FROM CSVs THEN ADD THEM TO THE DATABASE.
  • Done!

Database

This database has 3 main tables; Products, Locations, and Distributors, each having a join table within them. Other relevant tables aside from those three is the Manufacturer table, which has a one to many relationship with the Products table, and the Prices table, which functions as the join table between Products and Distributors.

Please refer to the schema.rb file to see all the columns within said tables.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.