Git Product home page Git Product logo

amazonbasicswebscraper's Introduction

Amazon Basics Web Scraper

Description:

Created a script to scrape web data from the AmazonBasics webpage. The script collects item information. This project serves as an exercise to demonstrate web scraping techniques using Puppeteer.js.

>> skip down to demo and results


Preview

Preview

Screen.Recording.2023-04-11.at.2.55.57.PM.mp4

Table of Contents
  1. Getting Started
  2. Preventing Endless Execution with the Timeout Option
  3. Scrolling Behavior
  4. Demo
  5. Acknowledgments

Getting Started

Instructions to get the copy of the project up and running on your local machine for development and testing purposes.

Built With

  • Puppeteer.js

Prerequisites

Project requires Node.js and npm installed.


Installing and Usage

To install dependencies, run the following command:

 npm install 

To run the script, use the following command:

 npm start 

Configuration

The script was configured with the following options:

  • headless: false - to display the browser's user interface. Determines whether to run the browser in headless mode.
  • userDataDir: './tmp' - a temporary directory created to store user data for the browser instance.

To modify these options, edit the puppeteer.launch() method in index.js.


(back to top)

Preventing Endless Execution with the Timeout Option

Timeout

The script includes a timeout option that determines how long puppeteer will wait for the product items to load. If the scrapper does not find 100 items within the specified time, it will stop and output the number of items it found. By default, the timeout is set to 30 seconds.

To modify the timeout, edit the timeout variable in script2.js.

Note that increasing the timeout can increase the time it takes for the script to complete, while decreasing the timeout can increase the risk of the scrapper not finding all 100 items. The timeout value should be set based on the performance of the website being scraped and the speed of your internet connection.


(back to top)

Scrolling Behavior

Viewport

The Amazon Basics store page loads more items as you scroll down the page, rather than requiring a click to go to the next page. This webpage format may depend on the viewport size, which we set to a consistent value using the following code:

 await page.setViewport({ width: 1280, height: 720 });

By setting the viewport size to a fixed width and height, we can ensure that the webpage format stays consist throughout other machines and we can follow the same method of scraping regardless of machine, by scrolling down.


While Loop

To ensure that the script finds all 100 product items on the Amazon Basics store page, we use the following while loop. The loop scrolls down the page until 100 items have been loaded, or until the specified timeout has been reached.

 while(itemsLoaded < 100 && Date.now() - start < timeout) {
        await page.evaluate(() => {
            window.scrollBy(0, window.innerHeight);
        });
        await page.waitForTimeout(1000); // wait 1 seconds for new items to load

        itemsLoaded = await page.$$eval(".ProductGridItem__image__ih70n", (items) => items.length);
 };

(back to top)

Demo

Movie

Screen.Recording.2023-04-11.at.2.55.57.PM.mp4

Results

Sample:

Screen Shot 2023-04-11 at 4 07 50 PM

(back to top)

Acknowledgments

(back to top)

amazonbasicswebscraper's People

Contributors

aahx avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.