Git Product home page Git Product logo

web_layout_crawler_plus_plus's Introduction

Web Layout Crawler

This project uses the Playwright library to crawl a specified webpage with Chrome and Firefox with WebAssembly enabled and disabled. The downloaded webpage files are downloaded to the folder JSOutput. Screenshots are saved to the Screenshots folder.

How to set up

Prequisites

  • Node.js
  • MySQL

Installation

  1. Run the found_page_schema.sql under the Database folder to set up the schema and table for metadata logging.
  2. Run the command npm install in the root directory of this project (same as this README).
  3. Run npm run build to rebuild the source TypeScript files in the src folder and output them to the build folder as JavaScript files.
  4. Optionally, modify scripts under src or configure the scan parameters in the config.json under src and rebuild by running Step 3 again.

Usage

  1. Run the command node ./build/index.js --url <url_to_san> to scan the <url_to_san> and all of its first-level subpages. For example, try running the command node ./build/index.js --url https://jkumara.github.io/pong-wasm/ as this site contains WebAssembly.
  2. To scan a list of urls with the crawler, run the command node ./build/index.js --file <file_path> to read in the file at <file_path>. For example, to use the included file sites.txt, run the command node ./build/index.js --file sites.txt
  3. By default, both of these commands will now only download WebAssembly file found by default. If you want to download all files, add the flag --full true to the command. For example, if running the example in Usage 2, run the command node ./build/index.js --file sites.txt --full true.

web_layout_crawler_plus_plus's People

Contributors

alanrom avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.