Git Product home page Git Product logo

nodejs-web-crawlr's Introduction

Node JS Web Crawlr



Made with Node JS



Prerequisite

  1. It is required to have NodeJs with version 8.5 or higher

  2. If you do not have installed node.js in your machine then go to this link in order to install node.

How to use

Node Web Crawlr requires at least node v.8.5.0.

  1. Clone this repository.
https://github.com/aximilli1212/nodejs-web-crawlr.git
  1. Go to the cloned directory (e.g. cd nodejs-web-crawlr).

  2. Run npm install.

  3. Run npm run start.

  4. Server starts on: localhost:3000.

  5. Make a post request to : localhost:3000/api.

  6. Request should have parameters: hostname, regexes,numLevels

    hostname = url string
    regexes = comma separated string of regexes
    numLevels = integer

    sample: {hostname:http://****, regexes:ai,facebook\.com%2F([^-]+)-,instagram ,numLevels:3}
    NB: all regex runs in default global flag /g hence regex string becomes /ai/g,/facebook\.com%2F([^-]+)-/g,/instagram/g

    Download Generated files

  7. All generated ndjson files will be exported to the /document/match.ndjson

  8. With a get request, Download and inspect your loot with localhost:3000/document/match.ndjson


`Crawler runs best on already rendered sites support for browser rendered sites (React*Angular*VueJS sites) will be made available soon`

Tests

App has been tested againt https:knust.edu.gh,https://google.com, dff.qbelimited.com,https://expressjs.com, https://ucc.edu.gh/ etc.

RegExs tested include a,as,instagram,(?:twitter.com)?,ar,facebook.com%2F([^-]+)- More RegEx Being tested.

Contribute

Feel free to contribute as Crawlr still needs more updates and fixes.

nodejs-web-crawlr's People

Contributors

aximilli1212 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.