Git Product home page Git Product logo

webster's Introduction

Webster

Financial Contributors on Open Collective npm version Build Status

Overview

Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.

Which is different from other crawling framework is that Webster can scrape the content which rendered by browser client side javascript and ajax request

Requirements

  • Node.js 10.x+
  • Works on Linux, Mac OSX

Or you can deploy on Docker.

Install

npm install webster

Single spider example

const { spider } = require('webster');

class MySpider extends spider {
    get defUserAgent() {
        return 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36';
    }
    get defDeviceType() {
        return 'pc';
    }
    async parseHtml(html) {
        return true;
    }
}

(async () => {
    const spider = new MySpider({
        actions: [
            {
                type: 'waitForSelector',
                selector: 'div.js-details-container',
            }
        ],
        targets: [
            {
                selector: 'div.Box-row[role=row]',
                type: 'text',
                field: 'sugs'
            }
        ],
    });
    const url = `https://github.com/zhuyingda/webster`;
    let crawlResult = await spider.startRequest(url);
    console.log(crawlResult);
})();

Docker cluster example

Pull the example docker image:

docker pull zhuyingda/webster-demo
docker run -it zhuyingda/webster-demo

In this docker image, there is a simple cluster-able example:

// producer
const Webster = require('webster');
const Producer = Webster.producer;
const Task = Webster.task;

let task = new Task({
    spiderType: 'browser',
    engineType: 'playwright',
    browserType: 'chromium',
    url: 'http://quotes.toscrape.com/tag/humor/',
    targets: [
        {
            selector: 'span.text',
            type: 'text',
            field: 'quote'
        },
        {
            selector: 'li.next > a',
            type: 'attr',
            attrName: 'href',
            field: 'link'
        }
    ],
    actions: [
        {
            type: 'waitAfterPageLoading',
            value: 500
        }
    ],
    referInfo: {
        para1: 'this is a refer field 1',
        para2: 'this is a refer field 2'
    }
});

let myProducer = new Producer({
    channel: 'demo_channel1',
    dbConf: {
        redis: {
            host: 'redis-12419.c44.us-east-1-2.ec2.cloud.redislabs.com',
            port: 12419,
            password: 'X2AcjziaOOYPppWFOPiP4rmzZ9RFLViv'
        }
    }
});
myProducer.generateTask(task).then(() => {
    console.log('done');
    process.exit();
});
// consumer
const Webster = require('webster');
const Consumer = Webster.consumer;

class MyConsumer extends Consumer {
    constructor(option) {
        super(option);
    }
    afterCrawlRequest(result) {
        console.log('your scrape result:', result);
    }
}

let myConsumer = new MyConsumer({
    channel: 'demo_channel1',
    sleepTime: 5000,
    deviceType: 'pc',
    dbConf: {
        redis: {
            host: 'redis-12419.c44.us-east-1-2.ec2.cloud.redislabs.com',
            port: 12419,
            password: 'X2AcjziaOOYPppWFOPiP4rmzZ9RFLViv'
        }
    }
});
myConsumer.startConsume();
node demo_producer.js
env MOD=debug node demo_consumer.js

You can organize your crawler cluster by Consumer and Producer like this:

Usage on Raspbian Platform

sudo apt install chromium-browser chromium-codecs-ffmpeg
env MOD=debug EXE_PATH=/usr/bin/chromium-browser node demo_consumer.js

Documentation

You can see more details from here.

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

GPL-V3

Copyright (c) 2017-present, Yingda (Sugar) Zhu

webster's People

Contributors

monkeywithacupcake avatar zhuyingda avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.