Git Product home page Git Product logo

crawlab-team / crawlab Goto Github PK

View Code? Open in Web Editor NEW
10.8K 211.0 1.7K 24.21 MB

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Home Page: https://www.crawlab.cn

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 2.86% Shell 15.63% Go 80.96% Python 0.56%
webcrawler scrapy crawlab spiders-management go scrapyd-ui spider crawler webspider web-crawler

crawlab's Introduction

Crawlab

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

You can follow the installation guide.

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/examples
cd examples/docker/basic
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_master
    environment:
      CRAWLAB_NODE_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
    volumes:
      - "./.crawlab/master:/root/.crawlab"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo

  worker01: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_worker01
    environment:
      CRAWLAB_NODE_MASTER: "N"
      CRAWLAB_GRPC_ADDRESS: "master"
      CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
    volumes:
      - "./.crawlab/worker01:/root/.crawlab"
    depends_on:
      - master

  worker02: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_worker02
    environment:
      CRAWLAB_NODE_MASTER: "N"
      CRAWLAB_GRPC_ADDRESS: "master"
      CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
    volumes:
      - "./.crawlab/worker02:/root/.crawlab"
    depends_on:
      - master

  mongo:
    image: mongo:4.2
    container_name: crawlab_example_mongo
    restart: always

Then execute the command below, and Crawlab Master and Worker Nodes + MongoDB will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up -d

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Spider List

Spider Overview

Spider Files

Task Log

Task Results

Cron Job

Architecture

The architecture of Crawlab is consisted of a master node, worker nodes, SeaweedFS (a distributed file system) and MongoDB database.

The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node provides below services:

  1. Task Scheduling;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (you can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node uploads (deploys) spiders to the distributed file system SeaweedFS, for synchronization by worker nodes.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through gRPC. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. Task queue is also stored in MongoDB.

SeaweedFS

SeaweedFS is an open source distributed file system authored by Chris Lu. It can robustly store and share files across a distributed system. In Crawlab, SeaweedFS mainly plays the role as file synchronization system and the place where task log files are stored.

Frontend

Frontend app is built upon Element-Plus, a popular Vue 3-based UI framework. It interacts with API hosted on the Master Node, and indirectly controls Worker Nodes.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.scrapy.pipelines.CrawlabPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Data

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Data

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework Technology Pros Cons Github Stats
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Supported by JetBrains

Community

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group.

crawlab's People

Contributors

0xflotus avatar appleboy avatar bestgopher avatar chncaption avatar codingendless avatar cxapython avatar darrenxyli avatar dependabot[bot] avatar gemerz avatar gs80140 avatar haivp3010 avatar hantmac avatar hyyzzz111 avatar jonnnh avatar luzihang123 avatar ma-pony avatar marvzhang avatar maybgit avatar seven2nine avatar tikazyq avatar wahyd4 avatar wo10378931 avatar wu526 avatar xiaoxiaolvqi avatar yann0917 avatar zerafachris avatar zhangweiii avatar zkqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawlab's Issues

一键启动服务

现在安装启动crawlab服务比较繁琐,需要执行多条命令,现在需要一键启动的脚本

Cannot load celery.commands extension 'flower.command:FlowerCommand': ModuleNotFoundError("No module named 'flower.command'; 'flower' is not a package"

环境:阿里 centos7 python3.7
nohup python3 flower.py
报错: 'stats' inspect method failed
'active_queues' inspect method failed
。。。。。
nohup python3 worker.py
报错 :
/usr/local/python3/lib/python3.7/site-packages/celery/utils/imports.py:167: UserWarning: Cannot load celery.commands extension 'flower.command:FlowerCommand': ModuleNotFoundError("No module named 'flower.command'; 'flower' is not a package")
namespace, class_name, exc))
/usr/local/python3/lib/python3.7/site-packages/celery/platforms.py:801: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0

日志管理系统

目前日志管理仅仅是将各任务的日志文件简单展示到前端,现在需要集中管理日志,实现日志过滤、汇总、解析

节点不会离线也无法删除

image

使用k8s平台部署(deployment),节点的IP使用service的ClusterIP,当对这个节点(master或者worker)进行更新,会发现在web平台看到2台节点,实际上有一台已经不存在了

部署时出现问题 api出错

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/spiders/5c9f2095428f2c217cc474e6/deploy_file?node_id=celery@26GXGQ2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000096A3A20>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。',))

接口跨域问题,出现localhost

Error
vue.runtime.esm.js?2b0e:619 [Vue warn]: Avoid mutating a prop directly since the value will be overwritten whenever the parent component re-renders. Instead, use a data or computed property based on the prop's value. Prop being mutated: "pageNum"

found in

---> at src/components/TableView/GeneralTableView.vue


at src/views/task/TaskDetail.vue
at src/views/layout/components/AppMain.vue
at src/views/layout/Layout.vue
at src/App.vue

spider 无法部署

爬虫部署的时候报一下错误:

/usr/local/lib/python3.6/dist-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function
    p.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process

不是很明白。

执行python3 ./bin/run_worker.py

python3 ./bin/run_worker.py
Traceback (most recent call last):
File "./bin/run_worker.py", line 9, in
from tasks.celery import celery_app
ModuleNotFoundError: No module named 'tasks.celery'

执行 python3 ./bin/run_flower.py 报以下错误

[I 190409 17:22:40 command:147] Registered tasks:

['celery.accumulate',

 'celery.backend_cleanup',

 'celery.chain',

 'celery.chord',

 'celery.chord_unlock',

 'celery.chunks',

 'celery.group',

 'celery.map',

 'celery.starmap']

[I 190409 17:22:40 mixins:229] Connected to redis://127.0.0.1:6379/0

[W 190409 17:22:45 control:44] 'stats' inspect method failed

[W 190409 17:22:45 control:44] 'active_queues' inspect method failed

[W 190409 17:22:45 control:44] 'registered' inspect method failed

[W 190409 17:22:45 control:44] 'scheduled' inspect method failed

[W 190409 17:22:45 control:44] 'active' inspect method failed

[W 190409 17:22:45 control:44] 'reserved' inspect method failed

[W 190409 17:22:45 control:44] 'revoked' inspect method failed

[W 190409 17:22:45 control:44] 'conf' inspect method failed

爬虫运行时报错

[2019-04-01 11:40:33,715: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)',)
Traceback (most recent call last):
File "C:\Users\xiaojiahao\Envs\jaho\lib\site-packages\billiard\pool.py", line 358, in workloop
result = (True, prepare_result(fun(*args, **kwargs)))
File "C:\Users\xiaojiahao\Envs\jaho\lib\site-packages\celery\app\trace.py", line 537, in _fast_trace_task
tasks, accept, hostname = _loc
ValueError: not enough values to unpack (expected 3, got 0)

批量部署节点

批量部署节点,做到一次性将节点部署完毕,而不用一个个部署

上传的zip无法同步子节点

docker run -d --name crawlab_w1 -e CRAWLAB_REDIS_ADDRESS=redis -e CRAWLAB_MONGO_HOST=mongo -e CRAWLAB_SERVER_MASTER=N -e CRAWLAB_API_ADDRESS=192.168.2.222:8000 -e CRAWLAB_SPIDER_PATH=/app/spiders -v /var/logs/crawlab:/var/logs/crawlab --link mongo:mongo --link redis:redis --privileged=true tikazyq/crawlab

该docker显示已经上线
image
上传一个zip文件, 只包含一个test.js,console.log("test");
只有主节点可以运行,今天试了上传几次子节点依然没有,发现子节点报以下错误,进入docker中也没有目录建立,只有示例的目录。
image

redis端口密码问题

docker run 时不能指定端口,CRAWLAB_REDIS_ADDRESS=127.0.0.1:6378 无效,链接时同样会访问127.0.0.1:6378:6379, 也不能指定密码,mongo一样。

对于配置了username和password的MongoClient对象连接的MongoDB的时候:pymongo.errors.OperationFailure: auth failed

mongo=MongoClient(host=MONGO_HOST,
port=MONGO_PORT,
username=MONGO_USERNAME,
password=MONGO_PASSWORD,
authSource=MONGO_DB,
connect=False)
对于上述初始化mongo对象的代码,需要在初始化MongoClient指定”authSource“字段为username这个账户对应有权限的数据库名称,否则会出现auth fail.具体原因:
一般我们设置MongoDB权限是针对某个db,auth也只能在这个db才能auth,否则就在admin那里auth,就会出现auth fail,对于MongoDB有权限控制的需要加这个 如果直接root账号密码 或者空账号密码那就不会出现我这个问题。

爬虫详情修改问题

第一次点击到 网站 那个input,没有数据时 白色的模态框 会一直显示,覆盖了运行与保存。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.