crawlab-team / crawlab Goto Github PK

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Home Page: https://www.crawlab.cn

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 2.86% Shell 15.63% Go 80.96% Python 0.56%

webcrawler scrapy crawlab spiders-management go scrapyd-ui spider crawler webspider web-crawler

crawlab's Introduction

Crawlab

中文 | English

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

You can follow the installation guide.

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/examples
cd examples/docker/basic
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_master
    environment:
      CRAWLAB_NODE_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
    volumes:
      - "./.crawlab/master:/root/.crawlab"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo

  worker01: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_worker01
    environment:
      CRAWLAB_NODE_MASTER: "N"
      CRAWLAB_GRPC_ADDRESS: "master"
      CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
    volumes:
      - "./.crawlab/worker01:/root/.crawlab"
    depends_on:
      - master

  worker02: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_worker02
    environment:
      CRAWLAB_NODE_MASTER: "N"
      CRAWLAB_GRPC_ADDRESS: "master"
      CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
    volumes:
      - "./.crawlab/worker02:/root/.crawlab"
    depends_on:
      - master

  mongo:
    image: mongo:4.2
    container_name: crawlab_example_mongo
    restart: always

Then execute the command below, and Crawlab Master and Worker Nodes + MongoDB will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up -d

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Spider List

Spider Overview

Spider Files

Task Log

Task Results

Cron Job

Architecture

The architecture of Crawlab is consisted of a master node, worker nodes, SeaweedFS (a distributed file system) and MongoDB database.

The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node provides below services:

Task Scheduling;
Worker Node Management and Communication;
Spider Deployment;
Frontend and API Services;
Task Execution (you can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node uploads (deploys) spiders to the distributed file system SeaweedFS, for synchronization by worker nodes.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through gRPC. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. Task queue is also stored in MongoDB.

SeaweedFS

SeaweedFS is an open source distributed file system authored by Chris Lu. It can robustly store and share files across a distributed system. In Crawlab, SeaweedFS mainly plays the role as file synchronization system and the place where task log files are stored.

Frontend

Frontend app is built upon Element-Plus, a popular Vue 3-based UI framework. It interacts with API hosted on the Master Node, and indirectly controls Worker Nodes.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.scrapy.pipelines.CrawlabPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Data

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Data

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework	Technology	Pros	Cons
Crawlab	Golang + Vue	Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.	Not yet support spider versioning
ScrapydWeb	Python Flask + Vue	Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.	Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy	Python Django + Vue	Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.	Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper	Python Flask	Open-source Scrapyhub. Concise and simple UI interface. Support cron job.	Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Supported by JetBrains

Community

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group.

crawlab's People

Contributors

Stargazers

Watchers

Forkers

cced3000 skymoon08 fakegit thirder xucn grit-tan 957204459 spook86 nickliqian francishero jdk6979 goseign masdude dujun31 yinspark fullsing yeekzhang futurebody zhongxingpeng adewin codingzxy cxapython cc8848 melice zol2019 bjccswd hhy5277 mfrank2016 leegohi w1614067865 myvodo zhianlin hctwgl forging2012 lcell yubobo aureliuspatiens mikexwang awesome-archive nsdown jruing baiyuanxiang shisiying woshixiaoqianbi fxmin jesse3692 knowsbuy hayhong sulidaniel9010 zzy0302 jesongit wangyuzhe1108 myhololens edgeowner tunuo08 sheepyang1993 norrislai qwdingyu yantoumu duxy-ios awesometype zfatgxu vadin vickey-wu duojiu lltamos honsa pythonliuwei-root alienlu ssinping rulai-jianfang kai2008 zhaotianen cookiesfly tenaciousz leadscloud anhilo qqizai geekhuyang charles2mx popmedd xuacker james-xi lengyun123456 awesome-python liu0tufang songofhack 17637462979 coolan2013 sangecoder eniac888 wersonlu lotapp niceyida duanbj laosuan xiaoguangzhouliu whoiskkk spidercamel dut3062796s

crawlab's Issues

不断打印日志“error handle”占满磁盘空间

定时任务不支持"-" ","等特殊符号

例如"0 0-1 * * * *"，会报错

直接执行app.py启动app的时候定时任务执行两次任务

如题，如果python app.py启动app的时候，新建一个定时任务，会执行两次爬虫任务

可配置爬虫错误

expected str, bytes or os.PathLike object, not NoneType

多机情况下，通过前端停止运行中的任务，虽然前端会显示“已取消”，但是实际该功能not work

一键启动服务

现在安装启动crawlab服务比较繁琐，需要执行多条命令，现在需要一键启动的脚本

Cannot load celery.commands extension 'flower.command:FlowerCommand': ModuleNotFoundError("No module named 'flower.command'; 'flower' is not a package"

环境：阿里 centos7 python3.7
nohup python3 flower.py
报错： 'stats' inspect method failed
'active_queues' inspect method failed
。。。。。
nohup python3 worker.py
报错：
/usr/local/python3/lib/python3.7/site-packages/celery/utils/imports.py:167: UserWarning: Cannot load celery.commands extension 'flower.command:FlowerCommand': ModuleNotFoundError("No module named 'flower.command'; 'flower' is not a package")
namespace, class_name, exc))
/usr/local/python3/lib/python3.7/site-packages/celery/platforms.py:801: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0

多台服务器用同一个Docker镜像启动只显示一个节点

Crawlab用MAC做的node的唯一识别号，如果在多个服务器用同一个docker镜像，就会出现MAC地址冲突的情况，导致显示只有一个节点的情况

日志管理系统

目前日志管理仅仅是将各任务的日志文件简单展示到前端，现在需要集中管理日志，实现日志过滤、汇总、解析

采用gunicorn启动app

使用gunicorn启动app，做到让apscheduler不执行两次

部署时可以自定义部署到哪些节点吗？

在DEMO中没有看到自定义部署到哪些节点的选项，但在实际中不是所有的爬虫都需要部署到每一台机器上的。

verbose stack Error: missing script: dev

run npm dev 报错 mac

节点不会离线也无法删除

使用k8s平台部署(deployment)，节点的IP使用service的ClusterIP，当对这个节点(master或者worker)进行更新，会发现在web平台看到2台节点，实际上有一台已经不存在了

部署时出现问题 api出错

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/spiders/5c9f2095428f2c217cc474e6/deploy_file?node_id=celery@26GXGQ2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000096A3A20>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝，无法连接。',))

可以动态添加节点吗？

接口跨域问题，出现localhost

Error
vue.runtime.esm.js?2b0e:619 [Vue warn]: Avoid mutating a prop directly since the value will be overwritten whenever the parent component re-renders. Instead, use a data or computed property based on the prop's value. Prop being mutated: "pageNum"

found in

---> at src/components/TableView/GeneralTableView.vue

at src/views/task/TaskDetail.vue
at src/views/layout/components/AppMain.vue
at src/views/layout/Layout.vue
at src/App.vue

任务显示时间不正确

显示任务时间与实际执行任务时间相差8小时

spider 无法部署

爬虫部署的时候报一下错误：

/usr/local/lib/python3.6/dist-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function
    p.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process

不是很明白。

提示缺少模块 eventlet 模块.

如题

执行python3 ./bin/run_worker.py

python3 ./bin/run_worker.py
Traceback (most recent call last):
File "./bin/run_worker.py", line 9, in
from tasks.celery import celery_app
ModuleNotFoundError: No module named 'tasks.celery'

执行 python3 ./bin/run_flower.py 报以下错误

[I 190409 17:22:40 command:147] Registered tasks:

['celery.accumulate',

 'celery.backend_cleanup',

 'celery.chain',

 'celery.chord',

 'celery.chord_unlock',

 'celery.chunks',

 'celery.group',

 'celery.map',

 'celery.starmap']

[I 190409 17:22:40 mixins:229] Connected to redis://127.0.0.1:6379/0

[W 190409 17:22:45 control:44] 'stats' inspect method failed

[W 190409 17:22:45 control:44] 'active_queues' inspect method failed

[W 190409 17:22:45 control:44] 'registered' inspect method failed

[W 190409 17:22:45 control:44] 'scheduled' inspect method failed

[W 190409 17:22:45 control:44] 'active' inspect method failed

[W 190409 17:22:45 control:44] 'reserved' inspect method failed

[W 190409 17:22:45 control:44] 'revoked' inspect method failed

[W 190409 17:22:45 control:44] 'conf' inspect method failed

Win10系统运行报错

提示缺少模块 eventlet 模块.

爬虫运行时报错

[2019-04-01 11:40:33,715: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)',)
Traceback (most recent call last):
File "C:\Users\xiaojiahao\Envs\jaho\lib\site-packages\billiard\pool.py", line 358, in workloop
result = (True, prepare_result(fun(*args, **kwargs)))
File "C:\Users\xiaojiahao\Envs\jaho\lib\site-packages\celery\app\trace.py", line 537, in _fast_trace_task
tasks, accept, hostname = _loc
ValueError: not enough values to unpack (expected 3, got 0)

前端页面展示太慢

前端所有页面刷新不及时需要过好久才能展示出来

前端需要将API URL做环境配置

目前前端的生产环境URL是写死在代码中，需要做成环境配置的

部署后没有反应，前端页面也无提示消息

Refresh也没有反应，前端页面也重启服务了，但是单独运行爬虫是有数据的，也能存入MongoDB中。

异常监控-错误日志

当日志中发生错误记录时，需要报警

添加定时任务，前端cron表达式生成错误，导致定时任务不运行

批量部署节点

批量部署节点，做到一次性将节点部署完毕，而不用一个个部署

查看不到在线节点情况

启动crawlab（app.py,flower.py,worker.py,npm run serve）后web界面看不到节点在线情况

现在是不支持直接部署了吗？

crawlweb/crawlweb 不存在了。
docker部署步骤如果能详细点就好了

有没有什么交流群？

方便交流

爬虫详情页面的分析拿不到数据报错

报错原因是crawlab /路由/ tasks.py的171行报错，任务没有_id这个键。改成spider_id后正常

有时任务无法写入开始时间

Error---celery-task---ValueError('not enough values to unpack (expected 3, got 0)',)

mac 下deploy任务id日志目录不存在的时候不会自动创建，导致任务出错

结果页只能显示前10页数据

结果页只能显示前10页数据，后面的无法显示

删除爬虫的时候应该是删除爬虫以及爬虫对应的源代码

现在只删除了数据库中的数据，没有删除文件夹

异常管理-零值监控

当任务执行产生的数据为0时，需要报警

上传的zip无法同步子节点

docker run -d --name crawlab_w1 -e CRAWLAB_REDIS_ADDRESS=redis -e CRAWLAB_MONGO_HOST=mongo -e CRAWLAB_SERVER_MASTER=N -e CRAWLAB_API_ADDRESS=192.168.2.222:8000 -e CRAWLAB_SPIDER_PATH=/app/spiders -v /var/logs/crawlab:/var/logs/crawlab --link mongo:mongo --link redis:redis --privileged=true tikazyq/crawlab

该docker显示已经上线

上传一个zip文件，只包含一个test.js，console.log("test");
只有主节点可以运行，今天试了上传几次子节点依然没有，发现子节点报以下错误，进入docker中也没有目录建立，只有示例的目录。

demo網頁無法開啟

出現404

dockerfile 中的ntp 所依赖的tzdata 在安装时存在交互。会使docker build 暂停

用协程支持高并发

利用gevent + flask 支持高并发

redis端口密码问题

docker run 时不能指定端口，CRAWLAB_REDIS_ADDRESS=127.0.0.1:6378 无效，链接时同样会访问127.0.0.1:6378:6379, 也不能指定密码，mongo一样。

爬虫项目能实现工程化多实例化

希望可以实现，通过配置参数（环境变量传参数也可以的），对同一个爬虫工程，可以实现保存多个爬虫实例。 # # #

win10系统下，部署成功后，新建可配置爬虫无法运行，日志路径报错

win10系统，用直接部署方式部署成功了，新建可配置爬虫无法运行，日志路径报错，错误：[WinError 267] 目录名称无效。修改setting里的PROJECT_LOGS_FOLDER = 'C:\var\log\crawlab'，无效

对于配置了username和password的MongoClient对象连接的MongoDB的时候：pymongo.errors.OperationFailure: auth failed

mongo=MongoClient(host=MONGO_HOST,
port=MONGO_PORT,
username=MONGO_USERNAME,
password=MONGO_PASSWORD,
authSource=MONGO_DB,
connect=False)
对于上述初始化mongo对象的代码，需要在初始化MongoClient指定”authSource“字段为username这个账户对应有权限的数据库名称，否则会出现auth fail.具体原因：
一般我们设置MongoDB权限是针对某个db，auth也只能在这个db才能auth，否则就在admin那里auth，就会出现auth fail，对于MongoDB有权限控制的需要加这个如果直接root账号密码或者空账号密码那就不会出现我这个问题。

crawlab-team / crawlab Goto Github PK

crawlab's Introduction

Crawlab

Installation

Quick Start

Run

Docker

Screenshot

Login

Home Page

Node List

Spider List

Spider Overview

Spider Files

Task Log

Task Results

Cron Job

Architecture

Master Node

Worker Node

MongoDB

SeaweedFS

Frontend

Integration with Other Frameworks

Scrapy

General Python Spider

Other Frameworks / Languages

Comparison with Other Frameworks

Contributors

Supported by JetBrains

Community

crawlab's People

Contributors

Stargazers

Watchers

Forkers

crawlab's Issues

Recommend Projects

Recommend Topics

Recommend Org