Light

yugiyx / mini_spider Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 13.2 MB

request+pyquery+mongoDB编写的基础爬虫框架

Python 100.00%

mini_spider's Introduction

轻量化爬虫框架

0x01 轻量化爬虫的意义

如果能用最简单的方法（小型库组合）实现需求，就无需使用更复杂的方法。杀鸡焉用牛刀（scrapy）

特点：

代码结构清晰，容易理解，便于技术交流
依赖库少，容易布署
使用fire和pyinstall制作成命令行和打包成可执行软件，体积非常小巧。供非程序员使用

缺点：

通用性差，可能需要同时修改多处模块。而不是仅仅修改解析模块
无法短时间爬取海量数据，性能不够强大
容错性差

0x02 轻量爬虫架构

框架图

爬虫调度器

主要负责统筹其他四个模块的协调工作

URL管理器

URL链接，维护已经爬取和未爬取的URL
提供获取新URL链接的接口

HTML下载器

从URL管理器中获取未爬取URL
下载URL内容

HTML解析器

从HTML下载器获取已经下载的网页，从中解析出新的URL返回URL管理器
将有效数据传给数据存储起

数据存储器

以某种需要的形式存储有效数据
存储增量爬虫所需要的历史爬取记录，用于去重复

爬虫运行流程

0x03 程序环境

Python及相关第三方库

python3.7.2 python解析器
requests 2.20.1 请求库
pyquery 1.4.0 解析库
pymongo 3.7.2 MongoDB库
fire 0.1.3 命令行制作库
PyInstaller 3.4 可执行程序打包库

0x04 程序组件说明

SpiderMan.py 爬虫调度器
URLManager.py URL管理器
Downloader.py HTML下载器
Parser.py HTML解析器
DataOutput.py 数据存储器
Download_log.txt纯文本下载历史记录(程序自动生成)，并没有使用数据库，是为了减少复杂度。

0x05 配置程序说明

详细看代码内部注释。
日常使用主要修改SpiderMan.py和Parser.py模块。
可以自行修改其他模块增加功能。例如selenium请求库，BS4解析库。或者自己定义数据结构，命令和存取方式。

0x06 下一步计划

CR: 使用多线程修改代码，增加下载速度。
CR: 增加单独的配置文件，便于多程序分享变量，传递文件名等重要信息。
CR：优化部件代码，提升程序通用度。
Fix：暂无

0x07 版本记录CHANGELOG

版本1.0 2019-1-30 以蜂鸟大师板块画册为例。演示基础爬虫框架。

mini_spider's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.