weiboredisspider's Introduction

微博分布式爬虫（一天爬取30万数据）

注意：

使用本爬虫，作者默认您已经了解 Scrapy、MongoDB、Redis的基本知识。

熟练的同学可以通过修改源码自定义需要爬取的数据

基于Scrapy 和 Scrapy-redis 的分布式微博爬虫项目。

使用了 ip 代理池库 Scylla，每30个请求切换一次代理ip，防止出现验证码和访问频繁的提示。

需要安装的库有：

Scylla、Scrapy、Scrapy-redis、pymongo

需要配置数据库 MongoDB 和 Redis 到服务器端电脑，具体使用方法请自行了解。

本分布式爬虫可以爬取到的数据如下图所示：

由于具有爬取时间延迟设置（防止被屏蔽），因此花费了24小时左右的时间，爬取到的数据量如下：

可以在weiboRedisSpider/settings.py配置文件中个性化配置自己的需求：

例如：

User-Agent 的配置
限制微博的时间，比这个日期更老的微博就不抓取了
代理获取间隔，每个多少个请求获取一次ip代理，控制获取代理的速度，减少获取代理等待的时间
爬虫所运行的主机的名字，用于多台电脑同时爬取时，区分数据是哪台电脑爬取的

具体的配置请参考 settings.py 的配置文件。

在使用中遇到任何问题，可以联系我，很乐意一起讨论解决问题。

企鹅号码（需 base64解密）：NTk1Njk2OTYw

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

opengit / weiboredisspider Goto Github PK