Git Product home page Git Product logo

weiboredisspider's Introduction

微博分布式爬虫(一天爬取30万数据)

注意:

  • 使用本爬虫,作者默认您已经了解 Scrapy、MongoDB、Redis的基本知识。
  • 熟练的同学可以通过修改源码自定义需要爬取的数据

基于Scrapy 和 Scrapy-redis 的分布式微博爬虫项目。

使用了 ip 代理池库 Scylla,每30个请求切换一次代理ip,防止出现验证码和访问频繁的提示。

需要安装的库有:

Scylla、Scrapy、Scrapy-redis、pymongo

需要配置数据库 MongoDB 和 Redis 到服务器端电脑,具体使用方法请自行了解。

本分布式爬虫可以爬取到的数据如下图所示:

image-20181122152739357

image-20181122152759130

由于具有爬取时间延迟设置(防止被屏蔽),因此花费了24小时左右的时间,爬取到的数据量如下:

image-20181122152931165

可以在weiboRedisSpider/settings.py配置文件中个性化配置自己的需求:

例如:

  • User-Agent 的配置
  • 限制微博的时间,比这个日期更老的微博就不抓取了
  • 代理获取间隔,每个多少个请求获取一次ip代理,控制获取代理的速度,减少获取代理等待的时间
  • 爬虫所运行的主机的名字,用于多台电脑同时爬取时,区分数据是哪台电脑爬取的

具体的配置请参考 settings.py 的配置文件。

在使用中遇到任何问题,可以联系我,很乐意一起讨论解决问题。

企鹅号码(需 base64解密):NTk1Njk2OTYw

weiboredisspider's People

Contributors

opengit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.