Git Product home page Git Product logo

zhihucrawler's Introduction

知乎爬虫

介绍

ZhihuCrawler是用C++编写的高效、基于事件驱动的知乎爬虫,目的是抓取最高赞回答、最高关注问题等数据。运行环境为支持epoll的平台。

使用

先找到浏览器访问知乎的cookie,将它复制到src/confic.cc下的cookie变量里。

编辑./startfile/seeds.txt, 将从这个文件指定的用户URL开始爬。

make
./zhihuCrawler

可以访问http://localhost:8080来查看爬虫的状态。

输出

爬下的数据都存储在./datafile/rawData.raw下。 使用

./sort.sh

可以查看根据票数排序后的结果。

TODO

  • 增加ajax获取用户的全部关注人和关注者

  • 降低模块间耦合度

  • 用代理IP处理429错误/IP被封

更多

更多详情请访问 http://zyearn.github.io/blog/2015/09/09/how-to-write-a-event-based-crawler-using-c/

// 用C/C++写爬虫真是做大死

zhihucrawler's People

Contributors

lucklove avatar zyearn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zhihucrawler's Issues

301 Moved Permanently

我用了你的代码 发现访问请求后 收到的页面都是301
你这个不支持https吧

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.