Git Product home page Git Product logo

zhihu-crawler's Introduction

ZhiHu-Crawler

简单的Java爬虫项目, 爬取知乎用户信息, 目前已爬取100w条

使用WebMagic框架, Maven构建项目

数据展示

数据展示

运行环境

JDK版本: 10   低版本需手动替换代码中的var

数据库支持: SQL Server / MySql

运行

主程序入口: Main.java

初次运行请配置jdbc.propertiescookies.txt

添加 -s 命令行参数可配置是否清空原有数据/线程数等,默认线程数为10

数据库配置

默认使用MySql持久化数据, 填入基本信息即可

也支持持久化至本地的SQL Server, 请在启动程序时添加 -s 参数配置启动, 采用Windows身份验证

Cookies配置

在浏览器中登录知乎, 在开发者工具 (F12) 中找到请求头cookies字段中的 z_c0 键值对,填入cookies.txt中

数量不限, 爬取速度 (由线程数和爬取间隔决定) 可随cookies数的增加而增加, 同时也可减少账号暂时被封的风险

运行示例

运行示例

爬取过程中按 ctrl+c 键停止爬虫

平均爬取速度: 1w条/h (5账号,10线程,2000+random(1000)sleep)

Sql访问情况

程序分析

所用框架语法参见WebMagic文档

项目分为主程序模块Crawler,爬虫组件模块Assembly,持久化模块Database

Crawler

Assembly

Database

可视化分析

可视化见img/visual文件夹

zhihu-crawler's People

Contributors

horizonftt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.