Git Product home page Git Product logo

rxcrawler's Introduction

rxcrawler

##常见的爬虫

  • crawler4j
    在抓取IP代理的项目proxy-checker 使用。

    • 优点:
      • 简单易用
      • 支持 resume (停止服务后,重启继续之前的任务)
    • 缺点:
      • 仅处理GET
      • 单机
  • webmagic

    • 优点:

      • 结构清晰
      • 扩展性好
      • 开箱即用
    • 缺点:

      1. 对POST请求resume有缺陷(缺省保存url,没有保存post body)
      2. 当服务被终止时,可能丢失正在运行的请求(一般情况下,这不是什么问题, 但例如分类下的商品抓取,一页接一页,当服务被重启时,丢失了一个请求可能使得整个分类丢失)
      3. 基于bio,在高并发抓取下会消耗大量的线程。

      前2个缺点基本可以通过扩展修改,但bio属于核心结构,无法修改。

webmagic的rx-java改造

借鉴webmagic的结构和接口,但对核心的spider,downloader 使用 nio,rx-java 进行重写, 可以支持少量线程支持上千的并发抓取,配合squid和proxy-checker获取的代理ip,极大提升抓取效率。

JD

基于rxcralwer的例子,抓取京东的移动端接口。

rxcrawler's People

Contributors

wuxudong avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.