Git Product home page Git Product logo

sbcrawler's Introduction

sbcrawler

light weight crawler

轻量级的爬虫框架sbcrawler

写这个框架的动机

  1. 平时写爬虫过程中,发现通常不需要什么高大上的异步、并发、分布式等功能。
  2. 小需求对防止被封,中断继续,日志进度等方面有更多重复性的代码。
  3. sbcrawler就是实现一个最简单的爬虫框架,让你可以专注于写内容抽取逻辑。

特点

  • 单进程,非异步
  • 断点续爬
  • 错误日志记录

用法示例

# -*- coding: utf-8 -*-

from sbcrawler import Crawler

class MyCrawlerExample(Crawler):

    start_url = "https://xxx.xxx.com/xxx/"  # 起始种子url

    allowed_domain =  "https://xxx.xxx.com/"   # 限制域,要带http

    def extract_links(self, html, task):
        # 抽取链接 加到爬取任务列表
        if task.depth == 0 or task.depth == 1:
            html = html.find('.module_summary', first=True)
        if task.depth == 2:
            html = html.find("#in_list_main > table > tr:nth-child(6)", first=True)
        if task.depth == 3:
            return
        
        super().extract_links(html, task)

    def extract_content(self, html, task):
        if task.depth == 3:
            title = html.find('#title', first=True)
            article = html.find('#article', first=True)
            return {'title': title.full_text, 'article': article.full_text}


if __name__ == "__main__":
    crawler = MyCrawlerExample()
    crawler.start()

安装

pip install git+https://github.com/ffteen/sbcrawler.git

sbcrawler's People

Contributors

ffteen avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.