Git Product home page Git Product logo

boris-code / feapder Goto Github PK

View Code? Open in Web Editor NEW
2.6K 37.0 450.0 1.26 MB

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

Home Page: http://feapder.com

License: Other

Python 100.00%
scrapy feapder spider crawler python feaplat

feapder's Introduction

FEAPDER

Downloads Downloads Downloads

简介

  1. feapder是一款上手简单,功能强大的Python爬虫框架,内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。
  2. 支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。
  3. 更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

读音: [ˈfiːpdə]

feapder

文档地址

环境要求:

  • Python 3.6.0+
  • Works on Linux, Windows, macOS

安装

From PyPi:

精简版

pip install feapder

浏览器渲染版:

pip install "feapder[render]"

完整版:

pip install "feapder[all]"

三个版本区别:

  1. 精简版:不支持浏览器渲染、不支持基于内存去重、不支持入库mongo
  2. 浏览器渲染版:不支持基于内存去重、不支持入库mongo
  3. 完整版:支持所有功能

完整版可能会安装出错,若安装出错,请参考安装问题

小试一下

创建爬虫

feapder create -s first_spider

创建后的爬虫代码如下:

import feapder


class FirstSpider(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://www.baidu.com")

    def parse(self, request, response):
        print(response)


if __name__ == "__main__":
    FirstSpider().start()
        

直接运行,打印如下:

Thread-2|2021-02-09 14:55:11,373|request.py|get_response|line:283|DEBUG|
                -------------- FirstSpider.parse request for ----------------
                url  = https://www.baidu.com
                method = GET
                body = {'timeout': 22, 'stream': True, 'verify': False, 'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}}

<Response [200]>
Thread-2|2021-02-09 14:55:11,610|parser_control.py|run|line:415|DEBUG| parser 等待任务...
FirstSpider|2021-02-09 14:55:14,620|air_spider.py|run|line:80|INFO| 无任务,爬虫结束

代码解释如下:

  1. start_requests: 生产任务
  2. parse: 解析数据

参与贡献

贡献之前请先阅读 贡献指南

感谢所有做过贡献的人!

爬虫工具推荐

  1. 爬虫在线工具库:http://www.spidertools.cn
  2. 爬虫管理系统:http://feapder.com/#/feapder_platform/feaplat
  3. 验证码识别库:https://github.com/sml2h3/ddddocr

微信赞赏

如果您觉得这个项目帮助到了您,您可以帮作者买一杯咖啡表示鼓励 🍹

也可和作者交个朋友,解决您在使用过程中遇到的问题

赞赏码

学习交流

知识星球:17321694 作者微信: boris_tm QQ群号:485067374

加好友备注:feapder

feapder's People

Contributors

ayixi avatar boris-code avatar d1rtydann avatar do1e avatar floweroda avatar hijack911 avatar litt1eq avatar mkdir700 avatar oslijunw avatar reclusexu avatar ruixiangs avatar s0ing avatar shellmonster avatar shurelol avatar valuefish avatar xiewei18 avatar xmqsvip avatar yudeqang avatar yufengsoft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feapder's Issues

为何我配置了代理API连接会一直出问题呢???

********** feapder begin **********
Thread-5|2021-03-31 15:51:30,281|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加5, 失效0, 当前代理数5,
Thread-5|2021-03-31 15:51:30,809|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:33,600|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:33,600|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:33,600|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 10
Thread-5|2021-03-31 15:51:34,604|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:34,604|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:34,604|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 20
Thread-5|2021-03-31 15:51:35,608|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:35,608|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 30
Thread-5|2021-03-31 15:51:36,613|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:36,613|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 40
Thread-5|2021-03-31 15:51:37,617|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:37,617|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:37,617|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 50
Thread-5|2021-03-31 15:51:38,621|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:38,622|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 60
Thread-5|2021-03-31 15:51:39,626|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:39,626|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 70
Thread-5|2021-03-31 15:51:40,629|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:40,629|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 80
Thread-5|2021-03-31 15:51:41,634|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:41,634|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:41,634|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 90
Thread-5|2021-03-31 15:51:42,638|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:42,638|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 100
Thread-5|2021-03-31 15:51:43,643|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:43,643|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:43,643|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 110
Thread-5|2021-03-31 15:51:44,646|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:44,646|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 120
Thread-5|2021-03-31 15:51:45,649|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:51:45,649|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:45,650|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:45,650|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:45,650|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 130
Thread-5|2021-03-31 15:51:46,654|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:46,654|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:46,654|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 140
Thread-5|2021-03-31 15:51:47,659|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:47,659|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 150
Thread-5|2021-03-31 15:51:48,664|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:48,664|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 160
Thread-5|2021-03-31 15:51:49,666|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:49,666|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:49,666|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 170
Thread-5|2021-03-31 15:51:50,669|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:50,670|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 180
Thread-5|2021-03-31 15:51:51,674|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:51,675|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:51,675|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 190
Thread-5|2021-03-31 15:51:52,679|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:52,680|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 200
Thread-5|2021-03-31 15:51:53,682|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:53,682|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:53,682|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 210
Thread-5|2021-03-31 15:51:54,687|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:54,687|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 220
Thread-5|2021-03-31 15:51:55,693|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:55,693|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 230
Thread-5|2021-03-31 15:51:56,698|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:56,698|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:56,698|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 240
Thread-5|2021-03-31 15:51:57,701|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:57,701|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 250
Thread-5|2021-03-31 15:51:58,706|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:58,706|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:58,706|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 260
Thread-5|2021-03-31 15:51:59,711|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:51:59,711|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 270
Thread-5|2021-03-31 15:52:00,715|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0,
Thread-5|2021-03-31 15:52:00,715|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:00,715|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:00,715|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 280
Thread-5|2021-03-31 15:52:01,720|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:01,720|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:01,720|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 290
Thread-5|2021-03-31 15:52:02,721|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:02,722|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:02,722|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 300
Thread-5|2021-03-31 15:52:03,723|request.py|get_response|line:272|DEBUG| 暂无可用代理 ...
Thread-5|2021-03-31 15:52:03,723|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 310

感谢作者

看了下文档,试用了,感觉这才是人类该用的框架。感谢作者的开源创作

失败的request无法存入redis

兼容问题,redis不可存储Python中的字典对象,存储之前需要将其转为字符串,字节

失败的Request存储方法

image

报错信息:

image

对于反爬取

很多网站反爬取 爬出来html 不是很完整 配置mimproxy ajax js请求没触发 用selenium就没问题 是不是需要加些配置?

item存储的问题

image
你好,我在使用框架的时候遇到这样一个现象,可能会是一个潜在的隐患
我是直接在解析体里实例化Item,然后给Item的各个键赋值。
整个解析体会有一些条件分支,如下示例代码所示
一开始我没有注意到,因为大部分情况和我料想的一样
else语句里没有赋值的字段都是自动填充None到数据库的
但是后来发现,有的数据本来key_a, key_b, key_c需要有值的,实际上却只有一个status字段的值是1,其余均为空
定位到上面那张图的代码位置
原因我猜想是这样的,多条数据在一起入库,生成sql语句的时候,选择了列表中第一条记录的key,而第一条记录的key如果是else条件下赋值的,就会只有一个字段被使用,那么这一批一起入库的数据就只有一个字段入库了
不知作者能不能明白我的意思哈哈~

item = Item()
item.create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
if 'hhh' in response.text:
    item.status = 1
    item.key_a = response.xpath('...')
    item.key_b = response.xpath('...')
    item.key_c = response.xpath('...')
else:
    item.status = 0

ModuleNotFoundError: No module named 'utils'

环境
Python 3.7.8
mac os 10.15.7
pycharm 2020.3.5

报错操作
pip3 install feapder安装好后
终端执行
feapder create -s first_spider

(venv) ➜ Spiders feapder create -s first_spider
Traceback (most recent call last):
File "/PycharmProjects/Spiders/venv/bin/feapder", line 5, in
from feapder.commands.cmdline import execute
File "/PycharmProjects/Spiders/venv/lib/python3.7/site-packages/feapder/init.py", line 28, in
from feapder.core.spiders import Spider, BatchSpider, AirSpider
File "/PycharmProjects/Spiders/venv/lib/python3.7/site-packages/feapder/core/spiders/init.py", line 13, in
from feapder.core.spiders.air_spider import AirSpider
File "/PycharmProjects/Spiders/venv/lib/python3.7/site-packages/feapder/core/spiders/air_spider.py", line 14, in
import feapder.utils.tools as tools
File "/PycharmProjects/Spiders/venv/lib/python3.7/site-packages/feapder/utils/tools.py", line 46, in
from feapder.utils.email_sender import EmailSender
File "/PycharmProjects/Spiders/venv/lib/python3.7/site-packages/feapder/utils/email_sender.py", line 18, in
from utils.log import log

我在feapder/utils/email_sender.py改成from feapder.utils.log import log 正常了

输出日志问题

image

一点意见

输出日志感觉有点混乱,特别是开了多线程之后
是否可以参考scrapy的日志格式,或者使用loguru用颜色来区分日志级别?

关于为爬虫名添加限制或者检查的建议

我个人在学习使用新的框架的时候,喜欢为每一步知识设置 如 0_hello_world 之类的文件(夹)名用来区分
数字是为了让系统自动排序文件,让它更好看

86c99a0f78ac64672516008fc564e48
在尝试创建数字开头的爬虫后,发现由于爬虫名称与代码相关,生成的代码如类名,不可使用数字开头,导致一些很容易被修复的错误
建议于代码中添加简单的检查,或者在文档中注明项目名取名须知

请教下 自定义配置 中如何使用 WEBDRIVER

这样使用不生效
`class SpiderTest(feapder.AirSpider):

__custom_setting__ =dict(
    WEBDRIVER=dict(
        pool_size=2,  # 浏览器的数量
        load_images=False,  # 是否加载图片
        user_agent=None,  # 字符串 或 无参函数,返回值为user_agent
        proxy=None,  # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址
        headless=False,  # 是否为无头浏览器
        driver_type="CHROME",  # CHROME 或 PHANTOMJS,
        timeout=30,  # 请求超时时间
        window_size=(1024, 800),  # 窗口大小
        executable_path=None,  # 浏览器路径,默认为默认路径
    )
)`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.