kiddyuchina / beanbun Goto Github PK

View Code? Open in Web Editor NEW

1.2K 77.0 249.0 82 KB

Beanbun 是用 PHP 编写的多进程网络爬虫框架，具有良好的开放性、高可扩展性，基于 Workerman。

License: MIT License

PHP 100.00%

php spider crawler beanbun

beanbun's People

Stargazers

Watchers

Forkers

stonegithubs kxiangtian chenshuphp l5071134 keppelcao smallsong yvlf ken-studio days72115 slocheng folkevil dumin199101 thetrueself joanstrive xxoxx hyxj1220 asdf20122012 hulkwan masdude diycp aimeidefashi justcodingnobb visionzk vanmedia jimmysum stamhe imarlboro hqsit plusdao dmagiceoy edwardyi zhanglei jason-kevin 2015phper xupin110 yongxiongwei jonny77 whu404 iwordz scalerone lcbyz windform w2yn imjerrybao xgs736214763 k155 gaolijuan wzcode xiaoxiaowu998 zhaozonglu mojiajuzi lxqunz aaranxu zsae gbkus123 zhaixue modelsim mrqiu2012 lsjing teller110 hezll icekingcy fufuyuan flashboycn wanglelecc hopher x12311231 guoqing1988 laterz aiwhj dawc hi-noikiy xiangminghu2018 macccha520 genjiluo ly360 marsberrys cheihcheung yhgcs tonygeli kalsolio mrquiet batcom tigersphp zrlhk haohailuo laurel-he m130535 slayerhover scrapies yanhuizen alirizhi im286er taozywu rucky2013 jiafenggit zhangkg rrrronny sunknight safly

beanbun's Issues

除了手工执行 php start.php stop 之外，怎样在代码里正确退出？

守护模式下，除了手工执行 php start.php stop 来停止之外，在代码里可以怎样可以正确地停止？
比如写个 timer 检查到 redis 队列为空之后，自动退出。

爬取有重定向的链接时，获取url不正确

在爬取百度搜索到的ur时，如果url有重定向，框架自动重定向了，但是url还是重定向之前的。
我尝试直接把GuzzleHttp\Client类的自动重定向配置设置为false，尝试后可以获取到即将重定向的页面，但是$beanbun->options中没有即将重定向的链接

关于去重性能

看了源代码中去重使用的2种方法，一个是md5直接放redis set，这个数据量到百万千万后性能不行
另一个使用bloomfilter映射redis bitmap，这2者在爬取URL数量在千万级性能差距有多少？

PHP Warning: array_flip()

PHP Warning: array_flip(): Can only flip STRING and INTEGER values! in /vendor/kiddyu/beanbun/src/Lib/Helper.php on line 44

有时间up一下，^_^

例子中都是get的，没有post的例程，不知道data数据在那里设定啊？能不能举一个post的例子呢。
$beanbun->seed = [
//'http://www.950d.com/',
[
'http://www.950d.com/list-2.html',
[
'method' => 'POST',
]
]
];
按例程中，这样设定，一是没有postdata数据，二是就算这样，也报错。crul error 3.

麻烦给个post的例子，谢谢

JS动态渲染后的页面能处理吗？

没有用过，有这么一个疑问，如果不能抓取JS渲染后的数据，那能不能配合phantomjs一起工作？

朋友文档站点是用什么框架做的

简洁实用啊
用的是框架还是自己搭的

redis queue can't use password ？

redis auth问题

redis auth问题
连接Redis服务的时候，config中应该添加auth授权配置，并在构造函数中添加auth授权。

我如果去请求post的接口的话参数如何传递没看到文档里有写啊

遇到网站返回 HTTP 错误，但爬虫不会停止，一直无限爬？

遇到某些 Seed 有时返回 500、或404、或超时，爬虫会一直重试，然后好像在爬一个空地址，而且也不进入afterDownloadPage。

1、网站故障无法避免，但爬虫应该要怎样正确处理这种情况？
2、在afterDownloadPage里除了 page 属性以外，能否获得网站返回的 http code 、Response Header Cookie 这些？？

一直想抓2345小说

这个能抓2345小说，嘿嘿，写写正则就ok了，还有源码咋没用单元测试

请问一直找不到beadbun这个类是哪里的原因呢？

[root@localhost www]# ls
composer.json composer.lock vendor
[root@localhost www]# vim start.php
[root@localhost www]# php start.php
PHP Fatal error: Uncaught Error: Class 'Beanbun\Beanbun' not found in /www/start.php:3
Stack trace:
#0 {main}
thrown in /www/start.php on line 3
[root@localhost www]#

这是刚刚用composer安装好，然后复制了start.php执行的结果

urlFilter 属性，在window平台下不起作用

因为window平台不能使用守护模式，所以在window平台下使用只能把要爬的网页链接全部列出来吗

cURL error 28: Operation timed out

cURL error 28: Operation timed out after 60001 milliseconds with 49054 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

请问是什么原因,怎么解决?

文档无法显示

打开文档后页面一片空白

爬虫似乎不会自动结束爬取，爬取完成后无法执行stopWorker

作为测试我配置了$beanbun->UrlFilter，给了一个seed让爬虫自己爬，在规则中大概能匹配7张页面，但是页面抓完之后一直没有执行stopWorker，而且任务管理器中的进程还在。
我尝试使用php start.php stop，在任务管理器中看到爬虫进程已经结束了，但是还是还是没有执行stopWorker();
部分代码：
$bean->stopWorker = function($b){ $b->log('执行了一次stopWorker'); }
表现在：日志文件中并没有'执行了一次stopWorker'

Beanbun整合到yii2框架时，无法运行

use yii\console\Controller;
use Beanbun\Beanbun;

class DemoController extends Controller
{

    public function actionBeanbun()
    {
        $beanbun = new Beanbun();
        $beanbun->seed = [
            'http://www.950d.com/',
            'http://www.950d.com/list-1.html',
            'http://www.950d.com/list-2.html',
        ];
        $beanbun->afterDownloadPage = function ($beanbun) {
            file_put_contents(__DIR__ . '/' . md5($beanbun->url), $beanbun->page);
        };
        $beanbun->start();
    }
}

问题一

PHP Notice 'yii\base\ErrorException' with message 'Undefined property: Beanbun\Beanbun::$count'
in /xxx/vendor/kiddyu/beanbun/src/Beanbun.php:136

问题二
在Beanbun加上count属性后，发现依然不能工作

会一直提示

Usage: php yourfile.php {start|stop|restart|reload|status|connections} [-d]

kiddyuchina / beanbun Goto Github PK

beanbun's People

Stargazers

Watchers

Forkers

beanbun's Issues

Recommend Projects

Recommend Topics

Recommend Org