Git Product home page Git Product logo

querylist-puppeteer's Introduction

QueryList-Puppeteer

QueryList插件: 使用Puppeteer采集JavaScript动态渲染的页面。使用此插件需要有一定的Node.js基础知识,并且会配置Node运行环境。

此插件是基于PuPHPeteer包的简单封装,支持使用Puppeteer所有的API,非常强大!

PuPHPeteer: https://github.com/nesk/puphpeteer

Puppeteer: https://github.com/GoogleChrome/puppeteer

QueryList: https://github.com/jae-jae/QueryList

环境要求

  • PHP >= 7.1
  • Node >= 8

安装

安装插件

composer require jaeger/querylist-puppeteer

安装Node依赖(与composer一样在项目根目录下执行)

npm install @nesk/puphpeteer

插件注册选项

QueryList::use(Chrome::class,$opt1)

  • $opt1: 设置chrome函数别名

API

  • chrome($url, $options = []) 使用Chrome打开链接,返回值为设置好HTML的QueryList对象

用法

在QueryList中注册插件

use QL\QueryList;
use QL\Ext\Chrome;

$ql = QueryList::getInstance();
// 注册插件,默认注册的方法名为: chrome
$ql->use(Chrome::class);
// 或者自定义注册的方法名
$ql->use(Chrome::class,'chrome');

基本用法

// 抓取的目标页面是使用Vue.js动态渲染的页面
$text = $ql->chrome('https://www.iviewui.com/components/button')->find('h1')->text();
print_r($text);
// 输出: Button 按钮
$rules = [
 'h1' => ['h1','text']
];
$ql = $ql->chrome('https://www.iviewui.com/components/button');
$data = $ql->rules($rules)->queryData();

设置Puppeteer launch选项,选项文档:https://github.com/GoogleChrome/puppeteer/blob/v1.11.0/docs/api.md#puppeteerlaunchoptions

$text = $ql->chrome('https://www.iviewui.com/components/button',[
  'timeout' => 6000,
  'ignoreHTTPSErrors' => true,
  // ...
])->find('h1')->text();

更高级的用法,查看Puppeteer文档了解全部API: https://github.com/GoogleChrome/puppeteer

$text = $ql->chrome(function ($page,$browser) {
    $page->setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36');
    // 设置cookie
    $page->setCookie([
      'name' => 'foo',
      'value' => 'xxx',
      'url' => 'https://www.iviewui.com'
    ],[
       'name' => 'foo2',
       'value' => 'yyy',
       'url' => 'https://www.iviewui.com'
    ]);
    $page->goto('https://www.iviewui.com/components/button');
    // 等待h1元素出现
    $page->waitFor('h1');
    // 获取页面HTML内容
    $html = $page->content();
    // 关闭浏览器
    $browser->close();
    // 返回值一定要是页面的HTML内容
    return $html;
})->find('h1')->text();

调试

调试有很多种方法,下面演示通过页面截图和启动可视化Chrome浏览器来了解页面加载情况

页面截图

运行下面代码后可以在项目根目录下看到page.png截图文件。

$text = $ql->chrome(function ($page,$browser) {
    $page->goto('https://www.iviewui.com/components/button');
    // 页面截图
    $page->screenshot([
        'path' => 'page.png',
        'fullPage' => true
    ]);
    $html = $page->content();
    $browser->close();
    return $html;
})->find('h1')->text();

启动可视化Chrome浏览器

运行下面代码后会启动一个Chrome浏览器。

$text = $ql->chrome(function ($page,$browser) {
    $page->goto('https://www.iviewui.com/components/button');
    $html = $page->content();
    // 这里故意设置一个很长的延长时间,让你可以看到chrome浏览器的启动
    sleep(10000000);
    $browser->close();
    // 返回值一定要是页面的HTML内容
    return $html;
},[
 'headless' => false, // 启动可视化Chrome浏览器,方便调试
 'devtools' => true, // 打开浏览器的开发者工具
])->find('h1')->text();

querylist-puppeteer's People

Contributors

jae-jae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

querylist-puppeteer's Issues

Puppeteer中含$符号的方法如何使用?

Puppeteer中的一些函数,含$符号,在php中好像无法使用,求方法。
$page->$('h1')好像这样不行耶
如下函数:
page.$(selector)
page.$$(selector)
page.$$eval(selector, pageFunction[, ...args])
page.$eval(selector, pageFunction[, ...args])

Fatal error: Uncaught Nesk\Rialto\Exceptions\Node\FatalException:

Fatal error: Uncaught Nesk\Rialto\Exceptions\Node\FatalException: Failed to launch the browser process! [1026/233157.600635:FATAL:zygote_host_impl_linux.cc(117)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox. #0 0x5583c69d2b39 base::debug::CollectStackTrace() #1 0x5583c69454c3 base::debug::StackTrace::StackTrace() #2 0x5583c6955c80 logging::LogMessage::~LogMessage() #3 0x5583c53e7f5e content::ZygoteHostImpl::Init() #4 0x5583c68efcf8 content::ContentMainRunnerImpl::Initialize() #5 0x5583c68ede0b content::RunContentProcess() #6 0x5583c68edf5c content::ContentMain() #7 0x5583c693f3d2 headless::(anonymous namespace)::RunContentMain() #8 0x5583c693f0bc headless::HeadlessShellMain() #9 0x5583c3f40a03 ChromeMain #10 0x7f96abcff555 __libc_start_main #11 0x5583c3f4082a _start Received in /www/wwwroot/snapshot.rsyun.net/vendor/nesk/rialto/src/ProcessSupervisor.php on line 307

跟另一个插件一样了,symfony/process版本冲突

`- Can only install one of: symfony/process[4.4.x-dev, v5.0.7].
- Can only install one of: symfony/process[v4.4.0, v5.0.7].
- Can only install one of: symfony/process[v4.4.0-BETA1, v5.0.7].
- Can only install one of: symfony/process[v4.4.0-BETA2, v5.0.7].
- Can only install one of: symfony/process[v4.4.0-RC1, v5.0.7].
- Can only install one of: symfony/process[v4.4.1, v5.0.7].
- Can only install one of: symfony/process[v4.4.2, v5.0.7].
- Can only install one of: symfony/process[v4.4.3, v5.0.7].
- Can only install one of: symfony/process[v4.4.4, v5.0.7].
- Can only install one of: symfony/process[v4.4.5, v5.0.7].
- Can only install one of: symfony/process[v4.4.6, v5.0.7].
- Can only install one of: symfony/process[v4.4.7, v5.0.7].
- Installation request for symfony/process (locked at v5.0.7) -> satisfiable by symfony/process[v5.0.7].

Installation failed, reverting ./composer.json to its original content.`

依赖的库版本太久不更新。。。

报错ProcessFailedException

源码
`<?php
namespace app\controller;

use app\BaseController;
use QL\QueryList;
use QL\Ext\Chrome;

class Index extends BaseController
{
public function index()
{
$ql = QueryList::getInstance();

    // 注册插件,默认注册的方法名为: chrome
    $ql->use(Chrome::class);
    // 抓取的目标页面是使用Vue.js动态渲染的页面
    $text = $ql->chrome('https://www.iviewui.com/components/button')->find('h1')->text();
    print_r($text);
}`

报错信息:
#0 [0]ProcessFailedException in ProcessSupervisor.php line 309
` if (!empty($process->getErrorOutput())) {
if (IdleTimeoutException::exceptionApplies($process)) {
throw new IdleTimeoutException(
$this->options['idle_timeout'],
new NodeFatalException($process, $this->options['debug'])
);
} else if (NodeFatalException::exceptionApplies($process)) {
throw new NodeFatalException($process, $this->options['debug']);
} elseif ($process->isTerminated() && !$process->isSuccessful()) {
throw new ProcessFailedException($process);
}
}

    if ($process->isTerminated()) {
        throw new Exceptions\ProcessUnexpectedlyTerminatedException($process);
    }
}

`
框架用的是TP6,php版本是7.3.4,node版本是8.17,请问这个问题怎么解决

'node' 不是内部或外部命令,也不是可运行的程序 或批处理文件。

local.INFO: The command "node "D:\wwwroot\system.test.com\vendor\nesk\rialto\src/node-process/serve.js" D:\wwwroot\system.test.com\vendor\nesk\puphpeteer\src\PuppeteerConnectionDelegate.js "{""idle_timeout"":10000,""log_node_console"":false,""log_browser_console"":false}"" failed.

Exit Code: 1(General error)

Working directory: D:\wwwroot\system.test.com\public

Output:

Error Output:

'node' 不是内部或外部命令,也不是可运行的程序
或批处理文件。

runtime exception : ProcessSupervisor

expose exceptions as below:

#0 D:\wwwroot\pscraper\vendor\nesk\rialto\src\ProcessSupervisor.php(423): Nesk\Rialto\ProcessSupervisor->checkProcessStatus() #1 D:\wwwroot\pscraper\vendor\nesk\rialto\src\ProcessSupervisor.php(382): Nesk\Rialto\ProcessSupervisor->readNextProcessValue(true) #2 D:\wwwroot\pscraper\vendor\nesk\rialto\src\Traits\CommunicatesWithProcessSupervisor.php(84): Nesk\Rialto\ProcessSupervisor->executeInstruction(Object(Nesk\Rialto\Instruction)) #3 D:\wwwroot\pscraper\vendor\nesk\rialto\src\Traits\CommunicatesWithProcessSupervisor.php(100): Nesk\Rialto\AbstractEntryPoint->proxyAction('call', 'launch', Array) #4 D:\wwwroot\pscraper\vendor\jaeger\querylist-puppeteer\Chrome.php(32): Nesk\Rialto\AbstractEntryPoint->__call('launch', Array) #5 D:\wwwroot\pscraper\vendor\jaeger\querylist-puppeteer\Chrome.php(24): QL\Ext\Chrome::render(Object(QL\QueryList), 'https://ll...', Array) #6 [internal function]: QL\QueryList->QL\Ext\{closure}('https://ll...') #7 D:\wwwroot\pscraper\vendor\jaeger\querylist\src\QueryList.php(67): Closure->call(Object(QL\QueryList), 'https://ll...') #8 D:\wwwroot\pscraper\libs\SScraper.php(29): QL\QueryList->__call('chrome', Array) #9 D:\wwwroot\pscraper\test.php(6): Pscraper\SScraper->getGoodsParams('https://ll...') #10 {main}

超时时间设置没有作用

设置idle_timeout、timeout参数貌似都没有生效

$ql->chrome(function ($page,$browser) {...}, [
    'idle_timeout' => 0,
    'timeout' => 0,
])

报错如下:

The idle timeout (60.000 seconds) has been exceeded. Maybe you should increase the "idle_timeout" option.

(值也尝试过不用0,设置成1000000也无效,到时间就推出了)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.