Git Product home page Git Product logo

Comments (23)

qwqcode avatar qwqcode commented on May 22, 2024 2
function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
    
    return $html;
}

$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);

from querylist.

ctfang avatar ctfang commented on May 22, 2024
    $listmain  = $ql->encoding('UTF-8','GBK')->rules([
        'title' => array('dd>a', 'text'),
        'link' => array('dd>a', 'href')
    ])->query()->getData();

// 进入源码,看到转码成功,但是$listmain为空
class EncodeService
{
public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null)
{
$html = $ql->getHtml();
$inputEncoding || $inputEncoding = self::detect($html);
$html = iconv($inputEncoding,$outputEncoding,$html);
dump($inputEncoding,$outputEncoding,$html);
$ql->setHtml($html);
return $ql;
}

from querylist.

wangyouw avatar wangyouw commented on May 22, 2024

楼主 查到原因了吗,我这也有这问题

from querylist.

varphper avatar varphper commented on May 22, 2024

这个问题还没解决吗?

from querylist.

luffyzhao avatar luffyzhao commented on May 22, 2024

我的解决方案是:

$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');

from querylist.

youngda avatar youngda commented on May 22, 2024

同样的问题,文档里面的方法都试了还是不行,自己默默写个正则,输出正常。目测采集正常,用了这个匹配就乱码了,楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下,谢谢

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了

from querylist.

youngda avatar youngda commented on May 22, 2024

@Zneiat 这边测试的结果不行,如果把GET到的HTML直接输出,是正常,打开匹配模式输出就乱了

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

我抓的html页面编码本来就是utf-8,但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。

from querylist.

youngda avatar youngda commented on May 22, 2024

@shanezhiu 同感,也有可能是咱们没找对方法,驾驭不了

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@youngda 发一下你的代码 我看看

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@Zneiat

public function handle_content()
{
		$data = $this->spider
			->rules([
				'title' => ['#activity-name','text']
			])
			->get("https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
			->encoding('UTF-8','GB2312')
			->query()
			->getData()
			->toArray();
		$title = array_pop($data)['title'];
		var_dump($title);exit;
}

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@youngda bug的可能性比较大。我去翻翻源码。

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@shanezhiu 尝试

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
    'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
    
    return $html;
}

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@shanezhiu https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1 XD 编码本来就是 UTF-8 无需转换

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@Zneiat 你可以去除一下encoding的代码,打印title,看看结果。

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后,没有乱码,但是 phpQuery 依然不能获取内容

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@Zneiat 让我感到好奇的是,你运行了你提供的snippet吗?我运行你的结果是:

array (size=1)
  0 => 
    array (size=1)
      'title' => string '1603澶���孧H370��棫��硶纭��澶辫��������' (length=152)

这结果显然是不正确的。

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@Zneiat 我认为这两个都属于编码问题。

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@shanezhiu 已解决。。。你采集的是微信公众号文章,html 代码开头 <!--headTrap<body></body><head></head><html></html>--> 和结尾 <!--tailTrap<body></body><head></head><html></html>--> 会影响 phpQuery

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl

$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);

from querylist.

shanezhiu avatar shanezhiu commented on May 22, 2024

@Zneiat 谢谢你,对,是这个原因。我逐步调试了,确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。

from querylist.

qwqcode avatar qwqcode commented on May 22, 2024

@shanezhiu 哈哈 不用谢 (/ω\)

from querylist.

youngda avatar youngda commented on May 22, 2024

@Zneiat 谢谢啊,就是这个问题,果然是自己功力尚浅

from querylist.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.