Comments (23)
function handleGbkPage($html)
{
$html = mb_convert_encoding($html, 'UTF-8', 'GBK');
$html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
return $html;
}
$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);
from querylist.
$listmain = $ql->encoding('UTF-8','GBK')->rules([
'title' => array('dd>a', 'text'),
'link' => array('dd>a', 'href')
])->query()->getData();
// 进入源码,看到转码成功,但是$listmain为空
class EncodeService
{
public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null)
{
$html = $ql->getHtml();
$inputEncoding || $inputEncoding = self::detect($html);
$html = iconv($inputEncoding,$outputEncoding,$html);
dump($inputEncoding,$outputEncoding,$html);
$ql->setHtml($html);
return $ql;
}
from querylist.
楼主 查到原因了吗,我这也有这问题
from querylist.
这个问题还没解决吗?
from querylist.
我的解决方案是:
$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');
from querylist.
同样的问题,文档里面的方法都试了还是不行,自己默默写个正则,输出正常。目测采集正常,用了这个匹配就乱码了,楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下,谢谢
from querylist.
@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了
from querylist.
@Zneiat 这边测试的结果不行,如果把GET到的HTML直接输出,是正常,打开匹配模式输出就乱了
from querylist.
我抓的html页面编码本来就是utf-8,但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。
from querylist.
@shanezhiu 同感,也有可能是咱们没找对方法,驾驭不了
from querylist.
@youngda 发一下你的代码 我看看
from querylist.
@Zneiat
public function handle_content()
{
$data = $this->spider
->rules([
'title' => ['#activity-name','text']
])
->get("https://mp.weixin.qq.com/s?src=11×tamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
->encoding('UTF-8','GB2312')
->query()
->getData()
->toArray();
$title = array_pop($data)['title'];
var_dump($title);exit;
}
from querylist.
@youngda bug的可能性比较大。我去翻翻源码。
from querylist.
@shanezhiu 尝试
$url = "https://mp.weixin.qq.com/s?src=11×tamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";
$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);
$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();
function handleGbkPage($html)
{
$html = mb_convert_encoding($html, 'UTF-8', 'GBK');
$html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
return $html;
}
from querylist.
@shanezhiu https://mp.weixin.qq.com/s?src=11×tamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1
XD 编码本来就是 UTF-8 无需转换
from querylist.
@Zneiat 你可以去除一下encoding的代码,打印title,看看结果。
from querylist.
@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后,没有乱码,但是 phpQuery 依然不能获取内容
from querylist.
@Zneiat 让我感到好奇的是,你运行了你提供的snippet吗?我运行你的结果是:
array (size=1)
0 =>
array (size=1)
'title' => string '1603澶â��æ��é��å§H370æ¸�æ¿�棫é��ç�³ç¡¶çºî�¿î�»æ¾¶è¾«ä»�é�ªç�¸î��é��ç�·æ´�é��' (length=152)
这结果显然是不正确的。
from querylist.
@Zneiat 我认为这两个都属于编码问题。
from querylist.
@shanezhiu 已解决。。。你采集的是微信公众号文章,html 代码开头 <!--headTrap<body></body><head></head><html></html>-->
和结尾 <!--tailTrap<body></body><head></head><html></html>-->
会影响 phpQuery
$url = "https://mp.weixin.qq.com/s?src=11×tamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";
$html = file_get_contents($url); // 建议用 Curl
$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);
$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);
from querylist.
@Zneiat 谢谢你,对,是这个原因。我逐步调试了,确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。
from querylist.
@shanezhiu 哈哈 不用谢 (/ω\)
from querylist.
@Zneiat 谢谢啊,就是这个问题,果然是自己功力尚浅
from querylist.
Related Issues (20)
- 涉及到特殊字符@<>就拿不到数据
- 设置的代理需要验证账号密码如何填写? HOT 1
- 请支持Laravel 9 谢谢 HOT 7
- 我想去掉hidden的属性怎么写? HOT 1
- php8.0 laravel 8.5 无法安装 HOT 3
- [已解决]开发环境下 依赖不兼容PHP7.4语法 HOT 1
- 可以修改富文本当中的部分内容吗? HOT 1
- 切片后数据为空
- Add support laravel 9 HOT 5
- 啥时候能支持laravel9 HOT 1
- Fatal error: Uncaught TypeError: Argument 1 passed to QL\Services\MultiRequestService::QL\Services\{closure}() must be an instance of GuzzleHttp\Exception\RequestException, instance of GuzzleHttp\Exception\ConnectException given HOT 3
- 希望支持一下laravel9&&php8+ HOT 4
- 期望实现devtools-protocol,就能和puppeteer一样愉快的使用了 HOT 1
- 依赖库 pguardiario/phpuri 已被删除
- PHP7.4环境代码执行结果与预期不一致! HOT 3
- 使用【Baidu 插件】返回 百度安全验证 网络不给力,请稍后重试 返回首页 问题反馈 HOT 1
- 希望支持php8.0/laravel 9 HOT 2
- The domain querylist.cc is expired,please fix it
- 什么时候支持php8啊,已经在用8.2了,马上都出8.3了 HOT 4
- php8.1 fixes - Cannot push my branch
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from querylist.