一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

License: MIT License

Python 88.25% JavaScript 3.41% CSS 1.36% HTML 6.98%

crawler spider taobao tmall example python selenium pyquery stock fund multithreading agent-pool wechat wechat-report wereader

examples-of-web-crawlers's Issues

Generate_wx_data.py typo?

Line 94, map.add("", provice. Should it be province?

守护女友成品的那个exe，自动发消息，到时间了必须点一下，或者按回车才会发消息

Exception in thread generate_data:
Traceback (most recent call last):
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "main.py", line 152, in generate_data
TypeError: 'NoneType' object is not subscriptable

生成微信报告提示：ImportError: cannot import name 'Pie'

生成微信报告提示：ImportError: cannot import name 'Pie'
安装Pie模块提示:ERROR: Could not find a version that satisfies the requirement pie (from versions: none)
ERROR: No matching distribution found for pie

改的代码用pyinstall不了是怎么回事呢？

wxpy的包加进去就有问题诶

可执行文件闪退

淘宝滑动验证码过不去

您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错，在开始爬取信息前，您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

可以做一个微信书(心书)的PDF导出吗？出书啦的有个限制，个人分组设置成私密的看不到。谢谢！

我也想自己修改代码做一个，但心书的页面直接调用系统打印PDF图片全部是空白（非懒加载），看了一下源码好像图片全部是调用的CSS，不知道应该怎么处理。

楼主你亲测过么，我试了很久，都不行。

我用你的方法试了，也在网上查了其他方法。只要浏览器是用selenium打开的，修改配置文件和改为开发者模式都会被识别。如果是用账号登录就有过不去的滑块，手动拉都不行。用微博登录就会看不见验证码，又必须要输入。

不行啊

老哥,你这个不行啊,淘宝还是会识别出来,加入这个开发者模式,淘宝还是可以识别

> 您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

我这边测试目前是正常的哦，如果不正常的话，你手动登录网页滑动一次，下次一般就不会提示了。

Originally posted by @shengqiangzhang in #4 (comment)

楼主，为什么我这边只要是用了selenium打开网页，手动登录都无法登录的。即便添加了开发者模式、、、、这个问题已经困扰我好久了。

life.chauo.net can not access

headless模式怎么使用'excludeSwitches', ['enable-automation']？

使用xvfb也不可以？不知道你们是怎么部署到服务器的

raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

找不到应用程序: 'QR.png'

C:\Users\Administrator>C:\Users\Administrator\Desktop\say_to_lady\say_to_lady.exe
Getting uuid of QR code.
Downloading QR code.
Traceback (most recent call last):
File "say_to_lady.py", line 154, in
File "site-packages\wxpy\api\bot.py", line 86, in init
File "site-packages\itchat\components\register.py", line 35, in auto_login
File "site-packages\itchat\components\login.py", line 44, in login
File "site-packages\itchat\components\login.py", line 117, in get_QR
File "site-packages\itchat\utils.py", line 85, in print_qr
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'
[3992] Failed to execute script say_to_lady

淘宝登录

淘宝登录里面有weibo的css 样式标签，好像有些东西没有完全改过来，

8.一键生成微信个人专属数据报告(了解你的微信社交历史)，无法运行

按照readme.md要求，卸载依赖再安装，运行后扫码报错

正在获取微信好友数据信息，请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊，请耐心等待……
Traceback (most recent call last):
  File "d:/code/8/generate_wx_data.py", line 563, in <module>
    group_common_in()
  File "d:/code/8/generate_wx_data.py", line 526, in group_common_in
    bar = Bar('共同所在群聊分析')
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 148, in __init__
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 14, in __init__ 
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\base.py", line 28, in __init__  
    self.width = _opts.get("width", "900px")
AttributeError: 'str' object has no attribute 'get'

总页数的css__selector

#content > div > div.ui-page > div > b.ui-page-skip > form > input[type="hidden"]:nth-child(7)
和你当时写的不一样了，现在还有什么方法可以破吗

微信的发消息的功能报KeyError: 'pass_ticket'，是不是账号被封禁了？wxpy还能用吗？

可以爬取下华为云空间相册吗，我没成功，希望交流学习下。

可以正常登陆，爬取第一页数据，但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图：

爬取5K分辨率超清唯美壁纸

程序报错：

Instance of 'closing' has no 'headers' member
Instance of 'closing' has no 'iter_content' member

模块已经引入了还是不行

豆瓣GUI打开一片空白呀

如图。

每天不同时间段通过微信发消息提醒女友,运行命令是什么呀，应该不会随便找个目录创建ini配置文件就运行了吧

configparser.NoSectionError: No section: 'configuration'

Traceback (most recent call last):
File "configparser.py", line 1138, in _unify_values
KeyError: 'configuration'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "say_to_lady.py", line 175, in
File "configparser.py", line 781, in get
File "configparser.py", line 1141, in _unify_values
configparser.NoSectionError: No section: 'configuration'
[1812] Failed to execute script say_to_lady

AttributeError: 'str' object has no attribute 'get'

正在获取微信好友数据信息，请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊，请耐心等待……
Traceback (most recent call last):
File "getUrl.py", line 532, in
group_common_in()
File "getUrl.py", line 498, in group_common_in
bar = Bar('共同所在群聊分析')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 143, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 15, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/base.py", line 29, in init
self.width = _opts.get("width")
AttributeError: 'str' object has no attribute 'get'

微信一键生成个人报告，卡在人脸识别不动了，win7系统

微信卡在人脸识别第一个不动了，有很长时间。

淘宝登录转微博显示网络连接才超时

运行您的淘宝登录代码，利用微博登录的时候显示网络连接超时，可能是被检测到selenium了，我用正常的浏览器手动操作能成功登录，不知道怎么解决

wxpy现在好像都无法使用

TX把很多账号的微信网页端登录功能给关了，所以wxpy暂时可能都无法使用了

天猫爬虫

我用天猫商品数据爬虫时，只能爬取到第一页的数据，然后命令行中显示 NOT IMPLEMENTED，之后就无法再继续爬取数据

微博登录要验证码

【问题反馈】微信扫码出现无法识别的二维码

大神好,pycharts版本问题？

可以正常登陆，爬取第一页数据，但是之后滑块部分就出错了

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

报错截图：

【3Q】受教了，正好需要这么一份

现在看理论课程看的很头疼，换换口味

一键生成微信个人专属数据报告报错

Traceback (most recent call last):
File "generate_wx_data.py", line 617, in
File "generate_wx_data.py", line 317, in merge_head_image
File "site-packages\PIL\Image.py", line 1817, in resize
File "site-packages\PIL\ImageFile.py", line 239, in load
OSError: image file is truncated (0 bytes not processed)
[8144] Failed to execute script generate_wx_data

运行不了程序

Getting uuid of QR code.
INFO:itchat:Getting uuid of QR code.
Downloading QR code.
INFO:itchat:Downloading QR code.
Traceback (most recent call last):
File "generate_wx_data.py", line 542, in
bot = Bot(cache_path=True)
File "E:\Program Files\python37\lib\site-packages\wxpy\api\bot.py", line 86, in init
loginCallback=login_callback, exitCallback=logout_callback
File "E:\Program Files\python37\lib\site-packages\itchat\components\register.py", line 30, in auto_login
loginCallback=loginCallback, exitCallback=exitCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 44, in login
picDir=picDir, qrCallback=qrCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 117, in get_QR
utils.print_qr(picDir)
File "E:\Program Files\python37\lib\site-packages\itchat\utils.py", line 85, in print_qr
os.startfile(fileDir)
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'

history文件215mb，网页崩溃

天猫爬虫的登录验证问题

天猫做滑块验证的时候，直接使用move_by_offset 函数直接一拖到底貌似会被检测到非人工
使用轨迹循环拖动会很慢，也会被检测到异常，这个问题该怎么办呢

微信书一次做了几个人的话，每次打开都是第一个人的，文件名未自动更新

一个小建议：最后不是打开pdf，改成打开文件所在文件夹

使用

使用手册能不能再详细点，如何在Linux运行

学会了，请问女朋友哪里领

如题。

能在Gitee(码云)上发布一版吗？太大了，下载总是超时！

知乎新版本的不好搞，都被屏蔽了，能搞个知乎不

可以做个新功能：淘宝的商品的评论爬取吗？

那么问题来了，女友去哪里找？

登录滑动块验证失败

哎呀，出错了，点击刷新再来一次(error:e3Euv)，大神有遇到类似情况吗？求解答

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错，在开始爬取信息前，您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

error when login

python taobao_login.py
Traceback (most recent call last):
File "taobao_login.py", line 72, in
a.login() #登录
File "taobao_login.py", line 32, in login
self.browser.find_element_by_xpath('//*[@Class="forget-pwd J_Quick2Static"]').click()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: element not interactable
(Session info: chrome=72.0.3626.119)
(Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.15.0-46-generic x86_64)

shengqiangzhang / examples-of-web-crawlers Goto Github PK

examples-of-web-crawlers's Issues

程序报错：

Recommend Projects

Recommend Topics

Recommend Org