一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

License: MIT License

Python 88.25% JavaScript 3.41% CSS 1.36% HTML 6.98%

agent-pool crawler example fund multithreading pyquery python selenium spider stock taobao tmall wechat wechat-report wereader

examples-of-web-crawlers's People

Contributors

Stargazers

Watchers

Forkers

wka591399669 cophiel easydowork tanzhaoyao wwgang char-1 givemetheball kimi-liver brucelee0304 junjielee fncheng ccang2016 hikeme weiplanet ityufeng gangbeigg xiaorui16888 simpledgq giserh samirchen makulu1987 alberteta ilasx hy59 dxbing117 dreadlord1984 mystudymaterial legendtianjin lp8608 no-bb-just-do-it aopao happycode-2010 e2ckp syg0312 black128 skykain geekworm maylanderx shepherds126 tiananixc overprince aduntong allensmile michae1g auther fuyi501 yangslhappy liusion blackmoonth levphon guh123 zeroyou thisiskkya fredpenny yuanweize yanggang12345 chongjibo yuandafu hworm bobcatsii lkj2049 yqzl lhq2016 anubisorhades hnljq yexin000 xutruth wking2014 skyjiao fengzi1880 huasuiyue21 xiangfeigao xiaoxin0 cm9vda ricardsu hell-to-heaven wangjunhao999 lsvvcc noenemy9 qq768350992 barneyst1ns0n qitiandashengsunwukong yulin195 firsttopman shen-guoxin y87891241 orangeclassmate mryzd txwowo greedyboy quincyc379 ygsea wdh88 cansnow123 waie123 jiangdalao testsoso ccyccxcl badaoliumang yunfeihaha

examples-of-web-crawlers's Issues

登录滑动块验证失败

哎呀，出错了，点击刷新再来一次(error:e3Euv)，大神有遇到类似情况吗？求解答

8.一键生成微信个人专属数据报告(了解你的微信社交历史)，无法运行

按照readme.md要求，卸载依赖再安装，运行后扫码报错

正在获取微信好友数据信息，请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊，请耐心等待……
Traceback (most recent call last):
  File "d:/code/8/generate_wx_data.py", line 563, in <module>
    group_common_in()
  File "d:/code/8/generate_wx_data.py", line 526, in group_common_in
    bar = Bar('共同所在群聊分析')
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 148, in __init__
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 14, in __init__ 
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\base.py", line 28, in __init__  
    self.width = _opts.get("width", "900px")
AttributeError: 'str' object has no attribute 'get'

python taobao_login.py
Traceback (most recent call last):
File "taobao_login.py", line 72, in
a.login() #登录
File "taobao_login.py", line 32, in login
self.browser.find_element_by_xpath('//*[@Class="forget-pwd J_Quick2Static"]').click()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: element not interactable
(Session info: chrome=72.0.3626.119)
(Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.15.0-46-generic x86_64)

淘宝滑动验证码过不去

您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

可以正常登陆，爬取第一页数据，但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图：

AttributeError: 'str' object has no attribute 'get'

正在获取微信好友数据信息，请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊，请耐心等待……
Traceback (most recent call last):
File "getUrl.py", line 532, in
group_common_in()
File "getUrl.py", line 498, in group_common_in
bar = Bar('共同所在群聊分析')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 143, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 15, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/base.py", line 29, in init
self.width = _opts.get("width")
AttributeError: 'str' object has no attribute 'get'

Generate_wx_data.py typo?

Line 94, map.add("", provice. Should it be province?

使用

使用手册能不能再详细点，如何在Linux运行

微信的发消息的功能报KeyError: 'pass_ticket'，是不是账号被封禁了？wxpy还能用吗？

天猫爬虫的登录验证问题

天猫做滑块验证的时候，直接使用move_by_offset 函数直接一拖到底貌似会被检测到非人工
使用轨迹循环拖动会很慢，也会被检测到异常，这个问题该怎么办呢

> 您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

您好，我跳出滑动验证码的时候老是过不去页面上显示“哎呀，出错了，点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’，请问大佬该怎么破。。

我这边测试目前是正常的哦，如果不正常的话，你手动登录网页滑动一次，下次一般就不会提示了。

Originally posted by @shengqiangzhang in #4 (comment)

楼主，为什么我这边只要是用了selenium打开网页，手动登录都无法登录的。即便添加了开发者模式、、、、这个问题已经困扰我好久了。

大神好,pycharts版本问题？

一键生成微信个人专属数据报告报错

Traceback (most recent call last):
File "generate_wx_data.py", line 617, in
File "generate_wx_data.py", line 317, in merge_head_image
File "site-packages\PIL\Image.py", line 1817, in resize
File "site-packages\PIL\ImageFile.py", line 239, in load
OSError: image file is truncated (0 bytes not processed)
[8144] Failed to execute script generate_wx_data

configparser.NoSectionError: No section: 'configuration'

Traceback (most recent call last):
File "configparser.py", line 1138, in _unify_values
KeyError: 'configuration'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "say_to_lady.py", line 175, in
File "configparser.py", line 781, in get
File "configparser.py", line 1141, in _unify_values
configparser.NoSectionError: No section: 'configuration'
[1812] Failed to execute script say_to_lady

可以做一个微信书(心书)的PDF导出吗？出书啦的有个限制，个人分组设置成私密的看不到。谢谢！

我也想自己修改代码做一个，但心书的页面直接调用系统打印PDF图片全部是空白（非懒加载），看了一下源码好像图片全部是调用的CSS，不知道应该怎么处理。

生成微信报告提示：ImportError: cannot import name 'Pie'

生成微信报告提示：ImportError: cannot import name 'Pie'
安装Pie模块提示:ERROR: Could not find a version that satisfies the requirement pie (from versions: none)
ERROR: No matching distribution found for pie

可以正常登陆，爬取第一页数据，但是之后滑块部分就出错了

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

报错截图：

楼主你亲测过么，我试了很久，都不行。

我用你的方法试了，也在网上查了其他方法。只要浏览器是用selenium打开的，修改配置文件和改为开发者模式都会被识别。如果是用账号登录就有过不去的滑块，手动拉都不行。用微博登录就会看不见验证码，又必须要输入。

豆瓣GUI打开一片空白呀

如图。

守护女友成品的那个exe，自动发消息，到时间了必须点一下，或者按回车才会发消息

每天不同时间段通过微信发消息提醒女友,运行命令是什么呀，应该不会随便找个目录创建ini配置文件就运行了吧

headless模式怎么使用'excludeSwitches', ['enable-automation']？

使用xvfb也不可以？不知道你们是怎么部署到服务器的

微博登录要验证码

raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

爬取5K分辨率超清唯美壁纸

程序报错：

Instance of 'closing' has no 'headers' member
Instance of 'closing' has no 'iter_content' member

模块已经引入了还是不行

天猫爬虫

我用天猫商品数据爬虫时，只能爬取到第一页的数据，然后命令行中显示 NOT IMPLEMENTED，之后就无法再继续爬取数据

【问题反馈】微信扫码出现无法识别的二维码

history文件215mb，网页崩溃

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错，在开始爬取信息前，您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

改的代码用pyinstall不了是怎么回事呢？

wxpy的包加进去就有问题诶

运行不了程序

Getting uuid of QR code.
INFO:itchat:Getting uuid of QR code.
Downloading QR code.
INFO:itchat:Downloading QR code.
Traceback (most recent call last):
File "generate_wx_data.py", line 542, in
bot = Bot(cache_path=True)
File "E:\Program Files\python37\lib\site-packages\wxpy\api\bot.py", line 86, in init
loginCallback=login_callback, exitCallback=logout_callback
File "E:\Program Files\python37\lib\site-packages\itchat\components\register.py", line 30, in auto_login
loginCallback=loginCallback, exitCallback=exitCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 44, in login
picDir=picDir, qrCallback=qrCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 117, in get_QR
utils.print_qr(picDir)
File "E:\Program Files\python37\lib\site-packages\itchat\utils.py", line 85, in print_qr
os.startfile(fileDir)
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错，在开始爬取信息前，您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

知乎新版本的不好搞，都被屏蔽了，能搞个知乎不

那么问题来了，女友去哪里找？

总页数的css__selector

#content > div > div.ui-page > div > b.ui-page-skip > form > input[type="hidden"]:nth-child(7)
和你当时写的不一样了，现在还有什么方法可以破吗

微信一键生成个人报告，卡在人脸识别不动了，win7系统

微信卡在人脸识别第一个不动了，有很长时间。

出现如下问题

Exception in thread generate_data:
Traceback (most recent call last):
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "main.py", line 152, in generate_data
TypeError: 'NoneType' object is not subscriptable

能在Gitee(码云)上发布一版吗？太大了，下载总是超时！

学会了，请问女朋友哪里领

如题。

【3Q】受教了，正好需要这么一份

现在看理论课程看的很头疼，换换口味

找不到应用程序: 'QR.png'

C:\Users\Administrator>C:\Users\Administrator\Desktop\say_to_lady\say_to_lady.exe
Getting uuid of QR code.
Downloading QR code.
Traceback (most recent call last):
File "say_to_lady.py", line 154, in
File "site-packages\wxpy\api\bot.py", line 86, in init
File "site-packages\itchat\components\register.py", line 35, in auto_login
File "site-packages\itchat\components\login.py", line 44, in login
File "site-packages\itchat\components\login.py", line 117, in get_QR
File "site-packages\itchat\utils.py", line 85, in print_qr
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'
[3992] Failed to execute script say_to_lady

shengqiangzhang / examples-of-web-crawlers Goto Github PK