Git Product home page Git Product logo

shengqiangzhang / examples-of-web-crawlers Goto Github PK

View Code? Open in Web Editor NEW
13.6K 347.0 3.8K 238.2 MB

一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

License: MIT License

Python 88.25% JavaScript 3.41% CSS 1.36% HTML 6.98%
crawler spider taobao tmall example python selenium pyquery stock fund multithreading agent-pool wechat wechat-report wereader

examples-of-web-crawlers's Issues

出现如下问题

Exception in thread generate_data:
Traceback (most recent call last):
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "main.py", line 152, in generate_data
TypeError: 'NoneType' object is not subscriptable

淘宝 滑动验证码过不去

您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错,在开始爬取信息前,您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

楼主你亲测过么,我试了很久,都不行。

我用你的方法试了,也在网上查了其他方法。只要浏览器是用selenium打开的,修改配置文件和改为开发者模式都会被识别。如果是用账号登录就有过不去的滑块,手动拉都不行。用微博登录就会看不见验证码,又必须要输入。

不行啊

老哥,你这个不行啊,淘宝还是会识别出来,加入这个开发者模式,淘宝还是可以识别

> 您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

我这边测试目前是正常的哦,如果不正常的话,你手动登录网页滑动一次,下次一般就不会提示了。

Originally posted by @shengqiangzhang in #4 (comment)

楼主,为什么我这边只要是用了selenium打开网页,手动登录都无法登录的。即便添加了开发者模式、、、、这个问题已经困扰我好久了。

找不到应用程序: 'QR.png'

C:\Users\Administrator>C:\Users\Administrator\Desktop\say_to_lady\say_to_lady.exe
Getting uuid of QR code.
Downloading QR code.
Traceback (most recent call last):
File "say_to_lady.py", line 154, in
File "site-packages\wxpy\api\bot.py", line 86, in init
File "site-packages\itchat\components\register.py", line 35, in auto_login
File "site-packages\itchat\components\login.py", line 44, in login
File "site-packages\itchat\components\login.py", line 117, in get_QR
File "site-packages\itchat\utils.py", line 85, in print_qr
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'
[3992] Failed to execute script say_to_lady

淘宝登录

淘宝登录里面有weibo的css 样式标签,好像有些东西没有完全改过来,

8.一键生成微信个人专属数据报告(了解你的微信社交历史),无法运行

按照readme.md要求,卸载依赖再安装,运行后扫码报错

正在获取微信好友数据信息,请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊,请耐心等待……
Traceback (most recent call last):
  File "d:/code/8/generate_wx_data.py", line 563, in <module>
    group_common_in()
  File "d:/code/8/generate_wx_data.py", line 526, in group_common_in
    bar = Bar('共同所在群聊分析')
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 148, in __init__
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 14, in __init__ 
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\base.py", line 28, in __init__  
    self.width = _opts.get("width", "900px")
AttributeError: 'str' object has no attribute 'get'

总页数的css__selector

#content > div > div.ui-page > div > b.ui-page-skip > form > input[type="hidden"]:nth-child(7)
和你当时写的不一样了,现在还有什么方法可以破吗

可以正常登陆,爬取第一页数据,但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手 月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图:

微信图片_20190804114913

爬取5K分辨率超清唯美壁纸

程序报错:

  1. Instance of 'closing' has no 'headers' member
  2. Instance of 'closing' has no 'iter_content' member

模块已经引入了还是不行

configparser.NoSectionError: No section: 'configuration'

Traceback (most recent call last):
File "configparser.py", line 1138, in _unify_values
KeyError: 'configuration'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "say_to_lady.py", line 175, in
File "configparser.py", line 781, in get
File "configparser.py", line 1141, in _unify_values
configparser.NoSectionError: No section: 'configuration'
[1812] Failed to execute script say_to_lady

AttributeError: 'str' object has no attribute 'get'

正在获取微信好友数据信息,请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊,请耐心等待……
Traceback (most recent call last):
File "getUrl.py", line 532, in
group_common_in()
File "getUrl.py", line 498, in group_common_in
bar = Bar('共同所在群聊分析')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 143, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 15, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/base.py", line 29, in init
self.width = _opts.get("width")
AttributeError: 'str' object has no attribute 'get'

淘宝登录转微博显示网络连接才超时

运行您的淘宝登录代码,利用微博登录的时候显示网络连接超时,可能是被检测到selenium了,我用正常的浏览器手动操作能成功登录,不知道怎么解决

天猫爬虫

我用天猫商品数据爬虫时,只能爬取到第一页的数据,然后命令行中显示 NOT IMPLEMENTED,之后就无法再继续爬取数据

可以正常登陆,爬取第一页数据,但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手 月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图:

微信图片_20190804114913

一键生成微信个人专属数据报告报错

Traceback (most recent call last):
File "generate_wx_data.py", line 617, in
File "generate_wx_data.py", line 317, in merge_head_image
File "site-packages\PIL\Image.py", line 1817, in resize
File "site-packages\PIL\ImageFile.py", line 239, in load
OSError: image file is truncated (0 bytes not processed)
[8144] Failed to execute script generate_wx_data

运行不了程序

Getting uuid of QR code.
INFO:itchat:Getting uuid of QR code.
Downloading QR code.
INFO:itchat:Downloading QR code.
Traceback (most recent call last):
File "generate_wx_data.py", line 542, in
bot = Bot(cache_path=True)
File "E:\Program Files\python37\lib\site-packages\wxpy\api\bot.py", line 86, in init
loginCallback=login_callback, exitCallback=logout_callback
File "E:\Program Files\python37\lib\site-packages\itchat\components\register.py", line 30, in auto_login
loginCallback=loginCallback, exitCallback=exitCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 44, in login
picDir=picDir, qrCallback=qrCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 117, in get_QR
utils.print_qr(picDir)
File "E:\Program Files\python37\lib\site-packages\itchat\utils.py", line 85, in print_qr
os.startfile(fileDir)
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'

天猫爬虫的登录验证问题

天猫 做滑块验证的时候,直接使用move_by_offset 函数 直接一拖到底 貌似会被检测到非人工
使用轨迹 循环拖动 会很慢,也会被检测到异常,这个问题该怎么办呢

使用

使用手册能不能再详细点,如何在Linux运行

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错,在开始爬取信息前,您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

error when login

python taobao_login.py
Traceback (most recent call last):
File "taobao_login.py", line 72, in
a.login() #登录
File "taobao_login.py", line 32, in login
self.browser.find_element_by_xpath('//*[@Class="forget-pwd J_Quick2Static"]').click()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: element not interactable
(Session info: chrome=72.0.3626.119)
(Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.15.0-46-generic x86_64)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.