Git Product home page Git Product logo

shengqiangzhang / examples-of-web-crawlers Goto Github PK

View Code? Open in Web Editor NEW
13.5K 13.5K 3.8K 238.2 MB

一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

License: MIT License

Python 88.25% JavaScript 3.41% CSS 1.36% HTML 6.98%
agent-pool crawler example fund multithreading pyquery python selenium spider stock taobao tmall wechat wechat-report wereader

examples-of-web-crawlers's People

Contributors

acfboy avatar dahuayuan avatar ernienishino avatar k27dong avatar linguoquan13 avatar shengqiangzhang avatar xsohydra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

examples-of-web-crawlers's Issues

8.一键生成微信个人专属数据报告(了解你的微信社交历史),无法运行

按照readme.md要求,卸载依赖再安装,运行后扫码报错

正在获取微信好友数据信息,请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊,请耐心等待……
Traceback (most recent call last):
  File "d:/code/8/generate_wx_data.py", line 563, in <module>
    group_common_in()
  File "d:/code/8/generate_wx_data.py", line 526, in group_common_in
    bar = Bar('共同所在群聊分析')
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 148, in __init__
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\chart.py", line 14, in __init__ 
    super().__init__(init_opts=init_opts)
  File "C:\Users\youyim\AppData\Local\Programs\Python\Python36\lib\site-packages\pyecharts\charts\base.py", line 28, in __init__  
    self.width = _opts.get("width", "900px")
AttributeError: 'str' object has no attribute 'get'

淘宝登录

淘宝登录里面有weibo的css 样式标签,好像有些东西没有完全改过来,

error when login

python taobao_login.py
Traceback (most recent call last):
File "taobao_login.py", line 72, in
a.login() #登录
File "taobao_login.py", line 32, in login
self.browser.find_element_by_xpath('//*[@Class="forget-pwd J_Quick2Static"]').click()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotVisibleException: Message: element not interactable
(Session info: chrome=72.0.3626.119)
(Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.15.0-46-generic x86_64)

淘宝 滑动验证码过不去

您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

可以正常登陆,爬取第一页数据,但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手 月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图:

微信图片_20190804114913

AttributeError: 'str' object has no attribute 'get'

正在获取微信好友数据信息,请耐心等待……
微信好友数据信息获取完毕

正在分析你的群聊,请耐心等待……
Traceback (most recent call last):
File "getUrl.py", line 532, in
group_common_in()
File "getUrl.py", line 498, in group_common_in
bar = Bar('共同所在群聊分析')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 143, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/chart.py", line 15, in init
super().init(init_opts=init_opts)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyecharts-1.2.1-py3.7.egg/pyecharts/charts/base.py", line 29, in init
self.width = _opts.get("width")
AttributeError: 'str' object has no attribute 'get'

使用

使用手册能不能再详细点,如何在Linux运行

天猫爬虫的登录验证问题

天猫 做滑块验证的时候,直接使用move_by_offset 函数 直接一拖到底 貌似会被检测到非人工
使用轨迹 循环拖动 会很慢,也会被检测到异常,这个问题该怎么办呢

> 您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

您好,我跳出滑动验证码的时候老是过不去页面上显示“哎呀,出错了,点击刷新再来一次(error:NgaRgk)”,然后程序显示‘selenium.common.exceptions.TimeoutException: Message:’,请问大佬该怎么破。。

我这边测试目前是正常的哦,如果不正常的话,你手动登录网页滑动一次,下次一般就不会提示了。

Originally posted by @shengqiangzhang in #4 (comment)

楼主,为什么我这边只要是用了selenium打开网页,手动登录都无法登录的。即便添加了开发者模式、、、、这个问题已经困扰我好久了。

一键生成微信个人专属数据报告报错

Traceback (most recent call last):
File "generate_wx_data.py", line 617, in
File "generate_wx_data.py", line 317, in merge_head_image
File "site-packages\PIL\Image.py", line 1817, in resize
File "site-packages\PIL\ImageFile.py", line 239, in load
OSError: image file is truncated (0 bytes not processed)
[8144] Failed to execute script generate_wx_data

configparser.NoSectionError: No section: 'configuration'

Traceback (most recent call last):
File "configparser.py", line 1138, in _unify_values
KeyError: 'configuration'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "say_to_lady.py", line 175, in
File "configparser.py", line 781, in get
File "configparser.py", line 1141, in _unify_values
configparser.NoSectionError: No section: 'configuration'
[1812] Failed to execute script say_to_lady

可以正常登陆,爬取第一页数据,但是之后滑块部分就出错了

Log如下:
包邮威德博威No.5羽毛球五号场馆训练比赛12只装耐打稳定室内高手 月成交68评价732旺旺在线 55.00 //detail.tmall.com/item.htm?id=13074420768&skuId=22276410079&areaId=310100&user_id=748152180&cat_id=50043727&is_b=1&rn=aa934dd095dab511c3ed3f2cde0da7b6

get button failed: Message: move target out of bounds
(Session info: chrome=76.0.3809.87)

Traceback (most recent call last):
File "C:/Users/Daoling/Downloads/Tmall.py", line 216, in
a.crawl_good_data() # 爬取天猫商品数据
File "C:/Users/Daoling/Downloads/Tmall.py", line 150, in crawl_good_data
EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList > div.product > div.product-iWrap')))
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

报错截图:

微信图片_20190804114913

楼主你亲测过么,我试了很久,都不行。

我用你的方法试了,也在网上查了其他方法。只要浏览器是用selenium打开的,修改配置文件和改为开发者模式都会被识别。如果是用账号登录就有过不去的滑块,手动拉都不行。用微博登录就会看不见验证码,又必须要输入。

爬取5K分辨率超清唯美壁纸

程序报错:

  1. Instance of 'closing' has no 'headers' member
  2. Instance of 'closing' has no 'iter_content' member

模块已经引入了还是不行

天猫爬虫

我用天猫商品数据爬虫时,只能爬取到第一页的数据,然后命令行中显示 NOT IMPLEMENTED,之后就无法再继续爬取数据

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错,在开始爬取信息前,您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

运行不了程序

Getting uuid of QR code.
INFO:itchat:Getting uuid of QR code.
Downloading QR code.
INFO:itchat:Downloading QR code.
Traceback (most recent call last):
File "generate_wx_data.py", line 542, in
bot = Bot(cache_path=True)
File "E:\Program Files\python37\lib\site-packages\wxpy\api\bot.py", line 86, in init
loginCallback=login_callback, exitCallback=logout_callback
File "E:\Program Files\python37\lib\site-packages\itchat\components\register.py", line 30, in auto_login
loginCallback=loginCallback, exitCallback=exitCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 44, in login
picDir=picDir, qrCallback=qrCallback)
File "E:\Program Files\python37\lib\site-packages\itchat\components\login.py", line 117, in get_QR
utils.print_qr(picDir)
File "E:\Program Files\python37\lib\site-packages\itchat\utils.py", line 85, in print_qr
os.startfile(fileDir)
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'

天猫爬虫的问题及更改

当搜索到的商品不满一页时会出错,在开始爬取信息前,您应加上这些代码。

err1 = self.browser.find_element_by_xpath("//*[@id='content']/div/div[2]").text
err1 = err1[:5]
        if(err1 == "喵~没找到"):
            print("出错了");
            return
     
        try:
            self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]")
            err2 = self.browser.find_element_by_xpath("//*[@id='J_ComboRec']/div[1]").text
            #print(stt)
            
            err2 = err2[:5]

            if(err2 == "我们还为您"):
                print("出错了")
                return

总页数的css__selector

#content > div > div.ui-page > div > b.ui-page-skip > form > input[type="hidden"]:nth-child(7)
和你当时写的不一样了,现在还有什么方法可以破吗

出现如下问题

Exception in thread generate_data:
Traceback (most recent call last):
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "main.py", line 152, in generate_data
TypeError: 'NoneType' object is not subscriptable

找不到应用程序: 'QR.png'

C:\Users\Administrator>C:\Users\Administrator\Desktop\say_to_lady\say_to_lady.exe
Getting uuid of QR code.
Downloading QR code.
Traceback (most recent call last):
File "say_to_lady.py", line 154, in
File "site-packages\wxpy\api\bot.py", line 86, in init
File "site-packages\itchat\components\register.py", line 35, in auto_login
File "site-packages\itchat\components\login.py", line 44, in login
File "site-packages\itchat\components\login.py", line 117, in get_QR
File "site-packages\itchat\utils.py", line 85, in print_qr
OSError: [WinError -2147221003] 找不到应用程序: 'QR.png'
[3992] Failed to execute script say_to_lady

淘宝登录转微博显示网络连接才超时

运行您的淘宝登录代码,利用微博登录的时候显示网络连接超时,可能是被检测到selenium了,我用正常的浏览器手动操作能成功登录,不知道怎么解决

不行啊

老哥,你这个不行啊,淘宝还是会识别出来,加入这个开发者模式,淘宝还是可以识别

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.