Git Product home page Git Product logo

ruia-pyppeteer's People

Contributors

elfgzp avatar howie6879 avatar ruiruizhou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ruia-pyppeteer's Issues

Ruia-pyppeteer cannot run in linux?

Hi,

Thanks for this good work. I am now moving my ruia spiders to linux system.

When i run the program in linux, it returns:

[2019:03:15 08:47:03] INFO Spider Spider started!
[2019:03:15 08:47:03] INFO Spider Worker started: 140541835336968
[2019:03:15 08:47:03] INFO Spider Worker started: 140541835337104
[2019:03:15 08:47:03] ERROR Request <Error: http://news.cnstock.com/bwsd/index.html 'PyppeteerRequest' object has no attribute 'browser'>
[2019:03:15 08:47:03] ERROR Spider 'NoneType' object has no attribute 'html'
[2019:03:15 08:47:03] INFO Spider Stopping spider: Ruia
[2019:03:15 08:47:03] INFO Spider Total requests: 0
[2019:03:15 08:47:03] INFO Spider Time usage: 0:00:00.136289
[2019:03:15 08:47:03] INFO Spider Spider finished!

The ruia spider was perfectly well when running on window. Do u know the reason?

do you know how to build a correct request to this website ??

this website must reject my request application whatever page options i have set

domain_page = 'http://stock.hexun.com/7x24h/'

class news_Item(Item):
    target_item = TextField(css_select='div.liveNews')
    publish_times = TextField(css_select="dl.newsDl.clearfix > dt:nth-child(1)", many=True)
    news_contents = TextField(css_select="dl.newsDl.clearfix > dd:nth-child(2)", many=True)


async def test():
    pyppeteer_page_options = {'waitUntil': 'networkidle2','timeout': 0}
    request = Request(domain_page, pyppeteer_page_options=pyppeteer_page_options)
    response = await request.fetch()
    item = await news_Item.get_item(html=response.html)
    for publish_time,text in zip(item.publish_times,item.news_contents):

        print(publish_time,text)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(test())

Thanks a lot

运行example出现的提示,是否正常? os:win7 64 , python 3.8 64bit

......
[2020:02:21 14:32:04] INFO Spider <Item {'cover': 'https://img3.doubanio.com/v
iew/photo/s_ratio_poster/public/p2173855883.jpg', 'title': '变脸', 'abstract': '
当发哥的风衣、墨镜出现在了凯奇身上⋯⋯'}>
[2020:02:21 14:32:04] ERROR Spider 'gbk' codec can't encode character '\u22ef'
in position 19: illegal multibyte sequence
[2020:02:21 14:32:04] INFO Spider Stopping spider: Ruia
[2020:02:21 14:32:04] INFO Spider Total requests: 11
[2020:02:21 14:32:04] INFO Spider Time usage: 0:00:03.090177
[2020:02:21 14:32:04] INFO Spider Spider finished!
[2020:02:21 14:32:04] ERROR asyncio Task was destroyed but it is pending!
task: <Task pending name='Task-38' coro=<<async_generator_athrow without _name
_>()>>
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

运行 jianshu_js_example.py 会经常遇到 INFO Spider Total requests: 0

日志如下:

[2020:02:03 23:21:20] INFO  Spider  Spider started!
[2020:02:03 23:21:20] WARNING Spider  Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2020:02:03 23:21:20] WARNING Spider  Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2020:02:03 23:21:20] INFO  Spider  Worker started: 1987584893408
[2020:02:03 23:21:20] INFO  Spider  Worker started: 1987584893544
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:63889/devtools/browser/6a26ef07-748e-4644-8bfd-a8c1e8fe1c5a
[I:pyppeteer.launcher] terminate chrome process...
[2020:02:03 23:21:25] ERROR Spider  <Callback[parse]: Get target_item's value error!
[2020:02:03 23:21:25] INFO  Spider  Stopping spider: Ruia
[2020:02:03 23:21:25] INFO  Spider  Total requests: 0
[2020:02:03 23:21:25] INFO  Spider  Time usage: 0:00:04.869658
[2020:02:03 23:21:25] INFO  Spider  Spider finished!

有时候是正常的,但是大多时候都是这样的日志输出,这个意思是不是没有请求成功?
Windows 10 x64
最新的 ruia 和 ruia-pyppeteer

运行示例 jianshu_js_example.py 报错,‘JianshuSpider’ object has no attribute 'kwargs'

之前用 ruia-pyppeteer 实现了一个爬虫,今天运行的时候报错很诡异,然后尝试新建了一个 pipenv 环境,且跑 jianshu_js_example.py 示例,也会提示 ‘JianshuSpider’ object has no attribute 'kwargs' 类似这样的错误。

貌似 from ruia_pyppeteer import PyppeteerSpider as Spider 继承这个就会出现。

pipenv install 都是自动拉取最新的。

do u know why i cannot parse the bar name in this website?

domain page: http://barmap.hk/bars?page=1

class bar_Item(Item):
    target_item = TextField(css_select="div#barmap > div.mdl-layout.mdl-js-layout.is-upgraded > div.mdl-layout__inner-container > div.mdl-layout__content.content > div > div.container > div.mdl-grid > div:nth-child(2)")
                   
    bar_names = TextField(css_select="div#barmap div > a > div > div",many=True)
                                    



class bar_Spider(Spider):
    start_urls = [domain_page]
    concurrency = 1


    async def parse(self, response):
        async for item in bar_Item.get_items(html=response.html):
             yield item

    async def process_item(self, item: bar_Item):
      
        for bar_name in item.bar_names:

            bar=re.sub('\(\'react\-text\: \d{3,4}','',bar_name)
           print(bar)


if __name__ == '__main__':
    bar_Spider.start()

The feedback is always like this, however, i am 100% the css selector is correct

[2019:03:25 14:21:10] INFO Spider Spider started!
[2019:03:25 14:21:10] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:25 14:21:10] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:25 14:21:10] INFO Spider Worker started: 2196814351352
[2019:03:25 14:21:10] INFO Spider Worker started: 2196814351488
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:59602/devtools/browser/6ae3158d-6011-43e1-b076-50d174e5cf84
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:25 14:21:12] ERROR Spider Extract div#barmap div > a > div > div error, please check selector or set parameter named default
[2019:03:25 14:21:12] INFO Spider Stopping spider: Ruia
[2019:03:25 14:21:12] INFO Spider Total requests: 1
[2019:03:25 14:21:12] INFO Spider Time usage: 0:00:02.102796
[2019:03:25 14:21:12] INFO Spider Spider finished!

TypeError: __init__() got an unexpected keyword argument 'html'

import aiohttp
import asyncio
from ruia_pyppeteer import PyppeteerRequest as Request


async def request_example():
    url = "https://deck.tk/07Pw8tfr"
    params = {
        'name': 'ruia',
    }
    headers = {
        'User-Agent': 'Python3.6',
    }
    async with aiohttp.ClientSession() as session:
        request = Request(url=url, method='GET', params=params, request_session = session , headers=headers, load_js=True)
        response = await request.fetch()
        html = await response.text()
        print(html)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(request_example())

produces:

Traceback (most recent call last):                                                                                             
  File "C:\yourproject\spiders\yourspider.py", line 22, in <module>                            
    asyncio.get_event_loop().run_until_complete(request_example())                                                             
  File "c:\python39\lib\asyncio\base_events.py", line 642, in run_until_complete                                               
    return future.result()                                                                                                     
  File "C:\yourproject\spiders\yourspider.py", line 16, in request_example                     
    response = await request.fetch()                                                                                           
  File "C:\yourproject\env\lib\site-packages\ruia_pyppeteer\request.py", line 76, in fetch    
    response = PyppeteerResponse(                                                                                              
  File "C:\yourproject\env\lib\site-packages\ruia_pyppeteer\response.py", line 26, in __init__
    super(PyppeteerResponse, self).__init__(                                                                                   
TypeError: __init__() got an unexpected keyword argument 'html'                                                                
[I:pyppeteer.launcher] terminate chrome process... 

something wrong with my xpath or my ip has been blocked by the vendor????

Hi,

Thanks for your help.

I am scrapping the news from this website "https://xueqiu.com/?category=livenews"

I have double-checked my xpath of the target items and the items i wanna extract but the ruin kept saying my xpath cannot be located.

Would you help me check if the problem comes from my ip address (being blocked) or any code mistakes on my ruia spider?

domain_page = 'https://xueqiu.com/?category=livenews'



class frame_Item(Item):
    target_item = TextField(xpath_select="//body/div[@id='app']/div[@class='AnonymousHome_container_2te']/div[@class='AnonymousHome_home__col--lf_2Fg']/div[@class='AnonymousHome_home__timeline_VTo']/div[2]/div/div")  
    publish_month = TextField(xpath_select="div[@class='div.AnonymousHome_home__timeline-live__hd_JGP']") 




async def test():
    pyppeteer_page_options = {'waitUntil': 'networkidle2', 'timeout': 0}
    request = Request(domain_page, pyppeteer_page_options=pyppeteer_page_options)
    response = await request.fetch()
    item = await frame_Item.get_item(html=response.html)
    print(item.publish_month)



if __name__ == '__main__':
   asyncio.get_event_loop().run_until_complete(test())

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.