python-ruia / ruia-pyppeteer Goto Github PK
View Code? Open in Web Editor NEWA Ruia plugin for loading javascript - pyppeteer
License: MIT License
A Ruia plugin for loading javascript - pyppeteer
License: MIT License
Hi,
Thanks for this good work. I am now moving my ruia spiders to linux system.
When i run the program in linux, it returns:
[2019:03:15 08:47:03] INFO Spider Spider started!
[2019:03:15 08:47:03] INFO Spider Worker started: 140541835336968
[2019:03:15 08:47:03] INFO Spider Worker started: 140541835337104
[2019:03:15 08:47:03] ERROR Request <Error: http://news.cnstock.com/bwsd/index.html 'PyppeteerRequest' object has no attribute 'browser'>
[2019:03:15 08:47:03] ERROR Spider 'NoneType' object has no attribute 'html'
[2019:03:15 08:47:03] INFO Spider Stopping spider: Ruia
[2019:03:15 08:47:03] INFO Spider Total requests: 0
[2019:03:15 08:47:03] INFO Spider Time usage: 0:00:00.136289
[2019:03:15 08:47:03] INFO Spider Spider finished!
The ruia spider was perfectly well when running on window. Do u know the reason?
When I request "https://www.jianshu.com/" only got7 seven elements.
I'm using the example in this repo :
https://github.com/python-ruia/ruia-pyppeteer/blob/master/example/jianshu_js_example.py
this website must reject my request application whatever page options i have set
domain_page = 'http://stock.hexun.com/7x24h/'
class news_Item(Item):
target_item = TextField(css_select='div.liveNews')
publish_times = TextField(css_select="dl.newsDl.clearfix > dt:nth-child(1)", many=True)
news_contents = TextField(css_select="dl.newsDl.clearfix > dd:nth-child(2)", many=True)
async def test():
pyppeteer_page_options = {'waitUntil': 'networkidle2','timeout': 0}
request = Request(domain_page, pyppeteer_page_options=pyppeteer_page_options)
response = await request.fetch()
item = await news_Item.get_item(html=response.html)
for publish_time,text in zip(item.publish_times,item.news_contents):
print(publish_time,text)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(test())
Thanks a lot
......
[2020:02:21 14:32:04] INFO Spider <Item {'cover': 'https://img3.doubanio.com/v
iew/photo/s_ratio_poster/public/p2173855883.jpg', 'title': '变脸', 'abstract': '
当发哥的风衣、墨镜出现在了凯奇身上⋯⋯'}>
[2020:02:21 14:32:04] ERROR Spider 'gbk' codec can't encode character '\u22ef'
in position 19: illegal multibyte sequence
[2020:02:21 14:32:04] INFO Spider Stopping spider: Ruia
[2020:02:21 14:32:04] INFO Spider Total requests: 11
[2020:02:21 14:32:04] INFO Spider Time usage: 0:00:03.090177
[2020:02:21 14:32:04] INFO Spider Spider finished!
[2020:02:21 14:32:04] ERROR asyncio Task was destroyed but it is pending!
task: <Task pending name='Task-38' coro=<<async_generator_athrow without _name
_>()>>
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x00000000
02B12F70>
Traceback (most recent call last):
File "C:\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Python38\lib\asyncio\base_events.py", line 715, in call_soon
self._check_closed()
File "C:\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
日志如下:
[2020:02:03 23:21:20] INFO Spider Spider started!
[2020:02:03 23:21:20] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2020:02:03 23:21:20] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2020:02:03 23:21:20] INFO Spider Worker started: 1987584893408
[2020:02:03 23:21:20] INFO Spider Worker started: 1987584893544
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:63889/devtools/browser/6a26ef07-748e-4644-8bfd-a8c1e8fe1c5a
[I:pyppeteer.launcher] terminate chrome process...
[2020:02:03 23:21:25] ERROR Spider <Callback[parse]: Get target_item's value error!
[2020:02:03 23:21:25] INFO Spider Stopping spider: Ruia
[2020:02:03 23:21:25] INFO Spider Total requests: 0
[2020:02:03 23:21:25] INFO Spider Time usage: 0:00:04.869658
[2020:02:03 23:21:25] INFO Spider Spider finished!
有时候是正常的,但是大多时候都是这样的日志输出,这个意思是不是没有请求成功?
Windows 10 x64
最新的 ruia 和 ruia-pyppeteer
之前用 ruia-pyppeteer 实现了一个爬虫,今天运行的时候报错很诡异,然后尝试新建了一个 pipenv 环境,且跑 jianshu_js_example.py 示例,也会提示 ‘JianshuSpider’ object has no attribute 'kwargs' 类似这样的错误。
貌似 from ruia_pyppeteer import PyppeteerSpider as Spider 继承这个就会出现。
pipenv install 都是自动拉取最新的。
domain page: http://barmap.hk/bars?page=1
class bar_Item(Item):
target_item = TextField(css_select="div#barmap > div.mdl-layout.mdl-js-layout.is-upgraded > div.mdl-layout__inner-container > div.mdl-layout__content.content > div > div.container > div.mdl-grid > div:nth-child(2)")
bar_names = TextField(css_select="div#barmap div > a > div > div",many=True)
class bar_Spider(Spider):
start_urls = [domain_page]
concurrency = 1
async def parse(self, response):
async for item in bar_Item.get_items(html=response.html):
yield item
async def process_item(self, item: bar_Item):
for bar_name in item.bar_names:
bar=re.sub('\(\'react\-text\: \d{3,4}','',bar_name)
print(bar)
if __name__ == '__main__':
bar_Spider.start()
The feedback is always like this, however, i am 100% the css selector is correct
[2019:03:25 14:21:10] INFO Spider Spider started!
[2019:03:25 14:21:10] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:25 14:21:10] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:25 14:21:10] INFO Spider Worker started: 2196814351352
[2019:03:25 14:21:10] INFO Spider Worker started: 2196814351488
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:59602/devtools/browser/6ae3158d-6011-43e1-b076-50d174e5cf84
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:25 14:21:12] ERROR Spider Extract div#barmap div > a > div > div
error, please check selector or set parameter named default
[2019:03:25 14:21:12] INFO Spider Stopping spider: Ruia
[2019:03:25 14:21:12] INFO Spider Total requests: 1
[2019:03:25 14:21:12] INFO Spider Time usage: 0:00:02.102796
[2019:03:25 14:21:12] INFO Spider Spider finished!
import aiohttp
import asyncio
from ruia_pyppeteer import PyppeteerRequest as Request
async def request_example():
url = "https://deck.tk/07Pw8tfr"
params = {
'name': 'ruia',
}
headers = {
'User-Agent': 'Python3.6',
}
async with aiohttp.ClientSession() as session:
request = Request(url=url, method='GET', params=params, request_session = session , headers=headers, load_js=True)
response = await request.fetch()
html = await response.text()
print(html)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(request_example())
produces:
Traceback (most recent call last):
File "C:\yourproject\spiders\yourspider.py", line 22, in <module>
asyncio.get_event_loop().run_until_complete(request_example())
File "c:\python39\lib\asyncio\base_events.py", line 642, in run_until_complete
return future.result()
File "C:\yourproject\spiders\yourspider.py", line 16, in request_example
response = await request.fetch()
File "C:\yourproject\env\lib\site-packages\ruia_pyppeteer\request.py", line 76, in fetch
response = PyppeteerResponse(
File "C:\yourproject\env\lib\site-packages\ruia_pyppeteer\response.py", line 26, in __init__
super(PyppeteerResponse, self).__init__(
TypeError: __init__() got an unexpected keyword argument 'html'
[I:pyppeteer.launcher] terminate chrome process...
Hi,
Thanks for your help.
I am scrapping the news from this website "https://xueqiu.com/?category=livenews"
I have double-checked my xpath of the target items and the items i wanna extract but the ruin kept saying my xpath cannot be located.
Would you help me check if the problem comes from my ip address (being blocked) or any code mistakes on my ruia spider?
domain_page = 'https://xueqiu.com/?category=livenews'
class frame_Item(Item):
target_item = TextField(xpath_select="//body/div[@id='app']/div[@class='AnonymousHome_container_2te']/div[@class='AnonymousHome_home__col--lf_2Fg']/div[@class='AnonymousHome_home__timeline_VTo']/div[2]/div/div")
publish_month = TextField(xpath_select="div[@class='div.AnonymousHome_home__timeline-live__hd_JGP']")
async def test():
pyppeteer_page_options = {'waitUntil': 'networkidle2', 'timeout': 0}
request = Request(domain_page, pyppeteer_page_options=pyppeteer_page_options)
response = await request.fetch()
item = await frame_Item.get_item(html=response.html)
print(item.publish_month)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(test())
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.