python3webspider / proxypool Goto Github PK
View Code? Open in Web Editor NEWAn Efficient ProxyPool with Getter, Tester and Server
Home Page: https://proxypool.scrape.center
License: MIT License
An Efficient ProxyPool with Getter, Tester and Server
Home Page: https://proxypool.scrape.center
License: MIT License
代理池开始运行
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
requirements.txt中redis版本为redis>=2.10.5
默认会安装最新版,现在已经3.x了。
试验证明,操作zadd时会报错。
Ip processing running
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "C:\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\Python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Python\Python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python\Python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python\Python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
Process Process-2:
Traceback (most recent call last):
File "C:\Python\Python36\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Python\Python36\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 130, in check_pool
adder.add_to_queue()
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 87, in add_to_queue
raw_proxies = self._crawler.get_raw_proxies(callback)
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 28, in get_raw_proxies
for proxy in eval("self.{}()".format(callback)):
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 35, in crawl_ip181
html = get_page(start_url)
File "C:\迅雷下载\ProxyPool-master\proxypool\utils.py", line 14, in get_page
'User-Agent': ua.random,
UnboundLocalError: local variable 'ua' referenced before assignment
Refreshing ip
Waiting for adding
Refreshing ip
Waiting for adding
Refreshing ip
免费资源可用率不高,希望是一个付费ip和免费ip结合的代理池。
这时候就有一个问题:无限测试付费ip,只扣费了,但是实际业务没有在用代理。
希望优化付费代理按需使用机制:
付费代理只有在有爬虫需求的时候,启动拉取,并且定制从代理服务商拉取IP个数。
应该是localhost:5555/random吧
运行一段时间,自动就宕掉了,是什么情况,可以解决吗?
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environments (please complete the following information):
Additional context
Add any other context about the problem here.
#启动代理池
from proxypool.scheduler import Scheduler
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
def main():
try:
s = Scheduler()
s.run()
except:
main()
if name == 'main':
main()
AttributeError: 'OutStream' object has no attribute 'buffer'
如题
不知道作者是忘了写还是,应该改成random,不然获取不到
请问为何只是在刚开始的时候爬取了一次代理之后会就不会定时爬取了?
不算bug,建议:
1.在项目setting.py文件中,看到声明了LOG_DIR日志存储路径参数,但未使用。
应新建出...\project\ProxyPool\logs文件夹,并在配置文件中修改:
logger.add(env.str('LOG_RUNTIME_FILE', 'runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', 'error.log'), level='ERROR', rotation='1 week')
修改为:
logger.add(env.str('LOG_RUNTIME_FILE', f'{LOG_DIR}/runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', f'{LOG_DIR}/error.log'), level='ERROR', rotation='1 week')
2.setting.py文件中ENABLE_TESTER, ENABLE_GETTER, ENABLE_SERVER开关参数若都为False时,运行run.py文件报错(try方法中finally还会报错),可修改scheduler.py文件。(此条有点杠精,可忽略)
return self.db.zadd(REDIS_KEY, {proxy: score})
能否获取能够破网的代理
Traceback (most recent call last):
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\run.py", line 1, in
from proxypool.scheduler import Scheduler
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "D:\Anaconda3\lib\site-packages\aiohttp_init_.py", line 6, in
from .client import * # noqa
File "D:\Anaconda3\lib\site-packages\aiohttp\client.py", line 16, in
from . import client_exceptions, client_reqrep
File "D:\Anaconda3\lib\site-packages\aiohttp\client_reqrep.py", line 18, in
from . import hdrs, helpers, http, multipart, payload
File "D:\Anaconda3\lib\site-packages\aiohttp\helpers.py", line 161, in
@attr.s(frozen=True, slots=True)
TypeError: attributes() got an unexpected keyword argument 'frozen'
粗略写了个,放在Crawler类里面,每行用“地址:端口”格式就行。
def crawl_file(self):
filename = 'proxy.txt' # txt文件和当前脚本在同一目录下,所以不用写具体路径
with open(filename, 'r') as file_to_read:
while True:
lines = file_to_read.readline() # 整行读取数据
if not lines:
break
yield lines
运行过程中代理抓取进程好像死亡了,不知道是什么问题?
观察到测试进程和API进程一直在运行,代理抓取进程没有动,redis队列中代理也一直在减少,有人知道这是什么问题吗?
我使用一个远程的环境,想在pycharm里调试该项目,但是每次Debug run.py 都显示文件无法找到,请问如何使用pycharm调试这个项目
相关配置和安装等都搞定了,之前那个用pop实现的也能用,但这个我运行run后却卡在“获取器开始执行”,请问怎么解决?谢谢了。
代理池开始运行
https://stackoverflow.com/questions/31663288/how-do-i-properly-use-connection-pools-in-redis
我在想每次请求链接redis都创建一个链接,不如写成
`redis_pool = None
class RedisClient(object):
def init(self, host=HOST, port=PORT):
global redis_pool
if not redis_pool:
if PASSWORD:
redis_pool = redis.Redis(host=host, port=port, password=PASSWORD)
else:
redis_pool = redis.Redis(host=host, port=port)
self._db = redis_pool
else:
self._db = redis_pool`
怎么解决
代理池开始运行
如果使用下面的测试方式是没有问题的,另外一个问题是aiohttp不支持https的代理
response=requests.get('https://www.baidu.com',proxies=‘HTTP://125.123.139.131:9999’,timeout=3)
我发现console_script里面写run:cli完全没有办法安装之后成功使用。请问为什么可以写成这个样子呢?我把run名称改为pool_run,脚本改成 pool_run:main 就可以正常使用了。
本地网页的代理是不是已经测试过可以使用的?
hi 你好 我在用你的代码的时候 发现你的redis配置暴露了.....
def add(self, proxy, score=INITIAL_SCORE):
"""
添加代理,设置分数为最高
:param proxy: 代理
:param score: 分数
:return: 添加结果
"""
if not re.match('\d+\.\d+\.\d+\.\d+\:\d+', proxy):
print('代理不符合规范', proxy, '丢弃')
return
if not self.db.zscore(REDIS_KEY, proxy):
dic={}
dic[proxy] =score
return self.db.zadd(REDIS_KEY, dic)
zincrby(name, amount, value)
需要将源代码中的zincrby第二、三参数换个顺序
当在命令行中运行 python run.py
时,项目可以正常工作,可以通过 http://127.0.0.1:5555/random 获取到代理;但当使用 VS Code 直接运行(按 F5)时,会出现异常,而且关不掉,几次都是自己通过重启才关掉。
有大佬能解释一下么?
前面的都能正常运行,到了测试的时候就是代理请求失败,想寻求解决方法
File "run.py", line 1, in
from proxypool.scheduler import Scheduler
File "C:\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "C:\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "C:\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp_init_.py", line 6, in
from .client import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\client.py", line 32, in
from . import hdrs, http, payload
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http.py", line 7, in
from .http_parser import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http_parser.py", line 755, in
from ._http_parser import (HttpRequestParser, # type: ignore # noqa
File "aiohttp_http_parser.pyx", line 44, in init aiohttp._http_parser
AttributeError: type object 'URL' has no attribute 'build'
➜ ProxyPool git:(master) pip3 install -r requirements.txt
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting aiohttp>=1.3.3 (from -r requirements.txt (line 1))
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
Could not fetch URL https://pypi.org/simple/aiohttp/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/aiohttp/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
Could not find a version that satisfies the requirement aiohttp>=1.3.3 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for aiohttp>=1.3.3 (from -r requirements.txt (line 1))
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
➜ ProxyPool git:(master)
各个库以及Python的版本都是符合要求的
然后python run.py 的时候就报错
ImportError: cannot import name 'etree' from 'lxml'
实在是百度不到方法了,求解。
redis-py 3.X版和2.X版 zadd和zincrby有变化
3.X版中的zadd需要传入一个dict,(element-names -> score)
zincrby参数中amount和value互换
redis.exceptions.ResponseError: value is not a valid float
运行报错
/proxypool/db.py", line 30, in add
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'
`import json
import re
from .utils import get_page
from pyquery import PyQuery as pq
class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)
class Crawler(object, metaclass=ProxyMetaclass):
def get_proxies(self, callback):
proxies = []
for proxy in eval(f"self.{callback}()"):
print('成功获取到代理', proxy)
proxies.append(proxy)
return proxies
def crawl_daili66(self, page_count=4):
"""
获取代理66
:param page_count: 页码
:return: 代理
"""
start_url = 'http://www.66ip.cn/{}.html'
urls = [start_url.format(page) for page in range(1, page_count + 1)]
for url in urls:
print('Crawling', url)
html = get_page(url)
if html:
doc = pq(html)
trs = doc('.containerbox table tr:gt(0)').items() # index > 0 第0个tr节点里面没有ip和port
for tr in trs:
ip = tr.find('td:nth-child(1)').text()
port = tr.find('td:nth-child(2)').text()
yield ':'.join([ip.strip(), port.strip()])
def crawl_ip3366(self):
for i in range(1, 4):
start_url = 'http://www.ip3366.net/?stype=1&page={}'.format(i)
html = get_page(start_url)
if html:
doc = pq(html)
trs = doc('#container #list table tbody tr').items()
for tr in trs:
ip = tr.find('td:nth-child(1)').text()
port = tr.find('td:nth-child(2)').text()
yield ':'.join([ip.strip(), port.strip()])
def crawl_kuaidaili(self):
for i in range(1, 4):
start_url = 'http://www.kuaidaili.com/free/inha/{}/'.format(i)
html = get_page(start_url)
if html:
doc = pq(html)
trs = doc('#content .con-body #list table tbody tr').items()
for tr in trs:
ip = tr.find('td:nth-child(1)').text()
port = tr.find('td:nth-child(2)').text()
yield ':'.join([ip.strip(), port.strip()])
def crawl_iphai(self):
start_url = 'http://www.iphai.com/'
html = get_page(start_url)
# print(html)
if html:
doc = pq(html)
trs = doc('.container .table tr:gt(0)').items()
for tr in trs:
ip = tr.find('td:nth-child(1)').text()
port = tr.find('td:nth-child(2)').text()
yield ':'.join([ip.strip(), port.strip()])
def crawl_xicidaili(self):
for i in range(1, 3):
start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
html = get_page(start_url)
if html:
doc = pq(html)
trs = doc('#wrapper #body table tr:gt(0)').items()
for tr in trs:
ip = tr.find('td:nth-child(2)').text()
port = tr.find('td:nth-child(3)').text()
yield ':'.join([ip.strip(), port.strip()])
def crawl_data5u(self):
start_url = 'http://www.data5u.com/'
html = get_page(start_url)
if html:
doc = pq(html)
uls = doc('.wlist>ul ul:gt(0)').items()
for ul in uls:
ip = ul.find('span:nth-child(1)').text()
port = ul.find('span:nth-child(2)').text()
yield ':'.join([ip.strip(), port.strip()])
`
自己替换一下就行了, 亲测没问题, 当前时间2019-10-10
新的版本中zadd有改动,需要改成zadd(REDIS_KEY, {proxy: score})
一共两处,分别在RedisClient.add()和RedisClient.max()里
url错误导致的retrying模块报错问题
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environments (please complete the following information):
Additional context
Add any other context about the problem here.
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
Crawl.py模块里是不是有个函数写多余了,crawl_ip3366写了两个。
D:\Pycharm工作资料\代码流\venv\Scripts\python.exe C:/Users/ThinkPad/Downloads/ProxyPool-master/run.py
浠g悊姹犲紑濮嬭繍琛�
代码看起来太费劲了,必须完全按照你书上来,才走的通,我又不想安装redis
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.