Git Product home page Git Product logo

proxy_pool's Introduction

ProxyPool 爬虫代理IP池

Build Status Packagist GitHub contributors

______                        ______             _
| ___ \_                      | ___ \           | |
| |_/ / \__ __   __  _ __   _ | |_/ /___   ___  | |
|  __/|  _// _ \ \ \/ /| | | ||  __// _ \ / _ \ | |
| |   | | | (_) | >  < \ |_| || |  | (_) | (_) || |___
\_|   |_|  \___/ /_/\_\ \__  |\_|   \___/ \___/ \_____\
                       __ / /
                      /___ /

ProxyPool

爬虫代理IP池项目,主要功能为定时采集网上发布的免费代理验证入库,定时验证入库的代理保证代理的可用性,提供API和CLI两种使用方式。同时你也可以扩展代理源以增加代理池IP的质量和数量。

  • 文档: document Documentation Status

  • 支持版本:

  • 测试地址: http://demo.spiderpy.cn (勿压谢谢)

  • 付费代理推荐: luminati-china. 国外的亮数据BrightData(以前叫luminati)被认为是代理市场领导者,覆盖全球的7200万IP,大部分是真人住宅IP,成功率扛扛的。付费套餐多种,需要高质量代理IP的可以注册后联系中文客服,开通后赠送5美金余额和教程指引(PS:用不明白的同学可以参考这个使用教程)。

运行项目

下载代码:
  • git clone
git clone [email protected]:jhao104/proxy_pool.git
  • releases
https://github.com/jhao104/proxy_pool/releases 下载对应zip文件
安装依赖:
pip install -r requirements.txt
更新配置:
# setting.py 为项目配置文件

# 配置API服务

HOST = "0.0.0.0"               # IP
PORT = 5000                    # 监听端口


# 配置数据库

DB_CONN = 'redis://:[email protected]:8888/0'


# 配置 ProxyFetcher

PROXY_FETCHER = [
    "freeProxy01",      # 这里是启用的代理抓取方法名,所有fetch方法位于fetcher/proxyFetcher.py
    "freeProxy02",
    # ....
]

启动项目:

# 如果已经具备运行条件, 可用通过proxyPool.py启动。
# 程序分为: schedule 调度程序 和 server Api服务

# 启动调度程序
python proxyPool.py schedule

# 启动webApi服务
python proxyPool.py server

Docker Image

docker pull jhao104/proxy_pool

docker run --env DB_CONN=redis://:password@ip:port/0 -p 5010:5010 jhao104/proxy_pool:latest

docker-compose

项目目录下运行:

docker-compose up -d

使用

  • Api

启动web服务后, 默认配置下会开启 http://127.0.0.1:5010 的api接口服务:

api method Description params
/ GET api介绍 None
/get GET 随机获取一个代理 可选参数: ?type=https 过滤支持https的代理
/pop GET 获取并删除一个代理 可选参数: ?type=https 过滤支持https的代理
/all GET 获取所有代理 可选参数: ?type=https 过滤支持https的代理
/count GET 查看代理数量 None
/delete GET 删除代理 ?proxy=host:ip
  • 爬虫使用

  如果要在爬虫代码中使用的话, 可以将此api封装成函数直接使用,例如:

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 删除代理池中代理
    delete_proxy(proxy)
    return None

扩展代理

  项目默认包含几个免费的代理获取源,但是免费的毕竟质量有限,所以如果直接运行可能拿到的代理质量不理想。所以,提供了代理获取的扩展方法。

  添加一个新的代理源方法如下:

  • 1、首先在ProxyFetcher类中添加自定义的获取代理的静态方法, 该方法需要以生成器(yield)形式返回host:ip格式的代理,例如:
class ProxyFetcher(object):
    # ....

    # 自定义代理源获取方法
    @staticmethod
    def freeProxyCustom1():  # 命名不和已有重复即可

        # 通过某网站或者某接口或某数据库获取代理
        # 假设你已经拿到了一个代理列表
        proxies = ["x.x.x.x:3128", "x.x.x.x:80"]
        for proxy in proxies:
            yield proxy
        # 确保每个proxy都是 host:ip正确的格式返回
  • 2、添加好方法后,修改setting.py文件中的PROXY_FETCHER项:

  在PROXY_FETCHER下添加自定义方法的名字:

PROXY_FETCHER = [
    "freeProxy01",    
    "freeProxy02",
    # ....
    "freeProxyCustom1"  #  # 确保名字和你添加方法名字一致
]

  schedule 进程会每隔一段时间抓取一次代理,下次抓取时会自动识别调用你定义的方法。

免费代理源

目前实现的采集免费代理网站有(排名不分先后, 下面仅是对其发布的免费代理情况, 付费代理测评可以参考这里):

代理名称 状态 更新速度 可用率 地址 代码
站大爷 ** 地址 freeProxy01
66代理 * 地址 freeProxy02
开心代理 * 地址 freeProxy03
FreeProxyList * 地址 freeProxy04
快代理 * 地址 freeProxy05
冰凌代理 ★★★ * 地址 freeProxy06
云代理 * 地址 freeProxy07
小幻代理 ★★ * 地址 freeProxy08
免费代理库 * 地址 freeProxy09
89代理 * 地址 freeProxy10
稻壳代理 ★★ *** 地址 freeProxy11

如果还有其他好的免费代理网站, 可以在提交在issues, 下次更新时会考虑在项目中支持。

问题反馈

  任何问题欢迎在Issues 中反馈,同时也可以到我的博客中留言。

  你的反馈会让此项目变得更加完美。

贡献代码

  本项目仅作为基本的通用的代理池架构,不接收特有功能(当然,不限于特别好的idea)。

  本项目依然不够完善,如果发现bug或有新的功能添加,请在Issues中提交bug(或新功能)描述,我会尽力改进,使她更加完美。

  这里感谢以下contributor的无私奉献:

  @kangnwh | @bobobo80 | @halleywj | @newlyedward | @wang-ye | @gladmo | @bernieyangmh | @PythonYXY | @zuijiawoniu | @netAir | @scil | @tangrela | @highroom | @luocaodan | @vc5 | @1again | @obaiyan | @zsbh | @jiannanya | @Jerry12228

Release Notes

changelog

proxy_pool's People

Contributors

bernieyangmh avatar bhzhangsun avatar bobobo80 avatar chncaption avatar dependabot[bot] avatar dustinpt avatar feng409 avatar gladmo avatar halleywj avatar highroom avatar houbaron avatar jhao104 avatar jiannanya avatar kagxin avatar kangnwh avatar netair avatar newlyedward avatar ozhiwei avatar plokid avatar roronoa-dong avatar scil avatar sunjngje avatar tinker-li avatar vc5 avatar vissssa avatar wang-ye avatar windhw avatar xuan25 avatar yeclimeric avatar yrjyrj123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxy_pool's Issues

关于IP地址的验证问题

看了下代码,只看到了从raw_proxy_queue中的对IP进行验证,把当前可用的IP放到useful_proxy_queue中,没有看到对useful_proxy_queue中的IP进行验证的代码。这些免费IP可能随时失效,也需要进行刷新验证。

想认识一下!

你好,你这些项目都不错 想认识一下,一直在组织大家做数据积累挖掘的事情,大家的力量是无限的,我的微信:toyaowu

调用get时返回服务器内部错误

使用get_all返回为空,使用get时返回服务器内部错误,这个何解?
Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

Python3 下运行报错,希望帮忙看下

[root@iz8vbawf20vjywci9aweg8z Run]# python3 main.py
Process ValidRun:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "../Schedule/ProxyValidSchedule.py", line 61, in run
p.main()
File "../Schedule/ProxyValidSchedule.py", line 56, in main
self.__validProxy()
File "../Schedule/ProxyValidSchedule.py", line 36, in __validProxy
for each_proxy in self.db.getAll():
File "../DB/DbClient.py", line 94, in getAll
return self.client.getAll()
File "/home/software/proxy_pool/DB/SsdbClient.py", line 94, in getAll
return self.__conn.hgetall(self.name).keys()
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 1050, in hgetall
return self.execute_command('hgetall', name)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 225, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 404, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 383, in send_packed_command
self._sock.sendall(item)
TypeError: a bytes-like object is required, not 'str'

关于ProxyValidSchedule里的计数问题

在ProxyValidSchedule中,如果一个proxy存在很长时间,那么即使失效,它的计数也会很大,需要很久才能减为负数并被清理掉。如果设置一个计数的上限,比如10。当计数超过十就不再增加,是不是可以更有效地清理过期的proxy?

Error 10061 connecting localhost:8888

你好,我在配置完依赖环境后跑了main.py然后就返回了ConnectionError: Error 10061 connecting localhost:8888 我还是新手没接触过SSDB方面的东西,不知道是不是我SSDB的配置有问题?单独运行getFreeProxy是可以返回IP列表的,应该就是数据库设置有问题吧,我把它放到云服务器上用SSDBAdmin改了服务器公网IP访问也还是返回10061的错误,请问我是还需要设置什么?谢谢。

使用问题

作者大大:
你好,下载了你的作品,在linux运行起来了,但只提供给我的程序不到几分钟的Ip地址,后面,proxyApi就开始报500错误,不再提供Ip地址了。这是为什么呢?下面是proxyApi报的Log。

...................................
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:58] "GET /get/ HTTP/1.1" 500 -
[2017-01-22 17:07:59,348] ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1642, in full_dispatch_request
response = self.make_response(rv)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1731, in make_response
raise ValueError('View function did not return a response')
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:59] "GET /get/ HTTP/1.1" 500 -

这两天一个代理也跑不出来了?

按照文档说明,安装了依赖包和ssdb,但是执行python后进程中仅有两个main.py,log文件中没有任何错误,是哪里出了问题么?

centos7访问报错

ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
    response = self.make_response(rv)
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
    raise ValueError('View function did not return a response')
ValueError: View function did not return a response
ERROR:Api.ProxyApi:Exception on /get/ [GET]
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
    response = self.make_response(rv)
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
    raise ValueError('View function did not return a response')
ValueError: View function did not return a response

centos7 python2.7.13,使用 curl http://localhost:5000/get/ 本地测试没问题,国外的云服务器上访问就报错

另外定时任务可能也有问题,python不是很熟能帮忙看下都是什么问题,怎么解决呢

ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-07-07 17:29:56 CST)" raised an exception
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "../Schedule/ProxyRefreshSchedule.py", line 73, in main
    p.refresh()
  File "../Manager/ProxyManager.py", line 42, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
  File "../ProxyGetter/getFreeProxy.py", line 80, in freeProxySecond
    for proxy in re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html):
  File "/usr/local/python/Lib/re.py", line 181, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

[FIXED] SSDB可视化配置问题

请教, 在分别运行Api下的ProxyApi.py, Schedule下的ProxyRefreshSchedule.py和ProxyValidSchedule.py后, 想通过README里提供的SSDBAdmin可视化工具查看结果,请问SSDBAdmin的 setting.py 中的‘host’和‘port’如何设置?

#!/usr/bin/env python

servers = [
{
"host": "172.16.1.69",
"port": 8889
},
{
"host": "127.0.0.1",
"port": 8889
}
]

DEBUG = False

补充:

在运行ProxyApi.py,ProxyRefreshSchedule.py和ProxyValidSchedule.py时, 运行SSDBAdmin下的 runserver.py出现了socket占用错误:socket.error: [Errno 98] Address already in use. 由于在此之前没有redis和其他数据库经验,请多指教, 谢谢!

如何添加scheme属性?

  1. 第一次用NosqlDb,发现这特别确实比起传统的数据库更适合做这份工作(代理池维护)。
  2. 现在我需要给每个ip添加一个scheme属性,我注意到你的数据库工厂类DbClient里面写到value是None,我觉得这应该就是可以放scheme的地方。
  3. 我很熟悉怎么抓取数据
  4. 我想问一下,应该怎么修改代码来减少工作量,以下是我的想法。

在网页抓取代理IP的时候判断其scheme并添加进SSdb的value,这样的话,工厂类可能也要改,ProxyManager也要改。

请问一下这种情况下的Best practice是什么?

拼写错误

Manager.ProxyManager 18行有个拼写错误

from ProxyGetter.GetFreeProxy import GetFreeProxy
==》
from ProxyGetter.getFreeProxy import GetFreeProxy

数据库更新逻辑

工具很有用赞一个。
几个问题想确认下:

  1. get_all返回的代理list的更新逻辑是什么。好像这个List里面是越来越多的。昨天的ipA可用,今天不可用,也会返回。
  2. 是否考虑加入api实现:返回过去X分钟测试过的,确定可用的代理List。

谢谢。

端口号

端口号最大为65534,所以最多有5位。
爬虫的正则表达式只会保存前4位

为何必须要先运行redis?

你好jhao104,我看文档说是用SSDB来替代redis,但实际运行程序中,如果不先运行redis就运行main.py,就会报错,新手请指教,谢谢。(windows 8 64位系统)。而且有时运行成功,但没有任何代理显示出来,在浏览器中只是可怜的显示[ ]。

有一点改动

在 proxy_pool/DB/RedisClient.py 中
pop 应该改为:
def pop(self):
return self.__conn.spop(self.name)

is proxy pool double ip checker?

my mean
if we give same ip:port from several sources

example :
192.168.56.1:123 from X
and
192.168.56.1:123 from Y

jhao proxy_pool can resolve this?

请求分类

不知道可不可以在get的时候,指定https或者http型代理,毕竟有些代理只支持http或者https?

谢谢!

我想加一个新的feature

你好。我在使用你的代码的时候,直接用的存在redis里面的结果。但是代理在useful_proxy_queue中的代理要人为丢掉才会丢掉,但一次失败就丢会导致代理很快用完,所以我使用的时候,在从redis取东西的时候加了判断,连续20次失败了我才正式把它丢了,效果还不错。但是这边主要是加在我的代码逻辑里面,我想把它加在代理池接口部分,不知道这个feature接收吗。

西刺被Ban

你好,自己在写一个类似的练习,但抓取西刺代理时就出现多次被Ban所得页面是block的情况,请问您是通过不断更换headers处理还是怎样呢?

新手求教,谢谢大神!

its apear ..

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ---># in browser

ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-06-14 23:20:31 CEST)" raised an exception Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "../Schedule/ProxyRefreshSchedule.py", line 73, in main p.refresh() File "../Manager/ProxyManager.py", line 42, in refresh for proxy in getattr(GetFreeProxy, proxyGetter.strip())(): File "../ProxyGetter/getFreeProxy.py", line 79, in freeProxySecond html = getHTMLText(url, headers=HEADER) File "../Util/utilFunction.py", line 31, in getHTMLText return response.status_code UnboundLocalError: local variable 'response' referenced before assignment ---># in terminal

how solve?

python2.7 -m Schedule.ProxyRefreshSchedule出现OverFlowError

在项目目录下执行python2.7 -m Schedule.ProxyRefreshSchedule
出现报错:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 46, in refresh
    self.db.put(proxy)
  File "DB/DbClient.py", line 73, in put
    return self.client.put(value, **kwargs)
  File "/home/sam/app/venv/proxy_pool/DB/SsdbClient.py", line 62, in put
    return self.__conn.hset(self.name, value, None)
  File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 797, in hset
    return self.execute_command('hset', name, key, value)
  File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 218, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 404, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 378, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 281, in connect
    sock = self._connect()
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 308, in _connect
    socket.SOCK_STREAM):
OverflowError: Python int too large to convert to C long

系统: CentOS6.4, 64位系统。
上网查了这个报错是因为底层用到了C的函数,导致此报错。
http://bugs.python.org/issue21816
奇怪的是我在issues里没看到其他人有同样的报错。

ssdb兼容问题

python3环境下安装ssdb报错,3.4.3和3.6.2环境下均无法安装,建议采用pyssdb或者ssdb.py

ubuntu@hp:~/workspace/proxy_pool/Run$ python -V
Python 3.6.2
ubuntu@hp:~/workspace/proxy_pool/Run$ pip install ssdb
Collecting ssdb
  Using cached ssdb-0.0.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-srsinocq/ssdb/setup.py", line 5, in <module>
        from ssdb import __version__
      File "/tmp/pip-build-srsinocq/ssdb/ssdb/__init__.py", line 2, in <module>
        from ssdb.client import StrictSSDB, SSDB
      File "/tmp/pip-build-srsinocq/ssdb/ssdb/client.py", line 3, in <module>
        from itertools import chain, starmap, izip_longest
    ImportError: cannot import name 'izip_longest'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-srsinocq/ssdb/

运行环境设置都是什么?

用的centos 6 python3.5 结果各种出错

[root@localhost proxy_pool-master]# python -m Schedule.ProxyRefreshSchedule
Traceback (most recent call last):
File "/usr/local/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/proxy_pool-master/Schedule/ProxyRefreshSchedule.py", line 19, in
from apscheduler.schedulers.blocking import BlockingScheduler
ImportError: No module named 'apscheduler'

是哪里没有装好吗?

关于IP定期更新的问题

在代理程序运行一段时间后,会出现大量僵尸进程, 如下:
2017-01-11 1 04 05

猜测是定期更新的ProxyRefreshSchedule类有bug~

python2.6在安装ssdb python驱动是报错

Collecting ssdb
/usr/lib/python2.6/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Using cached ssdb-0.0.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/tmp/pip-build-sfGK5r/ssdb/setup.py", line 5, in
from ssdb import version
File "ssdb/init.py", line 2, in
from ssdb.client import StrictSSDB, SSDB
File "ssdb/client.py", line 74
return {k:int(v) for k,v in list_to_dict(lst).items()}
^
SyntaxError: invalid syntax

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-sfGK5r/ssdb

centos 6下,Python2.7.1 运行报错

`

[root@VPS Run]# python main.py
Traceback (most recent call last):
File "main.py", line 22, in
from Schedule.ProxyRefreshSchedule import run as RefreshRun
File "../Schedule/ProxyRefreshSchedule.py", line 21, in
from apscheduler.schedulers.blocking import BlockingScheduler
File "/usr/local/python27/lib/python2.7/site-packages/apscheduler/init.py", line 2, in
release = import('pkg_resources').get_distribution('APScheduler').version.split('-')[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 292, in get_distribution
if isinstance(dist,Requirement): dist = get_provider(dist)
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 176, in get_provider
return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 648, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 546, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: APScheduler`

无法启动采集

启动采集出现错误:提示 TypeError:init() got an unexpected keyword argument minute

python -m Schedule.ProxyRefreshSchedule 执行报错

Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 40, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
  File "ProxyGetter/getFreeProxy.py", line 102, in freeProxyFifth
    d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
IndexError: list index out of range

怀疑是那个代理获取源xpath有问题,尝试注释掉后,出现另外一个错误
ProxyGetter/getFreeProxy.py

  91     @staticmethod
  92     @robustCrawl
  93     def freeProxyFifth():
  94         """
  95         抓取guobanjia http://www.goubanjia.com/free/gngn/index.shtml
  96         :return:
  97         """
  98         url = "http://www.goubanjia.com/free/gngn/index.shtml"
  99         tree = getHtmlTree(url)
 100         # 现在每天最多放15个(一页)
 101         #for i in xrange(15):
 102             #d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
 103             #o = d.xpath('.//span/text() | .//div/text()')
 104             #yield ''.join(o[:-1]) + ':' + o[-1]
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 40, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
TypeError: 'NoneType' object is not iterable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.