Git Product home page Git Product logo

luyishisi / anti-anti-spider Goto Github PK

View Code? Open in Web Editor NEW
7.2K 7.2K 2.2K 147.21 MB

越来越多的网站具有反爬虫特性,有的用图片隐藏关键数据,有的使用反人类的验证码,建立反反爬虫的代码仓库,通过与不同特性的网站做斗争(无恶意)提高技术。(欢迎提交难以采集的网站)(因工作原因,项目暂停)

Home Page: https://www.urlteam.cn

Python 99.92% Shell 0.08%
geek python spider

anti-anti-spider's People

Contributors

copie avatar ftlikon avatar leng-yue avatar luyishisi avatar metaspider avatar p-yl avatar xuna123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anti-anti-spider's Issues

实际项目中的验证码图片如何放入接口识别呢?

Uploading image.png…
if name == 'main':
# text, image = gen_captcha_text_and_image()
text = '1dxz'
captcha = 'y.jpg'
image = Image.open(captcha)
# image.show()
image = np.array(image)
image = convert2gray(image)
image = image.flatten() / 255
predict_text = crack_captcha(image)
print("正确: {} 预测: {}".format(text, predict_text))

# train_crack_captcha_cnn()
# text, image = gen_captcha_text_and_image()
# image = convert2gray(image)  # 生成一张新图
# image = image.flatten() / 255  # 将图片一维化
# predict_text = crack_captcha(image)  # 导入模型识别
# print("正确: {}  预测: {}".format(text, predict_text))

报错:
验证码图像channel: (36, 70, 3)
验证码文本最长字符数 4
2017-11-28 22:35:53.116468: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):
File "/Users/menggui/Desktop/project/zhuanli_spider/cnipr/other_operate/test_2/tensorflow_cnn.py", line 229, in
predict_text = crack_captcha(image)
File "/Users/menggui/Desktop/project/zhuanli_spider/cnipr/other_operate/test_2/tensorflow_cnn.py", line 209, in crack_captcha
text_list = sess.run(predict, feed_dict={X: [captcha_image], keep_prob: 1})
File "/Users/menggui/.pyenv/versions/env_comm_Ana3-4.3.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/Users/menggui/.pyenv/versions/env_comm_Ana3-4.3.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 36, 70) for Tensor 'Placeholder:0', which has shape '(?, 9600)'

CPU训练无法满载

24核心,48线程的2695 v2,始终无法满载。使用率30%左右。每次训练大约耗时1秒。

使用GPU(GTX960)的话,每次训练大约耗时0.5秒。

ubuntu 16.04,CUDA 8.0, cuDNN 5.1。tensorflow是本地编译的。

文件名最好别用特殊字符

文件名最好别用特殊字符

 git clone [email protected]:luyishisi/Anti-Anti-Spider.git
Cloning into 'Anti-Anti-Spider'...
remote: Counting objects: 6527, done.
remote: Total 6527 (delta 0), reused 0 (delta 0), pack-reused 6527
Receiving objects: 100% (6527/6527), 118.27 MiB | 196.00 KiB/s, done.
Resolving deltas: 100% (484/484), done.
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:52:46-2016.png: Invalid argument
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:53:29-2016.png: Invalid argument
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:54:50-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Mon-Nov-28-06:00:12-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Mon-Nov-28-12:00:21-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Tue-Dec-20-15:25:52-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/so_img/Mon-Nov-28-18:17:23-2016.png: Invalid argument
error: unable to create file 10.selenium/zhifubao/Tue-Apr-25-15:43:25-2017.png: Invalid argument
error: unable to create file 10.selenium/zhifubao/Tue-Apr-25-15:44:29-2017.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-06:00:12-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-12:00:21-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-17:43:13-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/so_img/Mon-Nov-28-18:17:23-2016.png: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_10_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_17_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_24_08:53:01_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar_27_08:53:01_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar__6_18:45:18_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar__6_18:53:02_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Sat_Mar_18_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Thu_Mar__9_08:53:05_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Wed_Mar_29_08:53:04_2017.xlsx: Invalid argument
Checking out files: 100% (8191/8191), done.
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

CNN网站验证码训练后重复预测异常

使用tensorflow_cnn进行验证码训练并预测,图片预测成功率很高,但是图片验证码样本是根据目标网站自己用PHP模拟生成,几乎接近一致,训练后使用模型进行目标网站的登录,登录逻辑为了防止验证码失败,会有重试,login是个循环,失败了掉login,login里会读取验证码图片调用crack_captcha,当第一次验证准确时直接登录成功,当第一次验证失败,重新调用登录去再读取验证码图片进行预测时报异常,然后N次循环都报异常,而且每一次重试,会增加异常的数量,异常如下:
alert('验证码错误');
9006
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_11 not found in checkpoint
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_10 not found in checkpoint
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_12 not found in checkpoint
2017-11-02 10:22:14.371381: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_14 not found in checkpoint
2017-11-02 10:22:14.371404: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_15 not found in checkpoint
2017-11-02 10:22:14.371410: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_13 not found in checkpoint
2017-11-02 10:22:14.372782: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_17 not found in checkpoint
2017-11-02 10:22:14.372955: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_16 not found in checkpoint
2017-11-02 10:22:14.375048: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_18 not found in checkpoint
2017-11-02 10:22:14.375149: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_19 not found in checkpoint
2393
2017-11-02 10:22:15.583424: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_10 not found in checkpoint
2017-11-02 10:22:15.584296: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_11 not found in checkpoint
2017-11-02 10:22:15.584677: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_13 not found in checkpoint
2017-11-02 10:22:15.584754: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_12 not found in checkpoint
2017-11-02 10:22:15.585934: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_14 not found in checkpoint
2017-11-02 10:22:15.586106: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_16 not found in checkpoint
2017-11-02 10:22:15.586465: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_15 not found in checkpoint
2017-11-02 10:22:15.587122: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_17 not found in checkpoint
2017-11-02 10:22:15.587133: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_18 not found in checkpoint
2017-11-02 10:22:15.588150: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_19 not found in checkpoint
2017-11-02 10:22:15.588196: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_20 not found in checkpoint
2017-11-02 10:22:15.589135: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_21 not found in checkpoint
2017-11-02 10:22:15.589325: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_22 not found in checkpoint
2017-11-02 10:22:15.591827: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_23 not found in checkpoint

验证码 RNN

tensorflow的那个验证码识别只用到了CNN没有用RNN吗?

关于代码这件事

你好up主,请问一下您的爬虫代码能否爬到京东,淘宝,拼多多和天猫等多个购物网站的信息?

生成的图片宽高改变CNN验证码训练报错

tensorflow_cnn 改变图片宽高,训练即失败,代码已定义常量IMAGE_HEIGHT、IMAGE_WIDTH,在此改变无法生效会报错,后面训练相关代码不会自适应图片宽高的改变
Traceback (most recent call last):
File "TensorFlow_cnn_train.py", line 197, in
train_crack_captcha_cnn()
File "TensorFlow_cnn_train.py", line 183, in train_crack_captcha_cnn
, loss = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: 0.75})
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [26,315] vs. [64,315]
[[Node: logistic_loss/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Add_1, _recv_Placeholder_1_0)]]

Caused by op u'logistic_loss/mul', defined at:
File "TensorFlow_cnn_train.py", line 197, in
train_crack_captcha_cnn()
File "TensorFlow_cnn_train.py", line 165, in train_crack_captcha_cnn
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=Y))
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 171, in sigmoid_cross_entropy_with_logits
return math_ops.add(relu_logits - logits * labels,
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/math_ops.py", line 821, in binary_op_wrapper
return func(x, y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1044, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1434, in _mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [26,315] vs. [64,315]
[[Node: logistic_loss/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Add_1, _recv_Placeholder_1_0)]]

w3school不知道为什么爬不出东西

不知道为什么我的爬不出东西来,json文件是0kb的。。其中spider里面我改了一点:from scrapy.spiders import Spider(因为报错说要用spiders)。还有log改logging了,然后运行的结果看不大懂,望大佬指正

D:\LZZZZB\w3school>scrapy crawl w3school
2017-06-21 22:33:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: w3school
)
2017-06-21 22:33:03 [scrapy.utils.log] INFO: Overridden settings: {‘BOT_NAME’: ‘
w3school’, ‘NEWSPIDER_MODULE’: ‘w3school.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘SPID
ER_MODULES’: [‘w3school.spiders’]}
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.logstats.LogStats’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’,
‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled item pipelines:
[‘w3school.pipelines.W3SchoolPipeline’]
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Spider opened
2017-06-21 22:33:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-06-21 22:33:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-21 22:33:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2017, 6, 21, 14, 33, 3, 262577),
‘log_count/DEBUG’: 1,
‘log_count/INFO’: 7,
‘start_time’: datetime.datetime(2017, 6, 21, 14, 33, 3, 252576)}
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Spider closed (finished)

请求商务推广合作

作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)

请问python对于VPN调用问题

写爬虫的时候发现代理可以用http,socks等之类的,那假如要用VPN的话是怎样的
不知道具体的做法,搜索无果,自己用的时候是手动切换Orz,求给个思路

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.