luyishisi / anti-anti-spider Goto Github PK

View Code? Open in Web Editor NEW

7.2K 7.2K 2.2K 147.21 MB

越来越多的网站具有反爬虫特性，有的用图片隐藏关键数据，有的使用反人类的验证码，建立反反爬虫的代码仓库，通过与不同特性的网站做斗争（无恶意）提高技术。（欢迎提交难以采集的网站）（因工作原因，项目暂停）

Home Page: https://www.urlteam.cn

Python 99.92% Shell 0.08%

geek python spider

anti-anti-spider's People

Contributors

Stargazers

Watchers

Forkers

tomzhang guoyu07 miudodo daxcoder roofxixi fretice shijianyang ovens secpersu nsxz ujxiao0 missdiog huansoo qwdingyu loveshell devenlu m00zh33 v1cker jijicanyu phantom0301 burningcodes melbshark alpc32 firearasi songofhack zhaoxx063 davischan3168 archspider bongwa spicahuhu zfx168 xjiangfan 872409 liuxingming yongliangliu imfht k3vr5iw yaoml sculzx007 nffly sorcerer0001 404soul cash2one randy-ran gantoday-spider wuyouzi fanshaopu wqool sdoom smtlify dixonzhang trcflyer heavysheep jerrydog xjlin ericlee20161201 hititan lynnljl jiayingjie92 nanohaikaros crash320 guoyaxiang heianxing sunyangu 521xueweihan rootlu baidang201 xiaosimao miaoyongbin caroid r00tgrok mingjiang-zeng adomore wkhunter jimmytoronto lyjdwh stephen-adams djlalala dungang fuyunyun ejoful xunux clrsdream plupy henk1025 zyearw1024 zhaosenbest geasscode jinjin123 mingxiu2012 lyicy hwsyy a2008ucd cloudthink ginking huidesign yuhou615 foursking1 terriermon fwc1994

anti-anti-spider's Issues

实际项目中的验证码图片如何放入接口识别呢？

if name == 'main':
# text, image = gen_captcha_text_and_image()
text = '1dxz'
captcha = 'y.jpg'
image = Image.open(captcha)
# image.show()
image = np.array(image)
image = convert2gray(image)
image = image.flatten() / 255
predict_text = crack_captcha(image)
print("正确: {} 预测: {}".format(text, predict_text))

# train_crack_captcha_cnn()
# text, image = gen_captcha_text_and_image()
# image = convert2gray(image)  # 生成一张新图
# image = image.flatten() / 255  # 将图片一维化
# predict_text = crack_captcha(image)  # 导入模型识别
# print("正确: {}  预测: {}".format(text, predict_text))

报错：
验证码图像channel: (36, 70, 3)
验证码文本最长字符数 4
2017-11-28 22:35:53.116468: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):
File "/Users/menggui/Desktop/project/zhuanli_spider/cnipr/other_operate/test_2/tensorflow_cnn.py", line 229, in
predict_text = crack_captcha(image)
File "/Users/menggui/Desktop/project/zhuanli_spider/cnipr/other_operate/test_2/tensorflow_cnn.py", line 209, in crack_captcha
text_list = sess.run(predict, feed_dict={X: [captcha_image], keep_prob: 1})
File "/Users/menggui/.pyenv/versions/env_comm_Ana3-4.3.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/Users/menggui/.pyenv/versions/env_comm_Ana3-4.3.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 36, 70) for Tensor 'Placeholder:0', which has shape '(?, 9600)'

CPU训练无法满载

24核心，48线程的2695 v2，始终无法满载。使用率30%左右。每次训练大约耗时1秒。

使用GPU（GTX960）的话，每次训练大约耗时0.5秒。

ubuntu 16.04，CUDA 8.0， cuDNN 5.1。tensorflow是本地编译的。

文件名最好别用特殊字符

 git clone [email protected]:luyishisi/Anti-Anti-Spider.git
Cloning into 'Anti-Anti-Spider'...
remote: Counting objects: 6527, done.
remote: Total 6527 (delta 0), reused 0 (delta 0), pack-reused 6527
Receiving objects: 100% (6527/6527), 118.27 MiB | 196.00 KiB/s, done.
Resolving deltas: 100% (484/484), done.
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:52:46-2016.png: Invalid argument
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:53:29-2016.png: Invalid argument
error: unable to create file 10.selenium/rewifi/Wed-Nov-30-15:54:50-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Mon-Nov-28-06:00:12-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Mon-Nov-28-12:00:21-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/Tue-Dec-20-15:25:52-2016.png: Invalid argument
error: unable to create file 10.selenium/so_gold/so_img/Mon-Nov-28-18:17:23-2016.png: Invalid argument
error: unable to create file 10.selenium/zhifubao/Tue-Apr-25-15:43:25-2017.png: Invalid argument
error: unable to create file 10.selenium/zhifubao/Tue-Apr-25-15:44:29-2017.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-06:00:12-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-12:00:21-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/Mon-Nov-28-17:43:13-2016.png: Invalid argument
error: unable to create file 3.代码模板/selenium模拟登陆/so_img/Mon-Nov-28-18:17:23-2016.png: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_10_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_17_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Fri_Mar_24_08:53:01_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar_27_08:53:01_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar__6_18:45:18_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Mon_Mar__6_18:53:02_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Sat_Mar_18_08:53:04_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Thu_Mar__9_08:53:05_2017.xlsx: Invalid argument
error: unable to create file 6.爬虫项目源码/17.淘宝关键词采集器/excel/Wed_Mar_29_08:53:04_2017.xlsx: Invalid argument
Checking out files: 100% (8191/8191), done.
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

工商信息查询官网的极验校验码

工商信息查询官网的极验校验码已经更新到2.0版本了，求破解更新~

验证码的accuracy似乎是单字符的accuracy吧

这里accuracy是单字符的话，整体的accuracy其实是会低于这个值的

识别含有大写字母的图片怎么改代码

请问一下，我的图片有大写字母，我给图片打的标签没有大写字母只有小写字母，怎么改代码才可以识别呢

CNN网站验证码训练后重复预测异常

使用tensorflow_cnn进行验证码训练并预测，图片预测成功率很高，但是图片验证码样本是根据目标网站自己用PHP模拟生成，几乎接近一致，训练后使用模型进行目标网站的登录，登录逻辑为了防止验证码失败，会有重试，login是个循环，失败了掉login，login里会读取验证码图片调用crack_captcha，当第一次验证准确时直接登录成功，当第一次验证失败，重新调用登录去再读取验证码图片进行预测时报异常，然后N次循环都报异常，而且每一次重试，会增加异常的数量，异常如下：
alert('验证码错误');
9006
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_11 not found in checkpoint
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_10 not found in checkpoint
2017-11-02 10:22:14.369586: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_12 not found in checkpoint
2017-11-02 10:22:14.371381: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_14 not found in checkpoint
2017-11-02 10:22:14.371404: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_15 not found in checkpoint
2017-11-02 10:22:14.371410: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_13 not found in checkpoint
2017-11-02 10:22:14.372782: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_17 not found in checkpoint
2017-11-02 10:22:14.372955: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_16 not found in checkpoint
2017-11-02 10:22:14.375048: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_18 not found in checkpoint
2017-11-02 10:22:14.375149: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_19 not found in checkpoint
2393
2017-11-02 10:22:15.583424: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_10 not found in checkpoint
2017-11-02 10:22:15.584296: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_11 not found in checkpoint
2017-11-02 10:22:15.584677: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_13 not found in checkpoint
2017-11-02 10:22:15.584754: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_12 not found in checkpoint
2017-11-02 10:22:15.585934: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_14 not found in checkpoint
2017-11-02 10:22:15.586106: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_16 not found in checkpoint
2017-11-02 10:22:15.586465: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_15 not found in checkpoint
2017-11-02 10:22:15.587122: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_17 not found in checkpoint
2017-11-02 10:22:15.587133: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_18 not found in checkpoint
2017-11-02 10:22:15.588150: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_19 not found in checkpoint
2017-11-02 10:22:15.588196: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_20 not found in checkpoint
2017-11-02 10:22:15.589135: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_21 not found in checkpoint
2017-11-02 10:22:15.589325: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_22 not found in checkpoint
2017-11-02 10:22:15.591827: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key Variable_23 not found in checkpoint

猫眼电影票使用了字体处理票房数据，这种怎么抓取

https://piaofang.maoyan.com

猫眼电影的票房和评分数据使用的是特殊字体处理的，这种一般怎么爬取

Anti spider

How to clone..?

亚马逊图片训练集问题

亚马逊爬虫很多用来训练的图片已经没有了，造成程序训练时缺失数据

代码风格有点奇怪

建议提交代码前做一些pep8风格检查吧，看着脑壳都有点疼

验证码 RNN

tensorflow的那个验证码识别只用到了CNN没有用RNN吗？

可不可以完善下淘宝规则？增加评论采集、图片采集

关于代码这件事

你好up主，请问一下您的爬虫代码能否爬到京东，淘宝，拼多多和天猫等多个购物网站的信息？

生成的图片宽高改变CNN验证码训练报错

tensorflow_cnn 改变图片宽高，训练即失败，代码已定义常量IMAGE_HEIGHT、IMAGE_WIDTH，在此改变无法生效会报错，后面训练相关代码不会自适应图片宽高的改变
Traceback (most recent call last):
File "TensorFlow_cnn_train.py", line 197, in
train_crack_captcha_cnn()
File "TensorFlow_cnn_train.py", line 183, in train_crack_captcha_cnn
, loss = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: 0.75})
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [26,315] vs. [64,315]
[[Node: logistic_loss/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Add_1, _recv_Placeholder_1_0)]]

Caused by op u'logistic_loss/mul', defined at:
File "TensorFlow_cnn_train.py", line 197, in
train_crack_captcha_cnn()
File "TensorFlow_cnn_train.py", line 165, in train_crack_captcha_cnn
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=Y))
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 171, in sigmoid_cross_entropy_with_logits
return math_ops.add(relu_logits - logits * labels,
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/math_ops.py", line 821, in binary_op_wrapper
return func(x, y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1044, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1434, in _mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [26,315] vs. [64,315]
[[Node: logistic_loss/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Add_1, _recv_Placeholder_1_0)]]

想用tensorflow_cnn进行中文验证码识别，text2vec该怎么处理呢？

text2vec仅支持向量用0,1编码每63个编码一个字符，对数字，英文没问题，那中文该怎么解决呢？

有没有qq和微信

留下联系方式，交流一下
我的qq11898198

5位数验证码+尺寸80*28如何调整参数？

5位数验证码+尺寸80*28如何调整参数？
需要调整参数吗？
能否自动适应参数数量或者图片尺寸？

企业验证码破解还能用吗？

请教下易盾有开源计划么

google到博主之前有写过易盾的文章，后面删除了，后面还有开源的计划么，持续关注中

w3school不知道为什么爬不出东西

不知道为什么我的爬不出东西来，json文件是0kb的。。其中spider里面我改了一点：from scrapy.spiders import Spider（因为报错说要用spiders）。还有log改logging了，然后运行的结果看不大懂，望大佬指正

D:\LZZZZB\w3school>scrapy crawl w3school
2017-06-21 22:33:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: w3school
)
2017-06-21 22:33:03 [scrapy.utils.log] INFO: Overridden settings: {‘BOT_NAME’: ‘
w3school’, ‘NEWSPIDER_MODULE’: ‘w3school.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘SPID
ER_MODULES’: [‘w3school.spiders’]}
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.logstats.LogStats’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’,
‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2017-06-21 22:33:03 [scrapy.middleware] INFO: Enabled item pipelines:
[‘w3school.pipelines.W3SchoolPipeline’]
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Spider opened
2017-06-21 22:33:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-06-21 22:33:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-21 22:33:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2017, 6, 21, 14, 33, 3, 262577),
‘log_count/DEBUG’: 1,
‘log_count/INFO’: 7,
‘start_time’: datetime.datetime(2017, 6, 21, 14, 33, 3, 252576)}
2017-06-21 22:33:03 [scrapy.core.engine] INFO: Spider closed (finished)