raawaa / jav-scrapy Goto Github PK
View Code? Open in Web Editor NEW批量抓取AV磁链或封面的苦劳力
批量抓取AV磁链或封面的苦劳力
访问主网站可以访问,但是爬取过程 : 获取失败:ETIMEOUT
为什么不能用主站地址www.javbus.com 呢?
error: unknown option `--nomag'
error: unknown option `-n'
在我的linode VPS ubuntu X64 下运行出现错误。非IT人士一枚,请指导,谢谢!
root@ubuntu:~/jav-scrapy# jav -s ipz-634 -o ~/magnet.txt
/root/jav-scrapy/jav.js:16
const baseUrl = 'https://www.javbus.me';
^^^^^
SyntaxError: Use of const in strict mode.
at Module._compile (module.js:439:25)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:902:3
只有.mp4的磁链一般为原版没有赌场水印或者某基地的水印,能否添加一个过滤规则?
小封面下载的方法已经除掉,方法可见#23,在 getItemPage
的getItemMagnet(link, meta, callback);
后面添加了下载当前番号的视频截图的功能:
// 所有截图link
var snapshots = []
$('a.sample-box').each(function (i, e) {
let $e = $(e);
snapshots.push($e.attr("href"))
})
getSnapshots(link, snapshots);
getSnapshots 方法:
function getSnapshots(link, snapshots) {
// https://pics.dmm.co.jp/digital/video/118abp00454/118abp00454jp-1.jpg
for (var i = 0; i < snapshots.length; i++){
getSnapshot(link, snapshots[i])
}
}
getSnapshot 方法:
function getSnapshot(link, snahpshotLink) {
let fanhao = link.split('/').pop();
let itemOutput = output + "/" + fanhao
mkdirp.sync(itemOutput);
let snapshotName = snahpshotLink.split('/').pop();
let fileFullPath = path.join(itemOutput, snapshotName)
fs.access(fileFullPath, fs.F_OK, function (err) {
if (err) {
var snapshotFileStream = fs.createWriteStream(fileFullPath + '.part');
var finished = false;
request.get(snahpshotLink)
.on('end', function () {
if (!finished) {
fs.renameSync(fileFullPath + '.part', fileFullPath);
finished = true;
// console.error(('[' + fanhao + ']').green.bold.inverse + '[截图]'.yellow.inverse, fileFullPath);
}
})
.on('error', function (err) {
if (!finished) {
finished = true;
// console.error(('[' + fanhao + ']').red.bold.inverse + '[截图]'.yellow.inverse, err.message.red);
errorCount++;
}
})
.pipe(snapshotFileStream);
} else {
// console.log(('[' + fanhao + ']').green.bold.inverse + '[截图]'.yellow.inverse, 'file already exists, skip!'.yellow);
}
})
}
由于本人对nodejs不熟,所以添加的方法是在作者原来的方法上改的。
y> jav -b http://www.javbus.in/genre/28
========== 获取资源站点:https://www.3ubdxu00l1lkcjoz5n.com/ ==========
并行连接数: 2 连接超时设置: 30 秒
磁链保存位置: C:\Users\taw\magnets
代理服务器: 无
第1页页面获取失败:connect ETIMEDOUT 67.228.126.62:80
...进行第2次尝试...
第1页页面获取失败:connect ETIMEDOUT 67.228.126.62:80
...进行第3次尝试...
第1页页面获取失败:connect ETIMEDOUT 67.228.126.62:80
...进行第4次尝试...
网址https://www.javbus5.com/
[ATOM-262] connect ETIMEDOUT 93.46.8.89:443
jav-scrapy/jav.js:370
mag_sizes = _.orderBy(mag_sizes, 'size', 'desc');
^
TypeError: _.orderBy is not a function
iOSdeMac-mini:jav-scrapy Jafar$ npm link
npm ERR! Darwin 16.0.0
npm ERR! argv "/usr/local/bin/node" "/usr/local/bin/npm" "link"
npm ERR! node v7.2.1
npm ERR! npm v3.10.10
npm ERR! path /Users/ios/jav-scrapy
npm ERR! code EACCES
npm ERR! errno -13
npm ERR! syscall symlink
npm ERR! Error: EACCES: permission denied, symlink '/Users/ios/jav-scrapy' -> '/usr/local/lib/node_modules/jav-scarpy'
npm ERR! { Error: EACCES: permission denied, symlink '/Users/ios/jav-scrapy' -> '/usr/local/lib/node_modules/jav-scarpy'
npm ERR! errno: -13,
npm ERR! code: 'EACCES',
npm ERR! syscall: 'symlink',
npm ERR! path: '/Users/ios/jav-scrapy',
npm ERR! dest: '/usr/local/lib/node_modules/jav-scarpy' }
npm ERR!
npm ERR! Please try running this command again as root/Administrator.
npm ERR! Please include the following file with any support request:
npm ERR! /Users/ios/jav-scrapy/npm-debug.log
iOSdeMac-mini:jav-scrapy Jafar$ jav
-bash: jav: command not found
iOSdeMac-mini:jav-scrapy Jafar$ jav -h
-bash: jav: command not found
iOSdeMac-mini:jav-scrapy Jafar$ jav --help
-bash: jav: command not found
iOSdeMac-mini:jav-scrapy Jafar$ cd
iOSdeMac-mini:~ Jafar$ jav --help
-bash: jav: command not found
iOSdeMac-mini:~ Jafar$ ls
Applications Documents Library Music Public ijkplayer-ios
Desktop Downloads Movies Pictures bin jav-scrapy
iOSdeMac-mini:~ Jafar$ cd jav-scrapy/
iOSdeMac-mini:jav-scrapy Jafar$ jav --help
-bash: jav: command not found
iOSdeMac-mini:jav-scrapy Jafar$
怎么破?
json 和 txt里面都只能看到一个磁力链接
运行最后一句话sudo npm link显示找不到命令是什么问题啊
Node: v5.10.0
可能是由于网站页面结构更改的原因,导致图中 script 为空。
然后下一句执行let meta = parse(script);
调用 parse
的时候
function parse(script) {
let gid_r = /gid\s+=\s+(\d+)/g.exec(script); //因为 script 为空,所以这里为 null
let gid = gid_r[1]; // 然后这里就报错了
let uc_r = /uc\s+=\s(\d+)/g.exec(script);
let uc = uc_r[1];
let img_r = /img\s+=\s+\'(\http.+\.jpg)/g.exec(script);
let img = img_r[1];
return {
gid: gid,
img: img,
uc: uc,
lang: 'zh'
};
}
系统前段时间全新安装后,又重新安装了本工具,但是在运行时出现报错,这个问题在ubuntu 18.04
时没有出现过,不知道是javbus
更改了网页结构还是自己的系统的问题,目前nodejs
版本是10.19.0
,请大神指教,报错情况如下:
$ jav -s KTB -p 20 -o /home/twinfish/movie/Japan/foto/KTB -x http://127.0.0.1:8118
========== 获取资源站点:https://www.javbus.com/ ==========
并行连接数: 20 连接超时设置: 30 秒
磁链保存位置: /home/twinfish/movie/Japan/foto/KTB
代理服务器: http://127.0.0.1:8118
正处理以下番号影片...
KTB-054,KTB-053,KTB-052,KTB-051,KTB-050,KTB-049,KTB-048,KTB-047,KTB-046,KTB-045,KTB-044,KTB-043,KTB-042,KTB-041,KTB-040,KTB-039,KTB-038,KTB-037,KTB-036,KTB-035,KTB-034,KTB-033,KTB-032,KTB-031,KTB-030,KTB-029,KTB-028,KTB-026,KTB-027,KTB-025
/media/data2/download/software/net/jav-scrapy/jav.js:204
let img = img_r[1];
^
TypeError: Cannot read property '1' of null
at parse (/media/data2/download/software/net/jav-scrapy/jav.js:204:20)
at Request._callback (/media/data2/download/software/net/jav-scrapy/jav.js:234:28)
at Request.self.callback (/media/data2/download/software/net/jav-scrapy/node_modules/request/request.js:185:22)
at Request.emit (events.js:198:13)
at Request.<anonymous> (/media/data2/download/software/net/jav-scrapy/node_modules/request/request.js:1161:10)
at Request.emit (events.js:198:13)
at IncomingMessage.<anonymous> (/media/data2/download/software/net/jav-scrapy/node_modules/request/request.js:1083:12)
at Object.onceWrapper (events.js:286:20)
at IncomingMessage.emit (events.js:203:15)
at endReadableNT (_stream_readable.js:1145:12)
百度好像和CloudFlare有合作,貌似所有经过CloudFlare加速的网站用国内IP访问都会被要求输入验证码。
不知道GUI这种东西不知道会不会被滥用滥传。。虽然我也编的不怎么好。。。Vc都忘光了,拿手头易语言编的。。。等改天闲了改成VC的代码再传吧,目前不准备发了,需要的可以拿e文件自己编译,不要外传(Pull #36)
靠作者的代码让GUI实现了些花样功能比如方便地抓star、label、studio、series、genre类某一种的,比如切换有无码,比如换域名防墙,比如读写配置等。。。
遇到如下错误:
D:\Develop\nodejs\jav-scrapy\jav.js:200
let gid = gid_r[1];
^
TypeError: Cannot read property '1' of null
at parse (D:\Develop\nodejs\jav-scrapy\jav.js:200:18)
at Request._callback (D:\Develop\nodejs\jav-scrapy\jav.js:235:20)
at Request.self.callback (D:\Develop\nodejs\jav-scrapy\node_modules\request\request.js:200:22)
at emitTwo (events.js:87:13)
at Request.emit (events.js:172:7)
at Request. (D:\Develop\nodejs\jav-scrapy\node_modules\request\request.js:1067:10)
at emitOne (events.js:82:20)
at Request.emit (events.js:169:7)
at IncomingMessage. (D:\Develop\nodejs\jav-scrapy\node_modules\request\request.js:988:12)
at emitNone (events.js:72:20)
从状况看起来是文件流写入出了问题,封面文件并没有写完就被结束了,但小图却可以正常获取
域名经常被墙。
直接跑连不上
$ jav -o ~/dir1/dir2/dir3/magnets.txt
报错:
Error: ENOENT: no such file or directory, open '/home/raawaa/dir1/dir2/dir3/magnets.txt'
如题
在执行jav -l 500(大于第一页的值),会发现只会解析第一页. 这个问题好像是你在提交"修正控制流上的小问题"出现的,如果改回去就对了,我没看出来什么原因.....
当待处理影片小于当前页面影片数时,如果获取磁链或封面时出现网络请求错误,会直接跳过当前页面直接抓取下一页的影片。
例如:jav -l 2
,首先会抓取第一页的两部影片,如果其中一部影片抓取出错,会直接结束本页处理,开始抓取下一页的两部影片。
========== 获取资源站点:http://www.javbus.in ==========
并行连接数: 2 连接超时设置: 1 秒
磁链保存位置: /home/raawaa/magnets.txt
获取第1页中的影片链接 ( http://www.javbus.in )...
正处理以下番号影片...
HIHL-012,INBA-004
===== 第1页处理完毕 =====
获取第2页中的影片链接 ( http://www.javbus.in/page/2 )...
正处理以下番号影片...
HVG-021,RABS-015
总进度(1/2): [=========================-------------------------]
===== 第2页处理完毕 =====
获取第3页中的影片链接 ( http://www.javbus.in/page/3 )...
正处理以下番号影片...
RBD-724,RBD-721
总进度(2/2): [==================================================]
已抓取2个磁链,本次抓取完毕,等待其他爬虫回家...
导致了许多影片没有被抓取,直接被跳过了
jav-scrapy/jav.js:203
let img = img_r[1];
^
TypeError: Cannot read property '1' of null
一直提示下班的错误,是什么问题!
========== 获取资源站点:https://www.3ubdxu00l1lkcjoz5n.com/ ==========
并行连接数: 10 连接超时设置: 3 秒
磁链保存位置: /Users/qihangchuangfu/magnets
代理服务器: http://127.0.0.1:1087
获取第1页中的影片链接 ( https://www.3ubdxu00l1lkcjoz5n.com/ )...
正处理以下番号影片...
NKKD-124,BBAN-230,HUNTA-591,AP-653,BBAN-231,BBAN-229,BBAN-228,HUNTA-589,HUNTA-592,BBSS-019,BBAN-227,AP-654,NKKD-127,FIV-040,DNW-032,KFNE-015,BURI-003,SCR-216,TUE-088,DPGD-003,KRU-038,MMB-234,KRU-021,SGA-128,ABP-856,JUY-845,JUY-848,JUY-838,JUY-843,JUY-841
[BBAN-229][截图] tunneling socket could not be established, statusCode=500
[BBAN-229][截图] tunneling socket could not be established, statusCode=500
[BBAN-229][截图] tunneling socket could not be established, statusCode=500
[BBAN-231][截图] Client network socket disconnected before secure TLS connection was established
[BBAN-231][截图] Client network socket disconnected before secure TLS connection was established
[BBAN-231][截图] Client network socket disconnected before secure TLS connection was established
[BBAN-231][截图] Client network socket disconnected before secure TLS connection was established
[BBAN-231][截图] Client network socket disconnected before secure TLS connection was established
[BBAN-229] read ECONNRESET
[BBAN-229][截图] read ECONNRESET
[BBAN-229][截图] read ECONNRESET
[BBAN-229][截图] read ECONNRESET
这是我的运行环境
在执行npm install
后,会报错npm WARN package.json [email protected] No repository field.
运行jav -h
时,Windows Script Host会弹出报错
脚本: G:/jav_scrapy/jav.js
行 : 1
字符:1
错误:无效字符
代码:800A03F6
源 :Microsoft JScript编译错误
请问这是什么原因?我没有Google到合适的解决方法
TypeError: Cannot read properties of null (reading '1')
at parse (D:!Software\jav-scrapy-0.7.0\jav.js:204:20)
at Request._callback (D:!Software\jav-scrapy-0.7.0\jav.js:235:28)
at Request.self.callback (D:!Software\jav-scrapy-0.7.0\node_modules\request\request.js:185:22)
at Request.emit (node:events:390:28)
at Request. (D:!Software\jav-scrapy-0.7.0\node_modules\request\request.js:1154:10)
at Request.emit (node:events:390:28)
at IncomingMessage. (D:!Software\jav-scrapy-0.7.0\node_modules\request\request.js:1076:12)
at Object.onceWrapper (node:events:509:28)
at IncomingMessage.emit (node:events:402:35)
at endReadableNT (node:internal/streams/readable:1343:12)
这个应该是没有抓到图片导致的吧,还有好像没有不抓图像的选项
前提:
function getItemCover(link, meta, done) {
var fanhao = link.split('/').pop();
var filename = fanhao + 'l.jpg';
个人修改成
function getItemCover(link, meta, done) {
var fanhao = link.split('/').pop();
var filename = meta.title + '.jpg';
以便直接输出以影片标题为文件名的封面图
但是有些影片的标题中含有/字符,而这个字符会导致命名错误
比如截图中的MUDR-034
想请教一下作者,有没有什么办法能够直接删除标题中的/字符或者替换为.或空格之类可以在文件名中使用的字符?
改js文件(win10路径C:\Users\XXX\AppData\Roaming\npm\node_modules\jav-scarpy\jav.js):
将
function getItemCover(link, meta, done) {
var fanhao = link.split('/').pop();
var filename = fanhao + 'l.jpg';
改为
function getItemCover(link, meta, done) {
var fanhao = link.split('/').pop();
var temptitle = meta.title;
var shorttitle = temptitle.substring(0, 200);
var finaltitle = shorttitle.replace(/[\/*?:"|<>]/g, '');
var filename = finaltitle + ' -' + meta.actress + '['+ meta.date + '].jpg';
jav -l 50 -p 10 -t 3000
========== 获取资源站点:http://www.javbus.me ==========
并行连接数: 10 连接超时设置: 3 秒
磁链保存位置: C:\Users\shengjianbin\magnets
代理服务器: 无
获取第1页中的影片链接 ( http://www.javbus.me )...
正处理以下番号影片...
ABP-435,ABP-439,CHN-099,ABP-437,ABP-438,ABP-436,JUX-790,SNIS-602,PGD-846,GVG-257,GVG-256,GVG-255,GVG-259,GVG-258,GVG-261,RVG-020,GVG-260,DAVK-001,RBD-744,BF-437,BF-438,BF-439,BF-436,ATID-265,ADN-088,ADN-086,ATID-266,SNIS-597,SNIS-598,SNIS-599
D:\Develop\nodejs\GitHub\jav-scrapy\jav.js:204
let img = img_r[1];
^
TypeError: Cannot read property '1' of null
at parse (D:\Develop\nodejs\GitHub\jav-scrapy\jav.js:204:18)
at Request._callback (D:\Develop\nodejs\GitHub\jav-scrapy\jav.js:235:20)
at Request.self.callback (D:\Develop\nodejs\GitHub\jav-scrapy\node_modules\request\request.js:199:22)
at emitTwo (events.js:87:13)
at Request.emit (events.js:172:7)
at Request. (D:\Develop\nodejs\GitHub\jav-scrapy\node_modules\request\request.js:1036:10)
at emitOne (events.js:82:20)
at Request.emit (events.js:169:7)
at IncomingMessage. (D:\Develop\nodejs\GitHub\jav-scrapy\node_modules\request\request.js:963:12)
at emitNone (events.js:72:20)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.