gsh199449 / spider Goto Github PK
View Code? Open in Web Editor NEWA configurable web spider with a easy-to-use web console
License: GNU General Public License v3.0
A configurable web spider with a easy-to-use web console
License: GNU General Public License v3.0
现在工程里clone下来会有idea的工程配置文件,建议添加到gitignore里面
analyzer [index_ansj] not found for field [content]
找了很久没有找到参考文档中的spider.war文件
很感谢您的这个产品,昨天我试了一下。有个建议希望在爬虫列表那里增加一个勾选框,这样方便批量发布采集任务,这样不用每次都要单独进入编辑模板才能发布。谢谢。
owenr 你好 请问 webmagic 如何 和 spring 整合在一起使用?
task在触发reachMax或者exceedRatio停止之后,CommonSpider onSuccess方法中log打印的【有效页面数】和抓取任务列表中的【已抓取数量】以及网站列表中的【资讯数】不一致均不一致
log
爬虫ID5e21c6bd-4878-413c-b0fa-a46a5c3376ac已处理31个页面,有效页面6个,最大抓取页数10,reachMax=false,exceedRatio=true,退出.
已抓取数量
任务名称 已抓取数量 抓取状态
www.163.com 9 STOP
资讯数
网站域名 资讯数
www.163.com 7
webmgic 发布了0.7更新,有些功能貌似跟之前咱们这个由重叠的
你好,我直接将你的打好的war包放置到Tomcat下跑然后使用你构建好的模板抓取时没问题的。
但是我在使用eclipse构建的时候,项目成功跑起来了,其他功能没问题。
同样使用你构建好的模板抓取报错
23:36:48[WARN com.gs.spider.gather.commons.ContentLengthLimitHttpClientDownloader] download page http://news.qq.com/a/20160418/023093.htm error
javax.net.ssl.SSLKeyException: RSA premaster secret error
at sun.security.ssl.RSAClientKeyExchange.<init>(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.ClientHandshaker.serverHelloDone(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.ClientHandshaker.processMessage(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.Handshaker.processLoop(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.Handshaker.process_record(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source) ~[?:1.8.0_161]
at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source) ~[?:1.8.0_161]
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:394) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:353) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[httpclient-4.5.2.jar:4.5.2]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[httpclient-4.5.2.jar:4.5.2]
at com.gs.spider.gather.commons.ContentLengthLimitHttpClientDownloader.download(ContentLengthLimitHttpClientDownloader.java:112) [classes/:?]
at us.codecraft.webmagic.Spider.processRequest(Spider.java:404) [webmagic-core-0.6.0.jar:?]
at us.codecraft.webmagic.Spider$1.run(Spider.java:321) [webmagic-core-0.6.0.jar:?]
at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74) [webmagic-core-0.6.0.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_161]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_161]
Caused by: java.security.NoSuchAlgorithmException: SunTls12RsaPremasterSecret KeyGenerator not available
at javax.crypto.KeyGenerator.<init>(KeyGenerator.java:169) ~[?:1.8.0_171]
at javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:223) ~[?:1.8.0_171]
at sun.security.ssl.JsseJce.getKeyGenerator(Unknown Source) ~[?:1.8.0_161]
... 28 more
请问可能是什么问题呢,jdk1.8,tomcat8.5
建议为当前的爬虫添加统一的定时任务执行模块
"categoryReg": "",
"categoryXPath": "",
"defaultCategory": "体育",
//之前以为前2项不指定,就会默认分类为"体育"。试了下不是这样。如上指定规则,最后得到的category是空
例子中的爬取只针对单页,不知翻页的情况要如何处理?
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'commonSpider' defined in URL [file:/C:/Java/IdeaProjects/GatherPlatform/spider-master/target/spider/WEB-INF/classes/mvc-dispatcher-servlet.xml]: Cannot resolve reference to bean 'commonWebpageDAO' while setting bean property 'commonWebpageDAO'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'commonWebpageDAO' defined in file [C:\Java\IdeaProjects\GatherPlatform\spider-master\target\spider\WEB-INF\classes\com\gs\spider\dao\CommonWebpageDAO.class]: Bean instantiation via constructor failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.gs.spider.dao.CommonWebpageDAO]: Constructor threw exception; nested exception is NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{y3YLXLvBRXGPiHARqVc2qQ}{127.0.0.1}{127.0.0.1:9300}]]
每次更新已有模板,都会新建一个新的模板ID,这是故意所为?
对于修改还是不太友好啊,特别是测试阶段可能会保存很多次。
另外,已经跑通了,分分钟抓一个站,非常赞!
需要本地ES在elasticsearch.yml
配置文件中加入network.host: 127.0.0.1
和network.publish_host: 127.0.0.1
参数配置,方可解决。
就像火车头一样,用配置代替代码。现在看文档配置了xpath也没有抓到正确的数据,其明奇妙跳到url,能不能我控制跳到哪个url,然后在新的页面再配一个模板
能提供Docker环境么?
从代码库里面导入了sample模板,点击存储该模板还是失败,提示
请重试java.lang.NullPointerException:null
感谢提供这个工具,我在百度下载的build好的版本spider.war,直接放tomcat下启动,访问页面报错如下,是还需要什么配置么?
java.lang.IllegalArgumentException: Non-positive period.
java.util.Timer.schedule(Unknown Source)
com.gs.spider.gather.commons.CommonSpider.<init>(CommonSpider.java:364)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
java.lang.reflect.Constructor.newInstance(Unknown Source)
在编辑模板页面编辑好后,点保存总是会出现/panel/commons/editSpiderInfo,每次都必现。
java.lang.NullPointerException
at com.gs.spider.dao.SpiderInfoDAO.getByDomain(SpiderInfoDAO.java:126)
at com.gs.spider.dao.SpiderInfoDAO.index(SpiderInfoDAO.java:51)
at com.gs.spider.service.commons.spiderinfo.SpiderInfoService.lambda$index$2(SpiderInfoService.java:58)
at com.gs.spider.model.utils.ResultBundleBuilder.bundle(ResultBundleBuilder.java:25)
at com.gs.spider.service.commons.spiderinfo.SpiderInfoService.index(SpiderInfoService.java:58)
at com.gs.spider.controller.commons.spiderinfo.SpiderInfoController.save(SpiderInfoController.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
如果数据量过大,很有可能造成程序崩溃。
webmagic升级咯,咱们这个是不是也跟进一下呢?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.