Git Product home page Git Product logo

gworm's Introduction

Gworm

Gworm是一个java版的用以提取指定网址中特定部分数据,以json格式返回的库。使用场合举例:获取电商平台的搜索结果、博客内容、对没有提供api接口的网站通过提取html制作DIY的接口。

version 0.8

添加并发的支持、url生成器、新版的CookieManager

使用说明

本库需要引用JsoupDom4j

//初始化单例GwormBox
GwormBox gwormBox = GwormBox.getInstance();
//初始化request参数
RequestProperties rp = RequestProperties.getInstance();
rp.initProperties(new FileInputStream(new File("request.properties")));
//添加爬去规则,amazonKey对应规则文件的路径amazon.xml
gwormBox.addWormConfigPath("amazonKey", "amazon.xml");
//返回链接 http://www.amazon.cn/s/ref=nb_sb_noss_2?field-keywords=算法 内提取的数据(json格式)
String json = gwormBox.getJson("amazonKey", "http://www.amazon.cn/s/ref=nb_sb_noss_2?field-keywords=算法" , "amazonSearch");

规则文件

amazon.xml 如下

<?xml version="1.0" encoding="UTF-8" ?>   
<gworm>   
	<url id = "amazonSearch">   
	    <array id = "amazonSearchArray"  rule = "#rightResultsATF .s-item-container" >
	        <object>
	            <value id = "productName" rule = "h2" get = "text" />
	            <value id = "productPrice" rule = ".a-color-price" get = "text" />
	            <value id = "productImg" rule = "img" get = "attr src" />
	            <value id = "productUrl" rule = "a" get = "attr href" />
	        </object>
	    </array>
	</url>
</gworm> 
		

所有规则都必须写在gworm标签中间。 下一级标签为url,通过id 区分处理不同内容的网址。下一级标签为array或object ,提取数组信息使用array,提取单一信息使用object, array与object标签可以互相嵌套, array与object的id 属性可以忽略。最后接value标签,对应提取项。rule属性用于 css选择器,object标签可以忽略 rule属性。get属性用以表明如何提取数据,可以有三种方式, text:提取css选择器对应的 Elements的文本段,attr 属性:提取css 选择器对应的Elements中指定属性的值,html:提取css 选择器对应的Elements。

爬取参数

request.properties 如下

Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding=gzip, deflate, sdch
Accept-Language=zh-CN,zh;q=0.8
Connection=keep-alive
User-Agent=Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36

实例链接

使用上述配置获取亚马逊搜索巧克力的json数据。

gworm's People

Contributors

guiyanakuang avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.