Git Product home page Git Product logo

htmlextractor's Introduction

HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件,本身并不包含爬虫功能,但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。

HtmlExtractor是为大规模分布式环境设计的,采用主从架构,主节点负责维护抽取规则,从节点向主节点请求抽取规则,当抽取规则发生变化,主节点主动通知从节点,从而能实现抽取规则变化之后的实时动态生效。

HtmlExtractor项目打成Jar包后运行在从节点上,而运行在主节点上的War包则是另外一个项目:HtmlExtractorServer

如何使用:

1、运行主节点,负责维护抽取规则:

将项目https://github.com/ysc/HtmlExtractorServer打成War包然后部署到Tomcat

2、获取一个HtmlExtractor的实例(从节点),示例代码如下:

String allExtractRegularUrl = "http://localhost:8080/HtmlExtractorServer/api/all_extract_regular.jsp";
String redisHost = "localhost";
int redisPort = 6379;
HtmlExtractor htmlExtractor = HtmlExtractor.getInstance(allExtractRegularUrl, redisHost, redisPort);

3、抽取信息,示例代码如下:

String url = "http://money.163.com/08/1219/16/4THR2TMP002533QK.html";
List<ExtractResult> extractResults = htmlExtractor.extract(url, "gb2312");

int i = 1;
for (ExtractResult extractResult : extractResults) {
    System.out.println((i++) + "、网页 " + extractResult.getUrl() + " 的抽取结果");
    for(ExtractResultItem extractResultItem : extractResult.getExtractResultItems()){
        System.out.print("\t"+extractResultItem.getField()+" = "+extractResultItem.getValue());              
    }
    System.out.println("\tdescription = "+extractResult.getDescription());
    System.out.println("\tkeywords = "+extractResult.getKeywords());
}

htmlextractor's People

Contributors

ysc avatar

Watchers

zcw avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.