Git Product home page Git Product logo

html-extractor's Introduction

网页正文内容抽取

此代码是对论文《基于行块分布函数的通用网页正文抽取》的Python实现方式。论文的出发点是针对搜索引擎正文提取的解决方案,移除了所有的标签元素,因此我在此保留了标签元素,改进用户阅读体验。

####特点:

  • 保留正文标签
  • 资源(图片、超链接等)路径为绝对路径(即使原文是相对路径)避免找不到资源

####调用:

from html_body_extractor import BodyExtractor
url = 'http://ballpo.com/detail/182560.html'
be = BodyExtractor(url)
be.execute()
print be.body

####输出:

经纪人承认,尽管拉齐奥前锋凯塔(Keita Balde Diao)刚刚与蓝白军团续约,但来自英超联赛的俱乐部仍旧对他保持着浓厚的兴趣。

“今天,对凯塔感兴趣的俱乐部都知道,要想拉齐奥放走他,你必须拿出一大笔的资金,”经纪人萨维尼(Ulisse Savini)告诉TuttoMercatoWeb.com。“没有人打电话给我,但我们都很清楚:对凯塔感兴趣的俱乐部很多,这一点也不意外。除了利物浦经常在关注他之外,还有曼联。”

最后,经纪人解释道,这名19岁的前巴塞罗那球员需要拿到西班牙的护照才能转投英国踢球,尽管这问题不大。

####TODO:

  • 自定义样式
  • 进一步改进提取正确率

html-extractor's People

Contributors

lzjun avatar

Watchers

James Cloos avatar John Ng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.