Git Product home page Git Product logo

page-signal's Introduction

page-signal

Originally intended as a Clojure library designed to extract content text from web pages presumed to be articles/blog posts etc. There are many approaches to this problem. This solution is based on the paper "Boilerplate Detection using Shallow Text Features" available here: http://www.l3s.de/~kohlschuetter/boilerplate/

The authors marked up the web page into atomic 'blocks', annotated the blocks with features, and classified each block as either boilerplate or content based on its features. Machine learning techniques were used to determine the most predictive features and classify each block as either boilerplate or content. The algorithm was trained on a dataset from 2008, so it may now be outdated.

One of the authors wrote an open source Java library called Boilerpipe based on the paper. My original intention was to reimplement the algorithm in Clojure and experiment with additional ad hoc rules to enhance the basic algorithm. I no longer have a need for this functionality, but I am leaving this here in case anybody wants to pickup where I left off.

The algorithm was evaluated against the google news data set (see paper) that is in the 'L3S-GN1-20100130203947-00001' directory. Precision, recall, and F1-score were calculated. Precision measures how much of the retrieved text is relevant. Precision improves as less boilerplate is mistakenly extracted. Recall measures how much of the relevant text has been extracted. Recall improves as less relevant text is missed. F1-score is the harmonic mean of precision and recall.

The array of words model is used for this calculation. Longest common substring algorithm (not to be confused with Longest common subsequence) is applied to words of a text rather than the typical characters of a string. This algorithm was written in pure Java for better performance. Attempts to achieve Java performance parity in Clojure failed. Clojure was still markedly slower. For more on this issue see this question on StackOverflow: https://stackoverflow.com/questions/14949705/clojure-performance-for-expensive-algorithms and this discussion in Clojure users group: https://groups.google.com/forum/#!topic/clojure/byHO-9t6X4U%5B1-25%5D

For more on evaluation metrics see here: http://tomazkovacic.com/blog/74/evaluation-metrics-for-text-extraction-algorithms/

My implementation got an F1-score of 81.7. This is short of 93.9 achieved by the authors in the paper. Clearly there is room for improvement. In addition some files in the data set aren't handled well. There is also code that attempts to pick out the headline, but it is still experimental and doesn't work consistently.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.