Git Product home page Git Product logo

article-extraction-benchmark's Introduction

Article extraction benchmark: open-source libraries and commercial services

We evaluate the quality of article body extraction for commercial services Zyte Automatic Extraction (ours), Diffbot and open-source libraries newspaper3k, readability-lxml, dragnet, boilerpipe, html-text, trafilatura, go-readability, Readability.js, Go-DomDistiller. news-please. Goose3, inscriptis, html2text, jusText, BeautifulSoup. We release evaluation datasets and scripts, and provide more details in a whitepaper.

Article extraction is a task of extracting certain fields of an article (e.g. news or blog post), such as headline, article body, publication date, authors, etc. Article extraction systems must work on any web-site. Here we evaluate only the article body field, as this is one of the most important fields and one of the hardest to get right.

Results of the initial evaluation, done in November 2019:

                  version   F1             precision      recall         accuracy
AutoExtract       Nov 2019  0.970 ± 0.005  0.984 ± 0.002  0.956 ± 0.010  0.470 ± 0.037
Diffbot           Nov 2019  0.951 ± 0.010  0.958 ± 0.009  0.944 ± 0.013  0.348 ± 0.038
boilerpipe        ab3694d   0.860 ± 0.016  0.850 ± 0.016  0.870 ± 0.020  0.006 ± 0.006
dragnet           1b65e7b   0.907 ± 0.014  0.925 ± 0.013  0.889 ± 0.019  0.221 ± 0.030
html-text         0.5.1     0.665 ± 0.015  0.500 ± 0.017  0.994 ± 0.001  0.000 ± 0.000
newspaper3k       0.2.8     0.912 ± 0.014  0.917 ± 0.014  0.906 ± 0.018  0.260 ± 0.032
readability-lxml  0.7.1     0.922 ± 0.014  0.913 ± 0.014  0.931 ± 0.016  0.315 ± 0.035
xpath-text        4.4.2     0.394 ± 0.020  0.246 ± 0.016  0.992 ± 0.001  0.000 ± 0.000

Result of packages added after original evaluation:

                  version   F1             precision      recall         accuracy
trafilatura       0.5.1     0.945 ± 0.009  0.925 ± 0.011  0.966 ± 0.009  0.221 ± 0.031
go_readability    bdc8717   0.943 ± 0.007  0.912 ± 0.009  0.975 ± 0.007  0.210 ± 0.030
readability_js    Feb 2021  0.887 ± 0.012  0.853 ± 0.013  0.924 ± 0.012  0.149 ± 0.026
go_domdistiller   1c90a88   0.927 ± 0.007  0.901 ± 0.010  0.956 ± 0.010  0.066 ± 0.018
news_please       1.5.17    0.911 ± 0.014  0.917 ± 0.013  0.906 ± 0.018  0.249 ± 0.032
goose3            3.1.8     0.887 ± 0.016  0.930 ± 0.015  0.847 ± 0.021  0.227 ± 0.032
inscriptis        1.1.2     0.679 ± 0.015  0.517 ± 0.017  0.993 ± 0.001  0.000 ± 0.000
html2text         2020.1.16 0.662 ± 0.015  0.499 ± 0.017  0.983 ± 0.002  0.000 ± 0.000
justext           2.2.0     0.802 ± 0.018  0.858 ± 0.017  0.754 ± 0.028  0.088 ± 0.021
beautifulsoup     4.9.3     0.665 ± 0.015  0.499 ± 0.017  0.994 ± 0.001  0.000 ± 0.000

Below you can find more details about the packages and result reproduction.

More details are available:

Clone this repo, and use Python 3.6+.

Evaluation does not require any dependencies. Dependencies listed in requirements.txt are only for re-generating output files for open-source article extraction libraries. See below for their installation details.

JSON data format: a dictionary which maps item ids to dictionaries, with the following fields:

  • articleBody: text of the article
  • url: page url (optional)

All files should have the same keys. Ground truth is in ground-truth.json, predictions from different systems is in output/*.json files.

HTML files are in html folder. They were fetched with Splash headless browser with JS disabled by default. They are gzip-compressed and utf-8 encoded.

Screenshots of all pages are not in the repo, they are available on github in the "Releases" section: https://github.com/scrapinghub/article-extraction-benchmark/releases

In addition to benchmarking AutoExtract and Diffbot services, we also benchmark several open-source libraries that work directly on HTML files without a need for rendering or external resources:

Output from these libraries is already present in the repo in output/*.json files. They were generated with extractors/run_*.py files.

All dependencies are in requirements.txt. Note that dragnet may fail to install at first try, as you need to have numpy and Cython installed, and have libxml2 headers (libxml2-dev on Ubuntu).

boilerpipe requires a custom installation: use python2, you also need Java (e.g. install default-jre in Ubuntu), install it with pip install -e git+https://github.com/misja/python-boilerpipe.git@ab3694d7bf695b73f0684a028e70aa816d63e6cb#egg=boilerpipe

go-readability requires a custom installation: see README in extractors/go_readability.

Readability.js require a custom installation: install nodejs and install cli tool: npm install -g [email protected]

Go-DomDistiller requires a custom installation: see README in extractors/go_domdistiller.

For evaluation, run:

python3 evaluate.py

We report precision, recall, F1, accuracy and their standard deviation estimated with bootstrap. Please refer to the technical report for more details.

License is MIT.

article-extraction-benchmark's People

Contributors

adbar avatar imduffy15 avatar ldulcic avatar lopuhin avatar madrugado avatar saippuakauppias avatar whalebot-helmsman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

article-extraction-benchmark's Issues

Adding more tools to the benchmark?

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

  • goose3
  • jusText (especially with a custom configuration)
  • inscriptis (html-to-txt conversion)
  • trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

Adding version numbers or dates to the list

Hi, as time passes some tools evolve. Not all of them are under active development or maintenance, but results do change, see e.g. commit adbar/trafilatura@e5ef7e0 on my benchmark.

It would be nice to provide the version numbers in the evaluation, or at least dates to indicate which version is referenced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.