dragnet-org / dragnet Goto Github PK

Just the facts -- web page content extraction

License: MIT License

Makefile 0.33% Python 55.55% C++ 1.79% HTML 41.99% Shell 0.24% Dockerfile 0.10%

dragnet's Introduction

Dragnet

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a web page. It's interested in... 'just the facts.' The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on a variety of test benchmarks.

For more information on our approach check out:

Our paper Content Extraction Using Diverse Feature Sets, published at WWW in 2013, gives an overview of the machine learning approach.
A comparison of Dragnet and alternate content extraction packages.
This blog post explains the intuition behind the algorithms.

This project was originally inspired by Kohlschütter et al, Boilerplate Detection using Shallow Text Features and Weninger et al CETR -- Content Extraction with Tag Ratios, and more recently by Readability.

GETTING STARTED

Depending on your use case, we provide two separate functions to extract just the main article content or the content and any user generated comments. Each function takes an HTML string and returns the content string.

import requests
from dragnet import extract_content, extract_content_and_comments

# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)

# get main article without comments
content = extract_content(r.content)

# get article and comments
content_comments = extract_content_and_comments(r.content)

We also provide a sklearn-style extractor class(complete with fit and predict methods). You can either train an extractor yourself, or load a pre-trained one:

from dragnet.util import load_pickled_model

content_extractor = load_pickled_model(
            'kohlschuetter_readability_weninger_content_model.pkl.gz')
content_comments_extractor = load_pickled_model(
            'kohlschuetter_readability_weninger_comments_content_model.pkl.gz')
            
content = content_extractor.extract(r.content)
content_comments = content_comments_extractor.extract(r.content)

A note about encoding

If you know the encoding of the document (e.g. from HTTP headers), you can pass it down to the parser:

content = content_extractor.extract(html_string, encoding='utf-8')

Otherwise, we try to guess the encoding from a meta tag or specified <?xml encoding=".."?> tag. If that fails, we assume "UTF-8".

Installing

Dragnet is written in Python (developed with 2.7, with support recently added for 3) and built on the numpy/scipy/Cython numerical computing environment. In addition we use lxml (libxml2) for HTML parsing.

We recommend installing from the master branch to ensure you have the latest version.

Installing with Docker:

This is the easiest method to install Dragnet and builds a Docker container with Dragnet and its dependencies.

Install Docker.
Clone the master branch: git clone https://github.com/dragnet-org/dragnet.git
Build the docker container: docker build -t dragnet .
Run the tests: docker run dragnet make test

You can also run an interactive Python session:

docker run -ti dragnet python3

Installing without Docker

Install the dependencies needed for Dragnet. The build depends on GCC, numpy, Cython and lxml (which in turn depends on libxml2). We use provision.sh to setup the dependencies in the Docker container, so you can use it as a template and modify as appropriate for your operation system.
Clone the master branch: git clone https://github.com/dragnet-org/dragnet.git
Install the requirements: cd dragnet; pip install -r requirements.txt
Build dragnet:

$ cd dragnet
$ make install
# these should now pass
$ make test

Contributing

We love contributions! Open an issue, or fork/create a pull request.

More details about the code structure

The Extractor class encapsulates a blockifier, some feature extractors and a machine learning model.

A blockifier implements blockify that takes a HTML string and returns a list of block objects. A feature extractor is a callable that takes a list of blocks and returns a numpy array of features (len(blocks), nfeatures). There is some additional optional functionality to "train" the feature (e.g. estimate parameters needed for centering) specified in features.py. The machine learning model implements the scikits-learn interface (predict and fit) and is used to compute the content/no-content prediction for each block.

Training/test data

The training and test data is available at dragnet_data.

Training content extraction models

Download the training data (see above). In what follows ROOTDIR contains the root of the dragnet_data repo, another directory with similar structure (HTML and Corrected sub-directories).
Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory $ROOTDIR/block_corrected/ for the output files, then run:
```
from dragnet.data_processing import extract_all_gold_standard_data
rootdir = '/path/to/dragnet_data/'
extract_all_gold_standard_data(rootdir)
```
This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
- Whether to use just article content or content and comments.
- The features to use
- The machine learning model to use
For example, to train the randomized decision tree classifier from sklearn using the shallow text features from Kohlschuetter et al. and the CETR features from Weninger et al.:
```
from dragnet.extractor import Extractor
from dragnet.model_training import train_model
from sklearn.ensemble import ExtraTreesClassifier

rootdir = '/path/to/dragnet_data/'

features = ['kohlschuetter', 'weninger', 'readability']

to_extract = ['content', 'comments']   # or ['content']

model = ExtraTreesClassifier(
    n_estimators=10,
    max_features=None,
    min_samples_leaf=75
)
base_extractor = Extractor(
    features=features,
    to_extract=to_extract,
    model=model
)

extractor = train_model(base_extractor, rootdir)
```
This trains the model and, if a value is passed to output_dir, writes a pickled version of it along with some some block level classification errors to a file in the specified output_dir. If no output_dir is specified, the block-level performance is printed to stdout.
Once you have decided on a final model, train it on the entire training data using dragnet.model_training.train_models.
As a last step, test the performance of the model on the test set (see below).

Evaluating content extraction models

Use evaluate_models_predictions in model_training to compute the token level accuracy, precision, recall, and F1. For example, to evaluate a trained model run:

from dragnet.compat import train_test_split
from dragnet.data_processing import prepare_all_data
from dragnet.model_training import evaluate_model_predictions

rootdir = '/path/to/dragnet_data/'
data = prepare_all_data(rootdir)
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data)
train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)

extractor.fit(train_html, train_labels, weights=train_weights)
predictions = extractor.predict(test_html)
scores = evaluate_model_predictions(test_labels, predictions, test_weights)

Note that this is the same evaluation that is run/printed in train_model

dragnet's People

Contributors

Stargazers

Watchers

Forkers

eristoddle cartercole jeffnappi gr8whitenorth pombredanne yossale jellchou redsuncmx ilovejs java66liu fancyspeed jimmy0000 horte seem-sky slitayem bebound zhoubug kkoci nkt1546789 mozii timwu iftekeriba simonqiang brandverity theunixman fucheng830 zhaodonghui3939 nhu2000 rw tiancode amitiit2015 lakezhang fosterwei dongweibox newatyork xypan1232 tuyendothanh vergili bumpuglies rmax-contrib manjunathd timwee meng-li baali autumnz613 leonar quorumus stoeoeoe respondperfect hieulq perryzm highflykxf msjyoo thiseye fluquid leezqcst proitm chui0107 r22gdl keimhaqi kodeworker skimit jevade bomboradata brianrusso ppbruce hhself aglaianwoman haydenliu vickzhang xianwill amrrs rlewis1uk dgreen2017 vincentlcy vicissitude47 ankurpandey42 ducalpha howinator stanreport chou852ishare vishalbelsare bdore quandb afcarl shannonyu ivanfoong cjacques1 aipacino tanthml ronaldgreeff inparse alexanderdaw tiagoatcision leosonh motazsaad trifle fabiofumarola yomguithereal khan007

dragnet's Issues

Compatibility with Newer Python Versions

Hello,

I am having trouble using dragnet with python3.9. In particular, I get an error like this when importing dragnet:

root@2e4bbb389174:/home# python3
Python 3.9.2 (default, Feb 19 2021, 17:23:45)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dragnet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/dragnet/__init__.py", line 1, in <module>
    from dragnet.blocks import Blockifier, PartialBlock, BlockifyError
  File "dragnet/blocks.pyx", line 32, in init dragnet.blocks
  File "/usr/local/lib/python3.9/site-packages/dragnet/compat.py", line 265, in <module>
    from sklearn import __version__ as sklearn_version
  File "/usr/local/lib/python3.9/site-packages/sklearn/__init__.py", line 64, in <module>
    from .base import clone
  File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 14, in <module>
    from .utils.fixes import signature
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 14, in <module>
    from . import _joblib
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_joblib.py", line 22, in <module>
    from ..externals import joblib
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/__init__.py", line 119, in <module>
    from .parallel import Parallel
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/parallel.py", line 28, in <module>
    from ._parallel_backends import (FallbackToBackend, MultiprocessingBackend,
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 22, in <module>
    from .executor import get_memmapping_executor
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/executor.py", line 14, in <module>
    from .externals.loky.reusable_executor import get_reusable_executor
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/loky/__init__.py", line 12, in <module>
    from .backend.reduction import set_loky_pickler
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 125, in <module>
    from sklearn.externals.joblib.externals import cloudpickle  # noqa: F401
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/__init__.py", line 3, in <module>
    from .cloudpickle import *
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 152, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 133, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

After some googling around, it appears that this error is related to changes introduced in python3.8. If my understanding of the issue is correct, would it be possible to support newer versions of python in dragnet? Thanks in advance.

publish a new release on pypi?

Hi! It's been about a year since the last release — and my previous ask for it ;) — and I was wondering if folks had bandwidth to publish a new (v2.0.4?) release to pypi. It would be great to get unblocked on upgrading scikit-learn to v0.20. Thanks in advance!

scikit-learn update

Hi,

I am using dragnet in combination with scikit-learn. I would like to use scikit-learn 0.18 together with dragnet, which is a little bit unconvenient, since dragnet depends on an older version of scikit and has problems loading the pretrained models.

Do you plan an update?

Thanks in advance,
Siavash Sefid Rodi

PyPi installation issue

I am on Python 3.6 (AMI), after installing lxml and numpy.

This is the error I get while trying to install dragnet 2.0.2 from pip --

Collecting dragnet
  Using cached https://files.pythonhosted.org/packages/d8/5b/b1ea2d21c45dc6862b950530e96bf9c7db0a20dd313384ae8aac4140e41a/dragnet-2.0.2.tar.gz
Requirement already satisfied: Cython>=0.21.1 in ./lib64/python2.7/site-packages (from dragnet) (0.28.2)
Requirement already satisfied: lxml in ./lib64/python2.7/site-packages (from dragnet) (4.2.1)
Collecting scikit-learn<0.19.0,>=0.15.2 (from dragnet)
  Downloading https://files.pythonhosted.org/packages/60/f0/c9db37931e1cf1d7d3a210ac3a18771cbe7ff6375c8f50c256793df63df8/scikit_learn-0.18.2-cp27-cp27mu-manylinux1_x86_64.whl (11.6MB)
    100% |████████████████████████████████| 11.7MB 3.5MB/s
Requirement already satisfied: numpy in ./lib64/python2.7/site-packages (from dragnet) (1.14.2)
Collecting scipy (from dragnet)
  Using cached https://files.pythonhosted.org/packages/9c/0b/5deb712a9ea5bb0a1de837d04ef7625c5f3ee44efe7ed0765ceda681d7f1/scipy-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting ftfy<5.0.0,>=4.1.0 (from dragnet)
Collecting wcwidth (from ftfy<5.0.0,>=4.1.0->dragnet)
  Using cached https://files.pythonhosted.org/packages/7e/9f/526a6947247599b084ee5232e4f9190a38f398d7300d866af3ab571a5bfe/wcwidth-0.1.7-py2.py3-none-any.whl
Collecting html5lib (from ftfy<5.0.0,>=4.1.0->dragnet)
  Using cached https://files.pythonhosted.org/packages/a5/62/bbd2be0e7943ec8504b517e62bab011b4946e1258842bc159e5dfde15b96/html5lib-1.0.1-py2.py3-none-any.whl
Collecting six>=1.9 (from html5lib->ftfy<5.0.0,>=4.1.0->dragnet)
  Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting webencodings (from html5lib->ftfy<5.0.0,>=4.1.0->dragnet)
  Using cached https://files.pythonhosted.org/packages/f4/24/2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7/webencodings-0.5.1-py2.py3-none-any.whl
Building wheels for collected packages: dragnet
  Running setup.py bdist_wheel for dragnet ... error
  Complete output from command /home/ec2-user/lambdas/content_extractor/dragnet/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m2PgjV/dragnet/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-bNQC3F --python-tag cp27:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-2.7
  creating build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/model_training.py -> build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/data_processing.py -> build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/__init__.py -> build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/extractor.py -> build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/compat.py -> build/lib.linux-x86_64-2.7/dragnet
  copying dragnet/util.py -> build/lib.linux-x86_64-2.7/dragnet
  creating build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/css.py -> build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/standardized.py -> build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/kohlschuetter.py -> build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/readability.py -> build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/weninger.py -> build/lib.linux-x86_64-2.7/dragnet/features
  copying dragnet/features/__init__.py -> build/lib.linux-x86_64-2.7/dragnet/features
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
  copying dragnet/pickled_models/sklearn_0.18.0/kohlschuetter_weninger_readability_content_comments_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
  copying dragnet/pickled_models/sklearn_0.18.0/kohlschuetter_weninger_readability_content_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/sklearn_0.15.2_0.17.1/kohlschuetter_weninger_readability_content_comments_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/sklearn_0.15.2_0.17.1/kohlschuetter_weninger_readability_content_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
  creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
  running build_ext
  building 'dragnet.lcs' extension
  creating build/temp.linux-x86_64-2.7
  creating build/temp.linux-x86_64-2.7/dragnet
  gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/home/ec2-user/lambdas/content_extractor/dragnet/local/lib64/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c dragnet/lcs.cpp -o build/temp.linux-x86_64-2.7/dragnet/lcs.o
  unable to execute 'gcc': No such file or directory
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for dragnet
  Running setup.py clean for dragnet
Failed to build dragnet
Installing collected packages: scikit-learn, scipy, wcwidth, six, webencodings, html5lib, ftfy, dragnet
  Running setup.py install for dragnet ... error
    Complete output from command /home/ec2-user/lambdas/content_extractor/dragnet/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m2PgjV/dragnet/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-z8yKqs/install-record.txt --single-version-externally-managed --compile --install-headers /home/ec2-user/lambdas/content_extractor/dragnet/include/site/python2.7/dragnet:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-2.7
    creating build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/model_training.py -> build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/data_processing.py -> build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/__init__.py -> build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/extractor.py -> build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/compat.py -> build/lib.linux-x86_64-2.7/dragnet
    copying dragnet/util.py -> build/lib.linux-x86_64-2.7/dragnet
    creating build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/css.py -> build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/standardized.py -> build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/kohlschuetter.py -> build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/readability.py -> build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/weninger.py -> build/lib.linux-x86_64-2.7/dragnet/features
    copying dragnet/features/__init__.py -> build/lib.linux-x86_64-2.7/dragnet/features
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
    copying dragnet/pickled_models/sklearn_0.18.0/kohlschuetter_weninger_readability_content_comments_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
    copying dragnet/pickled_models/sklearn_0.18.0/kohlschuetter_weninger_readability_content_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.18.0
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    copying dragnet/pickled_models/py2_sklearn_0.18.0/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.18.0
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py2_sklearn_0.15.2_0.17.1
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/sklearn_0.15.2_0.17.1/kohlschuetter_weninger_readability_content_comments_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/sklearn_0.15.2_0.17.1/kohlschuetter_weninger_readability_content_model.pickle.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/sklearn_0.15.2_0.17.1
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    copying dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.15.2_0.17.1
    creating build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_content_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_block_errors.txt -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_comments_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    copying dragnet/pickled_models/py3_sklearn_0.18.0/kohlschuetter_readability_weninger_content_model.pkl.gz -> build/lib.linux-x86_64-2.7/dragnet/pickled_models/py3_sklearn_0.18.0
    running build_ext
    building 'dragnet.lcs' extension
    creating build/temp.linux-x86_64-2.7
    creating build/temp.linux-x86_64-2.7/dragnet
    gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/home/ec2-user/lambdas/content_extractor/dragnet/local/lib64/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c dragnet/lcs.cpp -o build/temp.linux-x86_64-2.7/dragnet/lcs.o
    unable to execute 'gcc': No such file or directory
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
Command "/home/ec2-user/lambdas/content_extractor/dragnet/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-m2PgjV/dragnet/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-z8yKqs/install-record.txt --single-version-externally-managed --compile --install-headers /home/ec2-user/lambdas/content_extractor/dragnet/include/site/python2.7/dragnet" failed with error code 1 in /tmp/pip-install-m2PgjV/dragnet/

I have tried using Python 2.7.3 and have ended on the same error.

Possible to get HTML for winning content?

Is there any method to get the HTML for the winning block content? I'd like to also get img code pre elements and preserve formatting with p and heading tags where possible.

[Error] ValueError: Can't find libxml2 include headers

Hey,
I am working on my undergraduate project and was trying to use dragnet but I am unable to do so after repeated attempts. I have already installed the required dependencies but I am still getting the
"ValueError: Can't find libxml2 include headers" error (Below is the log file ).Please help.Thanks in advance.

c:\Python27\Scripts\pip run on 01/29/15 17:35:16
Downloading/unpacking dragnet
Getting page https://pypi.python.org/simple/dragnet/
URLs to search for versions for dragnet:

https://pypi.python.org/simple/dragnet/
Analyzing links from page https://pypi.python.org/simple/dragnet/
Skipping https://pypi.python.org/packages/cp27/d/dragnet/dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl#md5=84fb8d211155099f6406c2ada3f63578 (from https://pypi.python.org/simple/dragnet/) because it is not compatible with this Python
Skipping https://pypi.python.org/packages/cp27/d/dragnet/dragnet-1.0.1-cp27-none-macosx_10_10_intel.whl#md5=ddbb966ef2e0e1c0252b4a30db17fb45 (from https://pypi.python.org/simple/dragnet/) because it is not compatible with this Python
Found link https://pypi.python.org/packages/source/d/dragnet/dragnet-1.0.0.tar.gz#md5=1a71b6ad3ad87d98488e0dc4a2d848f6 (from https://pypi.python.org/simple/dragnet/), version: 1.0.0
Found link https://pypi.python.org/packages/source/d/dragnet/dragnet-1.0.1.tar.gz#md5=593bb685674901399dba5c305b3d85b5 (from https://pypi.python.org/simple/dragnet/), version: 1.0.1
Using version 1.0.1 (newest of versions: 1.0.1, 1.0.0)
Downloading from URL https://pypi.python.org/packages/source/d/dragnet/dragnet-1.0.1.tar.gz#md5=593bb685674901399dba5c305b3d85b5 (from https://pypi.python.org/simple/dragnet/)
Running setup.py (path:c:\users\faisal1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet\setup.py) egg_info for package dragnet
Traceback (most recent call last):
File "", line 17, in
File "c:\users\faisal1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet\setup.py", line 50, in
include_dirs = lxml.get_include() + [find_libxml2_include()],
File "c:\users\faisal~1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet\setup.py", line 36, in find_libxml2_include
raise ValueError("Can't find libxml2 include headers")
ValueError: Can't find libxml2 include headers
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 17, in

File "c:\users\faisal~1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet\setup.py", line 50, in

include_dirs = lxml.get_include() + [find_libxml2_include()],

File "c:\users\faisal~1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet\setup.py", line 36, in find_libxml2_include

raise ValueError("Can't find libxml2 include headers")

ValueError: Can't find libxml2 include headers

Cleaning up...
Removing temporary dir c:\users\faisal1\appdata\local\temp\pip_build_FAISAL KHAN...
Command python setup.py egg_info failed with error code 1 in c:\users\faisal1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet
Exception information:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\pip\basecommand.py", line 122, in main
status = self.run(options, args)
File "C:\Python27\lib\site-packages\pip\commands\install.py", line 278, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "C:\Python27\lib\site-packages\pip\req.py", line 1229, in prepare_files
req_to_install.run_egg_info()
File "C:\Python27\lib\site-packages\pip\req.py", line 325, in run_egg_info
command_desc='python setup.py egg_info')
File "C:\Python27\lib\site-packages\pip\util.py", line 697, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command python setup.py egg_info failed with error code 1 in c:\users\faisal~1\appdata\local\temp\pip_build_FAISAL KHAN\dragnet

CSS Feature Doesn't Work

Trying to use css feature returns an error.
I tried to print css attribute of blocks and it was empty for all of them.
Did I do something wrong or it is broken?

KeyError                                  Traceback (most recent call last)
<ipython-input-2-12f95110a825> in <module>
     23 )
     24 
---> 25 extractor = train_model(base_extractor, rootdir)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/model_training.py in train_model(extractor, data_dir, output_dir)
    106     logging.info('fitting and evaluating the extractor features and model...')
    107     try:
--> 108         extractor.fit(train_html, train_labels, weights=train_weights)
    109     except (TypeError, ValueError):
    110         extractor.fit(train_html, train_labels)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/extractor.py in fit(self, documents, labels, weights)
     87         # happens for now, but this might be important if the features change
     88         features_mat = np.concatenate([self.features.fit_transform(blocks)
---> 89                                        for blocks in block_groups])
     90         if weights is None:
     91             self.model.fit(features_mat, labels)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/extractor.py in <listcomp>(.0)
     87         # happens for now, but this might be important if the features change
     88         features_mat = np.concatenate([self.features.fit_transform(blocks)
---> 89                                        for blocks in block_groups])
     90         if weights is None:
     91             self.model.fit(features_mat, labels)

~/anaconda3_501/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    791             delayed(_fit_transform_one)(trans, X, y, weight,
    792                                         **fit_params)
--> 793             for name, trans, weight in self._iter())
    794 
    795         if not result:

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    918                 self._iterating = self._original_iterator is not None
    919 
--> 920             while self.dispatch_one_batch(iterator):
    921                 pass
    922 

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/anaconda3_501/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params)
    612 def _fit_transform_one(transformer, X, y, weight, **fit_params):
    613     if hasattr(transformer, 'fit_transform'):
--> 614         res = transformer.fit_transform(X, y, **fit_params)
    615     else:
    616         res = transformer.fit(X, y, **fit_params).transform(X)

~/anaconda3_501/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    462         if y is None:
    463             # fit method of arity 1 (unsupervised transformation)
--> 464             return self.fit(X, **fit_params).transform(X)
    465         else:
    466             # fit method of arity 2 (supervised transformation)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/features/css.py in transform(self, blocks, y)
     56             for token in tokens
     57             )
---> 58         return np.column_stack(tuple(feature_vecs)).astype(int)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/features/css.py in <genexpr>(.0)
     54                   for block in blocks)
     55             for attrib, tokens in self.attribute_tokens
---> 56             for token in tokens
     57             )
     58         return np.column_stack(tuple(feature_vecs)).astype(int)

~/anaconda3_501/lib/python3.6/site-packages/dragnet/features/css.py in <genexpr>(.0)
     52         feature_vecs = (
     53             tuple(re.search(token, block.css[attrib]) is not None
---> 54                   for block in blocks)
     55             for attrib, tokens in self.attribute_tokens
     56             for token in tokens

KeyError: 'id'

Installing error within WinPython

Hi. I'm trying to install Dragnet but i got the following error. I have installed lxml, but it doesn't help. Any hint?

C:\Users\ti\Desktop\WinPython-32bit-2.7.10.2\python-2.7.10>pip install dragnet
You are using pip version 7.1.0, however version 7.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
Collecting dragnet
Using cached dragnet-1.0.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "c:\users\ti\appdata\local\temp\pip-build-3ujoqz\dragnet\setup.py", line 50, in
include_dirs = lxml.get_include() + [find_libxml2_include()],
File "c:\users\ti\appdata\local\temp\pip-build-3ujoqz\dragnet\setup.py", line 36, in find_libxml2_include
raise ValueError("Can't find libxml2 include headers")
ValueError: Can't find libxml2 include headers

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in c:\users\ti\appdata\local\temp\pip-build-3ujoqz\dragnet

MemoryError: Unable to allocate array with shape (26577,) and data type <U1847338

base_extractor = Extractor(
File "/home/dragnet/dragnet/model_training.py", line 103, in train_model
train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)
File "/home/dragnet/dragnet/extractor.py", line 124, in get_html_labels_weights
return np.array(all_html), np.array(all_labels), np.array(all_weights)
MemoryError: Unable to allocate array with shape (26577,) and data type <U1847338

i try with length data more than 20K and show memory error in numpy.array, is that problem in features engineering before fitting proccess?

lxml test error on master

I've been looking into the test that's failing in travis, which is in the test_lxml_error on line 51 in test_kohlschuetter.py where it asserts self.assertRaises(etree.XMLSyntaxError, etree.fromstring, '', etree.HTMLParser(recover=True)).

This test was passing before #52 was merged the other day, and I haven't been able to sort out why it's failing on master: when I diff the Python 3 travis outputs, the only differences I notice are Python 3.5.5 vs 3.5.4 and different minor versions of certifi and sqlite.

Also, looking into that code, it seems that the recover=True param in the HTMLParser is specifically supposed to keep the expected XMLSyntaxError from occurring, and taking it out gives the expected behavior.

I have two questions on this:

What is the value of enforcing that this error is raised?
Why is recover=True being passed in, and is it important to the expected behavior? i.e. is it good enough if the exception is still raised when that param isn't passed into the parser?

@matt-peters, in addition to your knowledge of the codebase, it looks like you wrote this test, and I was wondering if you especially had any insight on this?

Is this project maintained?

I can't run dragnet if I simply pip install it. I am trying to compile it manually but pip install -r requirementsfails. I am considering using the docker file but I'm not sure how that will interact with the other tools I want to use dragnet inside of.

It seems like a pretty big project figuring out how to properly install this - I'm willing to take it on but I'm curious, is this project still being maintained? Why doesn't the pip installation work, is it not up to date or something?

Thanks very much

Not able to install on Windows machine

Way to many errors while installing on windows. I have been trying to resolve everything, but still no success.
Is it possible to provide a pipwin installation for dragnet?

Invalid ELF header with Dragnet PyPi Wheel

Trying to use the Wheel files in an AWS Lambda but get this error:

Unable to import module 'myprj/__init__': /var/task/dragnet/blocks.so: invalid ELF header

Apparently the problem is because AWS Lambda's run on Linux and the dragnet import was built on another platform (mac os, according to the name dragnet-2.0.4-cp27-cp27m-macosx_10_14_x86_64.whl).

A work around is explained here: https://tg4.solutions/how-to-resolve-invalid-elf-header-error/

Any chance of a fix? i.e the wheel here: https://pypi.org/project/dragnet/#files can be updated to support all platforms.

Here is an example of a wheel that works on all platforms: https://pypi.org/project/requests/#files (requests-2.23.0-py2.py3-none-any.whl)

Poor performance for content-only extraction

While training new models for a release(as mentioned in #52 and #60 ), I was getting much worse performance on content extraction than what was reported before #52, and much worse than on content with comments(I consistently get an F1 score of about 0.6).

I'm looking into the cause of this now, but I don't think a new release should be made until that is resolved(or if anyone here has trained a good model since #52 was merged).

potential improvements / new features to the extraction model?

I was doing a quick lit review to see if/how the state-of-the-art in web content extraction had changed over the past few years, and came upon a conference paper from last September, Learning Web Content Extraction with DOM Features, that seems interesting, relevant, and performant. There's also code: see learnhtml. Is there any interest in implementing its feature set within dragnet, and evaluating model performance with such features? This could be related to updates proposed in Issue #85.

Error while training system

I got "MemoryError" while training system.

The output of the error is attached above.

What can be the reason to get this error? Also I changed memory amount of vagrant file from "1024" to "2048".

BlockifyError when getting main content

Hi,
I am getting BlockifyError

     90
     91         blocks = self._blockifier.blockify(s, encoding=encoding,
---> 92             parse_callback=parse_callback)
     93
     94         # doc needs to be at least three blocks, otherwise return everything

dragnet/blocks.pyx in dragnet.blocks.TagCountNoCSSReadabilityBlockifier.blockify (dragnet/blocks.cpp:8879)()

dragnet/blocks.pyx in dragnet.blocks.Blockifier.blockify (dragnet/blocks.cpp:7963)()

BlockifyError:

when trying to get main content of the following html page with content_extractor.analyze

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.2//EN" "http://www.openmobilealliance.org/tech/DTD/xhtml-mobile12.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de-DE" dir="ltr">

<head>
    <!-- data-info:v:2.0.5542.29760;a:23125861-6957-4cdf-8ac3-9a2eb03636d1;cn:140;az:{did:e4542eb382cf47daa0d27ce0fde0622d, rid: 140, sn: neurope-hp, dt: 2015-03-06T07:50:13.1753311Z, bt: O};ddpi:1;dpio:;dpi:1;dg:downlevel.pc;th:start;PageName:startPage;m:de-de;cb:;l:de-de;mu:de-de;ud:{cid:,vk:homepage,n:,l:de-de,ck:};xd:AA7TmFi;ovc:f;fxd:t;xdpub:2015-03-10 00:56:48Z;xdmap:2015-03-10 01:53:27Z;axd:;f: -->

    <link rel="canonical" href="http://www.msn.com/de-de/" />


        <title>MSN Deutschland – mit Hotmail Nachfolger Outlook und Messenger Skype</title>     
    <meta name="description" content="Nachrichten, Sport, Unterhaltung, Finanzen, Wetter, Reise, Gesundheit, Lifestyle und Rezepte, kombiniert mit Outlook, vormals Hotmail, Skype und Facebook"/>

<meta name="viewport" content="width=device-width,initial-scale=1, maximum-scale=1" />


<link rel="stylesheet" href="//static-hp-neu.s-msn.com/de-de/homepage/_sc/css/a948ecb3-49788a58/direction=ltr.locales=de-de.themes=start.dpi=resolution1x/f9-9b024b-627907c8/a0-318546-28fba29e/21-6ddefc-846b3b2e/a3-9c147b-74baf5b2/bd-3b6326-bdfd55f0/f2-9eb7b7-2182c5e1/weather-css-3f-bc2b1074ad8f00550fd4291cb33ce0-68ddb2ab/weather-css-65-6341176d2157b6b321f0d5b7797e91-12ce4544/finance-css-52-03305154c657e20cda79e04a4d3e45-60c9bccb/finance-css-69-86cde7ca12112455a01bf1c4e4ca4e-74a48c4/health-css-cc-6f1a79dbf4ab1f25e2e969f1b81bab-68ddb2ab?ver=2.0.5542.29760" media="all" />
    <script type="text/javascript">//<![CDATA[
(function(n,t){function o(n,i,r){typeof n!="string"&&(r=i,i=n,n=t);i&&i.splice||(r=i,i=[]);n=="c.dom"&&(l=!0);s(n,i,r)}function s(n,t,i,r){var e,o,s,h;n&&u[n]||(e=k(n,t),e?(s=typeof i=="function",h=l&&s&&n!="dap"&&n!="adLoad"&&t&&(t.length!=1||t[0]!="dap"&&t[0]!="c.dom"),h?setTimeout(function(){o=i.apply(null,e);a(n,o)},1):(o=s?i.apply(null,e):i,a(n,o))):f.push(r||{i:n,d:t,f:i}))}function a(t,i){t&&(i?(u[t]=i,v()):n.console&&console.error("Dependencies resolved, but object still not defined (or is otherwise falsey). id:"+t+"; typeof obj: "+typeof i))}function v(){var r,u,n,t;if(f.length&&!i){do for(r=f,u=r.length,f=[],i=1,t=0;t<u;t++)n=r[t],s(n.i,n.d,n.f,n);while(i>1);i=0}else i&&(i=2)}function k(i,r){for(var s,v,h,f=[],c=r?r.length:0,o=0;o<c;o++){var l=r[o],a=u[l],y=typeof a!="undefined";if(!y){if(s=b.exec(l),s)if(v=s[1],h=n[v],h!==t){f.push(h);continue}else e||(e=setTimeout(d,w));break}f.push(a)}return c==f.length?f:0}function d(){e=0;v()}function g(n,i,r){(typeof n!="object"||n&&n.splice)&&(r=i,i=n,n={});i&&i.splice||(r=i,i=[]);nt(n.js);r&&s(t,i,r)}function nt(n){if(typeof n=="string")y(n);else if(n)for(var t=0;t<n.length;t++)y(n[t])}function y(n){if(!c[n]){c[n]=1;var i=document.getElementsByTagName("script")[0],t=document.createElement("script");t.src=n;t.onload=t.onreadystatechange=function(){this.readyState&&this.readyState!="loaded"&&this.readyState!="complete"||(t.onload=t.onreadystatechange=null,t.parentNode&&t.parentNode.removeChild(t))};i.parentNode.insertBefore(t,i)}}function tt(n){return p?n?r.now():Math.round(r.now()):new Date-h}var r=n.performance,h=((r||{}).timing||{}).navigationStart||(n._timing||{}).start||+new Date,p=r&&typeof r.now=="function",u={image:Image,document:document,location:document.location,window:n,screen:screen,navigator:navigator,date:Date,pageTime:tt,pageStart:h},f=[],i,w=50,e,c={},l=!1,b=/^window\.(.+)$/;o.amd={jQuery:1};o.is=function(n){return typeof u[n]!="undefined"};n.define=o;n.require=g})(window);define("navigation",["escape","location"],function(n,t){function r(n,t,i){var s=function(n){return n=n.replace(/\+/g," "),decodeURIComponent(n)},u={},o,e;if(n)for(n=n.split("#")[0],o=n.split("&"),e=0;e<o.length;e++){var h=o[e].split("="),r=h[0],f=h[1];i&&(r=s(r),f&&(f=s(f)));t?(u[r]||(u[r]=[]),u[r].push(f)):u[r]=f}return u}function u(n){var t=f.exec(n);return t?t[2]:!1}var f=/[a-z][a-z0-9+\-.]*:\/\/([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])/i,i={getUrl:function(n){return i.filter?i.filter(n):n},navigate:function(n,r){i.filter&&(n=i.filter(n));r?t.replace(n):t.href=n},getHostName:u,isLocal:function(n){var i=u(n);return!i||t.hostname==i},getParams:r,getParamsFromUrl:function(n,t,i){var u=n.split("?")[1];return r(u,t,i)},mergeQueryStringParams:function(t,i){var e,f,o,u;if(i){if(e=t.split("?"),e[1]){f=r(e[1],!1,!0);for(u in i)f[u]=i[u]}else f=i;t=e[0];o="?";for(u in f)t+=f[u]?o+n.urlPart(u)+"="+n.urlPart(f[u]):o+n.urlPart(u),o="&"}return t},filter:null};return i});define("logging",["window"],function(n){function r(n,t){if(n.addEventListener)n.addEventListener("error",t,!1);else if(n.onerror){var i=n.onerror;n.onerror=function(n,r,u,f,e){return i(n,r,u,f,e),t(n,r,u,f,e)}}else n.onerror=t}function t(t){n.console&&(console.error||console.log)(t)}function u(){}function f(){}function e(t){(n.console||{}).timeStamp?console.timeStamp(t):(n.performance||{}).mark&&n.performance.mark(t)}var i=0;return r(n,function(n){return i++,n&&t("[SCRIPTERROR] "+n),!0}),{error:t,fatalError:t,unhandledErrorCount:function(){return i},perfMark:e,warning:u,information:f}})//]]></script><script type="text/javascript" src="//static-hp-neu.s-msn.com/_h/804ff984/webcore/externalscripts/jquery/jquery-1.11.1.min.js"></script>

            <script type="text/javascript"  src="//ads1.msads.net/library/8.3/dapmsn.js"></script>

    <style>.ie10plus ul.swipenav{display:inline-block}body:not(.startpage) #opensh{display:none!important}.homepage.midlevel .pagingsection>button.show,.channelplayerpage.midlevel .pagingsection>button.show{display:none}#main .linkskypeid.integratedskypeflyout>h3{display:none}</style>
</head>

<body class="startpage sp center-content start">


        <div id="banners">


    <span>Durch Nutzung dieser Webseite stimmen Sie der Verwendung von Cookies f&#252;r Analysezwecke, personalisierte Inhalte und Werbung zu.</span>

        </div>
    <div class="head">
        <div>
<div id="topnav">
        <ul class="verticalsnav">
                    <li  class="current">
                        <a href="/de-de">Startseite</a>
                    </li>
                    <li >
                        <a href="/de-de/nachrichten">Nachrichten</a>
                    </li>
                    <li >
                        <a href="/de-de/wetter">Wetter</a>
                    </li>
                    <li >
                        <a href="/de-de/unterhaltung">Unterhaltung</a>
                    </li>
                    <li >
                        <a href="/de-de/sport">Sport</a>
                    </li>
                    <li >
                        <a href="/de-de/finanzen">Finanzen</a>
                    </li>

                <li class="more">
                    <a href="#nav">Mehr ></a>
                    <ul>
        <li >
            <a href="/de-de/lifestyle">Lifestyle</a>
        </li>
        <li >
            <a href="/de-de/gesundheit">Gesundheit &amp; Fitness</a>
        </li>
        <li >
            <a href="/de-de/kochen-und-genuss">Kochen &amp; Genuss</a>
        </li>
        <li >
            <a href="/de-de/reisen">Reisen</a>
        </li>
        <li >
            <a href="/de-de/auto">Auto</a>
        </li>
        <li >
            <a href="/de-de/video">Video</a>
        </li>
                    </ul>
                </li>
        </ul>
</div>

                    <div id="header-common">
            <div class="header">
                <div class="header-logo">
                            <a class="logo" href="/de-de">



<img alt="" height="20" width="20" src="//static-hp-neu.s-msn.com/sc/6d/b23cf2.gif" />
        </a>
            <a class="vertical" href="/de-de">msn</a>




                </div>
                <div id="header-links">
                        <a href="http://www.outlook.com/">Outlook.com</a>
                        <span>|</span>
                        <a href="/de-de/settings/markettoggle"><img alt="de-de" src="//static-hp-neu.s-msn.com/sc/6a/a62410.gif" /></a>

                    <a class="navigation" href="#nav">
                        <img  alt="wechseln zu navigation" width="27" height="20" src="//static-hp-neu.s-msn.com/sc/57/a49b8d.gif" />
                    </a>
                </div>
            </div>
        </div>



<div id="header-search">
    <form action="http://www.bing.com/search?scope=web" method="get" id="srchfrm">
        <div class="searchbox">
            <input type="text" id="q" name="q" value="" />
            <input type="hidden" name="form" value="PRDEDL" />
            <input type="hidden" name="refig" value="2312586169574cdf8ac39a2eb03636d1">
            <input type="submit" class="text" value="Websuche" title="Websuche"/>
        </div>
    </form>
</div>

        </div>


            <div class="upgradebrowser">
                Sie verwenden eine veraltete Browserversion. Bitte verwenden Sie eine <a href="http://support2.microsoft.com/kb/2999871/de-de">unterstütze Version</a>damit Sie MSN optimal nutzen können.
            </div>



    </div>
    <div id="maincontent">

        <div id="main"  data-region="main">




        <div class="stripe first">
<h2>Heute</h2>    <a href="/de-de/nachrichten/politik/kopfsch%c3%bctteln-allerorten-athen-hat-viel-zeit-verspielt/ar-AA9zduo"
       >
<img alt="new caption" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9zDGZ.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="AFP" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/nachrichten/politik/kopfsch%c3%bctteln-allerorten-athen-hat-viel-zeit-verspielt/ar-AA9zduo"
       >

Kopfsch&#252;tteln allerorten: Athen hat viel Zeit verspielt    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/bildergalerien/apple-stellt-v%c3%b6llig-neues-macbook-vor-und-enth%c3%bcllt-letzte-infos-zur-watch/ss-AA9z2xo"
       >

Apple Watch: Das kann sie, so viel kostet sie    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/panorama/nichts-ist-wie-es-scheint-alles-ist-geplant/ar-AA9yiuP"
       >

Verschw&#246;rung? Nichts ist, wie es scheint    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/politik/eklat-hinter-den-kulissen-bei-%e2%80%9eg%c3%bcnther-jauch%e2%80%9c/ar-AA9yEcT"
       >

Eklat hinter den Kulissen bei „G&#252;nther Jauch“    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/nachrichten/generationen-zwist-unter-metzgern/ar-AA9z5Cz"
       >

Schlappe f&#252;r Schlachter-K&#246;nig T&#246;nnies    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/nachrichten" >Nachrichten</a></h2>    <a href="/de-de/nachrichten/wissenundtechnik/apple-will-w%c3%bcnsche-wecken-die-wir-noch-nicht-kennen/ar-AA9yXxe"
       >
<img alt="new caption" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9zdx0.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f&amp;x=1015&amp;y=972" title="Bloomberg" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/nachrichten/wissenundtechnik/apple-will-w%c3%bcnsche-wecken-die-wir-noch-nicht-kennen/ar-AA9yXxe"
       >

Apple will W&#252;nsche wecken, die wir noch nicht kennen    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/other/das-passiert-heute-wer-bekommt-edathys-geld/ar-AA9zDTg"
       >

Das passiert heute: Wer bekommt Edathys Geld?    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/other/man%c3%b6ver-unter-us-f%c3%bchrung-nato-probt-marineeins%c3%a4tze-im-schwarzen-meer/vi-AA9zlqD"
       >

Man&#246;ver unter US-F&#252;hrung: Nato probt Marineeins&#228;tze im Schwarzen Meer    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/other/apple-stellt-watch-vor-%e2%80%93-ab-10-april-bestellbar/vi-AA9z9zR"
       >

Apple stellt Watch vor – ab 10. April bestellbar    </a>

                            </li>
                            <li >
                                    <a href="/de-de/nachrichten/politik/keinen-is-k%c3%a4mpfer-wollen-sie-entkommen-lassen/ar-AA9yZZk"
       >

Keinen IS-K&#228;mpfer wollen sie entkommen lassen    </a>

                            </li>
                    </ul>

        </div>
    <div  class="ad"  id="rectangle1_homepage_cf0a5546-df18-4dff-ab8c-62425edbf3f5">
        <div>
            <div  id="rectangle1_homepage_container_cf0a5546-df18-4dff-ab8c-62425edbf3f5">
                    <script type="text/javascript">
                        //<![CDATA[
                        (function define_dap() 
                        {
                            if (window.dap)
                            {
                                return;
                            }

                            // all data needed to render the ads or refresh them
                            var postEvaluationClassname = "non-empty-ad";
                            var postEvaluationAdSmallClassname = "layout-small";
                            var postEvaluationAdMediumClassname = "layout-medium";
                            var postEvaluationAdLargeClassname = "layout-large";
                            var postEvaluationNoAdClassname = "no-ad";

                            var mediumAdHeight = 200;
                            var largeAdHeight = 550;

                            var numEvalPasses = 10;
                            var contentCheckTimeout = 300;
                            var discernibleAdHeightThreshold = 30;
                            var discernibleAdWidthThreshold = 40;

                            var AdSizeType =
                            {
                                NotAnAd: 0,                     // dimensions are both 0
                                PointSizedAd: 1,                // dimensions are both under threshold
                                NonPointSizedAd: 2,             // dimensions are both above threshold (full size)
                                Inconclusive: 3                 // one dimension is 0, and the other is above threshold
                            };

                            function dapResult(params, width, height, htmlid)
                            {
                                var elem = document.getElementById(htmlid);
                                if (!elem)
                                {
                                    return;
                                }

                                dapMgr.renderAd(htmlid, params, width, height);

                                var renderData = {
                                    params: params,
                                    width: width,
                                    height: height,
                                    htmlid: htmlid,
                                    adSizeType: AdSizeType.NotAnAd,
                                    canDisplayAdChoices: false,
                                    elem: elem
                                };

                                checkAndSetAdContainerVisibility(renderData);
                            }

                            function checkAndSetAdContainerVisibility(renderData)
                            {
                                var retries = numEvalPasses;

                                checkAndSetAdContainerVisibilityRec();

                                function checkAndSetAdContainerVisibilityRec(finalCheck)
                                {
                                    retries--;
                                    var adId = renderData.htmlid;

                                    checkVisibilityAndUpdateRenderDataContextForElement(renderData);

                                    var isLastPass = (retries === 0);
                                    var adSizeType = renderData.adSizeType;
                                    var adDetected = adSizeType !== AdSizeType.Inconclusive;
                                    var doShow = adSizeType === AdSizeType.NonPointSizedAd;
                                    if (adDetected || isLastPass)
                                    {
                                        var adSizeClassname = postEvaluationNoAdClassname;
                                        if (doShow)
                                        {
                                            if (renderData.height < mediumAdHeight)
                                            {
                                                adSizeClassname = postEvaluationAdSmallClassname;
                                            }
                                            else if (renderData.height < largeAdHeight)
                                            {
                                                adSizeClassname = postEvaluationAdMediumClassname;
                                            }
                                            else
                                            {
                                                adSizeClassname = postEvaluationAdLargeClassname;
                                            }
                                        }

                                        setAdContainerDisplayState(adId, doShow, adSizeClassname);

                                        // last final check for RM
                                        if (!finalCheck)
                                        {
                                            setTimeout(
                                                function finalCheckOnAdContainer()
                                                {
                                                    checkAndSetAdContainerVisibilityRec(true);
                                                }, 
                                                (numEvalPasses * contentCheckTimeout) >> 1);
                                        }
                                    } 
                                    else
                                    {
                                        // if we got here, we didn't find anything but script. Try again later.
                                        setTimeout(checkAndSetAdContainerVisibilityRec, contentCheckTimeout);
                                    }
                                }
                            }

                            function checkVisibilityAndUpdateRenderDataContextForElement(renderData)
                            {
                                if (!renderData)
                                {
                                    return;
                                }
                                evaluateAdContent(renderData);
                            }

                            function evaluateAdContent(renderData)
                            {
                                var adContainer = renderData && renderData.elem;
                                if (!adContainer)
                                {
                                    renderData.adSizeType = AdSizeType.Inconclusive;
                                }

                                var adIframeCollection = adContainer.getElementsByTagName("iframe");
                                var evaluationResult;
                                for (var adIframe, ndx = 0; (adIframe = adIframeCollection[ndx]); ++ndx)
                                {
                                    // skip script-only iFrame elements
                                    var body = ((adIframe.contentDocument || (adIframe.contentWindow || {}).document) || {}).body;
                                    if (!body || !body.hasChildNodes())
                                    {
                                        continue;
                                    }

                                    var childNode, hasChildDiv = false;
                                    for (var index = body.childNodes.length - 1; (childNode = body.childNodes[index]); --index)
                                    {
                                        if (childNode.nodeType === 1 
                                            && childNode.nodeName !== "SCRIPT")
                                        {
                                            hasChildDiv = true;
                                            renderData.adSizeType = evaluateElement(childNode, renderData);
                                            if (renderData.adSizeType === AdSizeType.NonPointSizedAd)
                                            {
                                                return;
                                            }
                                        }

                                        // Bug 1715559:[dl_ux][FF9.0] [Win7] - Advertisement is overlapping destination section
                                        // For FF lower versions (FF9.0), index may be negative and hence cause js errors
                                        // Add index value check to solve the problem
                                        if (index <= 0)
                                        {
                                            break;
                                        }
                                    }

                                    if (renderData.adSizeType !== AdSizeType.NonPointSizedAd && hasChildDiv)
                                    {
                                        renderData.adSizeType = evaluateElement(adIframe, renderData);
                                    }

                                    if (renderData.adSizeType === AdSizeType.NonPointSizedAd)
                                    {
                                        return;
                                    }
                                }

                                // extract the characteristics of the div immediate children
                                var adDivCollection = adContainer.getElementsByTagName("div");
                                for (var adDiv, ndx = 0; (adDiv = adDivCollection[ndx]); ++ndx)
                                {
                                    renderData.adSizeType = evaluateElement(adDiv, renderData);

                                    if (renderData.adSizeType === AdSizeType.NonPointSizedAd)
                                    {
                                        return;
                                    }
                                }
                            }

                            function evaluateElement(element, renderData)
                            {
                                var maxWidth = evaluateElementDimension(element, true, discernibleAdWidthThreshold);
                                var maxHeight = evaluateElementDimension(element, false, discernibleAdHeightThreshold);

                                renderData.width = maxWidth;
                                renderData.height = maxHeight;

                                if (maxWidth > discernibleAdWidthThreshold && maxHeight > discernibleAdHeightThreshold)
                                {
                                    return AdSizeType.NonPointSizedAd;
                                }
                                if (maxWidth > 0 && maxHeight > 0)
                                {
                                    return AdSizeType.PointSizedAd;
                                }
                                return AdSizeType.Inconclusive;
                            }

                            function evaluateElementDimension(element, isWidth, threshold)
                            {
                                var dimensionProperties = isWidth ? ["width", "offsetWidth", "scrollWidth"] : ["height", "offsetHeight", "scrollHeight"];
                                var pixelStyle = isWidth ? "pixelWidth" : "pixelHeight";
                                var dimensionStyle = isWidth ? "width" : "height";
                                var totalProperties = 3;
                                var maxDimension = 0, dimension = 0;

                                for (var i = 0; i < totalProperties; i++)
                                {
                                    if ((dimension = element[dimensionProperties[i]]) > maxDimension) 
                                    {
                                        maxDimension = dimension;
                                        if (maxDimension > threshold) 
                                        {
                                            break;
                                        }
                                    }
                                }

                                var elemStyle = element.style;
                                if (maxDimension <= threshold && elemStyle)
                                {
                                    if ((dimension = elemStyle[pixelStyle]) > maxDimension)
                                    {
                                        maxDimension = dimension;
                                        if (maxDimension <= threshold && (dimension = parseInt(elemStyle[dimensionStyle])) > maxDimension)
                                        {
                                            maxDimension = dimension;
                                        }
                                    }
                                }

                                return maxDimension;
                            }

                            function setAdContainerDisplayState(elemId, doShow, adSizeClassname)
                            {
                                var adHtmlContainer = ((document.getElementById(elemId) || {}).parentNode || {}).parentNode;
                                if (!adHtmlContainer)
                                {
                                    return;
                                }

                                adHtmlContainer.style.display = doShow ? "" : "none";
                                var className = adHtmlContainer.className;

                                className = addOrRemoveClassname(className, postEvaluationClassname, doShow);
                                className = addOrRemoveClassname(className, postEvaluationAdSmallClassname, adSizeClassname === postEvaluationAdSmallClassname);
                                className = addOrRemoveClassname(className, postEvaluationAdMediumClassname, adSizeClassname === postEvaluationAdMediumClassname);
                                className = addOrRemoveClassname(className, postEvaluationAdLargeClassname, adSizeClassname === postEvaluationAdLargeClassname);
                                className = addOrRemoveClassname(className, postEvaluationNoAdClassname, adSizeClassname === postEvaluationNoAdClassname);

                                adHtmlContainer.className = className;
                            }

                            function addOrRemoveClassname(classNameList, className, add)
                            {
                                var classIndex = classNameList.indexOf(className);
                                if (add)
                                {
                                    if (classIndex === -1)
                                    {
                                        return classNameList + " " + className;
                                    }
                                }
                                else if (classIndex >= 0)
                                {
                                    return classNameList.replace(className, "");
                                }
                                return classNameList;
                            }

                            window.dap = dapResult;
                        })();
                        //]]>

                            dap("&amp;AP=1089&amp;PG=MSNDEDE11&amp;PVGUID=2312586169574cdf8ac39a2eb03636d1&amp;PROVIDERID=7HD66FC", 300, 250, "rectangle1_homepage_container_cf0a5546-df18-4dff-ab8c-62425edbf3f5");
                    </script>
            </div>
                    <a href="http://go.microsoft.com/fwlink/?LinkID=286759" target="_blank" class="adchoices" data-piitxt="adchoices">
                        <span> | Anzeige</span>
                    </a>
        </div>
    </div>
        <div class="stripe">
<h2>Einkaufen</h2>    <a href="/de-de/finanzen/top-stories/computer-mittwochs-kaufen-schuhe-donnerstags/ar-AA9s06u"
       >
<img alt="Online Schuhe shoppen - am besten donnerstags" height="194" src="//img.s-msn.com/tenant/amp/entityid/BBhCSqz.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="GARO/PHANIE/REX" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/finanzen/top-stories/computer-mittwochs-kaufen-schuhe-donnerstags/ar-AA9s06u"
       >

Computer mittwochs kaufen, Schuhe donnerstags    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/verbraucherpreise-sinken-gewinner-und-verlierer/ar-BBi1TT6"
       >

Verbraucherpreise sinken: Gewinner und Verlierer    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/die-teuersten-einkaufsstra%c3%9fen-der-welt/ss-BBgccw1"
       >

Die teuersten Einkaufsstra&#223;en der Welt    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/neuer-werbespot-des-aldi-konkurrenten-%e2%80%9e%c3%bcberhaupt-nichts-mit-der-marke-lidl-zu-tun%e2%80%9c/ar-BBhUZTq"
       >

Neuer Werbespot: Lidl will Edelmarke werden    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/discounter-wollen-mehr-bieten-als-nur-billig/ar-BBhZPif"
       >

Discounter wollen mehr bieten als nur billig    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/unterhaltung" >Unterhaltung</a></h2>    <a href="/de-de/unterhaltung/nachrichten/notruf-aus-newtopia/ar-AA9yG3h"
       >
<img alt="Ernster Zwischenfall in &quot;Newtopia&quot;: Isolde bricht zusammen - ein Notarzt muss kommen" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9yIGU.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="Newtopia/Sat.1" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/unterhaltung/nachrichten/notruf-aus-newtopia/ar-AA9yG3h"
       >

Notruf aus &quot;Newtopia&quot;    </a>

                            </li>
                            <li >
                                    <a href="/de-de/unterhaltung/nachrichten/videokonferenz-mit-helene-fischer-m%c3%b6glich/ar-AA9yvub"
       >

Helene Fischer: Video-Aussage vor Gericht?    </a>

                            </li>
                            <li >
                                    <a href="/de-de/unterhaltung/nachrichten/daniel-k%c3%bcblb%c3%b6ck-er-zeigt-seinen-geheimen-freund/ar-AA9ylW2"
       >

Daniel K&#252;blb&#246;ck: Er zeigt seinen geheimen Freund!    </a>

                            </li>
                            <li >
                                    <a href="/de-de/unterhaltung/musik/tokio-hotel-lassen-ihre-fans-tief-in-die-tasche-greifen/ar-AA9ykGi"
       >

Zocken Tokio Hotel ihre Fans ab?    </a>

                            </li>
                            <li >
                                    <a href="/de-de/unterhaltung/nachrichten/carolin-kebekus-vergleicht-helene-fischer-fans-mit-ultras-des-1-fc-k%c3%b6ln/ar-AA9ymly"
       >

Kebekus vergleicht Helene-Fischer-Fans mit Ultras des 1. FC K&#246;ln    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/sport" >Sport</a></h2>    <a href="/de-de/sport/fussball/schalke-will-sich-nur-nicht-abschlachten-lassen/ar-AA9yrpJ"
       >
<img alt="Schalkes Japaner Atsuto Uchida, links, am 18. Februar 2015 im Hinspiel des Achtelfinals der Champions League im Laufduell mit dem Madrilenen Cristiano Ronaldo." height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9yNFz.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f&amp;x=1367&amp;y=216" title="Martin Meissner, AP Photo" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/sport/fussball/schalke-will-sich-nur-nicht-abschlachten-lassen/ar-AA9yrpJ"
       >

Schalkes Angst, &quot;abgeschlachtet&quot; zu werden    </a>

                            </li>
                            <li >
                                    <a href="/de-de/sport/fussball/%c2%abcarlettos%c2%bb-weiche-hand/ar-AA9yEjM"
       >

Chance f&#252;r Schalke? &quot;Real ist langsam und ziemlich konfus&quot;    </a>

                            </li>
                            <li >
                                    <a href="/de-de/sport/wintersport/schweinsteiger-gegen-neureuther-rennen-am-gudiberg/ar-AA9xW1L"
       >

Gaudi am Gudiberg: Neureuther f&#228;hrt Slalom gegen Schweinsteiger    </a>

                            </li>
                            <li >
                                    <a href="/de-de/sport/wintersport/neureuthers-lobeshymne-auf-hirscher/ar-AA9xSRH"
       >

Was Neureuther wirklich von Hirscher h&#228;lt    </a>

                            </li>
                            <li >
                                    <a href="/de-de/sport/tennis/kerber-auf-der-us-tour-mit-trainer-torben-beltz/ar-AA9ysNg"
       >

Kerber kehrt zu ihrem alten Coach zur&#252;ck    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/finanzen" >Finanzen</a></h2>    <a href="/de-de/finanzen/other/streik-in-kitas-eltern-ratlos/ar-AA9yPdq"
       >
<img alt="Erzieher-Warnstreik: Vergangene Woche demonstrierten 2000 Lehrer." height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9yHyw.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="dpa" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/finanzen/other/streik-in-kitas-eltern-ratlos/ar-AA9yPdq"
       >

Streik in Kitas, Eltern ratlos    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/zweifel-an-raschem-freihandelsvertrag-von-eu-und-japan/ar-AA9yIrb"
       >

Zweifel an raschem Freihandelsvertrag von EU und Japan    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/ezb-%c3%b6ffnet-die-geldschleusen-dax-geht-die-puste-aus/ar-AA9yfoy"
       >

EZB &#246;ffnet die Geldschleusen - Dax geht die Puste aus    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/top-stories/lassen-sie-sich-%c3%bcberwachen-%e2%80%93-und-sparen-sie-geld/ar-AA9xYJb"
       >

Lassen Sie sich &#252;berwachen – und sparen Sie Geld    </a>

                            </li>
                            <li >
                                    <a href="/de-de/finanzen/other/general-motors-sch%c3%bcttet-milliarden-an-aktion%c3%a4re-aus/ar-AA9yy45"
       >

General Motors sch&#252;ttet Milliarden an Aktion&#228;re aus    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/lifestyle" >Lifestyle</a></h2>    <a href="/de-de/lifestyle/leben/happy-birthday-barbie/ss-AA9y26K"
       >
<img alt="new caption" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9soiv.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f&amp;x=1508&amp;y=321" title="Getty Images/Getty Images" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/lifestyle/leben/happy-birthday-barbie/ss-AA9y26K"
       >

Happy Birthday, Barbie!    </a>

                            </li>
                            <li >
                                    <a href="/de-de/lifestyle/style/f%c3%bcnf-goldene-beauty-regeln/ss-AA9ysSV"
       >

F&#252;nf goldene Beauty-Regeln    </a>

                            </li>
                            <li >
                                    <a href="/de-de/lifestyle/leben/liebe-and-beziehung-beim-verlieben-gibt-es-keine-regeln/ar-AA9tTve"
       >

Liebe &amp; Beziehung: Auf diese Date-Tipps k&#246;nnen Sie verzichten    </a>

                            </li>
                            <li >
                                    <a href="/de-de/lifestyle/lifestylewomen/sex-liebe-karriere-neun-frauenfakten-im-check/ar-AA9w7sp"
       >

Sex, Liebe, Karriere - Neun Frauenfakten im Check    </a>

                            </li>
                            <li >
                                    <a href="/de-de/lifestyle/style/sara-sampaio-bunt-and-sexy-portugiesisches-model-wirbt-f%c3%bcr-calzedonia-bademode/ar-BBihfaq"
       >

Bunt &amp; sexy! Model Sara Sampaio im Bikini    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/gesundheit" >Gesundheit &amp; Fitness</a></h2>    <a href="/de-de/gesundheit/kraft/wieso-sport-schlau-macht/ss-AA9woHq"
       >
<img alt="new caption" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9yyoM.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="Foto: Getty Images" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/gesundheit/kraft/wieso-sport-schlau-macht/ss-AA9woHq"
       >

Wieso Sport schlau macht    </a>

                            </li>
                            <li >
                                    <a href="/de-de/gesundheit/medizinisch/herzerkrankungen-frauen-sind-st%c3%a4rker-betroffen/ar-AA9xLs8"
       >

Herzerkrankungen: Frauen sind st&#228;rker betroffen    </a>

                            </li>
                            <li >
                                    <a href="/de-de/gesundheit/ernaehrung/ballaststoffe-diese-lebensmittel-halten-lange-satt/ss-BBicdTp"
       >

Ballaststoffe: Diese Lebensmittel machen lange satt!    </a>

                            </li>
                            <li >
                                    <a href="/de-de/gesundheit/ernaehrung/neue-studie-beweist-margarine-ist-ges%c3%bcnder-als-butter/ss-BBijHZT"
       >

Neue Studie beweist: Margarine ist ges&#252;nder als Butter!    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/kochen-und-genuss" >Kochen &amp; Genuss</a></h2>    <a href="/de-de/kochen-und-genuss/rezepte/h%c3%a4hnchen-nach-j%c3%a4gerart/fd-9ce3ffb4-8a6a-5388-a71a-84af5c057886"
       >
<img alt="Hähnchen nach Jägerart" height="194" src="//img.s-msn.com//tenant/amp/entityid/AA9yrjo/_h194_w300_m6_utrue_otrue_lfalse.jpg" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/kochen-und-genuss/rezepte/h%c3%a4hnchen-nach-j%c3%a4gerart/fd-9ce3ffb4-8a6a-5388-a71a-84af5c057886"
       >

H&#228;hnchen nach J&#228;gerart    </a>

                            </li>
                            <li >
                                    <a href="/de-de/kochen-und-genuss/essen-news/trend-der-filterkaffee-ist-zur%c3%bcck/ar-AA9sRzO"
       >

Trend: Der Filterkaffee ist zur&#252;ck    </a>

                            </li>
                            <li >
                                    <a href="/de-de/kochen-und-genuss/essen-news/10-geniale-rezeptideen-mit-frischk%c3%a4se/ss-AA9n81a"
       >

10 geniale Rezeptideen mit Frischk&#228;se    </a>

                            </li>
                            <li >
                                    <a href="/de-de/kochen-und-genuss/other/detox-zum-fr%c3%bchst%c3%bcck-so-gut-schmeckt-die-detox-kur/ar-BBfYXSN"
       >

Detox zum Fr&#252;hst&#252;ck: so gut schmeckt die Detox-Kur    </a>

                            </li>
                            <li >
                                    <a href="/de-de/kochen-und-genuss/essen-news/die-wahrscheinlich-besten-muffin-rezepte-der-welt/ss-BBi8mGu"
       >

Die wahrscheinlich besten Muffin Rezepte der Welt    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/reisen" >Reisen</a></h2>    <a href="/de-de/reisen/artikel/venezianisches-flair-oder-hamburger-kiez-spontane-trips-f%c3%bcr-den-resturlaub-aus-2014/ss-BBijsWm"
       >
<img alt="Um diese Jahreszeit ist Venedig viel leerer und dadurch um einiges angenehmer als etwa im Hochsommer. Ist der Karneval Mitte Februar erst einmal rum, können Besucher ganz entspannt durch die Gassen schlendern. Besondere Hingucker sind der Dogenpalast und der Markusdom, die bei einer Erkundungsreise auf keinem Fall fehlen dürfen." height="194" src="//img.s-msn.com/tenant/amp/entityid/BBijFW3.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="GetYourGuide" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/reisen/artikel/venezianisches-flair-oder-hamburger-kiez-spontane-trips-f%c3%bcr-den-resturlaub-aus-2014/ss-BBijsWm"
       >

Spontane Trips f&#252;r den Resturlaub aus 2014    </a>

                            </li>
                            <li >
                                    <a href="/de-de/reisen/artikel/das-hotel-der-zukunft-er%c3%b6ffnet-in-wien/ss-BBii0H7"
       >

Das Hotel der Zukunft er&#246;ffnet in Wien    </a>

                            </li>
                            <li >
                                    <a href="/de-de/reisen/artikel/welches-wahrzeichen-ist-das/ss-BBddSlH"
       >

Welches Wahrzeichen ist das?    </a>

                            </li>
                            <li >
                                    <a href="/de-de/reisen/artikel/neun-trendige-reiseziele-f%c3%bcr-2015-von-portland-bis-zur-w%c3%bcste-gobi/ss-BBicSXY"
       >

Neun trendige Reiseziele f&#252;r 2015: Von Portland bis zur W&#252;ste Gobi    </a>

                            </li>
                            <li >
                                    <a href="/de-de/reisen/artikel/verwunschene-wege-unter-bl%c3%a4tterd%c3%a4chern/ss-BBi8ktO"
       >

Verwunschene Wege unter Bl&#228;tterd&#228;chern    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/auto" >Auto</a></h2>    <a href="/de-de/auto/nachrichten/die-hei%c3%9festen-girls-aus-genf-genfer-autosalon-2015/ar-BBif91D"
       >
<img alt="new caption" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9xPMt.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f&amp;x=807&amp;y=280" title="Auto Motor und Sport" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/auto/nachrichten/die-hei%c3%9festen-girls-aus-genf-genfer-autosalon-2015/ar-BBif91D"
       >

Die hei&#223;esten Girls aus Genf - Genfer Autosalon 2015    </a>

                            </li>
                            <li >
                                    <a href="/de-de/auto/oldtimer/7-oldtimer-schn%c3%a4ppchen-von-2550-bis-23900-%e2%82%ac-klassiker-zum-sonderpreis/ar-AA9vyOb"
       >

7 Oldtimer-Schn&#228;ppchen von 2.550 bis 23.900 € - Klassiker zum Sonderpreis    </a>

                            </li>
                            <li >
                                    <a href="/de-de/auto/nachrichten/satte-ps-zugabe-f%c3%bcr-die-werksautos-tuning-auf-dem-genfer-autosalon-2015/ar-AA9tCQW"
       >

Satte PS-Zugabe f&#252;r die Werksautos - Tuning auf dem Genfer Autosalon 2015    </a>

                            </li>
                            <li >
                                    <a href="/de-de/auto/other/sitzprobe-audi-r8/ss-AA9sEyH"
       >

Sitzprobe Audi R8    </a>

                            </li>
                            <li >
                                    <a href="/de-de/auto/nachrichten/coup%c3%a9-kommt-mit-f%c3%bcnfliter-v8-erlk%c3%b6nig-hyundai-genesis-coup%c3%a9-2016/ar-AA9txR4"
       >

Coup&#233; kommt mit F&#252;nfliter-V8 - Erlk&#246;nig Hyundai Genesis Coup&#233; (2016)    </a>

                            </li>
                    </ul>

        </div>
        <div class="stripe">
<h2><a href="/de-de/video" >Video</a></h2>    <a href="/de-de/video/nachrichten/mumifizierte-bergsteiger-entdeckt/vi-AA9uBYh"
       >
<img alt="Mumifizierte Bergsteiger entdeckt" height="194" src="//img.s-msn.com/tenant/amp/entityid/AA9uJIr.img?h=194&amp;w=300&amp;m=6&amp;q=60&amp;u=t&amp;o=t&amp;l=f" title="Reuters" width="300" />
    </a>
                    <ul>
                            <li  class="first">
                                    <a href="/de-de/video/nachrichten/mumifizierte-bergsteiger-entdeckt/vi-AA9uBYh"
       >

Mumifizierte Bergsteiger entdeckt    </a>

                            </li>
                            <li >
                                    <a href="/de-de/video/nachrichten/drogenschmuggler-werfen-ballast-ab/vi-AA9y3Dc"
       >

Drogenschmuggler werfen &quot;Ballast&quot; ab    </a>

                            </li>
                            <li >
                                    <a href="/de-de/video/nachrichten/hereinspaziert-wilder-l%c3%b6we-%c3%b6ffnet-autot%c3%bcr/vi-BBigvur"
       >

Hereinspaziert: Wilder L&#246;we &#246;ffnet Autot&#252;r    </a>

                            </li>
                            <li >
                                    <a href="/de-de/video/ansehen/crash-in-taiwan-dashcam-zeichnet-alles-auf/vi-AA9uQNf"
       >

Crash in Taiwan: Dashcam zeichnet alles auf    </a>

                            </li>
                            <li >
                                    <a href="/de-de/video/nachrichten/vin-diesel-ist-der-star-mit-den-meisten-facebook-freunden/vi-AA9xvIH"
       >

Vin Diesel ist der Star mit den meisten Facebook-Freunden    </a>

                            </li>
                    </ul>

        </div>


        </div>
        <div id="aside"  data-region="aside">

        </div>
<div id="nav">
        <ul class="verticalsnav">
                <li  class="current">
                    <a href="/de-de">Startseite</a>
                </li>
                <li >
                    <a href="/de-de/nachrichten">Nachrichten</a>
                </li>
                <li >
                    <a href="/de-de/wetter">Wetter</a>
                </li>
                <li >
                    <a href="/de-de/unterhaltung">Unterhaltung</a>
                </li>
                <li >
                    <a href="/de-de/sport">Sport</a>
                </li>
                <li >
                    <a href="/de-de/finanzen">Finanzen</a>
                </li>
                <li >
                    <a href="/de-de/lifestyle">Lifestyle</a>
                </li>
                <li >
                    <a href="/de-de/gesundheit">Gesundheit &amp; Fitness</a>
                </li>
                <li >
                    <a href="/de-de/kochen-und-genuss">Kochen &amp; Genuss</a>
                </li>
                <li >
                    <a href="/de-de/reisen">Reisen</a>
                </li>
                <li >
                    <a href="/de-de/auto">Auto</a>
                </li>
                <li >
                    <a href="/de-de/video">Video</a>
                </li>
        </ul>
</div>
    </div>
    <div id="foot">
        <div>            <a href="http://www.microsoft.com/de-de/">&#169; 2015 Microsoft</a>
            <a href="http://go.microsoft.com/fwlink/?LinkId=248688">Datenschutz und Cookies</a>
            <a href="http://windows.microsoft.com/de-de/windows-live/microsoft-services-agreement">Nutzungsbedingungen</a>
            <a href="http://go.microsoft.com/fwlink/?LinkID=286759">&#220;ber unsere Anzeigen</a>
            <a href="https://jfe.qualtrics.com/form/SV_d4ir2X6Zkgjw0rb">Feedback</a>
            <a href="/de-de/nachrichten/schlagzeilen/Impressum/ar-BB56cmH">Impressum</a>
            <a href="/de-de/msn-worldwide">MSN Weltweit</a>
            <a href="http://www.bing.com/explore/newsletter?mkt=de-de&amp;FORM=MF12BH&amp;OCID=MF12BH&amp;wt.mc_id=MF12BH">Newsletter</a>
            <a href="http://go.microsoft.com/fwlink/?LinkId=512703">Hilfe</a>
            <a href="http://advertising.microsoft.com/de-de">Werben auf MSN</a>
</div>
    </div>
            <div>
            <img src="//c.msn.com/c.gif?udc=true&amp;rid=2312586169574cdf8ac39a2eb03636d1&amp;rnd=635615492100105610&amp;rf=&amp;tp=http%253A%252F%252Fwww.msn.com%252Fde-de%252F&amp;di=108&amp;lng=de-de&amp;cv.product=prime&amp;d.dg1=&amp;d.dg2=&amp;d.dg3=&amp;d.dg4=&amp;d.dgk=downlevel.pc&amp;d.imd=0&amp;d.b=Mozilla&amp;d.bv=0.0&amp;d.p=Unknown&amp;d.pv=Unknown%20Unknown" alt="image beacon" width="1" height="1" /><img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=3000001&amp;rn=635615492100105610&amp;c7=http%253A%252F%252Fwww.msn.com%252Fde-de%252F&amp;c8=&amp;c9=" alt="image beacon" width="1" height="1" /><img src="//otf.msn.com/c.gif?js=0&amp;evt=impr&amp;di=108&amp;pi=&amp;ps=&amp;su=http%253A%252F%252Fwww.msn.com%252Fde-de%252F&amp;pageid=startpage&amp;mkt=de-de&amp;pn=startpage&amp;mv=15&amp;pp=False&amp;cv.product=prime&amp;cv.partner=&amp;cv.publcat=&amp;st.dpt=&amp;st.sdpt=&amp;dv.Title1=&amp;cts=635615492100105610&amp;rf=&amp;rid=2312586169574cdf8ac39a2eb03636d1&amp;cvs=Browser&amp;subcvs=homepage&amp;cv.entityId=&amp;cv.entitySrc=&amp;provid=&amp;ar=0&amp;d.dg1=&amp;d.dg2=&amp;d.dg3=&amp;d.g4=&amp;d.dgk=downlevel.pc&amp;d.imd=0&amp;d.b=Mozilla&amp;d.bv=0.0&amp;d.p=Unknown&amp;d.pv=Unknown%20Unknown" alt="image beacon" width="1" height="1" />
        </div>




</body>
    <!--MSNAvailToken--></html>

Thanks and sorry for the long comment.

Sklearn incompatibility

This bug revealed again (closed 52).

/usr/local/lib/python3.7/site-packages/sklearn/base.py:253: UserWarning: Trying to unpickle estimator FeatureUnion from version 0.19.1 when using version 0.20.4. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

scikit-learn 0.20.4
Cython 0.29.14

Installed via pip3 install dragnet (OS X 10.15.2 (19C57))

How to maximize recall over precision

For my use case I want to maximize the recall, with precision being less important. Is there a straight forward way to achieve this?

Error compiling Cython file

I have an error for > sudo make install

pair.from_py:152:13: 'uint32_t' is not a type identifier

Vagrant install not working

I followed the vagrant install instructions see the install log, there are some compile warnings. See vagrant.log

Then when I try to run the example code as is, it doesn't import the names.

vagrant@dragnet:/vagrant/dragnet-test$ python dg-load.py 
Traceback (most recent call last):
  File "dg-load.py", line 2, in <module>
    from dragnet import content_extractor, content_comments_extractor
ImportError: cannot import name content_extractor


vagrant@dragnet:/vagrant/dragnet-test$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import dragnet
>>> dir(dragnet)
['Blockifier', 'BlockifyError', 'Extractor', 'PartialBlock', '_LOADED_MODELS', '__builtins__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', 'blocks', 'compat', 'extract_comments', 'extract_content', 'extract_content_and_comments', 'extractor', 'features', 'load_pickled_model', 'util']
>>>

Did something change in the recent commits? Or maybe I'm doing something wrong..

Training content extraction models?

After following the steps in the Training content extraction section, I'm getting a AttributeError on 'model.fit' when running the code to train additional/new content. What class needs to be instantiated for the model?

from dragnet import models
model = models.kohlschuetter_weninger_model
dragnet.model_training.train_models(datadir, output_dir, features_to_use, model, 'both')
Reading the training and test data...
..done!
Got 980 training, 416 test documents
Initializing features
Training the model
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/opt/virtualenvs/Dragnet/lib/python2.7/site-packages/dragnet/model_training.py", line 394, in train_models
model.fit(features, labels, np.minimum(weights, 200.0))
AttributeError: 'ContentExtractionModel' object has no attribute 'fit'

Thanks!

bump scikit-learn version ceiling to 0.20.1?

Hi! dragnet currently requires scikit-learn<=0.19.2, which prevents users from developing alongside the final Py2/3 compatible release of 0.20.1 (see here). Would you be able to bump the version ceiling and publish a new release? Assuming no version-driven incompatibilities, I think this shouldn't be too much work; you don't have to train new, version-specific models, right?

error when installing on windows

Hello,
I'm new to python.
I cannot install Dragnet on Windows 7 due to an error with mozsci !
Any help possible ? Thanks

Installing collected packages: scikit-learn, mozsci
Found existing installation: scikit-learn 0.18.1
Uninstalling scikit-learn-0.18.1:
Successfully uninstalled scikit-learn-0.18.1
Running setup.py install for mozsci ... error
Complete output from command c:\python27\python.exe -u -c "import setuptools
, tokenize;file='c:\users\vola-g1.sav\appdata\local\temp\pip-build-q1
ppjo\mozsci\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read(
).replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install
--record c:\users\vola-g1.sav\appdata\local\temp\pip-qdnl_z-record\install-rec
ord.txt --single-version-externally-managed --compile:

line breaker

Is it possible to keep line breaker in the extracted content?

I don't understand the install on Vagrant

vagrant@precise64:/vagrant/dragnet-1.1.0$ make test
nosetests --exe --cover-package=dragnet --with-coverage --cover-branches -v --cover-erase
make: nosetests: Command not found
make: *** [nose] Error 127

publish a new release on pypi?

Given the huge number of changes to dragnet since the last release (v1.1, Jan '17), are there near-term plans to publish a new release on PyPi? I see that the failing test got removed, but don't know if there are other critical to-do's before everything's release-ready.

error

ubuntu 14.04;
error:
Traceback (most recent call last):
File "", line 1, in
File "dragnet/init.py", line 11, in
from .models import content_extractor, content_comments_extractor
File "dragnet/models.py", line 21, in
content_extractor = pickle.load(fin)
File "_tree.pyx", line 2217, in sklearn.tree._tree.Tree.cinit (sklearn/tree/_tree.c:17574)
ValueError: ("Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'", <type 'sklearn.tree._tree.Tree'>, (8, array([2], dtype=int64), 1))

Can you expain the reasons for the problem?
Thank you in advance!

Failed to train the comments-content model with train_many_models

Hi, I am new to the machine learning area, and try to train the model based on https://github.com/dragnet-org/dragnet#training-content-extraction-models, but it failed with the error ValueError: setting an array element with a sequence.
I just use the dragnet data , and my code is simply as follows:

features = ['kohlschuetter', 'weninger', 'readability']
to_extract = ['content', 'comments']

extract_all_gold_standard_data(
    data_dir=rootdir,
    nprocesses=20
)
model = ExtraTreesClassifier()
base_extractor = Extractor(
    features=features,
    to_extract=to_extract,
    model=model
)
param_grid={'n_estimators': [10, 20, 50, 75]}
extractor = train_many_models(base_extractor, param_grid, rootdir, train_out_dir, verbose=1)

The details error message is:

train.py
WARNING:root:extraction failed: too few blocks (1)
/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:643: DeprecationWarning: "fit_params" as a constructor argument was deprecated in version 0.19 and will be removed in version 0.21. Pass fit parameters to the "fit" method instead.
  '"fit" method instead.', DeprecationWarning)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Traceback (most recent call last):
  File "/Users/XXXXXX/dev/src/train.py", line 31, in <module>
    extractor = train_many_models(base_extractor, param_grid, rootdir, train_out_dir, verbose=1)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/dragnet/model_training.py", line 190, in train_many_models
    gscv = gscv.fit(train_features, train_labels)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 253, in fit
    sample_weight = check_array(sample_weight, ensure_2d=False)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

However, if I change
to_extract = ['content', 'comments'] to to_extract = ['content'], it succeeded.

Could I have your guidance on the failure? Thanks.

How to install the dragnet package?

I've got the following error.

Collecting pytest>=4.0.0 (from -r requirements.txt (line 5))
Using cached pytest-7.4.2-py3-none-any.whl.metadata (7.9 kB)
Collecting pytest-cov>=2.6.0 (from -r requirements.txt (line 6))
Using cached pytest_cov-4.1.0-py3-none-any.whl.metadata (26 kB)
Collecting scikit-learn<0.21.0,>=0.15.2 (from -r requirements.txt (line 7))
Using cached scikit-learn-0.20.4.tar.gz (11.7 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [24 lines of output]
:12: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
Partial import of sklearn during the build process.
Traceback (most recent call last):
File "", line 153, in get_numpy_status
ModuleNotFoundError: No module named 'numpy'
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 149, in prepare_metadata_for_build_wheel
return hook(metadata_directory, config_settings)
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-6v8axpi2\overlay\Lib\site-packages\setuptools\build_meta.py", line 396, in prepare_metadata_for_build_wheel
self.run_setup()
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-6v8axpi2\overlay\Lib\site-packages\setuptools\build_meta.py", line 507, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "C:\Users\Administrator\AppData\Local\Temp\pip-build-env-6v8axpi2\overlay\Lib\site-packages\setuptools\build_meta.py", line 341, in run_setup
exec(code, locals())
File "", line 250, in
File "", line 238, in setup_package
ImportError: Numerical Python (NumPy) is not installed.
scikit-learn requires NumPy >= 1.8.2.
Installation instructions are available on the scikit-learn website: http://scikit-learn.org/stable/install.html

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
WARNING: There was an error checking the latest version of pip.

Use more parallelization in training, other speedups

Hi Matt, Dan,

thanks for this wonderful library.

While training some augmented models, I noticed that there are some steps in the process which could benefit a lot from parallelization.
There are also small corners where expanding the interface a bit would streamline processing in some cases.

I could submit one or several PRs, but want to ask whether you would be willing to have them.

I have a fork that I'm collecting changes in.

So far I'd propose to:

Use a multiprocessing.Pool for data pre-processing in data_processing.py:prepare_all_data(This gives a near-linear speedup, saving a couple of minutes on my 4-core).
Use n_jobs=-1 where possible (e.g. in GridSearchCV). Substantial speedup by default that grows with the number of grid parameters.
Let the Blockifier.blockify interface accept pre-parsed etree instances. I understand that this may be a niche use case, but I'm processing a lot of html where I need to parse the tree before extraction for some pre-processing. Skipping the duplicate parsing saves about 30% overall time:
Parsing 1000 entries with pre-existing trees, three run average: 28.9 seconds (user time)
Parsing 1000 entries from string, three run average: 37.7 seconds (user time)
(Note that this includes the first parsing pass and some trivial overhead for loading the data. Since the time includes python startup and model loading, the 30% saving is a lower bound estimate).
Finally, I've looked at some profiling data and saw that casting the blocks (str_block_list_cast in blocks.pyx: https://github.com/dragnet-org/dragnet/blob/master/dragnet/blocks.pyx#L860) presents a non-trivial overhead. There may be some potential here: I tried simply skipping the entire casting step, only decoding the text to unicode for the regex to work. The extraction seemed to still work fine. So I'm not quite sure in which cases the blocks would be in an unknown (bytes, str) state.

Thanks, best,
Pascal

port existing tests to `pytest`

Currently, dragnet has unittest-based tests run via nose, but we could and probably should take advantage of a more modern and maintained testing framework like pytest. This is more of a chore than an issue, but just wanted to assess buy-in before writing any code.

Tasks:

port the tests into pytest-style
update testing dependencies in requirements.txt
bonus: get Travis CI running tests via pytest (side note: is the make test command in .travis.yml actually working??)

ModuleNotFoundError is returned when I import dragnet as dependency

Hi,
I have a python project with the following setup.py:

setup(name='my_package',
      version='0.0.1',
      description='',
      author='',
      author_email='',
      packages=find_packages(where='src'),
      package_dir={'':'src'},
      install_requires=[
          'dragnet',
          'spacy',
          'pandas',
          'pytest'
      ]
)

I'm using python 3.7

When I try to install the project dependencies I have the following error:

Searching for dragnet
Reading https://pypi.org/simple/dragnet/
Downloading https://files.pythonhosted.org/packages/2f/8c/3ae7c2824d612555bc936a0fac43568e8c3a9d4e58a88565b8a6b2a1dc7e/dragnet-2.0.3.tar.gz#sha256=58790e43f670d58277569568b4ca8e70675e985432278c0603a805ca5c9c21b7
Best match: dragnet 2.0.3
Processing dragnet-2.0.3.tar.gz
Writing /var/folders/6h/7qdbkwd17cvd43rgl46c4jr40000gp/T/easy_install-smim9159/dragnet-2.0.3/setup.cfg
Running dragnet-2.0.3/setup.py -q bdist_egg --dist-dir /var/folders/6h/7qdbkwd17cvd43rgl46c4jr40000gp/T/easy_install-smim9159/dragnet-2.0.3/egg-dist-tmp-r55dxunp
Traceback (most recent call last):
  File "/Users/lanottef/miniconda3/envs/my_package/lib/python3.7/site-packages/setuptools/sandbox.py", line 154, in save_modules
    yield saved
  File "/Users/lanottef/miniconda3/envs/my_package/lib/python3.7/site-packages/setuptools/sandbox.py", line 195, in setup_context
    yield
  File "/Users/lanottef/miniconda3/envs/my_package/lib/python3.7/site-packages/setuptools/sandbox.py", line 250, in run_setup
    _execfile(setup_script, ns)
  File "/Users/lanottef/miniconda3/envs/my_package/lib/python3.7/site-packages/setuptools/sandbox.py", line 45, in _execfile
    exec(code, globals, locals)
  File "/var/folders/6h/7qdbkwd17cvd43rgl46c4jr40000gp/T/easy_install-smim9159/dragnet-2.0.3/setup.py", line 25, in <module>
ModuleNotFoundError: No module named 'lxml'

I obtain the same error if I put lxml in the install_requires param (obviously I put lxml before dragnet).
What I'm doing wrong?

how to train it to get author name and headline

issue installing

when I do make test I get *** No rule to make target `test'. Stop..

my command is this : vagrant@dragnet:~$ make test

How to install dragnet

Hello. I'm new to python and I found the installation of the package very hard.
Which version do I need ?
Which packages do I need to install ?
It's very complicated for a beginner.
Thanks

How to exclude disclaimer?

I wonder if there is a way to exclude a disclaimer section in the bottom of the content. For example in this page: https://medium.com/eosio/eosio-stack-exchange-proposal-a4b3787e1562
I'm using:

import requests
from dragnet import extract_content
url = "https://medium.com/eosio/eosio-stack-exchange-proposal-a4b3787e1562"
r = requests.get(url)
extract_content(r.content)

Using only Kohlshuetter feature produces TypeError

When I use kohlshuetter feature alone, trying to extract a content produces this error:

Traceback (most recent call last):
 File "test.py", line 35, in <module>
   content = content_extractor.extract(r.content)
 File "dragnet\extractor.py", line 171, in extract
   return str_cast(b'\n'.join(blocks[ind].text for ind in np.flatnonzero(preds)))
TypeError: sequence item 0: expected a bytes-like object, str found

I changed
str_cast(b'\n'.join(blocks[ind].text for ind in np.flatnonzero(preds)))
to
str_cast('\n'.join(blocks[ind].text for ind in np.flatnonzero(preds)))
and the error disappeared.
Is it ok to do so?

incompatible sklearn/joblib version?

Hi everyone,

First of all great project ;). I am trying to get dragnet to run but have a problem with loading the pickled models. This is probably due to a version conflict in either joblib, numpy or sklearn. At least that's what I assume due to this blogpost:

https://stackoverflow.com/questions/48948209/keyerror-when-loading-pickled-scikit-learn-model-using-joblib

My own versions of sklearn and joblib and numpy are:

sklearn.__version__
Out: '0.19.1'

from sklearn.externals import joblib
joblib.__version__
Out: '0.14.1'

import numpy
numpy.__version__
Out: '1.17.4'

I think that probably this code section:

https://github.com/dragnet-org/dragnet/blob/master/dragnet/compat.py#L265

takes care of loading the different pickle modules regarding the correct version. It doesn't say anything regarding joblib though. On my system (ubuntu/python3, sklearn installed with pip) sklearn makes use of the system-wide joblib version. So

import joblib == import sklearn.external.joblib

I hope that you can help me maybe I can even contribute a little to the project. Which versions of joblib & sklearn & numpy should I try to use? here is the error:

content = extract_content(doc.summary())
Traceback (most recent call last):

  File "<ipython-input-5-882782be121c>", line 1, in <module>
    content = extract_content(doc.summary())

  File "/home/tom/.local/lib/python3.6/site-packages/dragnet/__init__.py", line 12, in extract_content
    'kohlschuetter_readability_weninger_content_model.pkl.gz')

  File "/home/tom/.local/lib/python3.6/site-packages/dragnet/util.py", line 168, in load_pickled_model
    return joblib.load(filepath)

  File "/home/tom/.local/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 605, in load
    obj = _unpickle(fobj, filename, mmap_mode)

  File "/home/tom/.local/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
    obj = unpickler.load()

  File "/usr/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)

KeyError: 0

updating / moving the training dataset?

Is there any interest in moving the dragnet_data repository from seomoz to dragnet-org (this) GitHub account? It would be nice to have the two repos together and under the same administrative control.

On a related note, is there any interest in updating the training data (and retraining the various models)? The HTML in the current data is quite old at this point, so the trained models don't know how to learn from, say, HTML5's new syntactic features. I'm sure content extraction performance on newer webpages suffers. I don't know what the legal issues are (if any) of compiling a new dataset, but if somebody could advise, I would be interested in taking on some of the work.

Lastly, if we opted to compile a new training dataset from scratch, we wouldn't have to move the old repository and could, instead, just make a new one alongside this.

Plans to support Python3?

Are there any plans to support Python3?

Can not use the latest version of sklearn

python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator ExtraTreeClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)

python3.6/site-packages/dragnet/models.py:51: UserWarning: Unable to unpickle ContentExtractionModel! Your version of scikit-learn (0.18.2) may not be compatible. Setting extractors to None.
UserWarning)

model file size is around 447Kb for new training using same dataset

I am trying to train same ExtraTreeClassifier using same dragnet_dataset, but i am getting very small sized model pickle file and prediction accuracy is also very poor as compare to the existing model.
Original model file size - ~50MB
New model file size - 447KB

Am i doing any mistake in training? any help is appreciated.

gcc: error: dragnet/_weninger.c: No such file or directory when installing dragnet

I am getting the following error when trying to install it from the master branch

running build_ext
building 'dragnet._weninger' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/lib64/python2.6/site-packages/numpy/core/include -I/usr/include/python2.6 -c dragnet/_weninger.c -o build/temp.linux-x86_64-2.6/dragnet/_weninger.o
gcc: error: dragnet/_weninger.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4

I am using python 2.6
Thanks

Compatibility with scikit-learn > 0.21.3

Hello,

Currently, dragnet is not compatible with scikit-learn > 0.21.3. I did research and composed a table of compatibilities of pickled dragnet models with new sklearn versions.

Trained with version	Compatible with versions	Not compatible from version	Error
1.0.1	1.0.1; 0.24.2	0.23.2	`AttributeError: 'ExtraTreeClassifier' object has no attribute 'n_features_'`
0.23.2	0.23.2; 0.22.1	0.21.3	`ModuleNotFoundError: No module named 'sklearn.ensemble._forest'`

Models were trained with: python 3.9, Cython==0.29.24, numpy==1.21.4, scipy==1.7.2

Is it possible to update the library with new models? I could help with a pull request.

Getting started code snippet is not working

Hi, I've successfully installed dragnet on Python 2 as mentioned in the README, but when I try to run the demo snippet I get this error:

Traceback (most recent call last):
  File "/home/ibrahimsharaf/workspace/content_detection/dragtest.py", line 2, in <module>
    from dragnet import content_extractor, content_comments_extractor
ImportError: cannot import name content_extractor

Any clues?

import error

Using numpy 1.15.0 and last dragnet version I can't import dragnet from python3:

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dragnet
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/local/lib/python3.5/dist-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

neither from python2:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from dragnet import features
 File "/usr/local/lib/python2.7/dist-packages/dragnet/features/__init__.py", line 4, in <module>
   from dragnet.features.readability import ReadabilityFeatures
 File "/usr/local/lib/python2.7/dist-packages/dragnet/features/readability.py", line 3, in <module>
   from ._readability import make_readability_features
 File "dragnet/features/_readability.pyx", line 3, in init dragnet.features._readability
 File "__init__.pxd", line 1000, in numpy.import_array
ImportError: numpy.core.multiarray failed to import

malformed encoding leads to `BlockifyError`

I'm occasionally getting BlockifyError s caused by malformed encoding values set here. Here's the tail of the traceback:

Traceback (most recent call last):
    File "dragnet/blocks.pyx", line 846, in dragnet.blocks.Blockifier.blockify
    File "src/lxml/parser.pxi", line 1689, in lxml.etree.HTMLParser.__init__ 
    File "src/lxml/parser.pxi", line 823, in lxml.etree._BaseParser.__init__
    LookupError: unknown encoding: 'b'UTF-8,''

Looks like there's a trailing comma on "UTF-8", plus it's been incorrectly converted into unicode — possibly by calling str(b"UTF-8") instead of b"UTF-8".decode("utf-8").

I wasn't able to track down a relevant bug in blocks.pyx, so maybe this is just messy web data and 🤷‍♂ . Just posting in case somebody knows what's up!

Not able to clone the repository

When I followed the instructions given in description I got a permission denied error.
Complete messege:
C:\Users\devda>git clone [email protected]:seomoz/dragnet.git
Cloning into 'dragnet'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

MIssing license file

I think it would be a good idea to add a LICENSE file to the root folder of the project so it's easy to find and also Github reports the license on the project description.