Git Product home page Git Product logo

boilernet's Introduction

BoilerNet

This is the implementation of our paper Boilerplate Removal using a Neural Sequence Labeling Model.

Web Content Extraction

BoilerNet is now integrated into the SoBigData platform! Use your own or a pre-trained model to extract text from HTML pages or annotate them directly. Available in the SoBigData Method Engine.

Usage

This section explains how to train and evaluate your own model. The datasets are available for download here:

Requirements

This code is tested with Python 3.7.5 and

  • tensorflow==2.1.0
  • numpy==1.17.3
  • tqdm==4.39.0
  • nltk==3.4.5
  • beautifulsoup4==4.8.1
  • html5lib==1.0.1
  • scikit-learn==0.21.3

Preprocessing

usage: preprocess.py [-h] [-s SPLIT_DIR] [-w NUM_WORDS] [-t NUM_TAGS]
                     [--save SAVE]
                     DIRS [DIRS ...]

positional arguments:
  DIRS                  A list of directories containing the HTML files

optional arguments:
  -h, --help            show this help message and exit
  -s SPLIT_DIR, --split_dir SPLIT_DIR
                        Directory that contains train-/dev-/testset split
  -w NUM_WORDS, --num_words NUM_WORDS
                        Only use the top-k words
  -t NUM_TAGS, --num_tags NUM_TAGS
                        Only use the top-l HTML tags
  --save SAVE           Where to save the results

After downloading and extracting one of the zip files above, preprocess your dataset, for example:

python3 net/preprocess.py googletrends-2017/prepared_html/ -s googletrends-2017/50-30-100-split/ -w 1000 -t 50 --save googletrends_data

Training

The training script takes care of both training and evaluating on dev- and testset:

usage: train.py [-h] [-l NUM_LAYERS] [-u HIDDEN_UNITS] [-d DROPOUT]
                [-s DENSE_SIZE] [-e EPOCHS] [-b BATCH_SIZE]
                [--interval INTERVAL] [--working_dir WORKING_DIR]
                DATA_DIR

positional arguments:
  DATA_DIR              Directory of files produced by the preprocessing
                        script

optional arguments:
  -h, --help            show this help message and exit
  -l NUM_LAYERS, --num_layers NUM_LAYERS
                        The number of RNN layers
  -u HIDDEN_UNITS, --hidden_units HIDDEN_UNITS
                        The number of hidden LSTM units
  -d DROPOUT, --dropout DROPOUT
                        The dropout percentage
  -s DENSE_SIZE, --dense_size DENSE_SIZE
                        Size of the dense layer
  -e EPOCHS, --epochs EPOCHS
                        The number of epochs
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        The batch size
  --interval INTERVAL   Calculate metrics and save the model after this many
                        epochs
  --working_dir WORKING_DIR
                        Where to save checkpoints and logs

For example, the model can be trained like this:

python3 net/train.py googletrends_data --working_dir googletrends_train

Hyperparameters

In order to reproduce the paper results, use the following hyperparameters:

  • -s googletrends-2017/50-30-100-split -w 1000 -t 50 (preprocessing)
  • -l 2 -u 256 -d 0.5 -s 256 -e 50 -b 16 --interval 1 (training)

Select the checkpoint with the highest F1 score (average over both values) on the validation set.

Citation

@inproceedings{10.1145/3366424.3383547,
  author = {Leonhardt, Jurek and Anand, Avishek and Khosla, Megha},
  title = {Boilerplate Removal Using a Neural Sequence Labeling Model},
  year = {2020},
  isbn = {9781450370240},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3366424.3383547},
  doi = {10.1145/3366424.3383547},
  booktitle = {Companion Proceedings of the Web Conference 2020},
  pages = {226–229},
  numpages = {4},
  location = {Taipei, Taiwan},
  series = {WWW ’20}
}

boilernet's People

Contributors

mrjleo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.