Git Product home page Git Product logo

web-entity-extractor-acl2014's Introduction

Web Entity Extractor

This repository contains a toolkit for extracting entities from a given search query and web page.

Requirements

The requirements for running the code include:

  • Java 7
  • Ruby 1.8.7 or 1.9
  • Python 2.7

Other required libraries and resources can be downloaded using the following commands:

  • ./download-dependencies core: download required Java libraries
  • ./download-dependencies ling: download linguistic resources
  • ./download-dependencies dataset_debug: download a small dataset for testing the installation
  • ./download-dependencies dataset_openweb: download the OpenWeb dataset, which contains diverse queries and web pages
  • ./download-dependencies model: download a model trained on the training data of the OpenWeb dataset

Compiling

Run the following commands to download necessary libraries and compile:

./download-dependencies core
./download-dependencies ling
make

Testing

To train and test on the debug dataset (30 examples) using the default features, run

./download-dependencies dataset_debug
./web-entity-extractor @mode=main @data=debug @feat=default

For the OpenWeb dataset, make sure the system has enough RAM (~40GB recommended) and run

./download-dependencies dataset_openweb
./web-entity-extractor @memsize=high @mode=main @data=dev @feat=default -numThreads 0 -fold 3

Alternatively, run the pre-trained model on the dataset using

./download-dependencies model
./download-dependencies dataset_openweb
# Test on the training data
./web-entity-extractor @memsize=high @mode=load -loadModel models/openweb-devset @data=dev -numThreads 0
# Test on the test data
./web-entity-extractor @memsize=high @mode=load -loadModel models/openweb-devset @data=test -numThreads 0

The flag -numThreads 0 uses all CPUs available, while -fold 3 runs the system on 3 random splits of the dataset. Note that the system may take a long time on the first run to cache all linguistic data.

Interactive Mode

The interactive mode allows the user to apply the trained model on any query and web page.

To use the interactive mode, first train and save a model by adding -saveModel [MODELNAME] to one of the commands above, and then run

./web-entity-extractor @mode=interactive -loadModel [MODELNAME]

License

The code is under the GNU General Public License (v2). See the LICENSE file for the full license.

web-entity-extractor-acl2014's People

Contributors

ppasupat avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.