Git Product home page Git Product logo

coheel's Introduction

CohEEL

A library for the automatic detection and disambiguation of knowledge base entity mentions in texts.

Execution

Programs can be run via the bin/run script. All programs need a --configuration parameter, which identifies a file under src/main/resources. This file configures required properties, such as job manager, hdfs, path to certain files etc.

Spread Wikipedia data dump to HDFS

bin/spread-wikidump.sh

Run preprocessing and classification scripts

# preprocessing: extract main data like surfaces, links, redirects, language models, etc.
bin/run --configuration cluster_tenem --program extract-main

# extract probability that a surface is linked at all
bin/prepare-surface-link-probs-program.sh
bin/run --configuration cluster_tenem --program surface-link-probs

# create training data
bin/prepare-tries.sh
# .. upload tries manually to locations specified in the configuration
bin/run --configuration cluster_tenem --program training-program
# training
mvn scala:run -Dlauncher=MachineLearningTestSuite

# classification
bin/run --configuration cluster_tenem --program classification --parallelism 10

AWS EMR Setup

To setup CohEEL on Amazon Elastic MapReduce (EMR), a proper installation of the AWS Command Line Interface is required. Use aws configure to configure the local installation. Furthermore, you have to setup your EC2 key pair name [keyname], as well as the path to your private key file [pemfile]:

aws configure set emr.key_name [keyname]
aws configure set emr.key_pair_file [pemfile]

The following command starts a cluster (named "coheel") with 20 worker instances of type m1.large:

# create a new cluster
aws emr create-cluster --name "coheel" \
    --release-label emr-4.2.0 \
    --use-default-roles \
    --applications Name=Hadoop Name=Ganglia \
    --instance-count 21 \
    --instance-type m1.large \
    --configurations '[{ "Classification": "yarn-site", "Properties": { "yarn.nodemanager.resource.cpu-vcores": "1", "yarn.nodemanager.resource.memory-mb": "5120" } }]' \
    --bootstrap-action Name="installFlink",Path="s3://coheel-conf/install-flink-0.10.1.sh"

# wait until the cluster is running and get the name of the master node by executing
aws emr describe-cluster --cluster-id [ClusterId] | grep MasterPublicDnsName | cut -d\" -f4

Connect to the master node via ssh (user hadoop and identity file [pemfile]) and install some required/useful dependencies (Maven, Git, jd, tmux)

# install some dependencies (Maven, Git, jd, tmux)
sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo && sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo && sudo yum install -y apache-maven
sudo wget http://stedolan.github.io/jq/download/linux64/jq -O /usr/local/sbin/jd ; sudo chmod go+x /usr/local/sbin/jd
sudo yum install tmux git

# start a Apache Flink YARN session on the EMR cluster (using 20 workers)
yarn-session.sh -n 20 -s 1 -jm 768 -tm 4096 -Dfs.overwrite-files=true -Dtaskmanager.memory.fraction=0.5

To download and setup CohEEL run:

git clone https://github.com/stratosphere/coheel.git
cd coheel
# automatically retrieve the current cluster setup
source bin/load-aws-config.sh

Run a CohEEL program as usual (see Execution section) by choosing the cluster_aws setup

bin/run --configuration cluster_aws --program [...] --parallelism 20 ; coheel_message "CohEEL job finished!"

The coheel_message method sends an AWS SNS notification w/ some details after the program was terminated.

coheel's People

Contributors

knub avatar tongr avatar

Stargazers

 avatar Ramsey avatar Jihye Sofia Seo avatar James Villarrubia avatar Ramtin M. Seraj avatar  avatar Itsuki Toyota avatar

Watchers

Asterios Katsifodimos avatar Robert Metzger avatar Christoph Brücke avatar Christoph Boden avatar James Cloos avatar Philipp Grulich avatar Sergey Dudoladov avatar Kostas Tzoumas avatar  avatar Sebastian Kruse avatar Chen Xu avatar Tobias Herb avatar  avatar Jeyhun Karimov avatar James Villarrubia avatar  avatar Jan Ehmueller avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.