Git Product home page Git Product logo

corpusprocessing-hadoop-multir's Introduction

CorpusProcessing-Hadoop-Multir

This is a collection of Hadoop jobs I used to process a Corpus of ~1M documents for input to Multir training.

###/parsing This component works by taking in a set of tokenized sentences in the input format below

Input Format:

SentenceID \t Token1 Token2 Token3 ......\n
1 \t This is an example sentence . \n
2 \t This is another example sentence . \n

It runs the CharnianJohnson parser https://github.com/BLLIP/bllip-parser on a hadoop cluster via hadoop streaming http://hadoop.apache.org/docs/r1.1.2/streaming.html

To use the CharniakJohnson parser with Hadoop Streaming you must put the directory in a jar and put that jar on the HDFS. Then when you run the hadoop-streaming app specify its location with the -archives option, this will extract the contained objects into a directory with the same name as the original jar.

/parsing/cjmapper.py is the mapper that calls the CJ executable and handles the I/O /parsing/run.sh describes how to write the hadoop-streaming command to run the application

###/pre-parse-processing

The pre-parse processing takes in the following information for each sentence:

  • Document Name
  • Tokens
  • Token Offsetes
  • Sentence
  • Sentence offsets

The Input is in the following format:

SentID \t DocumentName \t String of Tokens \t String Of Token Offsets \t Sentence Offsets \t Sentence String
1 \t AFPXXX.txt \t Today 's example of a token string \t 0:5 5:7 8:15 16:18 19:20 21:26 27:33 \t 100 133 \t Today's example of a token string

It outputs the following information for each sentence:

  • Part Of Speech Tags
  • Named Entity Recognition Tags
  • Lemmatized token tags

To run this module you should have the following jars:

  • hadoop-core-0.20.2-cdh3u1.jar
  • stanford-corenlp-1.3.5.jar
  • stanford-corenlp-1.3.5-models.jar
  • joda-time-2.1.jar

From the /pre-parse-processing directory follow these instructions to run the hadoop job

mkdir classFolder

javac -source 1.6 -target 1.6 -classpath /path/to/stanford-corenlp-1.3.5.jar:/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar:/path/to/stanford-corenlp-1.3.5-models.jar:/path/to/joda-time-2.1.jar -d classFolder src/hadoop/PreParseProcessor.java
jar -cvf preparseprocessor.jar -C classFolder/ .
adoop jar preparseprocessor.jar hadoop.PreParseProcessor -libjars /path/to/stanford-corenlp-1.3.5.jar,/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar,/path/to/stanford-corenlp-1.3.5-models.jar,/path/to/joda-time-2.1.jar /hdfs/path/to/input /hdfs/path/to/output 100(map tasks) 12(reduce tasks)

###/post-parse-processing

The post-parse processing takes in the following information for each sentence:

  • Parse String

The input is in the following format:

SentenceID \t parseString
0 \t (S1 (NP (NP (NNP BEIJING)) (, ,) (NP (NNP Oct.) (CD 28)) (PRN (-LRB- -LRB-) (NP (NNP Xinhua)) (-RRB- -RRB-))))

It outputs the following information for each setnence:

  • Dependency Parse

To run this module you should have the following jars:

  • hadoop-core-0.20.2-cdh3u1.jar
  • stanford-corenlp-1.3.5.jar

From the /post-parse-processing directory follow these instructions to run the hadoop job

mkdir classFolder

javac -source 1.6 -target 1.6 -classpath /path/to/stanford-corenlp-1.3.5.jar:/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar -d classFolder src/hadoop/PostParseProcessor.java
jar -cvf postparseprocessor.jar -C classFolder/ .
hadoop jar postparseprocessor.jar hadoop.PostParseProcessor -libjars /path/to/stanford-corenlp-1.3.5.jar,/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar /hdfs/path/to/input /hdfs/path/to/output 300(map taskts)

corpusprocessing-hadoop-multir's People

Contributors

jgilme1 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.