Git Product home page Git Product logo

cluster-preprocessing's Introduction

cluster-preprocessing

This README explains the pre-processing performed to create the cluster lexicons that are used as features in the IXA pipes tools [http://ixa2.si.ehu.es/ixa-pipes]. So far we use the following three methods: Brown, Clark and Word2vec.

TABLE OF CONTENTS

  1. Overview
  2. Brown clusters
  3. Clark clusters
  4. Word2vec clusters
  5. XML/HTML cleaning

OVERVIEW

We induce the following clustering types:

Brown

Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Preclean corpus

This step is performed by using the following function in ixa-pipe-convert:

java -jar ixa-pipe-convert-$version.jar --brownClean corpus-directory/

ixa-pipe-convert will create a .clean file for each file contained in the folder corpus-directory.

  • Move all .clean files into a new directory called, for example, corpus-preclean.

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-preclean

The tokenized version of each file in the directory corpus-preclean will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus-preclean.tok.
cd corpus-preclean
cat *.tok > corpus-preclean.tok

Format the corpus for Liang's implementation

  • Run the brown-clusters-preprocess.sh script like this to create the format required to induce Brown clusters using Percy Liang's program.
./brown-clusters-preprocess.sh corpus-preclean.tok > corpus-preclean.tok.punct

Induce Brown clusters:

brown-cluster/wcluster --text corpus-preclean.tok.punct --c 1000 --threads 8

This trains 1000 class Brown clusters using 8 threads in parallel.

Clark

Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-directory

The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok

Format the corpus

  • Run the clark-clusters-preprocess.sh script like this to create the format required to induce Clark clusters using Clark's implementation.
./clark-clusters-preprocess.sh corpus.tok > corpus.tok.punct.lower

Train Clark clusters:

To train 100 word clusters use the following command line:

cluster_neyessenmorph -s 5 -m 5 -i 10 corpus.tok.punct.lower - 100 > corpus.tok.punct.lower.100

Word2vec

Assuming that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-directory

The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok

Format the corpus

  • Run the word2vec-clusters-preprocess.sh script like this to create the format required by Word2vec.
./word2vec-clusters-preprocess.sh corpus.tok > corpus-word2vec.txt

Train K-means clusters on top of word2vec word embeddings

To train 400 class clusters using 8 threads in parallel we use the following command:

word2vec/word2vec -train corpus-word2vec.txt -output corpus-s50-w5.400 -cbow 0 -size 50 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 8 -classes 400

Cleaning XML, HTML and other formats

There are many ways of cleaning XML, HTML and other metadata than often comes in corpora. As we will usually be processing very large amounts of texts, we do not pay too much attention to detail and crudely remove every tag using regular expressions. In the scripts directory there is a python script that takes either a file or a directory as argument like this:

python xml_clean_dir.py corpus-directory/

NOTE that this script will replace your original files with a cleaned version of them.

Wikipedia

If you are interested in using the Wikipedia for your language, here you can find many Wikipedia dumps already extracted to XML which can be directly fed to the xml_clean_dir.py script:

[http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/]

If your language is not among them, we usually use the Wikipedia Extractor and then the xml_clean_dir.py script:

[http://medialab.di.unipi.it/wiki/Wikipedia_Extractor]

Contact information

Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
[email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.