Git Product home page Git Product logo

clta's Introduction

#CLTA This is the project that pulishes the source code of category correlation based bilingual topic models: CC-BiLDA and CC-BiBTM, which can be applied to cross-lingual applications, such as cross-lingual taoxnomy alignment.

###Requirements:

  1. JDK 1.8.0_111
  2. Maven 3.3.9

###Data you need:

  1. Biterm Documents or Word Documents
  2. Biterm-Category or Document-Category Distribution file

###Biterm Documents content format: each line represents a category biterm document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese-chinese biterm document>@#@#@<chinese-english biterm document>@#@#@<english-english biterm document>
for example:
http://www.ebay.com/chp/Fins-/16054@#@#@Fins@#@#@en@#@#@[呼吸 手套,...]@#@#@[呼吸 full,...]@#@#@[cheap sailor,...]
###Word Documents content format: each line represents a category word document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese word document>@#@#@<translated english word document>
for example:
http://conference_en#c-7081035-6117083@#@#@committee@#@#@en@#@#@[任命, 报告...]@#@#@[elect, person...]
###Biterm-Category Distribution file content format: each line represents a biterm-category distribution organised as follows:
<word1>@#@#@<word2>@#@#@<lang1_lang2>\t[<category url>@#@#@<category distribution>,...]
for example:
稿件@#@#@carry@#@#@ZH_EN [http://cmt_cn#c-8430559-8614325@#@#@1.0]

###Document-Category Distribution file content format: each line represents a document-category distribution organised as follows:
<document id>@#@#@<document label>@#@#@<document language>\t[<category url>@#@#@<category distribution>, ...]
for example:
http://cmt_cn#c-1609047-4017692@#@#@合著者@#@#@zh@#@#@ [http://cmt_cn#c-1609047-4017692@#@#@1.0]

###input file organization: suppose the dataset name is 'A', for CC-BiLDA method, the Word Documents and the Document-Category Distribution file are as:
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A.(<avg_pi> or <hier_pi>)
for CC-BiBTM method, the Biterm Documents and the Biterm-Category Distribution file are as:
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A.(<avg_pi> or <hier_pi>)

###Compile Project: To run this project, you need to first compile this project using maven:
mvn assembly:assembly

#Run Project: Then the jar package of this project will be generated in the target directory named by 'alignment-1.0-SNAPSHOT.jar'

if you are first time to using this project, run:
java -jar target\alignment-1.0-SNAPSHOT.jar -h
you will get the help options

usage: Model Run Options
 -alpha <arg>         Hyper Parameter Alpha
 -avg                 Using Average Category Distribution to inference the
                      GibbsSampling.
 -f <arg>             File Name
 -h                   HELP_DESCRIPTION
 -hier                Using Hierarchy Category Distribution to inference
                      the GibbsSampling.
 -iter <arg>          Iteration Number
 -k <arg>             Topic Number
 -m <arg>             Method for training the corpus, one of <CCBiBTM,
                      CCBiLDA>
 -savestep <arg>      Step to Save
 -source_beta <arg>   Source Beta
 -t <arg>             Data Type
 -target_beta <arg>   Target Beta

then you can following the help option to run this project on your own datasets. for example, you can run:
java -jar target/alignment-1.0-SNAPSHOT.jar -m CCBiBTM -f "Biterms(for BiBTM)" -t "product catalogue" -iter 300 -savestep 100 -k 100
if options not refered, values will be put default.

clta's People

Contributors

143230 avatar

Stargazers

 avatar mikelkl avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.