#CLTA This is the project that pulishes the source code of category correlation based bilingual topic models: CC-BiLDA and CC-BiBTM, which can be applied to cross-lingual applications, such as cross-lingual taoxnomy alignment.
###Requirements:
- JDK 1.8.0_111
- Maven 3.3.9
###Data you need:
- Biterm Documents or Word Documents
- Biterm-Category or Document-Category Distribution file
###Biterm Documents content format:
each line represents a category biterm document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese-chinese biterm document>@#@#@<chinese-english biterm document>@#@#@<english-english biterm document>
for example:
http://www.ebay.com/chp/Fins-/16054@#@#@Fins@#@#@en@#@#@[呼吸 手套,...]@#@#@[呼吸 full,...]@#@#@[cheap sailor,...]
###Word Documents content format:
each line represents a category word document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese word document>@#@#@<translated english word document>
for example:
http://conference_en#c-7081035-6117083@#@#@committee@#@#@en@#@#@[任命, 报告...]@#@#@[elect, person...]
###Biterm-Category Distribution file content format:
each line represents a biterm-category distribution organised as follows:
<word1>@#@#@<word2>@#@#@<lang1_lang2>\t[<category url>@#@#@<category distribution>,...]
for example:
稿件@#@#@carry@#@#@ZH_EN [http://cmt_cn#c-8430559-8614325@#@#@1.0]
###Document-Category Distribution file content format:
each line represents a document-category distribution organised as follows:
<document id>@#@#@<document label>@#@#@<document language>\t[<category url>@#@#@<category distribution>, ...]
for example:
http://cmt_cn#c-1609047-4017692@#@#@合著者@#@#@zh@#@#@ [http://cmt_cn#c-1609047-4017692@#@#@1.0]
###input file organization:
suppose the dataset name is 'A', for CC-BiLDA method, the Word Documents and the Document-Category Distribution file are as:
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A.(<avg_pi> or <hier_pi>)
for CC-BiBTM method, the Biterm Documents and the Biterm-Category Distribution file are as:
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A.(<avg_pi> or <hier_pi>)
###Compile Project:
To run this project, you need to first compile this project using maven:
mvn assembly:assembly
#Run Project:
Then the jar package of this project will be generated in the target directory named by 'alignment-1.0-SNAPSHOT.jar'
if you are first time to using this project, run:
java -jar target\alignment-1.0-SNAPSHOT.jar -h
you will get the help options
usage: Model Run Options
-alpha <arg> Hyper Parameter Alpha
-avg Using Average Category Distribution to inference the
GibbsSampling.
-f <arg> File Name
-h HELP_DESCRIPTION
-hier Using Hierarchy Category Distribution to inference
the GibbsSampling.
-iter <arg> Iteration Number
-k <arg> Topic Number
-m <arg> Method for training the corpus, one of <CCBiBTM,
CCBiLDA>
-savestep <arg> Step to Save
-source_beta <arg> Source Beta
-t <arg> Data Type
-target_beta <arg> Target Beta
then you can following the help option to run this project on your own datasets. for example, you can run:
java -jar target/alignment-1.0-SNAPSHOT.jar -m CCBiBTM -f "Biterms(for BiBTM)" -t "product catalogue" -iter 300 -savestep 100 -k 100
if options not refered, values will be put default.