Provides functions for hierarchical latent tree analysis on text data
- Pre-processing
- Extract text from PDF documents
- Convert each document to bag-of-words representation
- Main output: data file (in txt and ARFF formats)
- Model building
- Learn LTM from the bag-of-words data
- Main output: model file
- Post-processing
- Topic hierarchy extraction (plain HTML)
- Build a JavaScript topic tree
- Main output: topic hierarchy (JavaScript)
You should first obtain a JAR file from the package, by either one of the following ways:
- Build the SBT and run
sbt-assembly
. Rename the generated JAR file toHLTA.jar
, which we assume in the steps below. - Download the
HLTA.jar
from the Releases page.
- To extract text from PDF files:
java -cp HLTA.jar tm.pdf.ExtractText papers extracted
Where: papers
is input directory and extracted
is output directory
- To convert text files to bag-of-words representation:
java -cp HLTA.jar tm.pdf.Convert sample 20 3 extracted
Where: sample
is a name to give and extracted
is directory of extracted text, 20
is the number of words to be included in the resulting data, 3
is the maximum of n to be considerd for n-grams.
After conversion, you can find:
sample.arff
: count data in ARFF formatsample.txt
: binary data in format for LTMsample.dict-2.csv
: information of words after selection for up to 2-gramsample.whole_dict-2.csv
: information of words before selection for up to 2-gram
To build the model with PEM:
java -Xmx15G -cp HLTA.jar PEM sample.txt sample.txt 50 5 0.01 3 model 10 15
Where: sample.txt
the name of the binary data file, model
is the name of output model file (the full name will be model.bif
).
The full parameter list is: PEM training_data test_data max_EM_steps num_EM_restarts EM_threshold UD_test_threshold model_name max_island max_top
. The numerical parameters can be divided into two parts:
- EM parameters:
max_EM_steps
: Maximum number of EM steps (e.g. 50).num_EM_restarts
: Number of restarts in EM (e.g. 5).EM_threshold
: Threshold of improvement to stop EM (e.g. 0.01).
- Model construction parameters:
UD_test_threshold
: The threshold used in unidimensionality test for constructing islands (e.g. 3).max_island
: Maximum number of variables in an island (e.g. 10).max_top
: Maximum number of variables in top level (e.g. 15).
- To extract topic hierarchy:
java -cp HLTA.jar HLTAOutputTopics_html_Ltm model.bif topic_output no no 7
Where: model.bif
is the name of the model file from PEM, topic_output
is the directory for output files
- To generate topic tree:
java -cp HLTA.jar tm.hlta.RegenerateHTMLTopicTree topic_output/TopicsTable.html sample
Where: topic_output/TopicsTable.html
is the name of the topic file from topic extraction, sample
is name of the files to be generated
The output files include:
sample.html
: HTML file for the topic treesample.nodes.js
: data for the topic nodeslib
: Javascript and CSS files required by the main HTML filefonts
: fonts used by some CSS files
- Multidimensional Text Clustering for Hierarchical Topic Detection (IJCAI 2016 Tutorial) by Nevin L. Zhang and Leonard K.M. Poon
- Peixian Chen et al. (AAAI 2016) Progressive EM for Latent Tree Models and Hierarchical Topic Detection [Longer version]
- General questions: [Leonard Poon](mailto: [email protected]) (The Education University of Hong Kong)
- For questions specific to the model building using the PEM algorithm: [Peixian Chen](mailto: [email protected]) (The Hong Kong University of Science and Technology)
Contributors: Prof. Nevin L. Zhang, Peixian Chen, Tao Chen, Tengfei Liu, Leonard K.M. Poon, Yi Wang