Git Product home page Git Product logo

hlta's Introduction

HLTA

Provides functions for hierarchical latent tree analysis on text data

Three main steps

  1. Pre-processing
  • Extract text from PDF documents
  • Convert each document to bag-of-words representation
  • Main output: data file (in txt and ARFF formats)
  1. Model building
  • Learn LTM from the bag-of-words data
  • Main output: model file
  1. Post-processing
  • Topic hierarchy extraction (plain HTML)
  • Build a JavaScript topic tree
  • Main output: topic hierarchy (JavaScript)

Prerequisites

You should first obtain a JAR file from the package, by either one of the following ways:

  1. Build the SBT and run sbt-assembly. Rename the generated JAR file to HLTA.jar, which we assume in the steps below.
  2. Download the HLTA.jar from the Releases page.

Pre-processing

  1. To extract text from PDF files:
java -cp HLTA.jar tm.pdf.ExtractText papers extracted

Where: papers is input directory and extracted is output directory

  1. To convert text files to bag-of-words representation:
java -cp HLTA.jar tm.pdf.Convert sample 20 3 extracted

Where: sample is a name to give and extracted is directory of extracted text, 20 is the number of words to be included in the resulting data, 3 is the maximum of n to be considerd for n-grams.

After conversion, you can find:

  • sample.arff: count data in ARFF format
  • sample.txt: binary data in format for LTM
  • sample.dict-2.csv: information of words after selection for up to 2-gram
  • sample.whole_dict-2.csv: information of words before selection for up to 2-gram

Model Building

To build the model with PEM:

java -Xmx15G -cp HLTA.jar PEM sample.txt sample.txt 50  5  0.01 3 model 10 15

Where: sample.txt the name of the binary data file, model is the name of output model file (the full name will be model.bif).

The full parameter list is: PEM training_data test_data max_EM_steps num_EM_restarts EM_threshold UD_test_threshold model_name max_island max_top. The numerical parameters can be divided into two parts:

  • EM parameters:
    • max_EM_steps: Maximum number of EM steps (e.g. 50).
    • num_EM_restarts: Number of restarts in EM (e.g. 5).
    • EM_threshold: Threshold of improvement to stop EM (e.g. 0.01).
  • Model construction parameters:
    • UD_test_threshold: The threshold used in unidimensionality test for constructing islands (e.g. 3).
    • max_island: Maximum number of variables in an island (e.g. 10).
    • max_top: Maximum number of variables in top level (e.g. 15).

Post-processing

  1. To extract topic hierarchy:
java -cp HLTA.jar HLTAOutputTopics_html_Ltm model.bif topic_output no no 7

Where: model.bif is the name of the model file from PEM, topic_output is the directory for output files

  1. To generate topic tree:
java -cp HLTA.jar tm.hlta.RegenerateHTMLTopicTree topic_output/TopicsTable.html sample

Where: topic_output/TopicsTable.html is the name of the topic file from topic extraction, sample is name of the files to be generated

The output files include:

  • sample.html: HTML file for the topic tree
  • sample.nodes.js: data for the topic nodes
  • lib: Javascript and CSS files required by the main HTML file
  • fonts: fonts used by some CSS files

References

Enquiry

  • General questions: [Leonard Poon](mailto: [email protected]) (The Education University of Hong Kong)
  • For questions specific to the model building using the PEM algorithm: [Peixian Chen](mailto: [email protected]) (The Hong Kong University of Science and Technology)

Acknowledgement

Contributors: Prof. Nevin L. Zhang, Peixian Chen, Tao Chen, Tengfei Liu, Leonard K.M. Poon, Yi Wang

hlta's People

Watchers

James Cloos avatar Triet Nguyen avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.