TopMost

TopMost provides complete lifecycles of topic modeling, including dataset preprocessing, model training, testing, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

This is our demo paper Towards the TopMost: A Topic Modeling System Toolkit.
This is our survey paper on neural topic models: A Survey on Neural Topic Models: Methods, Applications, and Challenges.

Table of Contents

Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

Scenario	Model	Evaluation Metric	Datasets
Basic Topic Modeling	LDA NMF ProdLDA DecTM ETM NSTM TSCTM ECRTM	TC TD Clustering Classification	\| 20NG \| IMDB \| NeurIPS \| ACL \| NYT \| Wikitext-103
Hierarchical Topic Modeling	\| HDP \| SawETM \| HyperMiner \| ProGBN	\| TC over levels \| TD over levels \| Clustering over levels \| Classification over levels	20NG IMDB NeurIPS ACL NYT Wikitext-103
\| Dynamic \| Topic Modeling	\| DTM \| DETM	TC over time slices TD over time slices Clustering Classification	\| NeurIPS \| ACL \| NYT
\| Cross-lingual \| Topic Modeling	\| NMTM \| InfoCTM	TC (CNPMI) TD over languages Classification (Intra and Cross-lingual)	ECNews Amazon Review Rakuten

Quick Start

Install

Install topmost with pip as

$ pip install topmost

Download a preprocessed dataset

Download a preprocessed dataset from our Github repo:

import topmost
from topmost.data import download_dataset

dataset_dir = "./datasets/20NG"
download_dataset('20NG', cache_path='./datasets')

Train a model

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.data.BasicDatasetHandler(dataset_dir, device=device, read_labels=True, as_tensor=True)
# create a model
model = topmost.models.ETM(vocab_size=dataset.vocab_size, pretrained_WE=dataset.pretrained_WE)
model = model.to(device)

# create a trainer
trainer = topmost.trainers.BasicTrainer(model, dataset)

# train the model
trainer.train()

Evaluate

# evaluate
# get theta (doc-topic distributions)
train_theta, test_theta = trainer.export_theta()
# get top words of topics
top_words = trainer.export_top_words()

# evaluate topic diversity
TD = topmost.evaluations.compute_topic_diversity(top_words, _type="TD")
print(f"TD: {TD:.5f}")

# evaluate clustering
results = topmost.evaluations.evaluate_clustering(test_theta, dataset.test_labels)
print(results)

# evaluate classification
results = topmost.evaluations.evaluate_classification(train_theta, test_theta, dataset.train_labels, dataset.test_labels)
print(results)

Test new documents (Optional)

# test new documents
import torch

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

parsed_new_docs, new_bow = preprocessing.parse(new_docs, vocab=dataset.vocab)
new_theta = runner.test(torch.as_tensor(new_bow, device=device).float())

Installation

Stable release

To install TopMost, run this command in your terminal:

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From sources

The sources for TopMost can be downloaded from the Github repository. You can clone the public repository by

$ git clone https://github.com/BobXWu/TopMost.git

Then install the TopMost by

$ python setup.py install

Tutorials

We provide tutorials for different usages:

Name	Link
How to preprocess datasets
How to train and evaluate a basic topic model
How to train and evaluate a hierarchical topic model
How to train and evaluate a dynamic topic model
How to train and evaluate a cross-lingual topic model

Notice

Differences from original implementations

Oringal implementations may use different optimizer settings. For simplicity and brevity, our package by default uses the same setting for different models.

Disclaimer

This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.

Contributors

Xiaobao Wu

Fengjun Pan

Acknowledgments

Icon by Flat-icons-com.
If you want to add any models to this package, we welcome your pull requests.
If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.

sorokinvld / topmost Goto Github PK

topmost's Introduction

TopMost

Overview

Quick Start

Install

Download a preprocessed dataset

Train a model

Evaluate

Test new documents (Optional)

Installation

Stable release

From sources

Tutorials

Notice

Differences from original implementations

Disclaimer

Contributors

Acknowledgments

topmost's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org