Git Product home page Git Product logo

topmost's Introduction

topmost-logo TopMost

License

Contributors

Github Stars

Downloads

TopMost provides complete lifecycles of topic modeling, including dataset preprocessing, model training, testing, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

Table of Contents

Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

image

Scenario Model Evaluation Metric Datasets
Basic Topic Modeling
TC
TD
Clustering
Classification
| 20NG | IMDB | NeurIPS | ACL | NYT | Wikitext-103
Hierarchical
Topic Modeling
| HDP | SawETM | HyperMiner | ProGBN | TC over levels | TD over levels | Clustering over levels | Classification over levels
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
| Dynamic | Topic Modeling | DTM | DETM
TC over time slices
TD over time slices
Clustering
Classification
| NeurIPS | ACL | NYT
| Cross-lingual | Topic Modeling | NMTM | InfoCTM
TC (CNPMI)
TD over languages
Classification (Intra and Cross-lingual)
ECNews
Amazon
Review Rakuten

Quick Start

Install

Install topmost with pip as

$ pip install topmost

Download a preprocessed dataset

Download a preprocessed dataset from our Github repo:

import topmost
from topmost.data import download_dataset

dataset_dir = "./datasets/20NG"
download_dataset('20NG', cache_path='./datasets')

Train a model

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.data.BasicDatasetHandler(dataset_dir, device=device, read_labels=True, as_tensor=True)
# create a model
model = topmost.models.ETM(vocab_size=dataset.vocab_size, pretrained_WE=dataset.pretrained_WE)
model = model.to(device)

# create a trainer
trainer = topmost.trainers.BasicTrainer(model, dataset)

# train the model
trainer.train()

Evaluate

# evaluate
# get theta (doc-topic distributions)
train_theta, test_theta = trainer.export_theta()
# get top words of topics
top_words = trainer.export_top_words()

# evaluate topic diversity
TD = topmost.evaluations.compute_topic_diversity(top_words, _type="TD")
print(f"TD: {TD:.5f}")

# evaluate clustering
results = topmost.evaluations.evaluate_clustering(test_theta, dataset.test_labels)
print(results)

# evaluate classification
results = topmost.evaluations.evaluate_classification(train_theta, test_theta, dataset.train_labels, dataset.test_labels)
print(results)

Test new documents (Optional)

# test new documents
import torch

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

parsed_new_docs, new_bow = preprocessing.parse(new_docs, vocab=dataset.vocab)
new_theta = runner.test(torch.as_tensor(new_bow, device=device).float())

Installation

Stable release

To install TopMost, run this command in your terminal:

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From sources

The sources for TopMost can be downloaded from the Github repository. You can clone the public repository by

$ git clone https://github.com/BobXWu/TopMost.git

Then install the TopMost by

$ python setup.py install

Tutorials

We provide tutorials for different usages:

Name Link
How to preprocess datasets Open In GitHub
How to train and evaluate a basic topic model Open In GitHub
How to train and evaluate a hierarchical topic model Open In GitHub
How to train and evaluate a dynamic topic model Open In GitHub
How to train and evaluate a cross-lingual topic model Open In GitHub

Notice

Differences from original implementations

  1. Oringal implementations may use different optimizer settings. For simplicity and brevity, our package by default uses the same setting for different models.

Disclaimer

This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.

Contributors

xiaobao-figure Xiaobao Wu

fengjun-figure Fengjun Pan

Acknowledgments

  • Icon by Flat-icons-com.
  • If you want to add any models to this package, we welcome your pull requests.
  • If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.

topmost's People

Contributors

bobxwu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.