lmcinnes / enstop Goto Github PK

Ensemble topic modelling with pLSA

License: BSD 2-Clause "Simplified" License

Python 91.14% Jupyter Notebook 8.86%

topic-modeling plsa dimensionality-reduction matrix-factorization

enstop's Introduction

EnsTop

EnsTop provides an ensemble based approach to topic modelling using pLSA. It makes use of a high performance numba based pLSA implementation to run multiple bootstrapped topic models in parallel, and then clusters the resulting outputs to determine a set of stable topics. It can then refit the document vectors against these topics embed documents into the stable topic space.

Why use EnsTop?

There are a number of advantages to using an ensemble approach to topic modelling. The most obvious is that it produces better more stable topics. A close second, however, is that, by making use of HDBSCAN for clustering topics, it can learn a "natural" number of topics. That is, while the user needs to specify an estimated number of topics, the actual number of topics produced will be determined by how many stable topics are produced over many bootstrapped runs. In practice this can either be more, or less, than the estimated number of topics.

Despite all of these extra features the ensemble topic approach is still very efficient, especially in multi-core environments (due the the embarrassingly parallel nature of the ensemble). A run with a reasonable size ensemble can be completed in around the same time it might take to fit an LDA model, and usually produces superior quality results.

In addition to this EnsTop comes with a pLSA implementation that can be used standalone (and not as part of an ensemble). So if all you are loosing for is a good fast pLSA implementation (that can run considerably faster than many LDA implementations) then EnsTop is the library for you.

How to use EnsTop

EnsTop follows the sklearn API (and inherits from sklearn base classes), so if you use sklearn for LDA or NMF then you already know how to use Enstop. General usage is very straightforward. The following example uses EnsTop to model topics from the classic 20-Newsgroups dataset, using sklearn's CountVectorizer to generate the required count matrix.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import EnsembleTopics

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = EnsembleTopics(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

How to use pLSA

EnsTop also provides a simple to use but fast and effective pLSA implementation out of the box. As with the ensemble topic modeller it follows the sklearn API, and usage is very similar.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import PLSA

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = PLSA(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

Installation

The easiest way to install EnsTop is via pip

pip install enstop

To manually install this package:

wget https://github.com/lmcinnes/enstop/archive/master.zip
unzip master.zip
rm master.zip
cd enstop-master
python setup.py install

Help and Support

Some basic example notebooks are available here.

Documentation is coming. This project is still very young. If you need help, or have problems please open an issue and I will try to provide any help and guidance that I can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

License

The EnsTop package is 2-clause BSD licensed.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

enstop's People

Contributors

Stargazers

Watchers

Forkers

gokceneraslan mayurmorin stjordanis vishalbelsare zhanglipku fagan2888 cjweir biobenkj timc-workshops w-qilong yu336

enstop's Issues

Integration with pyLDAvis

Dear Leland,
I tried to use pyLDAvis with enstop, following same API as the sklearn topic models. I did essentially what is shown here https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb and replaced scikit-learn's LatentDirichletAllocation with enstop's EnsembleTopics.
I got this error:

ValidationError: 
 * Not all rows (distributions) in doc_topic_dists sum to 1.

Can you help to sort it out?
Thank you in advance

How so I extract the common topic of a cluster of texts?

Hi @lmcinnes
thanks for this nice code here...
I am looking for a solution for the following task:
I have a cluster of small texts and want to extract the common topic of them.
The "headline" above them so to say.

Do you have an hint for me on how to solve this or what / where to read?

Thanks
Philip

HDBSCAN error stopping EnsembleTopics

The code from your homepage

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import EnsembleTopics

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = EnsembleTopics(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

results in an error:
File hdbscan\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()

File hdbscan\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

I have sklearn 1.3.0, Python 3.11.4

FloatingPointError. NMF

I can't run NMF algorithm. When I run:

%%time
nmf_model = NMF(n_components=20, beta_loss='kullback-leibler', solver='mu').fit(data)

... I see the following error stack :

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit(self, X, y, **params)
   1310         self
   1311         """
-> 1312         self.fit_transform(X, **params)
   1313         return self
   1314 

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit_transform(self, X, y, W, H)
   1285             l1_ratio=self.l1_ratio, regularization='both',
   1286             random_state=self.random_state, verbose=self.verbose,
-> 1287             shuffle=self.shuffle)
   1288 
   1289         self.reconstruction_err_ = _beta_divergence(X, W, H, self.beta_loss,

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle)
   1067                                                   tol, l1_reg_W, l1_reg_H,
   1068                                                   l2_reg_W, l2_reg_H, update_H,
-> 1069                                                   verbose)
   1070 
   1071     else:

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _fit_multiplicative_update(X, W, H, beta_loss, max_iter, tol, l1_reg_W, l1_reg_H, l2_reg_W, l2_reg_H, update_H, verbose)
    810         if update_H:
    811             delta_H = _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H,
--> 812                                                l2_reg_H, gamma)
    813             H *= delta_H
    814 

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H, l2_reg_H, gamma)
    634     else:
    635         # Numerator
--> 636         WH_safe_X = _special_sparse_dot(W, H, X)
    637         if sp.issparse(X):
    638             WH_safe_X_data = WH_safe_X.data

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _special_sparse_dot(W, H, X)
    178             batch = slice(start, start + batch_size)
    179             dot_vals[batch] = np.multiply(W[ii[batch], :],
--> 180                                           H.T[jj[batch], :]).sum(axis=1)
    181 
    182         WH = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)

FloatingPointError: underflow encountered in multiply

I also have the same error for LatentDirichletAllocation if I choose 448 clusters for 25000 rows:

%%time
lda_model = LatentDirichletAllocation(n_components=448).fit(data_vec)

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in fit(self, X, y)
    566                     # batch update
    567                     self._em_step(X, total_samples=n_samples,
--> 568                                   batch_update=True, parallel=parallel)
    569 
    570                 # check perplexity

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _em_step(self, X, total_samples, batch_update, parallel)
    446         # E-step
    447         _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 448                                      parallel=parallel)
    449 
    450         # M-step

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _e_step(self, X, cal_sstats, random_init, parallel)
    399                                               self.mean_change_tol, cal_sstats,
    400                                               random_state)
--> 401             for idx_slice in gen_even_slices(X.shape[0], n_jobs))
    402 
    403         # merge result

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1001             # remaining jobs.
   1002             self._iterating = False
-> 1003             if self.dispatch_one_batch(iterator):
   1004                 self._iterating = self._original_iterator is not None
   1005 

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    832                 return False
    833             else:
--> 834                 self._dispatch(tasks)
    835                 return True
    836 

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    751         with self._lock:
    752             job_idx = len(self._jobs)
--> 753             job = self._backend.apply_async(batch, callback=cb)
    754             # A job can complete so quickly than its callback is
    755             # called before we get here, causing self._jobs to

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    199     def apply_async(self, func, callback=None):
    200         """Schedule a func to be run"""
--> 201         result = ImmediateResult(func)
    202         if callback:
    203             callback(result)

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    580         # Don't delay the application, to avoid keeping the input
    581         # arguments in memory
--> 582         self.results = batch()
    583 
    584     def get(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _update_doc_distribution(X, exp_topic_word_distr, doc_topic_prior, max_iters, mean_change_tol, cal_sstats, random_state)
    115 
    116             doc_topic_d = (exp_doc_topic_d *
--> 117                            np.dot(cnts / norm_phi, exp_topic_word_d.T))
    118             # Note: adds doc_topic_prior to doc_topic_d, in-place.
    119             _dirichlet_expectation_1d(doc_topic_d, doc_topic_prior,

FloatingPointError: underflow encountered in multiply

Could you please help?
I am using Python 3.7.5 x64. Windows 10.

AttributeError: module 'dask' has no attribute 'delayed'

When I am running the following code:

ens_model = EnsembleTopics(n_components=20, n_starts=8, n_jobs=2).fit(data_vec)

I get the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit(self, X, y)
    719         self
    720         """
--> 721         self.fit_transform(X)
    722         return self
    723 

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit_transform(self, X, y)
    763             self.alpha,
    764             self.solver,
--> 765             self.random_state,
    766         )
    767         self.components_ = V

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_fit(X, estimated_n_topics, model, init, min_samples, min_cluster_size, n_starts, n_jobs, parallelism, topic_combination, n_iter, n_iter_per_test, tolerance, e_step_thresh, lift_factor, beta_loss, alpha, solver, random_state)
    507         alpha=alpha,
    508         solver=solver,
--> 509         random_state=random_state,
    510     )
    511 

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_of_topics(X, k, model, n_jobs, n_runs, parallelism, **kwargs)
    181 
    182     if parallelism == "dask":
--> 183         dask_topics = dask.delayed(create_topics)
    184         staged_topics = [dask_topics(X, k, **kwargs) for i in range(n_runs)]
    185         topics = dask.compute(*staged_topics, scheduler="threads", num_workers=n_jobs)

AttributeError: module 'dask' has no attribute 'delayed'

data_vec is a vector:

data_vec = CountVectorizer().fit_transform(data)
I cannot run any version of EnsembleTopics.

Could you please help?
I am using Python 3.7.5 x64. Windows 10.

Get coherence score for PLSA

PLSA and other methods gives strange coherence score:

PLSA(n_components=3).fit(data_vec).coherence()
PLSA(n_components=4).fit(data_vec).coherence()

n=5, -894.0931521853117
n=4, -846.5056881515624
n=1000, -548.1772075123278

When I use gensim, I get quite a good score:

2 & 0.4492 
3 & 0.4257 
4 & 0.4308 
5 & 0.4443 
6 & 0.4625 
7 & 0.455 
8 & 0.4791 
9 & 0.4897 
10 & 0.5354 
11 & 0.5165 
12 & 0.5149 
13 & 0.5382 
14 & 0.5546 
15 & 0.5669 
16 & 0.5633 
17 & 0.5323

Could you please tell whether there is a bug?

Call to plsa_refit fails due to missing sample_weight

When using model.transform() on new unseen data, the following error occurs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-30746b9aeea8> in <module>
      1 test_corpus = df1['cleaned_text'].tolist()
      2 test_dtm = vectorizer.transform(test_corpus)
----> 3 test_doc_vecs = model.transform(test_dtm)
      4 labels = np.argmax(test_doc_vecs, axis=1)
      5 

/opt/conda/lib/python3.7/site-packages/enstop/enstop_.py in transform(self, X, y)
    836             n_iter_per_test=5,
    837             tolerance=0.001,
--> 838             random_state=random_state,
    839         )
    840 

TypeError: plsa_refit() missing 1 required positional argument: 'sample_weight'

There seems to be a missing arg here.

Seems a simple fix - I would be happy to make a PR, but I am not sure how to derive the needed arg:

    sample_weight: array of shape (n_docs,)
        Input document weights.

If @lmcinnes you can shed some light here - could be a quick fix!