maartengr / bertopic Goto Github PK

View Code? Open in Web Editor NEW

5.7K 52.0 709.0 20.7 MB

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page: https://maartengr.github.io/BERTopic/

License: MIT License

Python 99.93% Makefile 0.07%

bert transformers topic-modeling sentence-embeddings nlp machine-learning topic ldavis topic-modelling topic-models

bertopic's Issues

Hyperparameter tuning

Hi,
Nice work on the package. I had a question.
The model is a lot sensitive to parameters. I was trying to prepare a pipeline to automatically find the best parameters. I am using outlier_count as my metric. Lower the nos of outliers, better the model.

I want to understand, Is this a right approach?

Thanks!

Issue when using n_gram_range other than (1,1)

Hi, really nice work with this package, it's very useful.

Model initiation takes the arguement n_gram_range, but I think that it doesn't get used. Should line 241 referenced here be
count = CountVectorizer(ngram_range=n_gram_range, stop_words="english").fit(documents)?

BERTopic/bertopic/model.py

Line 241 in 9f7dca1

count = CountVectorizer(stop_words="english").fit(documents)

It might be nice to have the stop_words argument be configurable at initiation as well, so that the user could pass a corpus-specific set of stop words.

No attribute 'self.verbose'

On running new_topics, new_probabilities = model.transform(new_doc)

c:\tools\anaconda3\envs\autotag\lib\site-packages\bertopic\_bertopic.py in transform(self, documents, embeddings)
    349         if not isinstance(embeddings, np.ndarray):
    350             self.embedding_model = self._select_embedding_model()
--> 351             embeddings = self._extract_embeddings(documents, verbose=self.verbose)
    352 
    353         umap_embeddings = self.umap_model.transform(embeddings)

AttributeError: 'BERTopic' object has no attribute 'verbose'

Seems to be introduced in 0.5 - the issue wasn't present on 0.4.3.

Data Input (vs. LDA & NMF)

Hey this is an awesome project. My question is that how exactly do you pre-process long texts. I notice that the metadata you demo with are all units of short texts (most of them are one sentences each). I tried to imitate that by segmented my texts into sentences (split long texts by period and semicolons, and validated their lengths) while cleaning all the punctuations, but getting only 1 cluster. Any suggestion would be much appreciated. I understand how transformer models are different from topic models like LDA and NMF, but do you think it's possible for BERT and transformer models to do something similar, which is inputting several long text files and simply generate models without a limitation of text length. Thank you.

Making embedding computation more scalable

Hello!

2020-10-31 14:35:53,446 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model
2020-10-31 15:29:37,627 - BERTopic - Transformed documents to Embeddings

It currently takes about an hour to compute embeddings for 20,000 documents in the 20 Newsgroups loaded with:

docs = fetch_20newsgroups(subset='all')['data']

To scale this better, one way is to use the bert-as-service with multiple workers. Have you thought of a possibility to make embedding computation pluggable?

issue due to the _plotly_topic_visualization() method

there is an issue related to the _plotly_topic_visualization() method.
python 3.x, d.keys() returns an iterator (not an iterator), so giving a dictionary to the hove_data parameter when creating the fig will cause an error.
here is the code:

# plotting subjects
        fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
                         hover_data={"x": False, "y": False, "Subject": True, "Words": True, "Size": True})

To solve this problem, simply use a list () :

# Plotting topics
        fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
                         hover_data=list({"x": False, "y": False, "Topic": True, "Words": True, "Size": True}))

inconsistency about outlier class in reduce_topics()

Hey, thanks for this great work.

The reduce_topics() mixes the topic with the biggest id and outlier class which has -1 class id. That happens because of the _map_probabilities() method. When the outlier topic determined for from_topic or to_topic variables, the method modifies the last element of the probabilities array which is not outlier class's probability, since outlier class's probability does not exist in the probability array.

Predict multiple topics per document

Would it be feasible to return the probabilities for all of the topics rather than only returning the best topic? This would be similar to LDA, where typically proportions or probabilities are returned for all topics.

I think this could be done by changing transform() to use membership_vector() instead of approximate_predict().

Allow passing all keyword arguments of CountVectorizer to BERTopic constructor

Hi
I think it could be great if we can pass all existing keyword arguments of CountVectorizer to BERTopic and not only n_gram_range and stop_words as of today.
Some of them like max_df, min_df, strip_accents or even tokenizer can be of great help when finetuning a model.

It could be done by changing the signature of the __init__ method from

    def __init__(self,
                 bert_model: str = 'distilbert-base-nli-mean-token``s',
                 top_n_words: int = 20,
                 nr_topics: int = None,
                 n_gram_range: Tuple[int, int] = (1, 1),
                 min_topic_size: int = 30,
                 n_neighbors: int = 15,
                 n_components: int = 5,
                 stop_words: Union[str, List[str]] = None,
                 verbose: bool = False)

    def __init__(self,
                 bert_model: str = 'distilbert-base-nli-mean-token``s',
                 top_n_words: int = 20,
                 nr_topics: int = None,
                 min_topic_size: int = 30,
                 n_neighbors: int = 15,
                 n_components: int = 5,
                 verbose: bool = False,
                 **kwargs)

Then storing the kwargs dictionary as a class attribute self.kwargs
And then in _c_tf_idf

count = CountVectorizer(**self.kwargs).fit(documents)

I can even provide a PR if you want

Thx again for this great package

Olivier Terrier
@kairntech

ModuleNotFoundError when pip installing bertopic in venv

I am getting ModuleNotFoundError: No module named 'bertopic' while the output of pip install bertopic is as follows:

Requirement already satisfied: bertopic in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (0.3.4)
Requirement already satisfied: matplotlib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (3.3.3)
Requirement already satisfied: pandas in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.1.5)
Requirement already satisfied: scikit-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.23.2)
Requirement already satisfied: tqdm in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (4.54.1)
Requirement already satisfied: hdbscan in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.8.26)
Requirement already satisfied: numpy in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.19.4)
Requirement already satisfied: sentence-transformers in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.3.9)
Requirement already satisfied: joblib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.0.0)
Requirement already satisfied: umap-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.4.6)
Requirement already satisfied: torch in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.7.1)
Requirement already satisfied: python-dateutil>=2.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.4.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (8.0.1)
Requirement already satisfied: pytz>=2017.2 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from pandas->bertopic) (2020.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (1.5.4)
Requirement already satisfied: six in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (1.14.0)
Requirement already satisfied: cython>=0.27 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (0.29.21)
Requirement already satisfied: nltk in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5)
Requirement already satisfied: transformers<3.6.0,>=3.1.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5.1)
Requirement already satisfied: numba!=0.47,>=0.46 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from umap-learn->bertopic) (0.52.0)
Requirement already satisfied: typing-extensions in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from torch->bertopic) (3.7.4.3)
Requirement already satisfied: click in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (7.1.2)
Requirement already satisfied: regex in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (2020.11.13)
Requirement already satisfied: requests in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (2.22.0)
Requirement already satisfied: protobuf in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.14.0)
Requirement already satisfied: sacremoses in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.0.43)
Requirement already satisfied: packaging in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (20.3)
Requirement already satisfied: sentencepiece==0.1.91 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.1.91)
Requirement already satisfied: tokenizers==0.9.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.9.3)
Requirement already satisfied: filelock in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.0.12)
Requirement already satisfied: llvmlite<0.36,>=0.35.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (0.35.0)
Requirement already satisfied: setuptools in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (44.0.0)

visualize_topics() not returning the value of "Words" and "Size"

When trying the same example posted, visualize_topics() returns a visualization that for each cluster shows the following when hovering over it:

Words:%{customdata[3]}
Size:%{customdata[4]}

Inconsistency between topic with maximum probability and the predicted one for a document

Hi Maarten,

When using BERTopic on fetch_20newsgroups dataset to extract topics and their associated representative documents I figured out that for a given document the predicted topic was different from the one with the maximum probability. Of course, I checked it for topic label different from -1. In other words, it seems to have an inconsistency between predicted topics and probabilities. Is this normal ?

When we use the following:

topic_model = BERTopic(language="english", calculate_probabilities=True)
preds, probs = topic_model.fit_transform(docs)

For each index idx we should not have preds[idx] == numpy.argmax(probs[idx, :]) ?

Thank you in advance for your response.

embedding_model bug

Thanks for providing an easy to use library. When setting an embedding_model parameter in bertopic initialization, it isn't loading the model I want but defaults to 'distilbert-base-nli-stsb-mean-tokens'. I think this is the case because the elif clause of _select_embedding_model function in _bertopic.py

BERTopic/bertopic/_bertopic.py

Line 875 in c271ec6

def _select_embedding_model(self) -> SentenceTransformer:

self.language is referenced before self.embedding_model and since the default language value is 'english', it is returning the transformer models under the self.language clause in spite of whatever embedding models I choose.

ValueError: k must be less than or equal to the number of training points`

Hi,

I want to use your pipeline with my own embeddings. However, I always get this error:

`2020-12-03 15:04:21,143 - BERTopic - Reduced dimensionality with UMAP
2020-12-03 15:04:21 - Reduced dimensionality with UMAP

ValueError Traceback (most recent call last)
in ()
9 npcorpus_embeds = np.array(corpus_embeds)
10
---> 11 topics = bmodel.fit_transform(cats, npcorpus_embeds)

4 frames
/usr/local/lib/python3.6/dist-packages/hdbscan/prediction.py in init(self, data, condensed_tree, min_samples, tree_type, metric, **kwargs)
102 self.tree = self._tree_type_map[tree_type](self.raw_data,
103 metric=metric, **kwargs)
--> 104 self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
105 self.dist_metric = DistanceMetric.get_metric(metric, **kwargs)
106

sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query()

ValueError: k must be less than or equal to the number of training points`

I also tried using the built in embedding creation but got the same error. Do you know, what the problem could be?

custom dataset instructions

Hi,

Hope you are all well !

I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?

Thanks for any insights or inputs on that question.

Cheers,
X

Compute Sub-Clusters

Hi,

used BERTopic on the arxiv dataset and extracted the most frequent topics (the biggest clusters).
Now I want to get the sub-clusters of the biggest cluster. What I did was to simply filter the documents and umap_embeddings with the corresponding cluster label and re-run hdbscan and c-TF-IDF on the sub-sets.
However, the results are not really satisfying. Even though my most frequent topic has a cluster size of 6756 I only get two sub-clusters. One with size 5810 and one with 579. If I repeat the process with the 5810 sub-cluster to get the sub-sub-clusters then hdbscan fails to make any clusters and all documents get label -1.

Is there something wrong about my approach? I feel like hdbscan should be able to find more clusters with cluster sizes of 6756 and 5810. For the first clustering I got 2733 clusters/topics.

The parameters are all on default.

Best
Karol

PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'...

Hi Maarteen,

I love BERTopic!

Yet, I'm facing this error (CPU, local machine)

Is this expected?

Thanks,
Charly

Does GPU help?

Hi, firstly thank you so much for this library. I've tried it and it does take some time to get the topics.
Just wondering, will having GPU help speed-wise? Is the speed bottle-necked at the sentence transformers embedding portion?

The option nr_topics seems useless

Hello! This work is remarkable!
I got a problem when I trained a topic model using Chinese text data and my own sentence embeddings:

The info given by the program suggests that the number of topics had been reduced to 30, but when I accessed the results using get_topics(), I found there were still 93 topics, why this happened?

By the way, I sometimes came across Memory Out of Limit Error when running this package on my data, I think the reason is that I have millions of texts. Do you have any suggestions on how to apply this package to millions of texts?

llvmlite Error when install

Hi, I am trying to install BERTopic on mac and get the error:


    ----------------------------------------
ERROR: Command errored out with exit status 1: ./venv/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"'; __file__='"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-record-mk9vn8xa/install-record.txt --single-version-externally-managed --compile --install-headers ./venv/include/site/python3.9/llvmlite Check the logs for full command output.

Any idea?

Expose batch_size parameter

It would be nice to have a control of the batch size that converts the docs to embeddings.
The default is 32 and during my run of the algorithm, the GPU memory never exceeded 2.5/16G. It could improve the speed of the embeddings extraction.

Apllying code in my own dataset

Hi, Thank you for this great job, i'm beginner in BERT, and i want to use your code to extract topics from arabic text (stored on MongDB), do you have an idea how can i do this? thank you so much.

model.transform does not return probabilities for new documents

rl_bertopic_model = BERTopic(language="english")
rl_bertopic_model = rl_bertopic_model.load(f'models/{model_name}')

new_doc = [r"some text"]

new_doc_topics, new_doc_probabilities = rl_bertopic_model.transform(new_doc[0])

new_doc_probabilities is None on 0.5.0 but it works fine on 0.4.3. This is the case regardless of whether the model is trained fresh or loaded from file.

I'm assuming this is related to the low_memory option introduced in 0.5.0. The wording here seems backwards: "If low_memory in BERTopic is set to False, then the probabilities are not calculated to speed up computation and decrease memory usage." - is this how it works? Seems like it should be the other way around.

Thanks

Proper exception handling for documents with no topics

I have come across a few cases in my corpus where probabilities[i] returns no probabilities that are equal or exceed min_probability and thus visualize_distribution will through and exception on vals = probabilities[labels_idx].tolist().

A better exception handling for these cases by showing an informative alert could be very handy instead of breaking the code.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used

I'm having a strange warning during the function fit_transform

from bertopic import BERTopic

model_berttopic = BERTopic(language="english", verbose=True, stop_words="english")
topics, probabilities = model_berttopic.fit_transform(documents)
print(topics)
print(probabilities)
model_berttopic.save("bertopic_model")

and the output is

2021-01-27 11:31:07,461 - BERTopic - Loaded embedding model
2021-01-27 20:53:53,191 - BERTopic - Transformed documents to Embeddings
2021-01-27 20:57:27,904 - BERTopic - Reduced dimensionality with UMAP
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

The program is still running for hours without other logs (I expected Clustered UMAP embeddings with HDBSCAN, Loaded embedding model, Transformed documents to Embeddings and the save of the model). What is happening? There is no feedback of what is doing.

Error while using model.visualize_distribution(probabilities)

GPU utility issue

Hello! When I tried to train the model using my local GPU, it shows that even though the code takes up some memory of the GPU, the GPU utility stays to be 0. Could you please give me some hints to solve this issue?

Loading datasets

I'm very excited to see that there is now an LDAVis alternative that works with embeddings! Your documentation and colab illustrate how to load the Newsgroup dataset. But could you also add sthg (for us less experienced users) about how to load local json files, using the 'abstract' field for the NLP, but keeping the metadata for other types of analyses - like for dynamic topic modelling (for examples of vizzes, see the also very cool DETM-tool ).

package bertopic[visualization] not found

Thank you for developing this implementation.

I am trying to use the bertopic[visualization] based on your blog post -- Towards Data Science post.

My system throws an error : no matches found: bertopic[visualization]

Could you help?
Thanks,
Anan

`check_documents_type(documents)` in `utils.py` to support non-string portions of documents

Looking at the piece of code below in utils.py:

def check_documents_type(documents):
    """ Check whether the input documents are indeed a list of strings """
    if isinstance(documents, Iterable) and not isinstance(documents, str):
        if not any([isinstance(doc, str) for doc in documents]):
            raise TypeError("Make sure that the iterable only contains strings.")

    else:
        raise TypeError("Make sure that the documents variable is an iterable containing strings only.")

There are a lot of cases where a majority of a document is <class 'str'> and yet there will be an exception raised here.

Better support for such cases can be beneficial, for instance, to make a document with isinstance(documents, str) of True into an Iterable object or allowing a prop to decide what to do with numbers/dates/etc. within the document text.

There is also the case of double quotations within a text to show quotes from someone that also breaks the code resulting in a TypeError().

This solution may potentially return a modified version as the outcome of such alteration.

TypingError : Failed in nopython mode pipeline (step: nopython frontend)

I'm trying to load a trained BERTopic model from disk by using BERTopic.load, but I'm getting this error:

TypingError                               Traceback (most recent call last)
<ipython-input-9-2081de8232b3> in <module>
      1 import joblib
      2 with open('bertopic_model', 'rb') as file:
----> 3     model=joblib.load(file)

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
    573         filename = getattr(fobj, 'name', '')
    574         with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 575             obj = _unpickle(fobj)
    576     else:
    577         with open(filename, 'rb') as f:

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
    502     obj = None
    503     try:
--> 504         obj = unpickler.load()
    505         if unpickler.compat_mode:
    506             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python3.8/pickle.py in load(self)
   1208                     raise EOFError
   1209                 assert isinstance(key, bytes_types)
-> 1210                 dispatch[key[0]](self)
   1211         except _Stop as stopinst:
   1212             return stopinst.value

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load_build(self)
    327         NDArrayWrapper is used for backward compatibility with joblib <= 0.9.
    328         """
--> 329         Unpickler.load_build(self)
    330 
    331         # For backward compatibility, we support NDArrayWrapper objects.

/usr/lib/python3.8/pickle.py in load_build(self)
   1701         setstate = getattr(inst, "__setstate__", None)
   1702         if setstate is not None:
-> 1703             setstate(state)
   1704             return
   1705         slotstate = None

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in __setstate__(self, d)
   1026     def __setstate__(self, d):
   1027         self.__dict__ = d
-> 1028         self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
   1029 
   1030     def _init_search_graph(self):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in <listcomp>(.0)
   1026     def __setstate__(self, d):
   1027         self.__dict__ = d
-> 1028         self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
   1029 
   1030     def _init_search_graph(self):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/rp_trees.py in renumbaify_tree(tree)
   1176     point_indices = numba.typed.List.empty_list(point_indices_type)
   1177 
-> 1178     hyperplanes.extend(tree.hyperplanes)
   1179     offsets.extend(tree.offsets)
   1180     children.extend(tree.children)

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py in extend(self, iterable)
    364             # can not be sliced.
    365             self._initialise_list(iterable[0])
--> 366         return _extend(self, iterable)
    367 
    368     def remove(self, item):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    413                 e.patch_message(msg)
    414 
--> 415             error_rewrite(e, 'typing')
    416         except errors.UnsupportedError as e:
    417             # Something unsupported is present in the user code, add help info

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    356                 raise e
    357             else:
--> 358                 reraise(type(e), e, None)
    359 
    360         argtypes = []

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     78         value = tp()
     79     if value.__traceback__ is not tb:
---> 80         raise value.with_traceback(tb)
     81     raise value
     82 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7f2a3dc6f4c0>) found for signature:

 >>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
    With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   - Resolution failure for literal arguments:
   No implementation of function Function(<function impl_append at 0x7f2a3dcf4c10>) found for signature:

    >>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
           With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
          Rejected as the implementation raised a specific error:
            LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
          
          
          File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
              def impl(l, item):
                  casteditem = _cast(item, itemty)
                  ^
          
          During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (597)
     raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py:81
   
   - Resolution failure for non-literal arguments:
   None
   
   During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
   During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (1051)
   
   
   File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
               def impl(l, iterable):
                   <source elided>
                   for i in iterable:
                       l.append(i)
                       ^

  raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071

- Resolution failure for non-literal arguments:
None

During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py (101)


File "../env/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
    return l.extend(iterable)
    ^

I tried to upgrade to joblib 1.0.0 but I'm still getting the same error. Did someone receive the same error in the past?
Why not use pickle/dill instead of joblib==0.17.0 ?

ValueError: k must be less than or equal to the number of training points

Hi Maarten,

I'm trying to get a topic on just a list of words: coffee, alcohol, drunk, cigarettes, smoking, drugs. So that I can have a topic called "Addiction" for example.

This is my code

from bertopic import BERTopic

docs = ['[CLS]', '[UNK]', 'coffee', 'alcohol', '[UNK]', 'drunk', 'cigarettes', 'smoking', 'drugs', '[SEP]']

model = BERTopic(verbose=True)
topics = model.fit_transform(docs)

And this is the error that I'm getting:

2021-02-08 23:08:18,794 - BERTopic - Loaded embedding model
2021-02-08 23:08:18,856 - BERTopic - Transformed documents to Embeddings
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/umap/umap_.py:2214: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
2021-02-08 23:08:21,252 - BERTopic - Reduced dimensionality with UMAP
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 278, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 753, in _cluster_embeddings
    self.cluster_model = hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 922, in fit
    self.generate_prediction_data()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 961, in generate_prediction_data
    self._prediction_data = PredictionData(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/prediction.py", line 104, in __init__
    self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
  File "sklearn/neighbors/_binary_tree.pxi", line 1342, in sklearn.neighbors._kd_tree.BinaryTree.query
ValueError: k must be less than or equal to the number of training points

Can it be something in the parameter settings? I can't figure it out, any help is very appreciated.

Model load time

transform() and fit_transform() uses the same time to produce results. If I train the model, save it and load it again, it takes the same time to give predictions. How can I quickly get predictions once I train and save a model?

List of documents in a topic cluster

Hi, I am wondering if there's a way to query a list of documents (either return ID or actual text) in each topic cluster?

Installation Error

I am running OS X 10.11.6.

$ rustc --version
rustc 1.46.0
$ cargo --version
cargo 1.45.1
...
error[E0554]: `#![feature]` may not be used on the stable release channel
    --> /Users/davidlaxer/.cargo/registry/src/github.com-1ecc6299db9ec823/lock_api-0.3.4/src/lib.rs:91:34
     |
  91 | #![cfg_attr(feature = "nightly", feature(const_fn))]
     |                                  ^^^^^^^^^^^^^^^^^

No random state

Hi,
When I run the model with the same data and the same parameters, I get different clusters. Is there a way to fix random state for reproducibility?

Thanks

Languages bug

BERTopic/bertopic/_bertopic.py

Line 892 in 1ec0313

elif self.language.lower() in languages:

you check whether language argument is in languages list. But lowercase is applied first and languages list elements start with uppercase, so it is not possible to initialize BERTopic with any language from this list besides English (as another if case handles this).

.transform not working after reducing topics

Error when install on windows

Hi everyone,
when i try to install bertopic on windows (pip install bertopic) i get an error. The problem arises on this line: " Building wheels for collected packages: hdbscan".
The next line is: Building wheel for hdbscan (PEP 517) ... error
...
ERROR: Failed building wheel for hdbscan
Failed to build hdbscan
ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly.

I tried to use different python version (3.5 - 3.6 - 3.7) but nothing has changed.
Did someone have the same problem and solved it?
Thank you all,

Andrea.

Text preprocessing

Hi! Thanks for developing this awesome library!

I have a question regarding text preprocessing.

From what I understand, the model takes List[str] as an input - basically a list of fulltext documents.

But do we need to preprocess texts somehow before passing it into the model?

With LDA, I usually preprocess texts (tokenize, lemmatize, remove stopwords, create n-grams, etc.) before running models. But since we're dealing with word embeddings, keeping all words in their original form is important for the context, right?

So I'm not sure how to proceed, should I use list of preprocessed words as an input, or leave texts untouched, or something in between (keeping text as a string but without stopwords, etc.)?

Retrieve the docs in a cluster

Hi Maarten,

First of all, thank you for this great tool and insight!

Am I missing something in the docs, how can I retrieve the indices of the docs belonging to a clusternumber? If this is not implemented yet, is there a quick workaround how I could do this?

More progressbars

Hi,

I see that there is an option for a tqdm progressbar in _extract_embeddings but no actual access to it. As the dataset I am working on is quite big I would like to have an estimate how long things are going to take.
Would it be possible to enable the progress bar from fit_transform?
I don't know if it is possible, but can you add progress bars for the other steps as well?

Best
Karol

Parallelizing `extract_embeddings()`

For larges corpora of documents, extracting BERT embeddings will take a long time.

Parallelizing it would be a sweet feature.

Plot customization

Hi Maarten!

Firstly, congratulations for your work.

I would like to suggest you allow the BERTopic "_plotly_topic_visualization" function from "visualize_topics()" to not only show the plotly figure but also to return the figure as a variable. This will be useful because with the figure in a variable the user can download the figure as HTML, pdf etc. In my case, I would like to embed the figure in a Dashboard.

Besides, I would suggest you allow the user to access to the parameters used during the Topic Modelling (i.e, UMAP, HDBSCAN, plotly visualization).

Make the algorithm less memory intensive

When using big data, it becomes infeasible to hold everything in memory at once.
Would it be possible to iterate over the data rather than hold it in memory?

It might also help exposing n_jobs parameter for UMAP so that the user has some control over the number of cores and therefore consumed memory.

Finetune on arxiv dataset

Hi,

thanks for your amazing work!
However, I currently still have some problems on getting good results.
I want to use BERTopic on the kaggle arxiv abstract dataset https://www.kaggle.com/Cornell-University/arxiv
It is a dataset that contains the abstract of each paper on arxiv. In total 1796908 abstracts, but I am using only 1/4 of them due to hardware constraints, so 449227 abstracts. The raw data is a list of dicts with each dict containing stuff like author, title, abstract and etc. but I am only using the abstracts itself.
My current results are sadly not what I expected. Here is the output of model.get_topics():

##################################
[('withdrawn', 0.12732245199899253), ('arxiv', 0.060818479638394804), ('author', 0.045282397053936205), ('been', 0.043582983757148634), ('paper', 0.04331377340066525), ('authors', 0.03602908119595011), ('has', 0.03413129955351502), ('discussion', 0.020046558277271205), ('version', 0.017570171724893863), ('error', 0.016245558058635576), ('due', 0.016034569088373845), ('4002', 0.015203603208166275), ('article', 0.015178787241213468), ('mcshane', 0.014512825764984364), ('1104', 0.013893798421411663), ('crucial', 0.012724570309587551), ('wyner', 0.011639183974558176), ('proxies', 0.011545341114998098), ('please', 0.011257392365683372), ('0804', 0.010829445454730597)]
##################################
[('withdrawn', 1.2378088374383105), ('been', 0.33161619452791685), ('paper', 0.2815599045047751), ('has', 0.2598521946696877), ('administratively', 0.037473809176819514), ('article', 0.035331755008345955), ('retracted', 0.032019088876856915), ('abstract', 0.03194552517951581), ('withdraw', 0.03023297555105207), ('submission', 0.025366426781504862), ('mistake', 0.024461766176310584), ('rewriting', 0.02108034213126981), ('want', 0.019899921380598113), ('this', 0.018769808691909386), ('shorter', 0.01690150874372038), ('comment', 0.01634104139337519), ('probably', 0.016086126678481083), ('applicable', 0.015210457362549933), ('modification', 0.014865572450063681), ('longer', 0.014582146616905768)]
##################################
[('isotopes', 0.2558417644790476), ('thirty', 0.22683223596981578), ('refereed', 0.19469496374394987), ('publication', 0.1454113600287234), ('isotope', 0.14061126024476908), ('brief', 0.11791781235255983), ('identification', 0.10522952641115933), ('discovery', 0.09816511283302375), ('summary', 0.08730514775501705), ('synopsis', 0.07636960227187568), ('production', 0.07302537404971297), ('discussed', 0.06821321035793458), ('including', 0.06506837933191251), ('presented', 0.06352529575007038), ('twenty', 0.057115384937416365), ('eight', 0.05686672933874315), ('each', 0.054793906411334324), ('far', 0.05099118599417039), ('minerals', 0.04545668048089448), ('observed', 0.04482361175054437)]
##################################
[('withdrawn', 0.8102220016577751), ('author', 0.5882810125654714), ('been', 0.21955498849750296), ('paper', 0.1935837075655028), ('has', 0.17516982841654938), ('pourmohammad', 0.08015698196117896), ('ali', 0.0605915947645577), ('seemann', 0.027628108579230422), ('eqn', 0.02270159607399226), ('admin', 0.022251661779234076), ('request', 0.01530721857357427), ('by', 0.013408498518361402), ('this', 0.012868228621997208), ('modification', 0.010599219818655269), ('authors', 0.010375596329419137), ('arxiv', 0.010147235836885003), ('km', 0.008868127574553979), ('due', 0.0053505630213037635), ('first', 0.004174426091124793), ('at', 0.0013290983743134937)]
##################################
[('de', 0.16471413053677894), ('la', 0.08859824729535025), ('un', 0.07960098252808794), ('en', 0.07656758724369946), ('des', 0.07493017494049987), ('une', 0.0685045487905329), ('est', 0.06506619811186878), ('nous', 0.0552461357202294), ('que', 0.0505970341092853), ('dans', 0.04833955653453861), ('pour', 0.04773108415278024), ('les', 0.04405246081293785), ('et', 0.04269515259835229), ('sur', 0.0425329786858753), ('caract', 0.034373522683066204), ('le', 0.03028301508609669), ('es', 0.029084319074609982), ('ees', 0.028840836535619835), ('cette', 0.023815804070613532), ('eme', 0.023083284220080345)]
##################################
[('model', 0.0029213211816859273), ('two', 0.0029181910122442487), ('it', 0.002917764978256985), ('can', 0.002911863114896897), ('these', 0.002900525114119986), ('our', 0.0028719993646575373), ('show', 0.0028703487897916566), ('results', 0.002862179058792491), ('also', 0.0028543448650093332), ('field', 0.002807120623162036), ('have', 0.0027961151449595436), ('using', 0.002780531524136966), ('between', 0.0027687481202621554), ('or', 0.002762760512175864), ('one', 0.0027467154286522437), ('time', 0.002741766841704294), ('energy', 0.0027274038973420667), ('data', 0.0026880146639130568), ('quantum', 0.0026615769324125527), ('such', 0.002660012066337066)]
##################################
[('withdrawn', 0.25382672910326465), ('arxiv', 0.1859321307804051), ('author', 0.10424878083447243), ('been', 0.09053395420662566), ('paper', 0.0810609839954744), ('has', 0.06435182608618999), ('version', 0.05760067485755415), ('authors', 0.05258540866955608), ('superseded', 0.04795847515378754), ('replaced', 0.043281616461112844), ('merged', 0.03743469139047417), ('0804', 0.03698167270671404), ('1008', 0.03085884218486115), ('because', 0.030835129343355642), ('0812', 0.023350724849799137), ('0901', 0.022639659892192746), ('revised', 0.02187828357506638), ('1306v6', 0.021542196549539806), ('submission', 0.02115347128457661), ('3484', 0.020833341465434623)]
##################################
[('withdrawn', 0.3174814989979916), ('author', 0.11776777433239484), ('been', 0.08872075679275165), ('paper', 0.08199984676632026), ('due', 0.08048541635175363), ('has', 0.07026260094965996), ('error', 0.05649266661391457), ('authors', 0.05390187230801506), ('arxiv', 0.051928726372487306), ('because', 0.034956486686744344), ('mistake', 0.032456919238108894), ('crucial', 0.0322646808282665), ('submission', 0.029450103971990518), ('administrators', 0.02776402639935557), ('admin', 0.024154968037124056), ('proof', 0.02232748069916814), ('errors', 0.017136278392237612), ('lemma', 0.015641869372412024), ('copyright', 0.015397080186955915), ('theorem', 0.014757663158028029)]

As you can see, the extracted topics are kind of bad and not what I have hoped for.
Can you give me some advice why this is not working and what I should finetune?

Best
Karol

Problem with numba package

I keep having the same installation error, related to the numba package. See error below:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-160-1d2f5f7c9d67> in <module>
----> 1 from bertopic import BERTopic

/opt/anaconda3/lib/python3.7/site-packages/bertopic/__init__.py in <module>
----> 1 from bertopic._bertopic import BERTopic
      2 from bertopic._ctfidf import ClassTFIDF
      3 from bertopic._embeddings import languages
      4 
      5 __version__ = "0.4.3"

/opt/anaconda3/lib/python3.7/site-packages/bertopic/_bertopic.py in <module>
     10 
     11 # Models
---> 12 import umap
     13 import hdbscan
     14 from sentence_transformers import SentenceTransformer

/opt/anaconda3/lib/python3.7/site-packages/umap/__init__.py in <module>
      1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
      3 
      4 try:
      5     with catch_warnings():

/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py in <module>
     45 )
     46 
---> 47 from pynndescent import NNDescent
     48 from pynndescent.distances import named_distances as pynn_named_distances
     49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/__init__.py in <module>
      1 import pkg_resources
      2 import numba
----> 3 from .pynndescent_ import NNDescent, PyNNDescentTransformer
      4 
      5 # Workaround: https://github.com/numba/numba/issues/3341

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in <module>
     19 import heapq
     20 
---> 21 import pynndescent.sparse as sparse
     22 import pynndescent.sparse_nndescent as sparse_nnd
     23 import pynndescent.distances as pynnd_dist

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/sparse.py in <module>
      8 import numba
      9 
---> 10 from pynndescent.utils import norm, tau_rand
     11 from pynndescent.distances import kantorovich
     12 

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/utils.py in <module>
      6 
      7 import numba
----> 8 from numba.core import types
      9 from numba.experimental import structref
     10 import numpy as np

ModuleNotFoundError: No module named 'numba.core'

I am running on macOS Big Sur. Package versions:
bertopic==0.4.3
conda==4.9.2
numba==0.52.0
umap-learn==0.5.0
Python==3.7.6

I've already done a lot of searching on the internet but can't find any solution. Does somebody have the same problem or any idea how to solve this?

Thanks in advance!

Using COVID-Twitter-BERT (CT-BERT)

Hi, firstly thank you so much for this library! :)

I am interested in performing topic modelling using tweets related to COVID-19, and I was wondering if it is possible to integrate the CT-BERT model (from https://github.com/digitalepidemiologylab/covid-twitter-bert) into BERTopic? And if it is indeed possible, how can I go about doing so?

Your help would be very much appreciated! Thank you in advance.

reduce_topics() changes probabilities in-place manner

When we use the method reduce_topics() it mutates the given probabilities parameter and it becomes identical with the returned probabilities. It would be better if it does not mutate the given one and 2 different probabilities for before and after.

Use SentenceTransformer through Flair

Hi Maarten
Thx again for this great package and the 0.5 release is just amazing.
Regarding Flair vs SentenceTransformer maybe it could be interesting to always use Flair even for SentenceTransformer:

Flair has a top level class DocumentEmbeddings and several implementation among

TransformerDocumentEmbeddings
SentenceTransformerDocumentEmbeddings
DocumentTFIDFEmbeddings
DocumentPoolEmbeddings

What do you think?

Best regards

Olivier

maartengr / bertopic Goto Github PK

bertopic's Issues

`2020-12-03 15:04:21,143 - BERTopic - Reduced dimensionality with UMAP 2020-12-03 15:04:21 - Reduced dimensionality with UMAP

Recommend Projects

Recommend Topics

Recommend Org

`2020-12-03 15:04:21,143 - BERTopic - Reduced dimensionality with UMAP
2020-12-03 15:04:21 - Reduced dimensionality with UMAP