maartengr / bertopic Goto Github PK
View Code? Open in Web Editor NEWLeveraging BERT and c-TF-IDF to create easily interpretable topics.
Home Page: https://maartengr.github.io/BERTopic/
License: MIT License
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
Home Page: https://maartengr.github.io/BERTopic/
License: MIT License
Hi,
Nice work on the package. I had a question.
The model is a lot sensitive to parameters. I was trying to prepare a pipeline to automatically find the best parameters. I am using outlier_count as my metric. Lower the nos of outliers, better the model.
I want to understand, Is this a right approach?
Thanks!
Hi, really nice work with this package, it's very useful.
Model initiation takes the arguement n_gram_range
, but I think that it doesn't get used. Should line 241 referenced here be
count = CountVectorizer(ngram_range=n_gram_range, stop_words="english").fit(documents)
?
Line 241 in 9f7dca1
It might be nice to have the stop_words
argument be configurable at initiation as well, so that the user could pass a corpus-specific set of stop words.
On running new_topics, new_probabilities = model.transform(new_doc)
c:\tools\anaconda3\envs\autotag\lib\site-packages\bertopic\_bertopic.py in transform(self, documents, embeddings)
349 if not isinstance(embeddings, np.ndarray):
350 self.embedding_model = self._select_embedding_model()
--> 351 embeddings = self._extract_embeddings(documents, verbose=self.verbose)
352
353 umap_embeddings = self.umap_model.transform(embeddings)
AttributeError: 'BERTopic' object has no attribute 'verbose'
Seems to be introduced in 0.5 - the issue wasn't present on 0.4.3.
Hey this is an awesome project. My question is that how exactly do you pre-process long texts. I notice that the metadata you demo with are all units of short texts (most of them are one sentences each). I tried to imitate that by segmented my texts into sentences (split long texts by period and semicolons, and validated their lengths) while cleaning all the punctuations, but getting only 1 cluster. Any suggestion would be much appreciated. I understand how transformer models are different from topic models like LDA and NMF, but do you think it's possible for BERT and transformer models to do something similar, which is inputting several long text files and simply generate models without a limitation of text length. Thank you.
Hello!
2020-10-31 14:35:53,446 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model
2020-10-31 15:29:37,627 - BERTopic - Transformed documents to Embeddings
It currently takes about an hour to compute embeddings for 20,000 documents in the 20 Newsgroups loaded with:
docs = fetch_20newsgroups(subset='all')['data']
To scale this better, one way is to use the bert-as-service with multiple workers. Have you thought of a possibility to make embedding computation pluggable?
there is an issue related to the _plotly_topic_visualization() method.
python 3.x, d.keys() returns an iterator (not an iterator), so giving a dictionary to the hove_data parameter when creating the fig will cause an error.
here is the code:
# plotting subjects
fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
hover_data={"x": False, "y": False, "Subject": True, "Words": True, "Size": True})
To solve this problem, simply use a list () :
# Plotting topics
fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
hover_data=list({"x": False, "y": False, "Topic": True, "Words": True, "Size": True}))
Hey, thanks for this great work.
The reduce_topics()
mixes the topic with the biggest id and outlier class which has -1 class id. That happens because of the _map_probabilities()
method. When the outlier topic determined for from_topic
or to_topic
variables, the method modifies the last element of the probabilities
array which is not outlier class's probability, since outlier class's probability does not exist in the probability
array.
Would it be feasible to return the probabilities for all of the topics rather than only returning the best topic? This would be similar to LDA, where typically proportions or probabilities are returned for all topics.
I think this could be done by changing transform()
to use membership_vector()
instead of approximate_predict()
.
Hi
I think it could be great if we can pass all existing keyword arguments of CountVectorizer to BERTopic and not only n_gram_range
and stop_words
as of today.
Some of them like max_df
, min_df
, strip_accents
or even tokenizer
can be of great help when finetuning a model.
It could be done by changing the signature of the __init__
method from
def __init__(self,
bert_model: str = 'distilbert-base-nli-mean-token``s',
top_n_words: int = 20,
nr_topics: int = None,
n_gram_range: Tuple[int, int] = (1, 1),
min_topic_size: int = 30,
n_neighbors: int = 15,
n_components: int = 5,
stop_words: Union[str, List[str]] = None,
verbose: bool = False)
to
def __init__(self,
bert_model: str = 'distilbert-base-nli-mean-token``s',
top_n_words: int = 20,
nr_topics: int = None,
min_topic_size: int = 30,
n_neighbors: int = 15,
n_components: int = 5,
verbose: bool = False,
**kwargs)
Then storing the kwargs dictionary as a class attribute self.kwargs
And then in _c_tf_idf
count = CountVectorizer(**self.kwargs).fit(documents)
I can even provide a PR if you want
Thx again for this great package
Olivier Terrier
@kairntech
I am getting ModuleNotFoundError: No module named 'bertopic'
while the output of pip install bertopic
is as follows:
Requirement already satisfied: bertopic in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (0.3.4)
Requirement already satisfied: matplotlib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (3.3.3)
Requirement already satisfied: pandas in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.1.5)
Requirement already satisfied: scikit-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.23.2)
Requirement already satisfied: tqdm in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (4.54.1)
Requirement already satisfied: hdbscan in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.8.26)
Requirement already satisfied: numpy in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.19.4)
Requirement already satisfied: sentence-transformers in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.3.9)
Requirement already satisfied: joblib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.0.0)
Requirement already satisfied: umap-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.4.6)
Requirement already satisfied: torch in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.7.1)
Requirement already satisfied: python-dateutil>=2.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.4.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (8.0.1)
Requirement already satisfied: pytz>=2017.2 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from pandas->bertopic) (2020.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (1.5.4)
Requirement already satisfied: six in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (1.14.0)
Requirement already satisfied: cython>=0.27 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (0.29.21)
Requirement already satisfied: nltk in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5)
Requirement already satisfied: transformers<3.6.0,>=3.1.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5.1)
Requirement already satisfied: numba!=0.47,>=0.46 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from umap-learn->bertopic) (0.52.0)
Requirement already satisfied: typing-extensions in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from torch->bertopic) (3.7.4.3)
Requirement already satisfied: click in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (7.1.2)
Requirement already satisfied: regex in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (2020.11.13)
Requirement already satisfied: requests in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (2.22.0)
Requirement already satisfied: protobuf in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.14.0)
Requirement already satisfied: sacremoses in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.0.43)
Requirement already satisfied: packaging in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (20.3)
Requirement already satisfied: sentencepiece==0.1.91 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.1.91)
Requirement already satisfied: tokenizers==0.9.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.9.3)
Requirement already satisfied: filelock in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.0.12)
Requirement already satisfied: llvmlite<0.36,>=0.35.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (0.35.0)
Requirement already satisfied: setuptools in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (44.0.0)
When trying the same example posted, visualize_topics() returns a visualization that for each cluster shows the following when hovering over it:
Words:%{customdata[3]}
Size:%{customdata[4]}
Hi Maarten,
When using BERTopic on fetch_20newsgroups dataset to extract topics and their associated representative documents I figured out that for a given document the predicted topic was different from the one with the maximum probability. Of course, I checked it for topic label different from -1. In other words, it seems to have an inconsistency between predicted topics and probabilities. Is this normal ?
When we use the following:
topic_model = BERTopic(language="english", calculate_probabilities=True)
preds, probs = topic_model.fit_transform(docs)
For each index idx we should not have preds[idx] == numpy.argmax(probs[idx, :])
?
Thank you in advance for your response.
Thanks for providing an easy to use library. When setting an embedding_model parameter in bertopic initialization, it isn't loading the model I want but defaults to 'distilbert-base-nli-stsb-mean-tokens'. I think this is the case because the elif clause of _select_embedding_model function in _bertopic.py
BERTopic/bertopic/_bertopic.py
Line 875 in c271ec6
Hi,
I want to use your pipeline with my own embeddings. However, I always get this error:
ValueError Traceback (most recent call last)
in ()
9 npcorpus_embeds = np.array(corpus_embeds)
10
---> 11 topics = bmodel.fit_transform(cats, npcorpus_embeds)
4 frames
/usr/local/lib/python3.6/dist-packages/hdbscan/prediction.py in init(self, data, condensed_tree, min_samples, tree_type, metric, **kwargs)
102 self.tree = self._tree_type_map[tree_type](self.raw_data,
103 metric=metric, **kwargs)
--> 104 self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
105 self.dist_metric = DistanceMetric.get_metric(metric, **kwargs)
106
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query()
ValueError: k must be less than or equal to the number of training points`
I also tried using the built in embedding creation but got the same error. Do you know, what the problem could be?
Hi,
Hope you are all well !
I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?
Thanks for any insights or inputs on that question.
Cheers,
X
Hi,
used BERTopic on the arxiv dataset and extracted the most frequent topics (the biggest clusters).
Now I want to get the sub-clusters of the biggest cluster. What I did was to simply filter the documents and umap_embeddings with the corresponding cluster label and re-run hdbscan and c-TF-IDF on the sub-sets.
However, the results are not really satisfying. Even though my most frequent topic has a cluster size of 6756 I only get two sub-clusters. One with size 5810 and one with 579. If I repeat the process with the 5810 sub-cluster to get the sub-sub-clusters then hdbscan fails to make any clusters and all documents get label -1.
Is there something wrong about my approach? I feel like hdbscan should be able to find more clusters with cluster sizes of 6756 and 5810. For the first clustering I got 2733 clusters/topics.
The parameters are all on default.
Best
Karol
Hi, firstly thank you so much for this library. I've tried it and it does take some time to get the topics.
Just wondering, will having GPU help speed-wise? Is the speed bottle-necked at the sentence transformers embedding portion?
Hello! This work is remarkable!
I got a problem when I trained a topic model using Chinese text data and my own sentence embeddings:
The info given by the program suggests that the number of topics had been reduced to 30, but when I accessed the results using get_topics(), I found there were still 93 topics, why this happened?
By the way, I sometimes came across Memory Out of Limit Error when running this package on my data, I think the reason is that I have millions of texts. Do you have any suggestions on how to apply this package to millions of texts?
Hi, I am trying to install BERTopic on mac and get the error:
----------------------------------------
ERROR: Command errored out with exit status 1: ./venv/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"'; __file__='"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-record-mk9vn8xa/install-record.txt --single-version-externally-managed --compile --install-headers ./venv/include/site/python3.9/llvmlite Check the logs for full command output.
Any idea?
It would be nice to have a control of the batch size that converts the docs to embeddings.
The default is 32 and during my run of the algorithm, the GPU memory never exceeded 2.5/16G. It could improve the speed of the embeddings extraction.
Hi, Thank you for this great job, i'm beginner in BERT, and i want to use your code to extract topics from arabic text (stored on MongDB), do you have an idea how can i do this? thank you so much.
BR
rl_bertopic_model = BERTopic(language="english")
rl_bertopic_model = rl_bertopic_model.load(f'models/{model_name}')
new_doc = [r"some text"]
new_doc_topics, new_doc_probabilities = rl_bertopic_model.transform(new_doc[0])
new_doc_probabilities
is None
on 0.5.0 but it works fine on 0.4.3. This is the case regardless of whether the model is trained fresh or loaded from file.
I'm assuming this is related to the low_memory
option introduced in 0.5.0. The wording here seems backwards: "If low_memory in BERTopic is set to False, then the probabilities are not calculated to speed up computation and decrease memory usage." - is this how it works? Seems like it should be the other way around.
Thanks
I have come across a few cases in my corpus where probabilities[i]
returns no probabilities that are equal or exceed min_probability
and thus visualize_distribution
will through and exception on vals = probabilities[labels_idx].tolist()
.
A better exception handling for these cases by showing an informative alert could be very handy instead of breaking the code.
I'm having a strange warning during the function fit_transform
from bertopic import BERTopic
model_berttopic = BERTopic(language="english", verbose=True, stop_words="english")
topics, probabilities = model_berttopic.fit_transform(documents)
print(topics)
print(probabilities)
model_berttopic.save("bertopic_model")
and the output is
2021-01-27 11:31:07,461 - BERTopic - Loaded embedding model
2021-01-27 20:53:53,191 - BERTopic - Transformed documents to Embeddings
2021-01-27 20:57:27,904 - BERTopic - Reduced dimensionality with UMAP
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
The program is still running for hours without other logs (I expected Clustered UMAP embeddings with HDBSCAN, Loaded embedding model, Transformed documents to Embeddings and the save of the model). What is happening? There is no feedback of what is doing.
Hello! When I tried to train the model using my local GPU, it shows that even though the code takes up some memory of the GPU, the GPU utility stays to be 0. Could you please give me some hints to solve this issue?
I'm very excited to see that there is now an LDAVis alternative that works with embeddings! Your documentation and colab illustrate how to load the Newsgroup dataset. But could you also add sthg (for us less experienced users) about how to load local json files, using the 'abstract' field for the NLP, but keeping the metadata for other types of analyses - like for dynamic topic modelling (for examples of vizzes, see the also very cool DETM-tool ).
Thank you for developing this implementation.
I am trying to use the bertopic[visualization] based on your blog post -- Towards Data Science post.
My system throws an error : no matches found: bertopic[visualization]
Could you help?
Thanks,
Anan
Looking at the piece of code below in utils.py
:
def check_documents_type(documents):
""" Check whether the input documents are indeed a list of strings """
if isinstance(documents, Iterable) and not isinstance(documents, str):
if not any([isinstance(doc, str) for doc in documents]):
raise TypeError("Make sure that the iterable only contains strings.")
else:
raise TypeError("Make sure that the documents variable is an iterable containing strings only.")
There are a lot of cases where a majority of a document is <class 'str'>
and yet there will be an exception raised here.
Better support for such cases can be beneficial, for instance, to make a document with isinstance(documents, str)
of True
into an Iterable object or allowing a prop to decide what to do with numbers/dates/etc. within the document text.
There is also the case of double quotations within a text to show quotes from someone that also breaks the code resulting in a TypeError()
.
This solution may potentially return a modified version as the outcome of such alteration.
I'm trying to load a trained BERTopic model from disk by using BERTopic.load
, but I'm getting this error:
TypingError Traceback (most recent call last)
<ipython-input-9-2081de8232b3> in <module>
1 import joblib
2 with open('bertopic_model', 'rb') as file:
----> 3 model=joblib.load(file)
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
573 filename = getattr(fobj, 'name', '')
574 with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 575 obj = _unpickle(fobj)
576 else:
577 with open(filename, 'rb') as f:
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
502 obj = None
503 try:
--> 504 obj = unpickler.load()
505 if unpickler.compat_mode:
506 warnings.warn("The file '%s' has been generated with a "
/usr/lib/python3.8/pickle.py in load(self)
1208 raise EOFError
1209 assert isinstance(key, bytes_types)
-> 1210 dispatch[key[0]](self)
1211 except _Stop as stopinst:
1212 return stopinst.value
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load_build(self)
327 NDArrayWrapper is used for backward compatibility with joblib <= 0.9.
328 """
--> 329 Unpickler.load_build(self)
330
331 # For backward compatibility, we support NDArrayWrapper objects.
/usr/lib/python3.8/pickle.py in load_build(self)
1701 setstate = getattr(inst, "__setstate__", None)
1702 if setstate is not None:
-> 1703 setstate(state)
1704 return
1705 slotstate = None
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in __setstate__(self, d)
1026 def __setstate__(self, d):
1027 self.__dict__ = d
-> 1028 self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
1029
1030 def _init_search_graph(self):
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in <listcomp>(.0)
1026 def __setstate__(self, d):
1027 self.__dict__ = d
-> 1028 self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
1029
1030 def _init_search_graph(self):
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/rp_trees.py in renumbaify_tree(tree)
1176 point_indices = numba.typed.List.empty_list(point_indices_type)
1177
-> 1178 hyperplanes.extend(tree.hyperplanes)
1179 offsets.extend(tree.offsets)
1180 children.extend(tree.children)
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py in extend(self, iterable)
364 # can not be sliced.
365 self._initialise_list(iterable[0])
--> 366 return _extend(self, iterable)
367
368 def remove(self, item):
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
413 e.patch_message(msg)
414
--> 415 error_rewrite(e, 'typing')
416 except errors.UnsupportedError as e:
417 # Something unsupported is present in the user code, add help info
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
356 raise e
357 else:
--> 358 reraise(type(e), e, None)
359
360 argtypes = []
/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
78 value = tp()
79 if value.__traceback__ is not tb:
---> 80 raise value.with_traceback(tb)
81 raise value
82
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7f2a3dc6f4c0>) found for signature:
>>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_append at 0x7f2a3dcf4c10>) found for signature:
>>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
Rejected as the implementation raised a specific error:
LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
def impl(l, item):
casteditem = _cast(item, itemty)
^
During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (597)
raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py:81
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (1051)
File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
def impl(l, iterable):
<source elided>
for i in iterable:
l.append(i)
^
raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py (101)
File "../env/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
return l.extend(iterable)
^
I tried to upgrade to joblib 1.0.0 but I'm still getting the same error. Did someone receive the same error in the past?
Why not use pickle/dill instead of joblib==0.17.0 ?
Hi Maarten,
I'm trying to get a topic on just a list of words: coffee, alcohol, drunk, cigarettes, smoking, drugs. So that I can have a topic called "Addiction" for example.
This is my code
from bertopic import BERTopic
docs = ['[CLS]', '[UNK]', 'coffee', 'alcohol', '[UNK]', 'drunk', 'cigarettes', 'smoking', 'drugs', '[SEP]']
model = BERTopic(verbose=True)
topics = model.fit_transform(docs)
And this is the error that I'm getting:
2021-02-08 23:08:18,794 - BERTopic - Loaded embedding model
2021-02-08 23:08:18,856 - BERTopic - Transformed documents to Embeddings
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/umap/umap_.py:2214: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
warn(
2021-02-08 23:08:21,252 - BERTopic - Reduced dimensionality with UMAP
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 278, in fit_transform
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 753, in _cluster_embeddings
self.cluster_model = hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 922, in fit
self.generate_prediction_data()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 961, in generate_prediction_data
self._prediction_data = PredictionData(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/prediction.py", line 104, in __init__
self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
File "sklearn/neighbors/_binary_tree.pxi", line 1342, in sklearn.neighbors._kd_tree.BinaryTree.query
ValueError: k must be less than or equal to the number of training points
Can it be something in the parameter settings? I can't figure it out, any help is very appreciated.
transform() and fit_transform() uses the same time to produce results. If I train the model, save it and load it again, it takes the same time to give predictions. How can I quickly get predictions once I train and save a model?
Hi, I am wondering if there's a way to query a list of documents (either return ID or actual text) in each topic cluster?
I am running OS X 10.11.6.
$ rustc --version
rustc 1.46.0
$ cargo --version
cargo 1.45.1
...
error[E0554]: `#![feature]` may not be used on the stable release channel
--> /Users/davidlaxer/.cargo/registry/src/github.com-1ecc6299db9ec823/lock_api-0.3.4/src/lib.rs:91:34
|
91 | #![cfg_attr(feature = "nightly", feature(const_fn))]
| ^^^^^^^^^^^^^^^^^
Hi,
When I run the model with the same data and the same parameters, I get different clusters. Is there a way to fix random state for reproducibility?
Thanks
In
BERTopic/bertopic/_bertopic.py
Line 892 in 1ec0313
languages
list. But lowercase is applied first and languages
list elements start with uppercase, so it is not possible to initialize BERTopic with any language from this list besides English (as another if case handles this).Hi everyone,
when i try to install bertopic on windows (pip install bertopic) i get an error. The problem arises on this line: " Building wheels for collected packages: hdbscan".
The next line is: Building wheel for hdbscan (PEP 517) ... error
...
ERROR: Failed building wheel for hdbscan
Failed to build hdbscan
ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly.
I tried to use different python version (3.5 - 3.6 - 3.7) but nothing has changed.
Did someone have the same problem and solved it?
Thank you all,
Andrea.
Hi! Thanks for developing this awesome library!
I have a question regarding text preprocessing.
From what I understand, the model takes List[str]
as an input - basically a list of fulltext documents.
But do we need to preprocess texts somehow before passing it into the model?
With LDA, I usually preprocess texts (tokenize, lemmatize, remove stopwords, create n-grams, etc.) before running models. But since we're dealing with word embeddings, keeping all words in their original form is important for the context, right?
So I'm not sure how to proceed, should I use list of preprocessed words as an input, or leave texts untouched, or something in between (keeping text as a string but without stopwords, etc.)?
Hi Maarten,
First of all, thank you for this great tool and insight!
Am I missing something in the docs, how can I retrieve the indices of the docs belonging to a clusternumber? If this is not implemented yet, is there a quick workaround how I could do this?
Hi,
I see that there is an option for a tqdm progressbar in _extract_embeddings
but no actual access to it. As the dataset I am working on is quite big I would like to have an estimate how long things are going to take.
Would it be possible to enable the progress bar from fit_transform
?
I don't know if it is possible, but can you add progress bars for the other steps as well?
Best
Karol
For larges corpora of documents, extracting BERT embeddings will take a long time.
Parallelizing it would be a sweet feature.
Hi Maarten!
Firstly, congratulations for your work.
I would like to suggest you allow the BERTopic "_plotly_topic_visualization" function from "visualize_topics()" to not only show the plotly figure but also to return the figure as a variable. This will be useful because with the figure in a variable the user can download the figure as HTML, pdf etc. In my case, I would like to embed the figure in a Dashboard.
Besides, I would suggest you allow the user to access to the parameters used during the Topic Modelling (i.e, UMAP, HDBSCAN, plotly visualization).
When using big data, it becomes infeasible to hold everything in memory at once.
Would it be possible to iterate over the data rather than hold it in memory?
It might also help exposing n_jobs
parameter for UMAP so that the user has some control over the number of cores and therefore consumed memory.
Hi,
thanks for your amazing work!
However, I currently still have some problems on getting good results.
I want to use BERTopic on the kaggle arxiv abstract dataset https://www.kaggle.com/Cornell-University/arxiv
It is a dataset that contains the abstract of each paper on arxiv. In total 1796908 abstracts, but I am using only 1/4 of them due to hardware constraints, so 449227 abstracts. The raw data is a list of dicts with each dict containing stuff like author, title, abstract and etc. but I am only using the abstracts itself.
My current results are sadly not what I expected. Here is the output of model.get_topics()
:
##################################
[('withdrawn', 0.12732245199899253), ('arxiv', 0.060818479638394804), ('author', 0.045282397053936205), ('been', 0.043582983757148634), ('paper', 0.04331377340066525), ('authors', 0.03602908119595011), ('has', 0.03413129955351502), ('discussion', 0.020046558277271205), ('version', 0.017570171724893863), ('error', 0.016245558058635576), ('due', 0.016034569088373845), ('4002', 0.015203603208166275), ('article', 0.015178787241213468), ('mcshane', 0.014512825764984364), ('1104', 0.013893798421411663), ('crucial', 0.012724570309587551), ('wyner', 0.011639183974558176), ('proxies', 0.011545341114998098), ('please', 0.011257392365683372), ('0804', 0.010829445454730597)]
##################################
[('withdrawn', 1.2378088374383105), ('been', 0.33161619452791685), ('paper', 0.2815599045047751), ('has', 0.2598521946696877), ('administratively', 0.037473809176819514), ('article', 0.035331755008345955), ('retracted', 0.032019088876856915), ('abstract', 0.03194552517951581), ('withdraw', 0.03023297555105207), ('submission', 0.025366426781504862), ('mistake', 0.024461766176310584), ('rewriting', 0.02108034213126981), ('want', 0.019899921380598113), ('this', 0.018769808691909386), ('shorter', 0.01690150874372038), ('comment', 0.01634104139337519), ('probably', 0.016086126678481083), ('applicable', 0.015210457362549933), ('modification', 0.014865572450063681), ('longer', 0.014582146616905768)]
##################################
[('isotopes', 0.2558417644790476), ('thirty', 0.22683223596981578), ('refereed', 0.19469496374394987), ('publication', 0.1454113600287234), ('isotope', 0.14061126024476908), ('brief', 0.11791781235255983), ('identification', 0.10522952641115933), ('discovery', 0.09816511283302375), ('summary', 0.08730514775501705), ('synopsis', 0.07636960227187568), ('production', 0.07302537404971297), ('discussed', 0.06821321035793458), ('including', 0.06506837933191251), ('presented', 0.06352529575007038), ('twenty', 0.057115384937416365), ('eight', 0.05686672933874315), ('each', 0.054793906411334324), ('far', 0.05099118599417039), ('minerals', 0.04545668048089448), ('observed', 0.04482361175054437)]
##################################
[('withdrawn', 0.8102220016577751), ('author', 0.5882810125654714), ('been', 0.21955498849750296), ('paper', 0.1935837075655028), ('has', 0.17516982841654938), ('pourmohammad', 0.08015698196117896), ('ali', 0.0605915947645577), ('seemann', 0.027628108579230422), ('eqn', 0.02270159607399226), ('admin', 0.022251661779234076), ('request', 0.01530721857357427), ('by', 0.013408498518361402), ('this', 0.012868228621997208), ('modification', 0.010599219818655269), ('authors', 0.010375596329419137), ('arxiv', 0.010147235836885003), ('km', 0.008868127574553979), ('due', 0.0053505630213037635), ('first', 0.004174426091124793), ('at', 0.0013290983743134937)]
##################################
[('de', 0.16471413053677894), ('la', 0.08859824729535025), ('un', 0.07960098252808794), ('en', 0.07656758724369946), ('des', 0.07493017494049987), ('une', 0.0685045487905329), ('est', 0.06506619811186878), ('nous', 0.0552461357202294), ('que', 0.0505970341092853), ('dans', 0.04833955653453861), ('pour', 0.04773108415278024), ('les', 0.04405246081293785), ('et', 0.04269515259835229), ('sur', 0.0425329786858753), ('caract', 0.034373522683066204), ('le', 0.03028301508609669), ('es', 0.029084319074609982), ('ees', 0.028840836535619835), ('cette', 0.023815804070613532), ('eme', 0.023083284220080345)]
##################################
[('model', 0.0029213211816859273), ('two', 0.0029181910122442487), ('it', 0.002917764978256985), ('can', 0.002911863114896897), ('these', 0.002900525114119986), ('our', 0.0028719993646575373), ('show', 0.0028703487897916566), ('results', 0.002862179058792491), ('also', 0.0028543448650093332), ('field', 0.002807120623162036), ('have', 0.0027961151449595436), ('using', 0.002780531524136966), ('between', 0.0027687481202621554), ('or', 0.002762760512175864), ('one', 0.0027467154286522437), ('time', 0.002741766841704294), ('energy', 0.0027274038973420667), ('data', 0.0026880146639130568), ('quantum', 0.0026615769324125527), ('such', 0.002660012066337066)]
##################################
[('withdrawn', 0.25382672910326465), ('arxiv', 0.1859321307804051), ('author', 0.10424878083447243), ('been', 0.09053395420662566), ('paper', 0.0810609839954744), ('has', 0.06435182608618999), ('version', 0.05760067485755415), ('authors', 0.05258540866955608), ('superseded', 0.04795847515378754), ('replaced', 0.043281616461112844), ('merged', 0.03743469139047417), ('0804', 0.03698167270671404), ('1008', 0.03085884218486115), ('because', 0.030835129343355642), ('0812', 0.023350724849799137), ('0901', 0.022639659892192746), ('revised', 0.02187828357506638), ('1306v6', 0.021542196549539806), ('submission', 0.02115347128457661), ('3484', 0.020833341465434623)]
##################################
[('withdrawn', 0.3174814989979916), ('author', 0.11776777433239484), ('been', 0.08872075679275165), ('paper', 0.08199984676632026), ('due', 0.08048541635175363), ('has', 0.07026260094965996), ('error', 0.05649266661391457), ('authors', 0.05390187230801506), ('arxiv', 0.051928726372487306), ('because', 0.034956486686744344), ('mistake', 0.032456919238108894), ('crucial', 0.0322646808282665), ('submission', 0.029450103971990518), ('administrators', 0.02776402639935557), ('admin', 0.024154968037124056), ('proof', 0.02232748069916814), ('errors', 0.017136278392237612), ('lemma', 0.015641869372412024), ('copyright', 0.015397080186955915), ('theorem', 0.014757663158028029)]
As you can see, the extracted topics are kind of bad and not what I have hoped for.
Can you give me some advice why this is not working and what I should finetune?
Best
Karol
I keep having the same installation error, related to the numba package. See error below:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-160-1d2f5f7c9d67> in <module>
----> 1 from bertopic import BERTopic
/opt/anaconda3/lib/python3.7/site-packages/bertopic/__init__.py in <module>
----> 1 from bertopic._bertopic import BERTopic
2 from bertopic._ctfidf import ClassTFIDF
3 from bertopic._embeddings import languages
4
5 __version__ = "0.4.3"
/opt/anaconda3/lib/python3.7/site-packages/bertopic/_bertopic.py in <module>
10
11 # Models
---> 12 import umap
13 import hdbscan
14 from sentence_transformers import SentenceTransformer
/opt/anaconda3/lib/python3.7/site-packages/umap/__init__.py in <module>
1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
3
4 try:
5 with catch_warnings():
/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py in <module>
45 )
46
---> 47 from pynndescent import NNDescent
48 from pynndescent.distances import named_distances as pynn_named_distances
49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances
/opt/anaconda3/lib/python3.7/site-packages/pynndescent/__init__.py in <module>
1 import pkg_resources
2 import numba
----> 3 from .pynndescent_ import NNDescent, PyNNDescentTransformer
4
5 # Workaround: https://github.com/numba/numba/issues/3341
/opt/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in <module>
19 import heapq
20
---> 21 import pynndescent.sparse as sparse
22 import pynndescent.sparse_nndescent as sparse_nnd
23 import pynndescent.distances as pynnd_dist
/opt/anaconda3/lib/python3.7/site-packages/pynndescent/sparse.py in <module>
8 import numba
9
---> 10 from pynndescent.utils import norm, tau_rand
11 from pynndescent.distances import kantorovich
12
/opt/anaconda3/lib/python3.7/site-packages/pynndescent/utils.py in <module>
6
7 import numba
----> 8 from numba.core import types
9 from numba.experimental import structref
10 import numpy as np
ModuleNotFoundError: No module named 'numba.core'
I am running on macOS Big Sur. Package versions:
bertopic==0.4.3
conda==4.9.2
numba==0.52.0
umap-learn==0.5.0
Python==3.7.6
I've already done a lot of searching on the internet but can't find any solution. Does somebody have the same problem or any idea how to solve this?
Thanks in advance!
Hi, firstly thank you so much for this library! :)
I am interested in performing topic modelling using tweets related to COVID-19, and I was wondering if it is possible to integrate the CT-BERT model (from https://github.com/digitalepidemiologylab/covid-twitter-bert) into BERTopic? And if it is indeed possible, how can I go about doing so?
Your help would be very much appreciated! Thank you in advance.
When we use the method reduce_topics()
it mutates the given probabilities
parameter and it becomes identical with the returned probabilities. It would be better if it does not mutate the given one and 2 different probabilities for before and after.
Hi Maarten
Thx again for this great package and the 0.5 release is just amazing.
Regarding Flair vs SentenceTransformer maybe it could be interesting to always use Flair even for SentenceTransformer:
Flair has a top level class DocumentEmbeddings and several implementation among
TransformerDocumentEmbeddings
SentenceTransformerDocumentEmbeddings
DocumentTFIDFEmbeddings
DocumentPoolEmbeddings
What do you think?
Best regards
Olivier
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.