Comments (4)
Hi @nassera2014!
I, unfortunately, do not know much about Arabic, so I am basing my experience on other languages that can be tokenized through the use of the whitespace char. I am assuming you have your documents in a text file arabic_documents.txt
, one document per line.
Preprocessing
First thing first, you need to preprocess the documents.
What we usually do is use our preprocessing pipeline (that will remove stopwords and just keep the most frequent 2000 tokens in the vocabulary) or you can run preprocessing in the way you prefer. However, I am not sure how words are tokenized in Arabic, and we currently tokenized based on whitespace. Arabic might require more steps.
If whitespace tokenization is ok for Arabic, this snippet should be a good starting point:
from contextualized_topic_models.utils.preprocessing import SimplePreprocessing
documents = [line.strip() for line in open("arabic_documents.txt").readlines()]
sp = SimplePreprocessing(documents, stopwords_language = "arabic")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess()
Training
Then you can use the rest of our code to run the topic model. For this step, you need:
- the documents we have prepared in the previous setting
- an Arabic BERT model (I am using this to create the document representations)
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
from contextualized_topic_models.datasets.dataset import CTMDataset
handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
# generate BERT data
training_bert = bert_embeddings_from_list(unpreprocessed_documents, "asafaya/bert-base-arabic")
training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)
ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)
ctm.fit(training_dataset) # run the model
After the model has been fitted, you can get topics like this:
ctm.get_topic_lists()
I hope this helps, but feel free to ask questions if something's not clear :)
from contextualized-topic-models.
closing this for now :) feel free to ping me again
from contextualized-topic-models.
Thank you so much for your detailed and well-explained response, when i run the following programme
arText = []
for post in posts :
arText.append(post['text'])
documents = arText
# documents = [line.strip() for line in open("doc.txt").readlines()]
sp = SimplePreprocessing(documents, stopwords_language = "arabic")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess()
print("proce : ",preprocessed_documents)
print("unproc : ",unpreprocessed_documents)
print("vocab:",vocab)
from contextualized_topic_models_master.contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models_master.contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models_master.contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
from contextualized_topic_models_master.contextualized_topic_models.datasets.dataset import CTMDataset
handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
from transformers import AutoTokenizer, AutoModel
# generate BERT data
training_bert = bert_embeddings_from_list(unpreprocessed_documents, "aubmindlab/bert-base-arabertv01")
# asafaya/bert-base-arabic
training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)
ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)
ctm.fit(training_dataset) # run the model
print('topics : ',ctm.get_topics())
i get this error :
`/home/nassera/PycharmProjects/Facebook/venv/lib/python3.6/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
FileNotFoundError: [Errno 2] No such file or directory: '/home/nassera/.cache/torch/sentence_transformers/sbert.net_models_aubmindlab_bert-base-arabertv01' `
Note that i use Python 3.6
Virtual Machine : Ubuntu 18.10
and my dataset is stored into MongoDB database.
Thank you.
from contextualized-topic-models.
Hi!
We tried to replicate the code in a colab notebook and used AraBERT. It seems to work. Please check and upgrade your version of the contextualized-topic-model and sentence-transformer packages. This might solve your problem :)
Silvia
from contextualized-topic-models.
Related Issues (20)
- Numpy error evalation scores HOT 17
- OSError: [Errno 22] Invalid argument HOT 5
- representation embedding HOT 18
- How to work with Large dataset? HOT 14
- Large Dataset HOT 4
- How to Find coherence of this Topic and Model? HOT 1
- GPU and CPU usage HOT 2
- Custom Embedding vs Vocabulary HOT 10
- [help] Required versions HOT 4
- Perplexity HOT 3
- AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names' HOT 2
- Loading own embedding & division by zero error HOT 7
- Testing with custom embedding HOT 7
- More time spent for finding smaller number of topics HOT 5
- Add patience to reduce LR as CTM argument HOT 1
- Bug: Minor bug when constructing the model directory path
- Running cythonize failed! HOT 2
- Variable naming issues HOT 3
- ImportError: cannot import name 'CombinedTM' HOT 2
- pip install HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from contextualized-topic-models.