Comments (21)
two more issues.
I've a dataset of 1k rows (Each row is a document - BOW), but on running it through GMM, it after tokenization it says -
processed 483 documents
And on trying to use some other models, for example - on using the GMNTM model, I get the error:
Tokenizing ...
Using HanLP tokenizer
100% 864/864 [07:23<00:00, 1.95it/s]
Processed 0 documents.
Traceback (most recent call last):
File "GMNTM_run.py", line 85, in <module>
main()
File "GMNTM_run.py", line 60, in main
no_above = docSet.topk_dfs(topk=20)
File "/content/Neural_Topic_Models/dataset.py", line 115, in topk_dfs
return 1.0*dfs_topk[-1][-1]/ndoc
IndexError: list index out of range
Please help me to understand why is this happening.
@zll17
from neural_topic_models.
For issue 1, to specify the English tokenizer, you may need to modify the default argument lang='zh'
to lang='en'
(dataset.py line 23: init of DocDataset class). This change was made by my cooperator, while I forgot to update it in the README document simultaneously. Sorry for that.
from neural_topic_models.
For issue 1, to specify the English tokenizer, you may need to modify the default argument
lang='zh'
tolang='en'
(dataset.py line 23: init of DocDataset class). This change was made by my cooperator, while I forgot to update it in the README document simultaneously. Sorry for that.
Got it!
I've a dataset directly in a Bag of words format.
So essentially I want to skip all the pre processing part and be able to input it directly to the model.
As, what happened when I inputted the bag of words data to the GSM.py was, my topic names in the output were incomplete.
For example, on having a topic name 'AKS15', it would only be displayed as 'AKS'.
Is this because I ran it through tokenisation? @zll17
from neural_topic_models.
For issue 2, this happened because after filtering out stopwords, some documents will leave no words and become empty, and those documents would not be counted as ‘processed’. That is why it reported only 483 documents processed while there are actually 1k docs. This usually happens when some documents are very short or the stopwords dictionary covering too many ones.
Regarding the index out of range error
, this is because the 'no_below' or the 'no_above' threshold to filter out the words with high documents frequency (dfs) is not properly set, leading to too many words are removed and the whole document set is empty, (maybe happens in a small dataset). To fix this, you can loose the restriction, for example, try to set 'no_below=0' and 'no_above=0.6', and do not use the 'auto_adj' argument.
from neural_topic_models.
I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.
from neural_topic_models.
It seems a valuable idea to improve the filtering strategy to make the models more robust. I will fix that.
from neural_topic_models.
I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.
Yes, I mean topic words.
Im working with genetic data. And I have the dataset in a bag of words format
(Rows being the transcription factor, columns being the proteins, if a protein exists in a particular genetype, the Transcription factor is set to 1, else 0, in the dataset)
Now the proteins have names such as 'AKS15'. Alphanumeric words. So accordingly the output (topic words) should have the entire alphaneumeric string right? But on running the models I get the output like 'AKS' (missing the numeric)
Any idea how I can fix this? @zll17
from neural_topic_models.
I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.
Yes, I mean topic words.
Im working with genetic data. And I have the dataset in a bag of words format
(Rows being the transcription factor, columns being the proteins, if a protein exists in a particular genetype, the Transcription factor is set to 1, else 0, in the dataset)Now the proteins have names such as 'AKS15'. Alphanumeric words. So accordingly the output (topic words) should have the entire alphaneumeric string right? But on running the models I get the output like 'AKS' (missing the numeric)
Any idea how I can fix this? @zll17
One more question I wanted to ask was, since I already have my data in a bag of words format, how do I skip the data preprocessing steps?
from neural_topic_models.
Two options. One is to convert the genetic data into text, e.g. list all the protein names of one transcription factor in a line, separated by spaces. Custom the tokenizer and remove the Spacy modules, just use a simple split(' ') for tokenization. This will also fix the 'incomplete name' problem. Do not employ the 'no_below', 'no_above' or 'auto_adj' strategy, you can even take off the relative codes in the dataset.py script.
The second option needs to modify the DocDataset class a lot. You can just comment on the process steps in the init method but need to preserve the self.docs, self.bows, self.vocabulary, and other necessary variables. Note that the dictionary cannot use the gensim.Dictionary any more, and you need to implement it as a class with a token2id method. The key point is to load your BOW data into variable self.bows, which is the co-occurrence matrix. And modify the getitem method. This approach requires more work to do but it will indeed improve efficiency.
from neural_topic_models.
Two options. One is to convert the genetic data into text, e.g. list all the protein names of one transcription factor in a line, separated by spaces. Custom the tokenizer and remove the Spacy modules, just use a simple split(' ') for tokenization. This will also fix the 'incomplete name' problem. Do not employ the 'no_below', 'no_above' or 'auto_adj' strategy, you can even take off the relative codes in the dataset.py script.
The second option needs to modify the DocDataset class a lot. You can just comment on the process steps in the init method but need to preserve the self.docs, self.bows, self.vocabulary, and other necessary variables. Note that the dictionary cannot use the gensim.Dictionary any more, and you need to implement it as a class with a token2id method. The key point is to load your BOW data into variable self.bows, which is the co-occurrence matrix. And modify the getitem method. This approach requires more work to do but it will indeed improve efficiency.
Following the first option, Modified the tokenizer this way -
The modified repo is here - https://github.com/radhikasethi2011/Neural_Topic_Models
Pushed the dataset as 'edge_allencode_lines.txt'
On running it for NVDM_GSM, I always run into
processed 0 documents
from neural_topic_models.
basically, the BOW is not being generated here. I'm just trying to understand why this is happening and how I can fix it @zll17
from neural_topic_models.
@radhikasethi2011 Did you set --rebuild
argument after you modified the tokenizer? After your first run with the incorrect tokenizer, the dictionary and BOW was saved as empty. To save processing time, by default it will use the processed dataset in the following runs. So you need to set --rebuild
to call the new tokenizer to process again.
from neural_topic_models.
@xDarkLemon Ahh yes, I also tried factory resetting my runtime and running again.
I modified tokenizer.py here - https://github.com/radhikasethi2011/Neural_Topic_Models
To only split by space, no other processing. (Added data - edge_allencode_lines.txt in data folder too)
Same errors, processed 0 documents.
Am I doing this right?
Edit: to expand,
I converted my BOW data to a txt file, with each gene and a row of its respective proteins. Can be found here.
And tried to run it through the models again with this tokenisation process
from neural_topic_models.
@radhikasethi2011 I have tried the code on your dataset and got processed files. I am using the tokenizer splitting by space and I have push the code to a new branch here: tokenization.py. I was only running the DocDataset
creation code and got the processed files, which are not empty:
And I have got something like this as the dictionary file:
from neural_topic_models.
Ohh.
Okay I cloned the issue10 branch to run it again,
Any idea why?
Trying your method to generate the dictionary file now
@xDarkLemon
from neural_topic_models.
I also wanted to ask one more thing,
Suppose I wanted to find the cosine similarity between the documents, ie, for each document in the dataset, the similarity between the raw vector and the latent vector (after processing), is there a way I can generate the latent vectorized data once my raw data runs through the model?
Could then find cosine similarity between the raw and the latent vector to understand the data
@xDarkLemon @zll17
from neural_topic_models.
I've tried to run the GSM model on your data, and preprocess step works fine, (although it met an OOM error on my laptop due to the too large vocabulary size):
Did you add the argument '--no_below 0' and '--no_above 0.9'? It looks like you are using the default configuration, which could not suit your data.
Another thing I've noticed is that your text data is not converted well. I am not familiar with biology, but there seem to be three kinds of noises in your data:
- Concatenated the name of a transcription factor with the name of a protein:
- Extra brackets and quotes at ends of lines:
- Mixture use of tabs and spaces to separate words:
I made an illustration picture to display the right process to convert your BOW data to text data according to my understanding:
from neural_topic_models.
What do you mean by "latent vector (after processing)"? Do you mean the "topic distribution of a document" or "a document's latent representation"?
Yes, you can calculate cosine similarities between documents, just by using the inference method provided by the model. However, you can only compute similarities between vectors in the same space, i.e. similarities between raw vector and raw vector, similarities between latent vector and latent vector, but you cannot compute similarities between raw vector and latent vector, since they have different dimensions.
from neural_topic_models.
I've tried to run the GSM model on your data, and preprocess step works fine, (although it met an OOM error on my laptop due to the too large vocabulary size):
![]()
Did you add the argument '--no_below 0' and '--no_above 0.9'? It looks like you are using the default configuration, which could not suit your data.
My bad, I forgot to add the no_below & no_above conditions.
Another thing I've noticed is that your text data is not converted well. I am not familiar with biology, but there seem to be three kinds of noises in your data:
1. Concatenated the name of a transcription factor with the name of a protein:
The idea behind this was to have the TF with a protein in each row, but yep I understand now that would be noise too.
![]()
1. Extra brackets and quotes at ends of lines:
![]()
1. Mixture use of tabs and spaces to separate words:
These noises will result in the wrong vocabulary, therefore, it would be a good idea to clean your text data first.
I made an illustration picture to display the right process to convert your BOW data to text data according to my understanding:
Thank you for this, understood. Working on it
from neural_topic_models.
What do you mean by "latent vector (after processing)"? Do you mean the "topic distribution of a document" or "a document's latent representation"?
Yes, you can calculate cosine similarities between documents, just by using the inference method provided by the model. However, you can only compute similarities between vectors in the same space, i.e. similarities between raw vector and raw vector, similarities between latent vector and latent vector, but you cannot compute similarities between raw vector and latent vector, since they have different dimensions.
RIght. So what I understood now is
x = similarity(raw, raw), y=similarity(latent, latent)
And then plot scatter plots of (x,y) for each document
by "inference method", do you mean the inference.py file?
from neural_topic_models.
On using the inference method provided by the model:
I gave the path to the text data (not clean as of now, but will get to that)
Is there another way I should be using the file - to get the cosine similarity for all documents in the data (latent vectors - for eg a cosine similarity between one vector and the next) @zll17 @xDarkLemon
from neural_topic_models.
Related Issues (16)
- SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes HOT 1
- Inference from Checkpoints HOT 5
- 作者您好关于BATM模型的问题 HOT 1
- NVDM-GSM最后存下来的embedding是不是文档的主题向量?怎么复原到文字表述的topic
- Inference for BATM HOT 2
- GSM实现有问题? HOT 7
- 原版的DailyDialog有13118个对话,与zhdd的对应关系是怎样的? HOT 1
- Inference not working!
- Stopwords.txt 如何产生
- Python版本问题 HOT 2
- AttributeError: 'ETM' object has no attribute 'dist' HOT 1
- Topics as they apply to documents HOT 2
- Inference problem when trying to convert to json HOT 2
- GSM的实现问题 HOT 2
- 请问模型训练出来怎么使用? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neural_topic_models.