I edited tokenize.py and in main called <div class="snippet-clipboard-content

Error on using Spacy Tokenizer about neural_topic_models HOT 21 OPEN

zll17 commented on August 11, 2024

Error on using Spacy Tokenizer

from neural_topic_models.

Comments (21)

radhikasethi2011 commented on August 11, 2024

two more issues.
I've a dataset of 1k rows (Each row is a document - BOW), but on running it through GMM, it after tokenization it says -

processed 483 documents

And on trying to use some other models, for example - on using the GMNTM model, I get the error:

Tokenizing ...
Using HanLP tokenizer
100% 864/864 [07:23<00:00,  1.95it/s]
Processed 0 documents.
Traceback (most recent call last):
  File "GMNTM_run.py", line 85, in <module>
    main()
  File "GMNTM_run.py", line 60, in main
    no_above = docSet.topk_dfs(topk=20)
  File "/content/Neural_Topic_Models/dataset.py", line 115, in topk_dfs
    return 1.0*dfs_topk[-1][-1]/ndoc
IndexError: list index out of range

Please help me to understand why is this happening.
@zll17

from neural_topic_models.

zll17 commented on August 11, 2024

For issue 1, to specify the English tokenizer, you may need to modify the default argument lang='zh' to lang='en' (dataset.py line 23: init of DocDataset class). This change was made by my cooperator, while I forgot to update it in the README document simultaneously. Sorry for that.

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

For issue 1, to specify the English tokenizer, you may need to modify the default argument lang='zh' to lang='en' (dataset.py line 23: init of DocDataset class). This change was made by my cooperator, while I forgot to update it in the README document simultaneously. Sorry for that.

Got it!
I've a dataset directly in a Bag of words format.
So essentially I want to skip all the pre processing part and be able to input it directly to the model.
As, what happened when I inputted the bag of words data to the GSM.py was, my topic names in the output were incomplete.

For example, on having a topic name 'AKS15', it would only be displayed as 'AKS'.
Is this because I ran it through tokenisation? @zll17

from neural_topic_models.

zll17 commented on August 11, 2024

For issue 2, this happened because after filtering out stopwords, some documents will leave no words and become empty, and those documents would not be counted as ‘processed’. That is why it reported only 483 documents processed while there are actually 1k docs. This usually happens when some documents are very short or the stopwords dictionary covering too many ones.

Regarding the index out of range error, this is because the 'no_below' or the 'no_above' threshold to filter out the words with high documents frequency (dfs) is not properly set, leading to too many words are removed and the whole document set is empty, (maybe happens in a small dataset). To fix this, you can loose the restriction, for example, try to set 'no_below=0' and 'no_above=0.6', and do not use the 'auto_adj' argument.

from neural_topic_models.

zll17 commented on August 11, 2024

I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.

from neural_topic_models.

zll17 commented on August 11, 2024

It seems a valuable idea to improve the filtering strategy to make the models more robust. I will fix that.

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.

Yes, I mean topic words.
Im working with genetic data. And I have the dataset in a bag of words format
(Rows being the transcription factor, columns being the proteins, if a protein exists in a particular genetype, the Transcription factor is set to 1, else 0, in the dataset)

Now the proteins have names such as 'AKS15'. Alphanumeric words. So accordingly the output (topic words) should have the entire alphaneumeric string right? But on running the models I get the output like 'AKS' (missing the numeric)

Any idea how I can fix this? @zll17

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer, which adopts the Spacy module, it might be the reason of the form transformation of the topic words, since Spacy would do some stemming work.

Yes, I mean topic words.
Im working with genetic data. And I have the dataset in a bag of words format
(Rows being the transcription factor, columns being the proteins, if a protein exists in a particular genetype, the Transcription factor is set to 1, else 0, in the dataset)

Now the proteins have names such as 'AKS15'. Alphanumeric words. So accordingly the output (topic words) should have the entire alphaneumeric string right? But on running the models I get the output like 'AKS' (missing the numeric)

Any idea how I can fix this? @zll17

One more question I wanted to ask was, since I already have my data in a bag of words format, how do I skip the data preprocessing steps?

from neural_topic_models.

zll17 commented on August 11, 2024

Two options. One is to convert the genetic data into text, e.g. list all the protein names of one transcription factor in a line, separated by spaces. Custom the tokenizer and remove the Spacy modules, just use a simple split(' ') for tokenization. This will also fix the 'incomplete name' problem. Do not employ the 'no_below', 'no_above' or 'auto_adj' strategy, you can even take off the relative codes in the dataset.py script.

The second option needs to modify the DocDataset class a lot. You can just comment on the process steps in the init method but need to preserve the self.docs, self.bows, self.vocabulary, and other necessary variables. Note that the dictionary cannot use the gensim.Dictionary any more, and you need to implement it as a class with a token2id method. The key point is to load your BOW data into variable self.bows, which is the co-occurrence matrix. And modify the getitem method. This approach requires more work to do but it will indeed improve efficiency.

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

Two options. One is to convert the genetic data into text, e.g. list all the protein names of one transcription factor in a line, separated by spaces. Custom the tokenizer and remove the Spacy modules, just use a simple split(' ') for tokenization. This will also fix the 'incomplete name' problem. Do not employ the 'no_below', 'no_above' or 'auto_adj' strategy, you can even take off the relative codes in the dataset.py script.

The second option needs to modify the DocDataset class a lot. You can just comment on the process steps in the init method but need to preserve the self.docs, self.bows, self.vocabulary, and other necessary variables. Note that the dictionary cannot use the gensim.Dictionary any more, and you need to implement it as a class with a token2id method. The key point is to load your BOW data into variable self.bows, which is the co-occurrence matrix. And modify the getitem method. This approach requires more work to do but it will indeed improve efficiency.

Following the first option, Modified the tokenizer this way -

The modified repo is here - https://github.com/radhikasethi2011/Neural_Topic_Models
Pushed the dataset as 'edge_allencode_lines.txt'

On running it for NVDM_GSM, I always run into

processed 0 documents

@zll17

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

basically, the BOW is not being generated here. I'm just trying to understand why this is happening and how I can fix it @zll17

from neural_topic_models.

xDarkLemon commented on August 11, 2024

@radhikasethi2011 Did you set --rebuild argument after you modified the tokenizer? After your first run with the incorrect tokenizer, the dictionary and BOW was saved as empty. To save processing time, by default it will use the processed dataset in the following runs. So you need to set --rebuild to call the new tokenizer to process again.

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

@xDarkLemon Ahh yes, I also tried factory resetting my runtime and running again.
I modified tokenizer.py here - https://github.com/radhikasethi2011/Neural_Topic_Models
To only split by space, no other processing. (Added data - edge_allencode_lines.txt in data folder too)

Same errors, processed 0 documents.
Am I doing this right?

Edit: to expand,
I converted my BOW data to a txt file, with each gene and a row of its respective proteins. Can be found here.

And tried to run it through the models again with this tokenisation process

from neural_topic_models.

xDarkLemon commented on August 11, 2024

@radhikasethi2011 I have tried the code on your dataset and got processed files. I am using the tokenizer splitting by space and I have push the code to a new branch here: tokenization.py. I was only running the DocDataset creation code and got the processed files, which are not empty:

And I have got something like this as the dictionary file:

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

Ohh.
Okay I cloned the issue10 branch to run it again,

Any idea why?

Trying your method to generate the dictionary file now
@xDarkLemon

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

I also wanted to ask one more thing,
Suppose I wanted to find the cosine similarity between the documents, ie, for each document in the dataset, the similarity between the raw vector and the latent vector (after processing), is there a way I can generate the latent vectorized data once my raw data runs through the model?

Could then find cosine similarity between the raw and the latent vector to understand the data
@xDarkLemon @zll17

from neural_topic_models.

zll17 commented on August 11, 2024

I've tried to run the GSM model on your data, and preprocess step works fine, (although it met an OOM error on my laptop due to the too large vocabulary size):

Did you add the argument '--no_below 0' and '--no_above 0.9'? It looks like you are using the default configuration, which could not suit your data.

Another thing I've noticed is that your text data is not converted well. I am not familiar with biology, but there seem to be three kinds of noises in your data:

Concatenated the name of a transcription factor with the name of a protein:

Extra brackets and quotes at ends of lines:

Mixture use of tabs and spaces to separate words:

These noises will result in the wrong vocabulary, therefore, it would be a good idea to clean your text data first.

I made an illustration picture to display the right process to convert your BOW data to text data according to my understanding:

from neural_topic_models.

zll17 commented on August 11, 2024

What do you mean by "latent vector (after processing)"? Do you mean the "topic distribution of a document" or "a document's latent representation"?

Yes, you can calculate cosine similarities between documents, just by using the inference method provided by the model. However, you can only compute similarities between vectors in the same space, i.e. similarities between raw vector and raw vector, similarities between latent vector and latent vector, but you cannot compute similarities between raw vector and latent vector, since they have different dimensions.

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

I've tried to run the GSM model on your data, and preprocess step works fine, (although it met an OOM error on my laptop due to the too large vocabulary size):

Did you add the argument '--no_below 0' and '--no_above 0.9'? It looks like you are using the default configuration, which could not suit your data.

My bad, I forgot to add the no_below & no_above conditions.

Another thing I've noticed is that your text data is not converted well. I am not familiar with biology, but there seem to be three kinds of noises in your data:
1. Concatenated the name of a transcription factor with the name of a protein:

The idea behind this was to have the TF with a protein in each row, but yep I understand now that would be noise too.

1. Extra brackets and quotes at ends of lines:
1. Mixture use of tabs and spaces to separate words:
These noises will result in the wrong vocabulary, therefore, it would be a good idea to clean your text data first.
I made an illustration picture to display the right process to convert your BOW data to text data according to my understanding:

Thank you for this, understood. Working on it

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

What do you mean by "latent vector (after processing)"? Do you mean the "topic distribution of a document" or "a document's latent representation"?

Yes, you can calculate cosine similarities between documents, just by using the inference method provided by the model. However, you can only compute similarities between vectors in the same space, i.e. similarities between raw vector and raw vector, similarities between latent vector and latent vector, but you cannot compute similarities between raw vector and latent vector, since they have different dimensions.

RIght. So what I understood now is

x = similarity(raw, raw), y=similarity(latent, latent)
And then plot scatter plots of (x,y) for each document

by "inference method", do you mean the inference.py file?

from neural_topic_models.

radhikasethi2011 commented on August 11, 2024

On using the inference method provided by the model:

I gave the path to the text data (not clean as of now, but will get to that)

Is there another way I should be using the file - to get the cosine similarity for all documents in the data (latent vectors - for eg a cosine similarity between one vector and the next) @zll17 @xDarkLemon

from neural_topic_models.

Error on using Spacy Tokenizer about neural_topic_models HOT 21 OPEN

Comments (21)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent