<div class="highlight highlight-source-ruby notranslate position-relative overflow-auto" dir="auto"

This is a known issue when using lsi without <a href="https://github.com/jekyll/classi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

fixed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

Error when searching for a word that doesn't exist in the corpus about classifier-reborn HOT 9 CLOSED

jekyll commented on May 29, 2024

Error when searching for a word that doesn't exist in the corpus

from classifier-reborn.

Comments (9)

Ch4s3 commented on May 29, 2024

This is a known issue when using lsi without rb-gel. There is a bug in our matrix SVD, that I haven't been able to track down. In the meantime, using rb-gsl should fix the issue.

from classifier-reborn.

tra38 commented on May 29, 2024

I was already using rb-gsl beforehand (which I needed to do since the Ruby's version was so slow). Using binding.pry, I determined that $GSL is set to true after I required the classifier-reborn library, yet I'm still getting the error that I pointed above. I'll see if I can try to debug the issue...

from classifier-reborn.

tra38 commented on May 29, 2024

So here's Day 1 of me trying to solve this problem (and writing down notes so that I remember when I come back to this problem again). When I call LSI#search, I first construct a ContentNode (using my search term). Here's an example of a ContentNode:

=> #<ClassifierReborn::ContentNode:0x007fafe415b468
 @categories=[],
 @lsi_norm=nil,
 @lsi_vector=nil,
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:cat=>1}>

After I did so, I then call ContentNode#proximity_norms_for_content, and that method checks the content node's @raw_norm, which is a GSL::Vector that is filled with ``NaNs...Obviously doing operations on that vector is bound to cause an error.

The @raw_vector itself seems to look alright though (although it only gets used if I call ContentNode#proximity_array_for_content), and won't error out when I try to do Matrix operations with that vector. I guess a workaround for me is for me to just use ContentNode#proximity_array_for_content and accept the probable worse results. But that seems like a bad workaround. I'd be better off trying to see what is causing the @raw_norm to be corrupted. It's not using Ruby's standard library Matrix though...so at least we can confirm that GSL is being used on my machine. Ugh. I wish I had paid more attention to Calculus in college.

EDIT: I see what's corrupting the raw_norm. Normalize. After calculating the raw_vector, we call then normalize on that vector and save that result into @raw_norm. Except...

content_node.raw_vector
=> GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ]
content_node.raw_vector.normalize
=> GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ]

It seems that GSL::Vector#normalize is failing to normalize the raw_vector properly. I'm going to have to look at the GSL code to see why that might be the case. This rabbit hole is much deeper than I expected.

EDIT2: So I found out how to normalize a vector.

To normalize a vector
v1=(x0,y0,z0)
d=sqrt(x0 * x0 + y0 * y0 + z0 * z0)
x1=x0/d
y1=y0/d
z1=z0/d
this is your new normalized vector (x1,y1,z1)

Well the sqrt(0 * 0 * ... * 0) is going to be sqrt(0)...which is 0. And, so when we normalize a vector where each coordinate point is 0...we'll be seeing a lot of 0/0 errors. So GSL::Vector's normalize isn't failing, it's Math itself that is betraying us.

Would it be okay to write some code to check if a normalized vector has any "NaNs", and then replacing all instances of NaNs with 0s, so that vector multiplication can still occur properly? Or would doing so be seen as too hacky?

EDIT3:
StackOverflow Link: How do you normalize a zero-vector?, with some possible yet unappealing answers to this dilemma. Fairly depressing stuff.

from classifier-reborn.

Ch4s3 commented on May 29, 2024

I wish I had paid more attention to Calculus in college
same.

Yeah, some form of this problem has been causing issues since the first versions. If you're willing to try and put together a pr with some normalization and NaN handling, that would be amazing!

from classifier-reborn.

Ch4s3 commented on May 29, 2024

I'd be curious to know more about the input that's causing that error.

from classifier-reborn.

tra38 commented on May 29, 2024

Here is the source code of the input that was causing the error.

from classifier-reborn.

Ch4s3 commented on May 29, 2024

@tra38 sorry for the delay.

So I was able to reproduce the issue with the following searches:

array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

Maybe we can catch this error, and respond with a sensible message.

from classifier-reborn.

tra38 commented on May 29, 2024

array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

This makes logical sense (for a computer, I mean). Please forgive me if this seems a bit too technical, but I wanted to write the following explanation down to clarify for myself what's going on:

"we" and "can" are stopwords, so obviously the computer will ignore them. "p" isn't a stop word, but ClassifierReborn::Hasher.word_hash_for_words only stores words that have more than 2 characters. "p" has less than or equal to 2 characters, so...we throw it away. The end result is that we are asking LSI to find a document that is similar to that an empty hash, and obviously none of the documents we have are empty hashes. So we have NaN vectors and errors galore.

Interestingly, if I throw in the sentence "we can p" into my array of strings, LSI searching breaks entirely and you will be unable to search for anything. This is how the computer views the sentence "we can p":

=> #<ClassifierReborn::ContentNode:0x007f9aaa208248
 @categories=[],
 @lsi_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={}>

(This is the same ContentNode that is constructed if we use "we can p" as a search term instead of as a document.)

ContentNode#proximity_norms_for_content, the method we use for searching, will multiply the search term's content node's lsi_norm with all existing content nodes' lsi_norm``s, to determine how similar each content node is to the search term. And thelsi_normin this ContentNode is`NaN```, we can be assured that an error can occur. (This, by the way, is an in-depth explanation of how #64 can occur).

On the other hand...

array = lsi.search("we can predict", 9)

Well, "predict" isn't a stopword. It's a word that has more than 2 characters. So a new Content Node can be created, which only includes the word "predict":

#<ClassifierReborn::ContentNode:0x007f8b131d8df0
 @categories=[],
 @lsi_norm=GSL::Vector
[ 5.018e-03 5.836e-03 9.885e-04 -2.808e-02 5.018e-03 5.044e-02 -2.720e-03 ... ],
 @lsi_vector=GSL::Vector
[ 3.550e-03 4.129e-03 6.994e-04 -1.987e-02 3.550e-03 3.569e-02 -1.924e-03 ... ],
 @raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:predict=>1}>

Since the lsi_norm here is not full of NaNs, searching works as normal.

...However, keep in mind that to replicate the error that led to me posting this issue, you need to type this out...

array = lsi.search("dogs", 9)

We only just discovered more bugs in the system.

So there's two major issues to worry about then.

Dealing with invalid search terms...where the word doesn't appear in the corpus at all
Dealing with "invalid" documents...where the document is composed of all stop words.

Number 2 is more of an edge case, since the larger the document, the more likely it is that there will be words that aren't stop words. So I'll probably focus on dealing with Number 1 (normalization and NaN handling).

from classifier-reborn.

Ch4s3 commented on May 29, 2024

fixed by #77

from classifier-reborn.

Error when searching for a word that doesn't exist in the corpus about classifier-reborn HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent