Git Product home page Git Product logo

Comments (9)

Ch4s3 avatar Ch4s3 commented on May 29, 2024

This is a known issue when using lsi without rb-gel. There is a bug in our matrix SVD, that I haven't been able to track down. In the meantime, using rb-gsl should fix the issue.

from classifier-reborn.

tra38 avatar tra38 commented on May 29, 2024

I was already using rb-gsl beforehand (which I needed to do since the Ruby's version was so slow). Using binding.pry, I determined that $GSL is set to true after I required the classifier-reborn library, yet I'm still getting the error that I pointed above. I'll see if I can try to debug the issue...

from classifier-reborn.

tra38 avatar tra38 commented on May 29, 2024

So here's Day 1 of me trying to solve this problem (and writing down notes so that I remember when I come back to this problem again). When I call LSI#search, I first construct a ContentNode (using my search term). Here's an example of a ContentNode:

=> #<ClassifierReborn::ContentNode:0x007fafe415b468
 @categories=[],
 @lsi_norm=nil,
 @lsi_vector=nil,
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:cat=>1}>

After I did so, I then call ContentNode#proximity_norms_for_content, and that method checks the content node's @raw_norm, which is a GSL::Vector that is filled with ``NaNs...Obviously doing operations on that vector is bound to cause an error.

The @raw_vector itself seems to look alright though (although it only gets used if I call ContentNode#proximity_array_for_content), and won't error out when I try to do Matrix operations with that vector. I guess a workaround for me is for me to just use ContentNode#proximity_array_for_content and accept the probable worse results. But that seems like a bad workaround. I'd be better off trying to see what is causing the @raw_norm to be corrupted. It's not using Ruby's standard library Matrix though...so at least we can confirm that GSL is being used on my machine. Ugh. I wish I had paid more attention to Calculus in college.

EDIT: I see what's corrupting the raw_norm. Normalize. After calculating the raw_vector, we call then normalize on that vector and save that result into @raw_norm. Except...

content_node.raw_vector
=> GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ]
content_node.raw_vector.normalize
=> GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ]

It seems that GSL::Vector#normalize is failing to normalize the raw_vector properly. I'm going to have to look at the GSL code to see why that might be the case. This rabbit hole is much deeper than I expected.

EDIT2: So I found out how to normalize a vector.

To normalize a vector
v1=(x0,y0,z0)
d=sqrt(x0 * x0 + y0 * y0 + z0 * z0)
x1=x0/d
y1=y0/d
z1=z0/d
this is your new normalized vector (x1,y1,z1)

Well the sqrt(0 * 0 * ... * 0) is going to be sqrt(0)...which is 0. And, so when we normalize a vector where each coordinate point is 0...we'll be seeing a lot of 0/0 errors. So GSL::Vector's normalize isn't failing, it's Math itself that is betraying us.

Would it be okay to write some code to check if a normalized vector has any "NaNs", and then replacing all instances of NaNs with 0s, so that vector multiplication can still occur properly? Or would doing so be seen as too hacky?

EDIT3:
StackOverflow Link: How do you normalize a zero-vector?, with some possible yet unappealing answers to this dilemma. Fairly depressing stuff.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

I wish I had paid more attention to Calculus in college
same.

Yeah, some form of this problem has been causing issues since the first versions. If you're willing to try and put together a pr with some normalization and NaN handling, that would be amazing!

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

I'd be curious to know more about the input that's causing that error.

from classifier-reborn.

tra38 avatar tra38 commented on May 29, 2024

Here is the source code of the input that was causing the error.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

@tra38 sorry for the delay.

So I was able to reproduce the issue with the following searches:

array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

Maybe we can catch this error, and respond with a sensible message.

from classifier-reborn.

tra38 avatar tra38 commented on May 29, 2024
array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

This makes logical sense (for a computer, I mean). Please forgive me if this seems a bit too technical, but I wanted to write the following explanation down to clarify for myself what's going on:

"we" and "can" are stopwords, so obviously the computer will ignore them. "p" isn't a stop word, but ClassifierReborn::Hasher.word_hash_for_words only stores words that have more than 2 characters. "p" has less than or equal to 2 characters, so...we throw it away. The end result is that we are asking LSI to find a document that is similar to that an empty hash, and obviously none of the documents we have are empty hashes. So we have NaN vectors and errors galore.

Interestingly, if I throw in the sentence "we can p" into my array of strings, LSI searching breaks entirely and you will be unable to search for anything. This is how the computer views the sentence "we can p":

=> #<ClassifierReborn::ContentNode:0x007f9aaa208248
 @categories=[],
 @lsi_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={}>

(This is the same ContentNode that is constructed if we use "we can p" as a search term instead of as a document.)

ContentNode#proximity_norms_for_content, the method we use for searching, will multiply the search term's content node's lsi_norm with all existing content nodes' lsi_norm``s, to determine how similar each content node is to the search term. And thelsi_normin this ContentNode is`NaN```, we can be assured that an error can occur. (This, by the way, is an in-depth explanation of how #64 can occur).

On the other hand...

array = lsi.search("we can predict", 9)

Well, "predict" isn't a stopword. It's a word that has more than 2 characters. So a new Content Node can be created, which only includes the word "predict":

#<ClassifierReborn::ContentNode:0x007f8b131d8df0
 @categories=[],
 @lsi_norm=GSL::Vector
[ 5.018e-03 5.836e-03 9.885e-04 -2.808e-02 5.018e-03 5.044e-02 -2.720e-03 ... ],
 @lsi_vector=GSL::Vector
[ 3.550e-03 4.129e-03 6.994e-04 -1.987e-02 3.550e-03 3.569e-02 -1.924e-03 ... ],
 @raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:predict=>1}>

Since the lsi_norm here is not full of NaNs, searching works as normal.

...However, keep in mind that to replicate the error that led to me posting this issue, you need to type this out...

array = lsi.search("dogs", 9)

We only just discovered more bugs in the system.

So there's two major issues to worry about then.

  1. Dealing with invalid search terms...where the word doesn't appear in the corpus at all
  2. Dealing with "invalid" documents...where the document is composed of all stop words.

Number 2 is more of an edge case, since the larger the document, the more likely it is that there will be words that aren't stop words. So I'll probably focus on dealing with Number 1 (normalization and NaN handling).

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

fixed by #77

from classifier-reborn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.