Comments (9)
This is a known issue when using lsi without rb-gel. There is a bug in our matrix SVD, that I haven't been able to track down. In the meantime, using rb-gsl should fix the issue.
from classifier-reborn.
I was already using rb-gsl beforehand (which I needed to do since the Ruby's version was so slow). Using binding.pry, I determined that $GSL is set to true after I required the classifier-reborn library, yet I'm still getting the error that I pointed above. I'll see if I can try to debug the issue...
from classifier-reborn.
So here's Day 1 of me trying to solve this problem (and writing down notes so that I remember when I come back to this problem again). When I call LSI#search
, I first construct a ContentNode (using my search term). Here's an example of a ContentNode:
=> #<ClassifierReborn::ContentNode:0x007fafe415b468
@categories=[],
@lsi_norm=nil,
@lsi_vector=nil,
@raw_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:cat=>1}>
After I did so, I then call ContentNode#proximity_norms_for_content
, and that method checks the content node's @raw_norm
, which is a GSL::Vector that is filled with ``NaNs
...Obviously doing operations on that vector is bound to cause an error.
The @raw_vector
itself seems to look alright though (although it only gets used if I call ContentNode#proximity_array_for_content
), and won't error out when I try to do Matrix operations with that vector. I guess a workaround for me is for me to just use ContentNode#proximity_array_for_content
and accept the probable worse results. But that seems like a bad workaround. I'd be better off trying to see what is causing the @raw_norm to be corrupted. It's not using Ruby's standard library Matrix though...so at least we can confirm that GSL is being used on my machine. Ugh. I wish I had paid more attention to Calculus in college.
EDIT: I see what's corrupting the raw_norm. Normalize. After calculating the raw_vector, we call then normalize
on that vector and save that result into @raw_norm
. Except...
content_node.raw_vector
=> GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ]
content_node.raw_vector.normalize
=> GSL::Vector
[ nan nan nan nan nan nan nan ... ]
It seems that GSL::Vector#normalize
is failing to normalize the raw_vector properly. I'm going to have to look at the GSL code to see why that might be the case. This rabbit hole is much deeper than I expected.
EDIT2: So I found out how to normalize a vector.
To normalize a vector
v1=(x0,y0,z0)
d=sqrt(x0 * x0 + y0 * y0 + z0 * z0)
x1=x0/d
y1=y0/d
z1=z0/d
this is your new normalized vector (x1,y1,z1)
Well the sqrt(0 * 0 * ... * 0) is going to be sqrt(0)...which is 0. And, so when we normalize a vector where each coordinate point is 0...we'll be seeing a lot of 0/0 errors. So GSL::Vector's normalize
isn't failing, it's Math itself that is betraying us.
Would it be okay to write some code to check if a normalized vector has any "NaNs", and then replacing all instances of NaNs with 0s, so that vector multiplication can still occur properly? Or would doing so be seen as too hacky?
EDIT3:
StackOverflow Link: How do you normalize a zero-vector?, with some possible yet unappealing answers to this dilemma. Fairly depressing stuff.
from classifier-reborn.
I wish I had paid more attention to Calculus in college
same.
Yeah, some form of this problem has been causing issues since the first versions. If you're willing to try and put together a pr with some normalization and NaN handling, that would be amazing!
from classifier-reborn.
I'd be curious to know more about the input that's causing that error.
from classifier-reborn.
Here is the source code of the input that was causing the error.
from classifier-reborn.
@tra38 sorry for the delay.
So I was able to reproduce the issue with the following searches:
array = lsi.search("we",9)
array = lsi.search("we can p",9)
But, array = lsi.search("we can predict", 9)
works. :(
Maybe we can catch this error, and respond with a sensible message.
from classifier-reborn.
array = lsi.search("we",9)
array = lsi.search("we can p",9)
But, array = lsi.search("we can predict", 9) works. :(
This makes logical sense (for a computer, I mean). Please forgive me if this seems a bit too technical, but I wanted to write the following explanation down to clarify for myself what's going on:
"we" and "can" are stopwords, so obviously the computer will ignore them. "p" isn't a stop word, but ClassifierReborn::Hasher.word_hash_for_words only stores words that have more than 2 characters. "p" has less than or equal to 2 characters, so...we throw it away. The end result is that we are asking LSI to find a document that is similar to that an empty hash, and obviously none of the documents we have are empty hashes. So we have NaN vectors and errors galore.
Interestingly, if I throw in the sentence "we can p" into my array of strings, LSI searching breaks entirely and you will be unable to search for anything. This is how the computer views the sentence "we can p":
=> #<ClassifierReborn::ContentNode:0x007f9aaa208248
@categories=[],
@lsi_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={}>
(This is the same ContentNode that is constructed if we use "we can p" as a search term instead of as a document.)
ContentNode#proximity_norms_for_content
, the method we use for searching, will multiply the search term's content node's lsi_norm
with all existing content nodes' lsi_norm``s
, to determine how similar each content node is to the search term. And thelsi_norm
in this ContentNode is`NaN```, we can be assured that an error can occur. (This, by the way, is an in-depth explanation of how #64 can occur).
On the other hand...
array = lsi.search("we can predict", 9)
Well, "predict" isn't a stopword. It's a word that has more than 2 characters. So a new Content Node can be created, which only includes the word "predict":
#<ClassifierReborn::ContentNode:0x007f8b131d8df0
@categories=[],
@lsi_norm=GSL::Vector
[ 5.018e-03 5.836e-03 9.885e-04 -2.808e-02 5.018e-03 5.044e-02 -2.720e-03 ... ],
@lsi_vector=GSL::Vector
[ 3.550e-03 4.129e-03 6.994e-04 -1.987e-02 3.550e-03 3.569e-02 -1.924e-03 ... ],
@raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:predict=>1}>
Since the lsi_norm
here is not full of NaNs, searching works as normal.
...However, keep in mind that to replicate the error that led to me posting this issue, you need to type this out...
array = lsi.search("dogs", 9)
We only just discovered more bugs in the system.
So there's two major issues to worry about then.
- Dealing with invalid search terms...where the word doesn't appear in the corpus at all
- Dealing with "invalid" documents...where the document is composed of all stop words.
Number 2 is more of an edge case, since the larger the document, the more likely it is that there will be words that aren't stop words. So I'll probably focus on dealing with Number 1 (normalization and NaN handling).
from classifier-reborn.
fixed by #77
from classifier-reborn.
Related Issues (20)
- Migrating classifier data from an older classifier-reborn structure HOT 14
- whan i add a utf8 chars HOT 1
- In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. HOT 2
- How to install via jruby HOT 1
- ability to serialize model? HOT 1
- "ArgumentError: comparison of Float with NaN failed" if trying to search a corpus with an item that lacks common words HOT 3
- HTTPS for static site HOT 4
- Deprecated Gem::Specification#has_rdoc HOT 4
- 2.3.0 not released to Rubygems HOT 4
- broken links to docs (domain name not resolving) HOT 6
- TypeError: no implicit conversion from nil to integer in /classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:313:in `sort' HOT 2
- Multiple separate bayes classifiers with single redis database HOT 1
- Documentation at classifier-reborn.com in inaccessible HOT 6
- Allow redis connection to be injected HOT 1
- Can classifier-reborn work with Numo::NArray / Numo::GSL ? Is that a better choice than nmatrix? HOT 9
- Is this project still actively maintained, or is it abandoned? HOT 3
- Problem with certain characters?
- [JRuby] Tests fail with jar-dependencies version mismatch
- Add prefix to the Redis keys
- Jekyll LSI not calculated on localized blog posts HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from classifier-reborn.