Comments (21)
It approaches 0 as the match becomes a better fit.
from classifier-reborn.
@Ch4s3 Well then, is this a bug?
b = ClassifierReborn::Bayes.new 'Normal'
b.enable_threshold
b.train_normal("This is really normal")
# Tests
b.classify_with_score("This is not")
=> ["Normal", 0.0]
b.classify_with_score("This is really normal")
=> ["Normal", -1.3862943611198906]
Why "This is not" is matched better to "This is really normal" then "This is really normal" to itself ?
from classifier-reborn.
Sorry, I had it backwards. The more negative the score the better the fit. So a score of -11.0
would be a better fit than -8.2
. A better way of putting it would be to say that the greater the absolute value of the score, the greater the fit.
from classifier-reborn.
Again, sorry about the mistake and any ensuing confusion.
from classifier-reborn.
@Ch4s3 No problem.
@MadBomber do you think then that the threshold should be "reversed" ? Right now it says "anything higher then threshold is not "normal" (as in my example), but, because the closer the match the higher the negative number, the threshold should say "anything lower then threshold is not "normal"
Does this makes sense ?
from classifier-reborn.
looking at the docs
# Returns the scores in each category the provided +text+. E.g.,
# b.classifications "I hate bad words and you"
# => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
# The largest of these scores (the one closest to 0) is the one picked out by #classify
That does seem confusing.
from classifier-reborn.
Yeha, it should be the other way around doesn't it ?
from classifier-reborn.
This is how the README explains it.
#### Knowing the Score
When you ask a bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category. The higher the score the closer the fit your text has with that category. The category with the highest score is returned as the best matching category.
The use of negative numbers can be confusing to some people. We know (right?) that -12 is higher (maybe larger?) than -18.
Maybe we need to take a look at how we generate the score and map it to a probability. Explaining 0% to 100% in terms of a match is much easier. It is unambiguous to say that "this text" is a 100% match to this category. The higher the probability the better the match.
Adjusting the score this way will slow down the process. I can't say by how much. Is it worth it to try? This is a major change that would have a wide impact.
from classifier-reborn.
@MadBomber I'd rather just clarify the documentation. It works as expected, and IMO a slowdown isn't really worth it.
from classifier-reborn.
Then the meaning of the score is that the larger the absolute value the better the fit? A score of zero is a definite no match? and a score of Float::INFINITY is the same as zero?
from classifier-reborn.
I just took a dataset of 109,582 English words and trained each word against the category English. I randomly selected 1,124 of these training words to test with classify_with_score keeping each score in an array. Then I ran descriptive_statistics against that array of scores. Here is what I got:
{
:number => 1124.0,
:variance => 0.6600894439244815,
:standard_deviation => 0.8124588875287669,
:min => -11.602839292439027,
:max => 0.0,
:mean => -10.497146916065935,
:mode => -10.909692111879082,
:median => -10.504227003770918,
:range => 11.602839292439027,
:q1 => -10.909692111879082,
:q2 => -10.504227003770918,
:q3 => -9.993401380004928
}
I then took the gibberish word 'hdfiqdof' and got a score of -13.905424385433074.
From the outside looking in these stats tell me that the closer to zero the better the match.
from classifier-reborn.
@MadBomber that appears to be correct. So there may be a bug with either classify
or classify_with_score
per @bararchy's original example.
from classifier-reborn.
@bararchy I took some time to look at your example in detail. Here is what I found. First we have to remember that part of the training and classification process involves the removal of STOPWORDS for the language specified. The default language is 'en' so in your counter-example "this is not" all three words are stop words. That makes your text is the same as an empty string. An empty string always returns a score of 0.0
The second thing that I noticed is that you used b.enable_threshold and then used classify_with_score to get your result. The threshold feature only is applied to the classify method. I have thought about extending it into the other two classifier methods but decided it was not necessary because the user application is given the score to do with as it pleases.
This example has in fact turned up what I think is a bug. A score of 0.0 looks to me like it is a mis-match for the category in the same way that Float::INFINITY is a mis-match.
from classifier-reborn.
@MadBomber Great catch, I hadn't considered stop words. Two things come to mind, we should handle strings that are only/mostly stop words, and we should handle scores of 0.0 differently.
from classifier-reborn.
And we need to document that.
from classifier-reborn.
Pinging about this issue :)
Thanks
from classifier-reborn.
I'll put it on my holiday to do list
from classifier-reborn.
On Nov 24, 2015, at 7:54 AM, Chase Gilliam [email protected] wrote:
I'll put it on my holiday to do list
—
Reply to this email directly or view it on GitHub #58 (comment).I will also revisit the topic. A few weeks ago I took a couple of hours to review the sensitivity to training sets exposed in the bayes score. At that time I could not come up with sufficiently general verbiage to explain what actually goes on in a baysian classification. The algorithm used in classifier-reborn is very basic. There is a suitability issue that should be in the docs that describes the kinds of applications to which this approach is best applied.
o-*
from classifier-reborn.
@MadBomber If I use classify_with_score instead of threshold, and the lets say I do
normal = "1234"
test1 = "123abc"
test2 = "abc"
classify_with_score("1234")
=> "Normal"
classify_with_score(test1)
=> 100~
classify_with_score(test2)
=> 0 || nil
Then for sure I can just do if score > 50 => normal || if score < 50 => not normal
Right ?
This is the way threshold is kinda working no ?
from classifier-reborn.
So, the more I use classify_with_score the more I see stuff that makes no sense to me.
Example:
ai_overlord = ClassifierReborn::Bayes.new 'normal_activity'
ai_overlord.train_normal_activity("GET / HTTP/1.2")
=> ["GET / HTTP/1.2"]
ai_overlord.classify_with_score("get / http/1.lol")
=> ["Normal activity", -8.047189562170502]
ai_overlord.classify_with_score("get / http/1.2")
=> ["Normal activity", -5.744604469176456]
ai_overlord.classify_with_score("get / http/1.1")
=> ["Normal activity", -8.047189562170502]
ai_overlord.classify_with_score("get / http/1.?!@?!")
=> ["Normal activity", -19.78325857845494]
Why is get / http/1.2
less resembles get / http/1.2
then get / http/1.?!@?!
? this makes no sense to me :/
The only way this makes sense is if the closer to 0 the more likely the string to the trained one.
from classifier-reborn.
@bararchy My guess is that the stemmer is doing something odd with your input.
from classifier-reborn.
Related Issues (20)
- Migrating classifier data from an older classifier-reborn structure HOT 14
- whan i add a utf8 chars HOT 1
- In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. HOT 2
- How to install via jruby HOT 1
- ability to serialize model? HOT 1
- "ArgumentError: comparison of Float with NaN failed" if trying to search a corpus with an item that lacks common words HOT 3
- HTTPS for static site HOT 4
- Deprecated Gem::Specification#has_rdoc HOT 4
- 2.3.0 not released to Rubygems HOT 4
- broken links to docs (domain name not resolving) HOT 6
- TypeError: no implicit conversion from nil to integer in /classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:313:in `sort' HOT 2
- Multiple separate bayes classifiers with single redis database HOT 1
- Documentation at classifier-reborn.com in inaccessible HOT 6
- Allow redis connection to be injected HOT 1
- Can classifier-reborn work with Numo::NArray / Numo::GSL ? Is that a better choice than nmatrix? HOT 9
- Is this project still actively maintained, or is it abandoned? HOT 3
- Problem with certain characters?
- [JRuby] Tests fail with jar-dependencies version mismatch
- Add prefix to the Redis keys
- Jekyll LSI not calculated on localized blog posts HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from classifier-reborn.