Git Product home page Git Product logo

Comments (21)

Ch4s3 avatar Ch4s3 commented on May 29, 2024

It approaches 0 as the match becomes a better fit.

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

@Ch4s3 Well then, is this a bug?

b = ClassifierReborn::Bayes.new 'Normal'
b.enable_threshold

b.train_normal("This is really normal")

# Tests

b.classify_with_score("This is not")
=> ["Normal", 0.0]

b.classify_with_score("This is really normal")
=> ["Normal", -1.3862943611198906]

Why "This is not" is matched better to "This is really normal" then "This is really normal" to itself ?

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

Sorry, I had it backwards. The more negative the score the better the fit. So a score of -11.0 would be a better fit than -8.2. A better way of putting it would be to say that the greater the absolute value of the score, the greater the fit.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

Again, sorry about the mistake and any ensuing confusion.

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

@Ch4s3 No problem.

@MadBomber do you think then that the threshold should be "reversed" ? Right now it says "anything higher then threshold is not "normal" (as in my example), but, because the closer the match the higher the negative number, the threshold should say "anything lower then threshold is not "normal"

Does this makes sense ?

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

looking at the docs

# Returns the scores in each category the provided +text+. E.g.,
    #    b.classifications "I hate bad words and you"
    #    =>  {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
    # The largest of these scores (the one closest to 0) is the one picked out by #classify

That does seem confusing.

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

Yeha, it should be the other way around doesn't it ?

from classifier-reborn.

MadBomber avatar MadBomber commented on May 29, 2024

This is how the README explains it.

#### Knowing the Score

When you ask a bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category.  The higher the score the closer the fit your text has with that category.  The category with the highest score is returned as the best matching category.

The use of negative numbers can be confusing to some people. We know (right?) that -12 is higher (maybe larger?) than -18.

Maybe we need to take a look at how we generate the score and map it to a probability. Explaining 0% to 100% in terms of a match is much easier. It is unambiguous to say that "this text" is a 100% match to this category. The higher the probability the better the match.

Adjusting the score this way will slow down the process. I can't say by how much. Is it worth it to try? This is a major change that would have a wide impact.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

@MadBomber I'd rather just clarify the documentation. It works as expected, and IMO a slowdown isn't really worth it.

from classifier-reborn.

MadBomber avatar MadBomber commented on May 29, 2024

Then the meaning of the score is that the larger the absolute value the better the fit? A score of zero is a definite no match? and a score of Float::INFINITY is the same as zero?

from classifier-reborn.

MadBomber avatar MadBomber commented on May 29, 2024

I just took a dataset of 109,582 English words and trained each word against the category English. I randomly selected 1,124 of these training words to test with classify_with_score keeping each score in an array. Then I ran descriptive_statistics against that array of scores. Here is what I got:

{
                :number => 1124.0,
              :variance => 0.6600894439244815,
    :standard_deviation => 0.8124588875287669,
                   :min => -11.602839292439027,
                   :max => 0.0,
                  :mean => -10.497146916065935,
                  :mode => -10.909692111879082,
                :median => -10.504227003770918,
                 :range => 11.602839292439027,
                    :q1 => -10.909692111879082,
                    :q2 => -10.504227003770918,
                    :q3 => -9.993401380004928
}

I then took the gibberish word 'hdfiqdof' and got a score of -13.905424385433074.

From the outside looking in these stats tell me that the closer to zero the better the match.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

@MadBomber that appears to be correct. So there may be a bug with either classify or classify_with_score per @bararchy's original example.

from classifier-reborn.

MadBomber avatar MadBomber commented on May 29, 2024

@bararchy I took some time to look at your example in detail. Here is what I found. First we have to remember that part of the training and classification process involves the removal of STOPWORDS for the language specified. The default language is 'en' so in your counter-example "this is not" all three words are stop words. That makes your text is the same as an empty string. An empty string always returns a score of 0.0

The second thing that I noticed is that you used b.enable_threshold and then used classify_with_score to get your result. The threshold feature only is applied to the classify method. I have thought about extending it into the other two classifier methods but decided it was not necessary because the user application is given the score to do with as it pleases.

This example has in fact turned up what I think is a bug. A score of 0.0 looks to me like it is a mis-match for the category in the same way that Float::INFINITY is a mis-match.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

@MadBomber Great catch, I hadn't considered stop words. Two things come to mind, we should handle strings that are only/mostly stop words, and we should handle scores of 0.0 differently.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

And we need to document that.

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

@Ch4s3 @MadBomber

Pinging about this issue :)

Thanks

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

I'll put it on my holiday to do list

from classifier-reborn.

MadBomber avatar MadBomber commented on May 29, 2024

On Nov 24, 2015, at 7:54 AM, Chase Gilliam [email protected] wrote:

I'll put it on my holiday to do list


Reply to this email directly or view it on GitHub #58 (comment).

I will also revisit the topic. A few weeks ago I took a couple of hours to review the sensitivity to training sets exposed in the bayes score. At that time I could not come up with sufficiently general verbiage to explain what actually goes on in a baysian classification. The algorithm used in classifier-reborn is very basic. There is a suitability issue that should be in the docs that describes the kinds of applications to which this approach is best applied.

o-*

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

@MadBomber If I use classify_with_score instead of threshold, and the lets say I do

normal = "1234"

test1 = "123abc"
test2 = "abc"

classify_with_score("1234")
=> "Normal"

classify_with_score(test1)
=> 100~

classify_with_score(test2)
=> 0 || nil

Then for sure I can just do if score > 50 => normal || if score < 50 => not normal
Right ?
This is the way threshold is kinda working no ?

from classifier-reborn.

bararchy avatar bararchy commented on May 29, 2024

So, the more I use classify_with_score the more I see stuff that makes no sense to me.

Example:

ai_overlord = ClassifierReborn::Bayes.new 'normal_activity'

ai_overlord.train_normal_activity("GET / HTTP/1.2")
=> ["GET / HTTP/1.2"]

ai_overlord.classify_with_score("get / http/1.lol")
=> ["Normal activity", -8.047189562170502]

ai_overlord.classify_with_score("get / http/1.2")
=> ["Normal activity", -5.744604469176456]

ai_overlord.classify_with_score("get / http/1.1")
=> ["Normal activity", -8.047189562170502]

ai_overlord.classify_with_score("get / http/1.?!@?!")
=> ["Normal activity", -19.78325857845494]

Why is get / http/1.2 less resembles get / http/1.2 then get / http/1.?!@?! ? this makes no sense to me :/

The only way this makes sense is if the closer to 0 the more likely the string to the trained one.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 29, 2024

@bararchy My guess is that the stemmer is doing something odd with your input.

from classifier-reborn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.