Comments (17)
Very interesting! The fact that the classifier doesn't have a competing classification may be the root cause. When not_a_number
is empty, the classifier only has one classification that it could possibly pick – null
is not an option, and even if the match is some infinitesimal number or 0 itself, it's the only classification we know about so match it. Solution would be to say match coefficient must be > 0
– even if infinitesimal – and if the match coefficient is 0, then return null
. It's a breaking change of the API, though, so maybe we could just return ""
.
[13] pry(main)> number_finder.classify("5")
=> "A number"
[14] pry(main)> number_finder.classify("15 and more numbers")
=> "A number"
[15] pry(main)> number_finder.classify("numbers")
=> "A number"
[16] pry(main)> number_finder.classify("lol wut ?")
=> "A number"
[17] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "A number"
[18] pry(main)> number_finder.classify("")
=> "A number"
from classifier-reborn.
@parkr Will it be too bad to say something like: "if value is not classified in category 'a', then, even though category 'b' is empty, it belongs there" ?
from classifier-reborn.
Maybe if I give an example of my usage it will be easier to understand my need to leave one category empty.
I'm using this gem to learn HTTP Traffic, I'm setting it to "Training Mode" and show it what "normal traffic" looks like, then, I want it to check traffic and if the classification isn't "normal traffic" then it is "suspicious traffic" and I drop the packet.
Right now it only works if I show it what "suspicious traffic" looks like, but this creates kind of a 'black list' situation, and I want more of a 'white list' approach
from classifier-reborn.
I'm just getting back from vacation I'll take a look soon
from classifier-reborn.
@Ch4s3 Hi, did you manage to see what's going on here ?
from classifier-reborn.
Sorry, I got bogged down catching up at work. It's on my radar though.
from classifier-reborn.
This is akin to a "none of the above" kind of classification where given a set of categories if the best fit is less than some threshold then a result indicating "none of the above" or :unknown is returned.
from classifier-reborn.
@MadBomber Good point, this is exactly the answer I was looking for from the classifier :)
I hope it could be implemented.
from classifier-reborn.
Take a look at my fork to see if this is what you had in mind.
MadBomber@5096334 MadBomber@5096334
I am not 100% sure this is a good solution to what you want to do. I'm thinking that there will be a large number of false positives. You may find yourself spending more time adjusting the threshold value.
I will submit a pull request after you play with it for a while.
Dewayne
o-*
On Oct 19, 2015, at 10:17 AM, Bar Hofesh [email protected] wrote:
@MadBomber https://github.com/MadBomber Good point, this is exactly the answer I was looking for from the classifier :)
I hope it could be implemented.—
Reply to this email directly or view it on GitHub #47 (comment).
from classifier-reborn.
In your pry session I think that if you had used #classify_with_score you would have seen that the score was being returned as Float::INIFINITY for text that was not classified as 'a_number'
from classifier-reborn.
@MadBomber I just tried your version, again, only training the "normal activity" category.
This is what I do:
ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true}
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}},
@category_counts={},
@category_word_count={},
@enable_threshold=true,
@language="en",
@threshold=0.0,
@total_words=0>
### Training the classifier
ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)
[35] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[36] pry(main)> ai_overlord.classify_with_score("GET / POST")
=> ["Suspicious activity", Infinity]
## Trying to play around with threshold
23] pry(main)> ai_overlord.threshold = 0.5
=> 0.5
[24] pry(main)> ai_overlord.classify("GET / ")
=> nil
[25] pry(main)> ai_overlord.threshold = 10.0
=> 10.0
[26] pry(main)> ai_overlord.classify("GET / ")
=> nil
[27] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[28] pry(main)> ai_overlord.threshold = 50.0
=> 50.0
[29] pry(main)> ai_overlord.classify("GET / POST")
=> nil
### Making sure the Threshold is changed inside the class
[37] pry(main)> ai_overlord
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}},
@category_counts={:"Normal activity"=>1},
@category_word_count={:"Normal activity"=>10001},
@enable_threshold=true,
@language="en",
@threshold=50.0,
@total_words=10001>
I seems that again when one category is empty it would always classify to the empty one, would the threshold feature help in this case ?
Thanks :)
from classifier-reborn.
Given your examples, that is proper behavior. You trained the classifier with only one example - a very long string. You asked it to classify a very short string. It rejected the string showing a score of Infinity which means that there is no matching category.
Try this pattern:
Notice there is only one category: Normal
ai_overlord = ClassifierReborn::Bayes.new(
'Normal',
enable_threshold: true
)
normal_request = "Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept"
10.times { |x| ai_overlord.train_normal(normal_request) }
Dynamically set the threshold to less than a known sample
ai_overlord.threshold = ai_overlord.classify_with_score(normal_request).last - 0.5
ai_overlord.classify(normal_request)
Now try to classify a counter-example
abnormal_request = "Safari opera webkit GET POST / http 1.1 1.0 1.2 Accept"
ai_overlord.classify( abnormal_request )
o-*
On Oct 20, 2015, at 2:26 AM, Bar Hofesh [email protected] wrote:
@MadBomber https://github.com/MadBomber I just tried your version, again, only training the "normal activity" category.
This is what I do:ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true}
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}},
@category_counts={},
@category_word_count={},
@enable_threshold=true,
@language="en",
@Threshold=0.0,
@total_words=0>Training the classifier
ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)
[35] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[36] pry(main)> ai_overlord.classify_with_score("GET / POST")
=> ["Suspicious activity", Infinity]Trying to play around with threshold
23] pry(main)> ai_overlord.threshold = 0.5
=> 0.5
[24] pry(main)> ai_overlord.classify("GET / ")
=> nil
[25] pry(main)> ai_overlord.threshold = 10.0
=> 10.0
[26] pry(main)> ai_overlord.classify("GET / ")
=> nil
[27] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[28] pry(main)> ai_overlord.threshold = 50.0
=> 50.0
[29] pry(main)> ai_overlord.classify("GET / POST")
=> nilMaking sure the Threshold is changed inside the class
[37] pry(main)> ai_overlord
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}},
@category_counts={:"Normal activity"=>1},
@category_word_count={:"Normal activity"=>10001},
@enable_threshold=true,
@language="en",
@Threshold=50.0,
@total_words=10001>
—
Reply to this email directly or view it on GitHub #47 (comment).
from classifier-reborn.
So, @MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.
from classifier-reborn.
On Oct 27, 2015, at 3:42 AM, Bar Hofesh [email protected] wrote:
So, @MadBomber https://github.com/MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.
The feature has been merged with the master but a new version of the gem has not yet been published. You can look at the code to see the details:
https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb
Here is the gist:
-
it only works with the classify method. All other methods behave as before. If the result falls below a threshold score or the score is INFINITY the result returned will be nil. So to see if it was "none of the above" just check for result.nil?
-
you can enable the threshold at initialization time with the option 'enable_threshold' set to true. You can also enable/disable threshold process at any time using the methods enable_threshold and disable_threshold.
-
The default threshold is 0.0 any score below this will return a nil result; HOWEVER, threshold that you should use is one that makes sense for your application. You can set your own threshold at initialization time with the option 'threshold' which expects a floating point number. You can reset the threshold or get its value using the methods 'threshold=' or just 'threshold'
Check out the unit tests:
https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb
The test at line 82 'test_classification_with_threshold_again' is your specific scenario as I understood it.
Lets us know if you catch any bad guys using this technique.
Dewayne
o-*
from classifier-reborn.
@MadBomber Thanks for the great explanation and the example in the tests.
Right now I used a -200.0 threshold to stop a SQL Injection attack from SQLMAP.
I need to play around with letting the classifier learn more, then, test a few attacks.
anyhow this is fine for my use case, many thanks (also it's stable, I would push a new gem version ;) )
I'll update if necessary, issue closed :)
from classifier-reborn.
@MadBomber and @bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.
from classifier-reborn.
I will add a section to the README on the threshold features. Should have a pull request in by tonight.
o-*
On Oct 27, 2015, at 10:54 PM, Chase Gilliam <[email protected] mailto:[email protected]> wrote:
@MadBomber https://github.com/MadBomber and @bararchy https://github.com/bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.
—
Reply to this email directly or view it on GitHub #47 (comment).
from classifier-reborn.
Related Issues (20)
- whan i add a utf8 chars HOT 1
- In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. HOT 2
- How to install via jruby HOT 1
- ability to serialize model? HOT 1
- "ArgumentError: comparison of Float with NaN failed" if trying to search a corpus with an item that lacks common words HOT 3
- HTTPS for static site HOT 4
- Deprecated Gem::Specification#has_rdoc HOT 4
- 2.3.0 not released to Rubygems HOT 4
- broken links to docs (domain name not resolving) HOT 6
- TypeError: no implicit conversion from nil to integer in /classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:313:in `sort' HOT 2
- Multiple separate bayes classifiers with single redis database HOT 1
- Documentation at classifier-reborn.com in inaccessible HOT 6
- Allow redis connection to be injected HOT 1
- Can classifier-reborn work with Numo::NArray / Numo::GSL ? Is that a better choice than nmatrix? HOT 9
- Is this project still actively maintained, or is it abandoned? HOT 3
- Problem with certain characters?
- [JRuby] Tests fail with jar-dependencies version mismatch
- Add prefix to the Redis keys
- Jekyll LSI not calculated on localized blog posts HOT 1
- Wijiji10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from classifier-reborn.