jekyll / classifier-reborn Goto Github PK

View Code? Open in Web Editor NEW

547.0 20.0 108.0 701 KB

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.

Home Page: https://jekyll.github.io/classifier-reborn/

License: GNU Lesser General Public License v2.1

Ruby 98.65% Shell 0.82% Dockerfile 0.54%

ruby rubyml bayesian-classifier

classifier-reborn's Introduction

Classifier Reborn

Read the Docs

Getting Started

Classifier Reborn is a general classifier module to allow Bayesian and other types of classifications. It is a fork of cardmagic/classifier under more active development. Currently, it has Bayesian Classifier and Latent Semantic Indexer (LSI) implemented.

Here is a quick illustration of the Bayesian classifier.

$ gem install classifier-reborn
$ irb
irb(main):001:0> require 'classifier-reborn'
irb(main):002:0> classifier = ClassifierReborn::Bayes.new 'Ham', 'Spam'
irb(main):003:0> classifier.train "Ham", "Sunday is a holiday. Say no to work on Sunday!"
irb(main):004:0> classifier.train "Spam", "You are the lucky winner! Claim your holiday prize."
irb(main):005:0> classifier.classify "What's the plan for Sunday?"
#=> "Ham"

Now, let's build an LSI, classify some text, and find a cluster of related documents.

irb(main):006:0> lsi = ClassifierReborn::LSI.new
irb(main):007:0> lsi.add_item "This text deals with dogs. Dogs.", :dog
irb(main):008:0> lsi.add_item "This text involves dogs too. Dogs!", :dog
irb(main):009:0> lsi.add_item "This text revolves around cats. Cats.", :cat
irb(main):010:0> lsi.add_item "This text also involves cats. Cats!", :cat
irb(main):011:0> lsi.add_item "This text involves birds. Birds.", :bird
irb(main):012:0> lsi.classify "This text is about dogs!"
#=> :dog
irb(main):013:0> lsi.find_related("This text is around cats!", 2)
#=> ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]

There is much more that can be done using Bayes and LSI beyond these quick examples. For more information read the following documentation topics.

Installation and Dependencies
Bayesian Classifier
Latent Semantic Indexer (LSI)
Classifier Validation
Development and Contributions (Optional Docker instructions included)

Notes on JRuby support

gem 'classifier-reborn-jruby', platforms: :java

While experimental, this gem should work on JRuby without any kind of additional changes. Unfortunately, you will not be able to use C bindings to GNU/GSL or similar performance-enhancing native code. Additionally, we do not use fast_stemmer, but rather an implementation of the Porter Stemming algorithm. Stemming will differ between MRI and JRuby, however you may choose to disable stemming and do your own manual preprocessing (or use some other popular Java library).

If you encounter a problem, please submit your issue with [JRuby] in the title.

Code of Conduct

In order to have a more open and welcoming community, Classifier Reborn adheres to the Jekyll code of conduct adapted from the Ruby on Rails code of conduct.

Please adhere to this code of conduct in any interactions you have in the Classifier community. If you encounter someone violating these terms, please let Chase Gilliam know and we will address it as soon as possible.

Authors and Contributors

The Classifier Reborn library is released under the terms of the GNU LGPL-2.1.

classifier-reborn's People

Contributors

Stargazers

Watchers

Forkers

onli danbernier yeuem1vannam ch4s3 mfpiccolo bgreenlee luuuc mastfish bezhermoso kreynolds ludovicdeluna sanbiv madbomber robvandenbogaard blackbirdco jcomello joshashby mike-stewart godfox2012 king210876 pkdevboxy kplawver chkjohn bunto joetyler2014 rotem443 jugg3rn4u7 robinator kleopatra999 cassiuschen codecommunity marciovicente empact tra38 pcbuilders mattmueller hgwr naps62 michaelsp nfredrik looooong valeronm ibnesayeed hakehuang alexdwu mbkali firstbytetech yalcin handleror jfabre stmllr pombredanne parthkolekar pskarlas piroor skarlcf gugaiz camelpunch solertis mcianni wetruby mach-kernel shireeshj nguyennhan09cntt shyammohankanojia sueka kirankarki neuralnetworkingtechnologies samuraraujo nicolasleger ashmaroli cryptowen dalavancloud musik thirunjuguna hakimghanem lavatechdev leafly-com vijayasarathymuthu allcentury m-zareba sarah1919 dabble andreapavoni bandzoogle prisi1398 shishir-github everestial stevegeek tabiatanzin ashiq121914 standardgalactic fuzzyboo97 bearerpipelinetest mkasberg ladyk-21 mounish0006 l966211 grayfox9 veer2701

classifier-reborn's Issues

Feature: Classify with minimum threshold

We monkey patched ClassifierReborn::Bayes with a method that allows you to classify text but return "Not Found" if the classifications method returns a score below a given threshold. The idea is to cut down on false positives.

Would this be an interesting feature to include in the gem proper? If so I'd be happy to tackle that.

Error: undefined method `<=>' when LSI is enable, having fenced code block

I run into this issue when trying to set a better related post for my blog. So I set lsi: true it turns out that if I have more than one post file written with fenced code in it that has some opening language tag it will not build the site whether the fenced code exists in more than one file. To produce the error I got one post called.
2014-07-29-post-one.md witch has this fenced code.

```php
<?php
echo "string";
$var = 1;
?>
```

NOTE: This works fine when run jekyll build.

But if I create another post file
2014-07-29-post-two.md witch has the following fenced code in it.

```php
<?php
echo "string";
$var = 1;
?>
```

with these two files written like this jekyll throws the following error:

Populating LSI... 
Rebuilding index... jekyll 2.1.1 | Error:  undefined method `<=>' for (0+1.3158589669405984e-08i):Complex`

if I remove the line <?php in just one file jekyll build works fine.

This issue only happens if LSI is enable.

I've Googleing a lot after all, nothing clear about it was found, But I was noticed that jekyll supports GFM by default anyhow in my _config.yml I got this configration.

markdown: kramdown
kramdown:
  input: GFM

lsi: true

Here is the back trace:

Populating LSI... 
Rebuilding index... 
/Users/adriano/.rvm/gems/ruby-2.1.1/gems/classifier-1.3.4/lib/classifier/lsi.rb:285:in `<=>': undefined method `<=>' for (0.2685144772068373+0i):Complex (NoMethodError)
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/classifier-1.3.4/lib/classifier/lsi.rb:285:in `sort'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/classifier-1.3.4/lib/classifier/lsi.rb:285:in `build_reduced_matrix'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/classifier-1.3.4/lib/classifier/lsi.rb:136:in `build_index'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/related_posts.rb:38:in `build_index'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/related_posts.rb:20:in `build'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/post.rb:246:in `related_posts'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/post.rb:258:in `render'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/site.rb:261:in `block in render'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/site.rb:260:in `each'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/site.rb:260:in `render'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/site.rb:43:in `process'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/command.rb:53:in `process_site'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/commands/build.rb:50:in `build'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/commands/build.rb:33:in `process'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/lib/jekyll/commands/build.rb:17:in `block (2 levels) in init_with_program'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `call'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `block in execute'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `each'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `execute'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary/program.rb:35:in `go'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/mercenary-0.3.4/lib/mercenary.rb:22:in `program'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/gems/jekyll-2.1.1/bin/jekyll:18:in `<top (required)>'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/bin/jekyll:23:in `load'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/bin/jekyll:23:in `<main>'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in `eval'
    from /Users/adriano/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in `<main>'

Runing on:

ruby 2.1.1p76
jekyll 2.1.1
OSX 10.8.5

Something seems wrong with bayes classify method

I'm using classifier to training a specific dataset that contains ~2200 samples, distributed as below:
~1100 of class p
~1100 of class f
and just 1 of class m.

To test step, I'm using 600 samples, where 300 are f, 300 are p and again just 1 belongs to the class m.

The training step is below and the classify step is in https://github.com/marciovicente/classifier-reborn/blob/validate/lib/classifier-reborn/classifier.rb#L13

When I run the classify method for each of them, the result is very weird:
{m: 560, f: 27, p: 0}

# training
bayes = ClassifierReborn::Bayes.new('t', 'f', 'm', language: 'pt')

@training.each_with_index do |x,i|
  if (x.last.to_s == :t.to_s)
    bayes.train_t(x.first)
  end
  if x.last.to_s == :f.to_s
    bayes.train_f(x.first)
  end
  if x.last.to_s == :m.to_s
    bayes.train_m(x.first)
  end
end

To go beyond, all dataset is in Portuguese, only the m class has a sentence in English (in both datasets: training and testing). I can't understand what's happens, once the Bayes is a probabilistic approach.

Anyone could help me?

Remove /bin file cruft

Does anyone actually use the setup in the files here https://github.com/jekyll/classifier-reborn/tree/master/bin?

They're setup is quite outdated, and I don't know what summarize is. Does anyone use them? If not I'm removing summarize.rb and cutting the setup from the other file as part of the path towards 3.0. Thoughts @parkr?

Executable for training with a persistent data store

Per @parkr's idea it might be useful to have an executable that could be used to train and classify inputs for systems using persistent datastores.

LSI#build_index is very slow

I have code something like this:

lsi = ClassifierReborn::LSI.new(auto_rebuild: false)
data.each do |row|
  lsi.add_item(row['foo'], row['bar'])
end
lsi.build_index

...and build_index runs very slowly on lots of items.

With ~10 items, it runs in <1 second
With ~20 items, it runs in ~15 seconds
With ~30 items, it runs in ~130 seconds

I tracked it down to #build_index by disabling auto_rebuild. From there, I tracked it through LSI#build_reduced_matrix, to the monkey-patched extension Matrix#SV_decomp, inside the 3-level nested loop:

     while true do
       for row in (0...qrot.row_size-1) do
         for col in (1..qrot.row_size-1) do

Based on the name SV_decomp, I'll hazard a guess that this is supposed to be a Singular Value Decomposition (which I just discovered). A quick search turned up the Ruby-SVD gem, which could be an option.

I don't understand any of the math, or much of this gem's layout yet, but wanted to record my findings.

A question about the score/threshold

Is the score gets lower or higher as the match gets more exact ?

lets say I matched 3 out of 4 words, does it go north (-) of 0.0 or south (+)?

Filter out punctuation marks?

Would you consider a pull request to filter out standalone punctuation marks?

ie: {:!=>1, :"@"=>1, :"#"=>1, :"$"=>1, :%=>1, :^=>1, :&=>1, :*=>1, :"?"=>1}

I could filter in my own scripts in advance of training..

But I don't imagine a lot of cases where users would want to train with punctuation. And its probably accidental if it happens.

Thoughts?

Functionality to modify stopword lists?

What is the current recommended practise for modifying stopwords. Just edit the word out of the list?

Currently, not is a stopword in English.

But it would be beneficial in my use-cases if not was not a stopword. In order to distinguish strings like below.

"This is a dog." => :dog
"This is not a dog."=> :otherAnimal

In which case not might have an impact.

Thoughts?

Tests do not run on 2.2.1

The require fails:

classifier-reborn/test/test_helper.rb:3:in `require': cannot load such file -- test/unit (LoadError)

I'm investigating.

Should we add a Dockerfile for development and testing?

On my machine I usually prefer not to install a bag full of libraries and run a bunch of processes to facilitate development. This allows me to keep the system clean while I context-switch a lot. For this reason, I have written a simple Dockerfile that packages necessary requirements to work with classifier-reborn. It looks something like this:

FROM ruby:2.3
MAINTAINER Sawood Alam <https://github.com/ibnesayeed>

ENV LANG C.UTF-8

RUN apt update && apt install -y libgsl0-dev && rm -rf /var/lib/apt/lists/*

RUN cd /tmp \
    && wget http://download.redis.io/redis-stable.tar.gz \
    && tar xvzf redis-stable.tar.gz \
    && cd redis-stable \
    && make && make install \
    && cd /tmp && rm -rf redis-stable*

WORKDIR /usr/src/app
COPY . /usr/src/app
RUN bundle install
RUN gem install narray nmatrix gsl redis

CMD redis-server --daemonize yes && rake

Using this file I have built a Docker image named classifier-reborn:

$ docker build -t classifier-reborn .

Now, anytime I make any changes in the code, I can simply run the following command to test it:

$ docker run --rm -it -v "$PWD":/usr/src/app classifier-reborn

It will use the Docker image I built before that has all the dependencies installed. Then it will run a Redis server inside the container and invoke the default Rake task. Once the task is done, everything will be wiped clean.

If I want to do something more than running the default Rake task, I can access the Bash prompt of the container:

$ docker run --rm -it -v "$PWD":/usr/src/app classifier-reborn bash

Once I exit from the Bash prompt, my machine will be clean again.

If some dependencies have changed that require to re-run bundle install then the image can be built again.

If this workflow seems useful for others then I can push the Dockerfile in the repo with some documentation.

Question regarding languages

Hi
I've seen that the classifier has the fast-stemmer as a dependency right? I think that this stemmer doesn't support Spanish, would this mean that the classifier would not work properly with documents in Spanish?

What if the text that I pass to the classifier has already been processed by a Spanish stemmer, would that work?

Best
Alejandro

Unnecessary intermediate variables

I was casually looking at the code and found a few places where method arguments were assigned to another intermediate variable while there was no need to preserve the original argument.

If this is part of some style then why the same is not practiced in word_list.rb#L26?

Remove all these crazy monkey patches

This repository contains core extensions, which can muck with a lot of stuff.

This "reborn" version should not contain any of these core extensions. Let's figure out a way to use modules & module methods to get it all isolated from the core.

New tagged release and docs

We should add docs for the new classify_with_score and push out a new minor release.

Docs for classify_with_score

Bayes ok for short-form text?

Forgive me in advance if this is a total noob question.

I've been trying to get Bayes working to categorize job titles. However, i'm running into an issue where the classifier chooses the category with the least training text:

#!/usr/bin/env ruby
# classifier_reborn_demo.rb

require 'classifier-reborn'
training_set = DATA.read.split("\n")
categories = training_set.shift.split(',').map{ |c| c.strip }
classifier = ClassifierReborn::Bayes.new categories

training_set.each do |a_line|
  next if a_line.empty? || '#' == a_line.strip[0]
  parts = a_line.strip.split(':')
  classifier.train(parts.first, parts.last)
end

puts classifier.classify "Regional Marketing Assistant"

__END__
Consulting,Marketing
# consulting titles
consulting: Consultant
consulting: Consulting
# marketing titles
marketing: Chief Marketing Officer
marketing: Marketing
marketing: Marketing Operations
marketing: Search Engine Optimization
marketing: SEO

The classifier is returning consulting even though Marketing appears in the title.

ruby scripts/classifier_reborn_demo.rb
Consulting
-10.239959789157341

It's possible Bayes it not suited for this type of classification? Maybe bag-of-words algo might be better suited for this type of classification?

Thanks in advance!

NoMethodError: undefined method `+' for nil:NilClass at Train method

Hi, I'm getting this error for train method.

NoMethodError: undefined method `+' for nil:NilClass
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:52:in `block in train'
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:51:in `each'
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:51:in `train'

The problem is in:

def train(category, text)
  ...
  @category_counts[category] += 1
  Hasher.word_hash(text, @language).each do |word, count|
    @categories[category][word]      += count
    @category_word_count[category]   += count
    @total_words += count
  end
end

When, for example, @categories[category][word] is nil, it get this error. I solve it calling the to_i method.
It changed for

def train(category, text)
  ...
  @category_counts[category] =   @category_counts[category].to_i + 1
  Hasher.word_hash(text, @language).each do |word, count|
    @categories[category][word]      = @categories[category][word].to_i + count
    @category_word_count[category]   = @category_word_count[category].to_i + count
    @total_words += count
  end
end

The thing is, I trying to reproduce this error in a test. Can you help me?

Port classifier-reborn to Crystal lang

This is just an idea, would you be OK with trying to port classifier-reborn to Crystal ?
I use the Bayes classifier in production setup and would really love to have it available also in Crystal for more heavy lifting and scaling.

Marshal data too short

I've installed all dependencies (rb-gsl, gsl, classifier-reborn) and when I try load the my trained LSI data model i get this error: Marshal data too short

So I've checked the model and it's seems ok.

The code below work fine in my personal computer (Mac OS 10.11) but when I'm trying to run it in my webserver (Ubuntu 15.10) this doesn't work. Is there any problem generate this model in an OSX computer and try to use the classify in another OS? Maybe GSL lib version or something like this.

require 'classifier-reborn'

model = File.read("#{Rails.root}/bin/lsi-model.dat")
lsi_classifier = Marshal.load(model)

How can i pass an array for training words or sentences.

I have some set of arrays containing good and bad words how can i pass into this. classifier.train_interesting

Changing language and auto_categorize attributes

I have to use the portuguese stopwords at bayes classifier but the initialize method is fixed for english, the auto_categorize has the same issue.

def initialize(*args)
      @categories = Hash.new
      options = { language: 'en', auto_categorize: false }
      ...
end

I search at the gem and I didn't find a setter. There is another way?

wrong argument type GSL::Vector (expected Data)

Hi, I'm having problems using classifier-reborn with GSL and Madeleine Gem.

When there is an madelaine's snapshot I have the error as the following gist

https://gist.github.com/jcomello/3595d8e4db753e1dabc4

Is this a classifier-reborn issue or Madeleine's?

Using Redis for the data structure

I have seen references to Redis in the README for dumping and restoring the complete marshaled snapshot. This, in my opinion, has very little advantage over storing the data in a file.

However, using Redis for storing individual counter objects/hashes (with some name-spacing in place) rather than using raw memory would be a nice feature. Every time a counter needs to be updated for training, it will update necessary records on Redis and when classify method is called, it will pull the vectors from Redis. This approach will have following advantages:

in case of accidental program crash the training data will not be lost
a shared Redis store instance with adequate memory can serve many small worker application instances that can participate in training and classifying documents

Possible approach to implement it:

define proxy methods to override [] and []= in a way that can save and retrieve records to and from Redis the same way it would with a hash
in the initialize method check to see if Redis configs are supplied, only then override the hash accessors, this way the classifier would seamlessly work with and without Redis while avoiding branching (conditional check) on each read and write operation

Thoughts?

uninitialized constant ClassifierReborn::BayesRedisBackend

When trying to connect redis backend.
redis_backend = ClassifierReborn::BayesRedisBackend.new {host: "10.0.1.1", port: 6380, db: 15}

(version 2.1.0)

Code Climate

I ran code climate against the gem and it looks pretty interesting.
https://codeclimate.com/github/jekyll/classifier-reborn

This could be a good source for future improvements. Thoughts?

Other classification/nlp tools

We already do a bag of words, and word counts. Would it be useful to anyone to expose this functionality for other classification uses?

Some other things to consider:

N-grams
Levenshtein distance
Sentiment analysis
term frequency–inverse document frequency

incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

This happens to me when I stream data to the Classifier, is there a need to force the encoding on the strings coming into the .train method ?

Rewrite the SVD

As discussed in #27 we need to rewrite the SVD method here. This could also be used as an opportunity to remove the monkey patch on matrix, and provide a method like svd(matrix). This will clear up #5 as well.

Here's a resource I found on SVD. I'll definitely need help on this one.

Error when searching for a word that doesn't exist in the corpus

require 'classifier-reborn'
lsi = ClassifierReborn::LSI.new

#add strings to lsi that has nothing to do with dogs

lsi.search("dogs", 8)

/Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn
/lsi.rb:213:in `sort_by': comparison of Float with NaN failed (ArgumentError)
    from /Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:213:in `proximity_norms_for_content'
    from /Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:225:in `search'
    from chat.rb:24:in `<main>'

I'm assuming that somewhere in the code, we have a "0/0" out there that is being converted into a NaN. This error is avoidable so long as you don't search for terms not specifically within the corpus you're training the LSI on, but...well...what happens if I'm using a very huge corpus? How am I supposed to know what words are (or are not) present?

Thoughts about validation

I'm using this gem and it seems awesome. But I have some questions about validation.

What are you using to check the accuracy of a specific model? I've look through the documentation and don't find anything about validation. So I've thinking it's a great feature implement a module to measure the precision of a specific model.

In my academic work, I create a simple model validator to check the accuracy and it's works fine for me. I had thought of implementing initial validate method:

classifier.validate(testing_samples)

that returns

accuracy | positive | negative
   0.89      802          98   | positive
   0.81      128         572   | negative

Mean accuracy: 0.8587

I think it's possible create a "fake" cross validation method yet (or maybe the original), passing optionally, a number of folds that will be the number of iterations to shuffle the testing dataset and re-classify them. So, the output would be something like:

classifier.validate(testing_samples, 5)
Fold 1: 0.891
Fold 2: 0.887
Fold 3: 0.821
Fold 4: 0.798
Fold 5: 0.803
Mean accuracy for 5 folds: 0.84

ArgumentError: invalid byte sequence in US-ASCII

When I am running the test, I am getting invalid byte sequence in US-ASCII error from the Hasher class.

$ ruby -v
ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-linux]
$ ruby test/extensions/hasher_test.rb
Loaded suite test/extensions/hasher_test
Started
...E
===================================================================================================
Error: test_default_stopwords(HasherTest): ArgumentError: invalid byte sequence in US-ASCII
/usr/src/app/lib/classifier-reborn/extensions/hasher.rb:59:in `split'
/usr/src/app/lib/classifier-reborn/extensions/hasher.rb:59:in `block (2 levels) in <module:Hasher>'
/usr/src/app/lib/classifier-reborn/extensions/hasher.rb:57:in `each'
/usr/src/app/lib/classifier-reborn/extensions/hasher.rb:57:in `block in <module:Hasher>'
test/extensions/hasher_test.rb:26:in `[]'
test/extensions/hasher_test.rb:26:in `test_default_stopwords'
     23: 
     24:   def test_default_stopwords
     25:     assert_not_empty Hasher::STOPWORDS['en']
  => 26:     assert_not_empty Hasher::STOPWORDS['fr']
     27:     assert_empty Hasher::STOPWORDS['gibberish']
     28:   end
     29: 
===================================================================================================
..

Finished in 0.003000319 seconds.
---------------------------------------------------------------------------------------------------
6 tests, 6 assertions, 0 failures, 1 errors, 0 pendings, 0 omissions, 0 notifications
83.3333% passed
---------------------------------------------------------------------------------------------------
1999.79 tests/s, 1999.79 assertions/s

When I try to interact with it from the IRB console:

$ irb
irb(main):001:0> load './lib/classifier-reborn/extensions/hasher.rb'
=> true
irb(main):002:0> ClassifierReborn::Hasher::STOPWORDS['en']
=> #<Set: {"a", "again", "all", "along", "are", "also", "an", "and", "as", "at", "but", "by", "came", "can", "cant", "couldnt", "did", "didn", "didnt", "do", "doesnt", "dont", "ever", "first", "from", "have", "her", "here", "him", "how", "i", "if", "in", "into", "is", "isnt", "it", "itll", "just", "last", "least", "like", "most", "my", "new", "no", "not", "now", "of", "on", "or", "should", "sinc", "so", "some", "th", "than", "this", "that", "the", "their", "then", "those", "to", "told", "too", "true", "try", "until", "url", "us", "were", "when", "whether", "while", "with", "within", "yes", "you", "youll"}>
irb(main):003:0> ClassifierReborn::Hasher::STOPWORDS['fr']
ArgumentError: invalid byte sequence in US-ASCII
	from /usr/src/app/lib/classifier-reborn/extensions/hasher.rb:59:in `split'
	from /usr/src/app/lib/classifier-reborn/extensions/hasher.rb:59:in `block (2 levels) in <module:Hasher>'
	from /usr/src/app/lib/classifier-reborn/extensions/hasher.rb:57:in `each'
	from /usr/src/app/lib/classifier-reborn/extensions/hasher.rb:57:in `block in <module:Hasher>'
	from (irb):3
	from /usr/local/bin/irb:11:in `<main>'
irb(main):004:0>

Zero vectors can not be normalized

Hi there!

I am using Rake to try to automate my Jekyll blog generation and find such errors. I didn't encounter such when using jekyll serve --lsi. Any pointers? 😄

Rebuilding index... rake aborted!
Zero vectors can not be normalized
/Library/Ruby/Gems/2.0.0/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:143:in `block in build_index'
/Library/Ruby/Gems/2.0.0/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:141:in `times'
/Library/Ruby/Gems/2.0.0/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:141:in `build_index'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/related_posts.rb:38:in `build_index'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/related_posts.rb:20:in `build'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/document.rb:455:in `related_posts'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/renderer.rb:41:in `run'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/page_renderer.rb:17:in `prepare'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/page_renderer.rb:34:in `render'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/search_entry.rb:20:in `create'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/indexer.rb:64:in `block in generate'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/indexer.rb:63:in `each'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/indexer.rb:63:in `each_with_index'
/Library/Ruby/Gems/2.0.0/gems/jekyll-lunr-js-search-3.0.0/lib/jekyll_lunr_js_search/indexer.rb:63:in `generate'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/site.rb:154:in `block in generate'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/site.rb:153:in `each'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/site.rb:153:in `generate'
/Library/Ruby/Gems/2.0.0/gems/jekyll-3.0.1/lib/jekyll/site.rb:58:in `process'

JRuby Support

There is a pure ruby version fast-stemmer and a java version, so it should be possible to add jruby support. I could probably get that working if there is interest.

[Request] Add Vietnamese stop words

Can you add Vietnamese stop words, please?

Diferent weight for known words or user oriented trainning

Hi,
There is a way to add weight for the probabilities? I mean, I have some words that I already know the classifications that they have to be, but I don't know if there's a way to add weights for these words. Thanks

classify_with_score and scored_categories return undefined?

I tried running ClassifierReborn::LSI from the example in the README and when invoking the 'classify_with_score' and 'scored_categories' methods, it returns undefined? Am I calling it wrong?

lsi = ClassifierReborn::LSI.new

strings = [ ["This text deals with dogs. Dogs.", "a"],
            ["This text involves dogs too. Dogs!", "a"],
            ["This text revolves around cats. Cats.", "b"],
            ["This text also involves cats. Cats!", "b"],
            ["This text involves birds. Birds.", "c"]]

strings.each {|x| lsi.add_item x.first, x.last}

# NoMethodError
scored_categories = lsi.scored_categories("dog bird cat")

# NoMethodError
lsi.classify_with_score("test")

comparison of Float with Float failed (ArgumentError) on a simple training test

Hi,
I had an error and managed to reproduce it with the following basic test adapted from the Readme example:

require 'classifier-reborn'
lsi = ClassifierReborn::LSI.new
strings = [["Word1 Word2", :cat],
           ["Word1 Word3", :cat],
           ["Word4 Word5 Word6 Word7", :cat]]
strings.each {|x| lsi.add_item x.first, x.last}

the third .add_item fails with:

classifier-reborn-2.0.2/lib/classifier-reborn/lsi.rb:284:in `sort': comparison of Float with Float failed (ArgumentError)

Anything I'm doing wrong?

Thanks

License file is truncated

https://github.com/jekyll/classifier-reborn/blob/master/LICENSE file text is incomplete/cropped. Is it intended? Shouldn't it be an exact copy of https://www.gnu.org/licenses/old-licenses/lgpl-2.1.txt?

Redis backend for LSI

We have successfully abstracted datastructure of Bayes which allowed us to implement an alternate Redis storage backend (#81) while making it possible to easily add more backends (such as ORM based). However, I found it little difficult to follow along the flow of the LSI implementation to understand all the datastructures needed for that. Can someone give a high-level overview of datastructures of LSI, their relationship, and desired operations?

As a side note, can we please make sure to abstract the datascructure away from the logic from the day one of every new algorithm we might implement as indicated in #88.

Issue with sqrt

When I execute Jekyll with LSI enabled (using --lsi) I run into an error:

Math::DomainError: Numerical argument is out of domain - "sqrt"

I believe the error occurs in line 58 of vector.rb. Apparently, the input number is negative and this causes the error. As also mentioned here, this can be fixed by using CMath.sqrt instead of Math.sqrt to use complex numbers.

In my case, the issue shows up when I have duplicate posts (I had them because I was testing the pagination).

Warning on lsi.rb line 237

Hello guys,

I recently setup lsi for my octopress application but these following warnings always appear and i don't know why:
/Users/user/.rvm/gems/ruby-2.2.1/gems/classifier-reborn-2.0.3/lib/classifier-reborn/lsi.rb:237: warning: Comparable#== will no more rescue exceptions of #<=> in the next release.

/Users/user/.rvm/gems/ruby-2.2.1/gems/classifier-reborn-2.0.3/lib/classifier-reborn/lsi.rb:237: warning: Return nil in #<=> if the comparison is inappropriate or avoid such comparison.

Encoding::CompatibilityError for hasher

I'm using an older version of ruby (1.9.3) and I getting the following error. I notice that the hasher.rb file does not have the encoding specified.
May I fix it and send a PR?

Thanks guys.

Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
/Users/joao/.rvm/gems/ruby-1.9.3-p551@admin-novo/bundler/gems/classifier-reborn-e832ae342ff5/lib/classifier-reborn/extensions/hasher.rb:23:in `gsub'
/Users/joao/.rvm/gems/ruby-1.9.3-p551@admin-novo/bundler/gems/classifier-reborn-e832ae342ff5/lib/classifier-reborn/extensions/hasher.rb:23:in `clean_word_hash'
/Users/joao/.rvm/gems/ruby-1.9.3-p551@admin-novo/bundler/gems/classifier-reborn-e832ae342ff5/lib/classifier-reborn/extensions/hasher.rb:16:in `word_hash'
/Users/joao/.rvm/gems/ruby-1.9.3-p551@admin-novo/bundler/gems/classifier-reborn-e832ae342ff5/lib/classifier-reborn/bayes.rb:50:in `train'

NoMethodError: undefined method `+' for nil:NilClass at Train method

Hi, I'm getting this error for train method.

NoMethodError: undefined method `+' for nil:NilClass
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:52:in `block in train'
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:51:in `each'
/Users/my/directory/classifier-reborn/lib/classifier-reborn/bayes.rb:51:in `train'

The problem is in:

def train(category, text)
  ...
  @category_counts[category] += 1
  Hasher.word_hash(text, @language).each do |word, count|
    @categories[category][word]      += count
    @category_word_count[category]   += count
    @total_words += count
  end
end

When, for example, @categories[category][word] is nil, it get this error. I solve it calling the to_i method.
It changed for

def train(category, text)
  ...
  @category_counts[category] =   @category_counts[category].to_i + 1
  Hasher.word_hash(text, @language).each do |word, count|
    @categories[category][word]      = @categories[category][word].to_i + count
    @category_word_count[category]   = @category_word_count[category].to_i + count
    @total_words += count
  end
end

The thing is, I'm trying to reproduce this error in a test. Can you help me?

Undefined method when classifying with Madeline

Using the cached word count here while classifying from a presiously trained set breaks as @category_word_count will not have been initilized.

Working on a fix.

Remove enable_threshold config from Bayes classifier

Currently, the threshold option of the Bayes classifier only makes sense if the enable_threshold option is also set to true (at initialization or later dynamically). This tight interdependence makes me think that the threshold enabler flag is an overhead. However, by making the default value of the threshold property as Float::INFINITY we can cleverly get rid of the enable_threshold option. Let's have a look at the current implementation of the classify method:

def classify(text)
  result, score = classify_with_score(text)
  result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
  result
end

If we change it to something like this:

def classify(text)
  result, score = classify_with_score(text)
  score < @threshold ? nil : result
end

The modified method will behave exactly the same as the current implementation with threshold_enabled set to true if the default value of threshold is Float::INFINITY.

However, there is a case where it will not behave the same as the current implementation. Suppose that the threshold_enabled is set to false and classify_with_score returned a class with score Infinity. In this case the current implementation would return that class while the modified method will return nil instead. This is obviously a breaking change. However, I think the current implementation, without the threshold in action, is broken anyway, so we might as well get rid of that behavior completely. Here is what I mean by broken:

[1] pry(main)> require 'classifier-reborn'
[2] pry(main)> b = ClassifierReborn::Bayes.new 'Foo', 'Bar'
[3] pry(main)> b.classifications("Baz")
=> {"Foo"=>Infinity, "Bar"=>Infinity}
[4] pry(main)> b.classify_with_score("Baz")
=> ["Foo", Infinity]

As illustrated above, for the text "Baz", the category Foo is as "irrelevant" as Bar, so why should return one, but not the other. Hence the nil is a better choice here in my opinion. One could argue that the score collision could happen even if the scores are not Infinity, but the keyword here is "irrelevant". In case of a real score value the distance of the classification text vector is same from the competing category vectors, so they are equally relevant and picking one or the other is equally good. However, if the score is Infinity then saying one Infinity is equal to the other Infinity is mathematically incorrect.

With this proposal in place, we can get rid of many unnecessary attributes and convenience methods (including threshold_enabled, threshold, threshold=, threshold_enabled?, threshold_disabled?, enable_threshold, and disable_threshold). I am not sure how useful these dynamic enablers and disablers are if the user has access to raw scores for making custom decisions. This will make the classifier API simpler for consumers, easier to document, and easier to maintain.

undefined method `reduce'

I'm using jekyll (2.5.1) and after upgrading ruby-classifier-reborn to 2.0.2 it errors out:
Error: undefined method `reduce' for #GSL::Vector...

I have no idea where this comes from, all I can tell is in 2.0.2 a few commits have replaced sum with .reduce(0, :+) (926fc3f & 555d4a3) and that might to be the source of this. I've downgraded ruby-classifier-reborn back to 2.0.1 for now.

FYI I have ruby-rb-gsl 1.16.0.3 & gsl 1.16 (on Arch Linux x64)

Tests are a bit flaky with the new redis backend

It looks like the tests are a bit flaky now and seem to hang from time to time.

LSI is broken af.

I'm on Ruby 2.2.4. I'm trying to use LSI. Nothing works, and the error messages SUCK. I've tried both the last release (i.e. the gem version) and the latest commit from Github.

lsi = ClassifierReborn::LSI.new
training_data = ["Bcom", "Corporate Administration", "Forensic Auditing"]
category = :accounting
training_data.each do |d|
  begin
    lsi.add_item(d, category)
  rescue StandardError => e
    puts "#{d} misbehaving: #{e.message}"
  end
end

#=> Forensic Auditing misbehaving: comparison of Float with NaN failed

Better yet, if I swap the order of the training data, I get this:

lsi = ClassifierReborn::LSI.new
training_data = ["Corporate Administration", "Forensic Auditing", "Bcom"]
category = :accounting
training_data.each do |d|
  begin
    lsi.add_item(d, category)
  rescue StandardError => e
    puts "#{d} misbehaving: #{e.message}"
  end
end

#=> Forensic Auditing misbehaving: comparison of Float with NaN failed
#=> Bcom misbehaving: comparison of Float with NaN failed

Feature Reqiest: save/load baysian classifier

Someone mentioned this in a comment within the last rew months. I thought I'd just consolidate it here. What I'm thinking about is ClassifierReborn#load and ClassifierReborn#save instance methods that serialize the classifier from/to a compressed file. It would looking something like this:

b = ClassifierReborn.new load: 'path/to/classifier_data.zip'
# and
b = ClassifierReborn.new 'Ham', 'Spam'
b.train :ham, 'good stuff'
b.train :spam, 'processed stuff'
b.save 'path/to/classifier_data.zip'
# or
b.save # to use a default which could be the place from which the serialized data was loaded on initialization

Weird behavior when one category is empty

So, this is an example

[2] pry(main)> require 'classifier-reborn'
=> true
[3] pry(main)> number_finder = ClassifierReborn::Bayes.new 'a_number', 'not_a_number'
=> #<ClassifierReborn::Bayes:0x00000002d1d330 @categories={:"A number"=>{}, :"Not a number"=>{}}, @category_counts={}, @category_word_count={}, @total_words=0>
[4] pry(main)> number_finder.train_a_number('1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20')
=> ["1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20"]
[5] pry(main)> number_finder
=> #<ClassifierReborn::Bayes:0x00000002d1d330 @categories={:"A number"=>{}, :"Not a number"=>{}}, @category_counts={:"A number"=>1}, @category_word_count={:"A number"=>0}, @total_words=0>
[6] pry(main)> number_finder.train_a_number('2')
=> ["2"]
[7] pry(main)> number_finder.train_a_number('3')
=> ["3"]
[8] pry(main)> number_finder.train_a_number('4')
=> ["4"]
[9] pry(main)> number_finder.train_a_number('5')
=> ["5"]
# More lines...
[11] pry(main)> number_finder.train_a_number('one')
=> ["one"]
[12] pry(main)> number_finder.train_a_number('two')
=> ["two"]
[13] pry(main)> number_finder.classify("5")
=> "A number"
[14] pry(main)> number_finder.classify("15 and more numbers")
=> "A number"
[15] pry(main)> number_finder.classify("numbers")
=> "A number"
[16] pry(main)> number_finder.classify("lol wut ?")
=> "A number"
[17] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "A number"
[18] pry(main)> number_finder.classify("")
=> "A number"

After training not_a_number it works as expected

[21] pry(main)> number_finder.train_not_a_number("?")
=> ["?"]
[22] pry(main)> number_finder.train_not_a_number("!")
=> ["!"]
[23] pry(main)> number_finder.train_not_a_number("a b c d e f g h i j k l m n o p")
=> ["a b c d e f g h i j k l m n o p"]
[24] pry(main)> number_finder.train_not_a_number("those are also not numbers")
=> ["those are also not numbers"]
[25] pry(main)> number_finder.classify("")
=> "A number"
[26] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "Not a number"

I'm new to the idea of classifiers so maybe this is intentional, just looks strange to me.

jekyll / classifier-reborn Goto Github PK

classifier-reborn's Introduction

Classifier Reborn

Getting Started

Notes on JRuby support

Code of Conduct

Authors and Contributors

classifier-reborn's People

Contributors

Stargazers

Watchers

Forkers

classifier-reborn's Issues

Recommend Projects

Recommend Topics

Recommend Org