Comments (18)
@Ch4s3 I'm sorry. I was very busy in the last month. I have interest in contributing. I already have some code about this in my personal project and I pretend to move and improve the quality of code, that currently is very poor. ASAP I'll submit a new pull request solving this issue.
from classifier-reborn.
Thanks @marciovicente, I would appreciate if you could have a look at #142 and provide feedback on what else can possibly be done beyond what is implemented and planned so far.
from classifier-reborn.
That seems interesting. I'm not sure I would know how to do that correctly myself, but if you're willing to take a stab at it, I'm happy to answer questions about the code base.
from classifier-reborn.
@marciovicente are you still interested in doing this?
from classifier-reborn.
@marciovicente Awesome, there's no huge rush to get it in. I don't have any looming release deadlines so I'll look it over as it comes in.
from classifier-reborn.
@Ch4s3 Just to keep you updated, I've created the confusion matrix for binary samples, when the user call the method validate (bayes.validate(validate_sample)
)
I'm working now in precision, recall and F-measure metrics and I hope soon I'll submit a pull request.
from classifier-reborn.
@marciovicente ping
from classifier-reborn.
@Ch4s3 As I've reported in another issue #76 seems there's a problem with the bayes classification.. Unfortunately I can't share my data right now because it's is private. :/
from classifier-reborn.
Ok, keep me updated and I'll try to get to the bottom of it.
from classifier-reborn.
@marciovicente Can you take a look at #92? We added a big test case using real spam filtering data, and we're able to match the expected results. It would be great if we could find cases where we don't perform as well.
from classifier-reborn.
@Ch4s3 I did create a gist with some samples of my data.. Look at the last line, there's the one record with the class :m
. Except it, all the lines are either :f
or :t
.
https://gist.github.com/marciovicente/f836c302697a786c12bb7721fdd5dd2c
from classifier-reborn.
I am glad that I saw this ticket. I was thinking about validation for last few weeks. In fact I wrote some code to get the confusion matrix
when I wrote the integration test with real dataset #92. But, I was not very sure where that could reside and what exactly one would want to validate? For example;
- User might want to compare more than one classifiers to see how each of them perform in terms of accuracy and their tendency towards false negatives or false positives (depending on the application, one might be less desired than the other) and how much they cost in terms of training and classification time and memory needed.
- Another intent could be to find out the statistics of a classifier in action to decide whether enough training is done or more training is required to reach the desired accuracy.
Based on the intent the API might change. Here are a few options that I considered.
- Make a rake task to get statistics of a classifier against a given dataset. We can use the included SMS dataset, allow user to pass data file formatted the same way, load multi-line data from individual text or html files organized in sub-folders where the corresponding folder name is the class. This mechanism can only satisfy the first intent.
- Write a test/benchmark to get statistics of a classifier against some included datasets. This is very similar to the Rake task, except it is more limiting as the user can't pass custom dataset. This can be helpful to get the cost associated with each algorithm.
- Add evaluation method in each classifier class to be called (say
evaluate()
) on the trained object with suppliedtest data
as an array of arrays[[category, record]...]
. This would allow interaction with the method from a program while the classifier instance (model) is being trained. This will also allow separation of the concern of how the supplied data is originally stored (spreadsheet, CSV/TSV, individual files, or URLs etc.) as the required formatting would be done before calling theevaluate()
method. This can serve the second intent described above very well, but it has a few limitations though, the logic of the evaluation will be duplicated in each classifier class we implement and thek-fold
style cross-validation won't be possible (or at least not in a sensible way). - Add a separate module for validation. This module should expose methods as illustrated below:
module ClassifierReborn
module ClassifierValidator
module_function
def evaluate(classifier, test_data)
conf_mat = {}
categories = classifier.categories
categories.each do |actual|
categories.each do |predicted|
conf_mat[actual][predicted] = 0
end
end
test_data.each do |rec|
conf_mat[rec.first][classifier.classify(rec.last)] += 1
end
conf_mat
end
def validate(classifier, training_data, test_data)
classifier.reset()
training_data.each do |rec|
classifier.train(rec.first, rec.last)
end
evaluate(classifier, test_data)
end
def cross_validate(classifier, sample_data, fold=k, *options)
classifier = ClassifierReborn::const_get(classifier).new(options) if classifier.is_a?(String)
sample_data.shuffle!
partition_size = sample_data.length / fold
partitioned_data = sample_data.each_slice(partition_size)
conf_mats = []
fold.times do |i|
training_data = partitioned_data.take(fold)
test_data = training_data.slice!(i)
conf_mats << validate(classifier, training_data.flatten!(1), test_data)
end
classifier.reset()
generate_stats(conf_mats)
# Optionally, generate time and memory profiles for individual and accumulated iterations
end
def generate_stats(*conf_mats)
# Derive various statistics for one or more supplied confusion matrices
# Report summary based on individual and accumulated confusion matrices
end
end
end
In my opinion, this is the best way to go as it covers all the intents described earlier. Additionally, Rake tasks, tests, and benchmarks can utilize it if needed.
- To measure the performance of a populated classifier model one can use
evaluate()
by supplying the classifier instance and a test data to it (for instance to know when to stop training or to check if more training is needed for a desired accuracy). - To validate based on manually decided set of training_data and test_data or to implement one of many known validation methods use
validate()
for an initialized classifier. - To validate using most popular
k-fold
validation method, usecross_validate()
by supplying an initialized classifier instance or optionally by the name of the classifier (such asBayes
orLSI
to let the method create the classifier instance) and a sample_data. Partitioning the sample_data and callingvalidate()
method on themk
times is done internally.
One or more confusion matrices generated by manually called evaluate()
or validate()
methods can be supplied to generate_stats()
method to get a nice statistical summary, while this method is automatically called if cross_validate()
method is utilized.
This implementation assumes that a classifier is validatable if it responds to categories()
, train()
, classify()
, and reset()
methods. The latter is important because now we have persistent storage backend support that needs to be cleared before any training is done for validation. This can be made optional so that it is only called if the classifier responds to it. Unfortunately, LSI
does not implement a train()
method, instead it has a corresponding add_item()
method which has different parameter fingerprint. However, to make the API uniform, we can add a method with train([categories...], text)
signature that internally calls add_item()
.
from classifier-reborn.
I think the last approach seems to be the best.
from classifier-reborn.
@Ch4s3: I think the last approach seems to be the best.
I will send a PR later this week then.
from classifier-reborn.
awesome thanks @ibnesayeed!
from classifier-reborn.
Ow! Nice job! @ibnesayeed 👏
from classifier-reborn.
@marciovicente would you like to have a look at the validation documentation and provide any feedback.
@Ch4s3 if the documentation is satisfactory then this issue can be closed.
from classifier-reborn.
thanks @ibnesayeed
from classifier-reborn.
Related Issues (20)
- whan i add a utf8 chars HOT 1
- In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. HOT 2
- How to install via jruby HOT 1
- ability to serialize model? HOT 1
- "ArgumentError: comparison of Float with NaN failed" if trying to search a corpus with an item that lacks common words HOT 3
- HTTPS for static site HOT 4
- Deprecated Gem::Specification#has_rdoc HOT 4
- 2.3.0 not released to Rubygems HOT 4
- broken links to docs (domain name not resolving) HOT 6
- TypeError: no implicit conversion from nil to integer in /classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:313:in `sort' HOT 2
- Multiple separate bayes classifiers with single redis database HOT 1
- Documentation at classifier-reborn.com in inaccessible HOT 6
- Allow redis connection to be injected HOT 1
- Can classifier-reborn work with Numo::NArray / Numo::GSL ? Is that a better choice than nmatrix? HOT 9
- Is this project still actively maintained, or is it abandoned? HOT 3
- Problem with certain characters?
- [JRuby] Tests fail with jar-dependencies version mismatch
- Add prefix to the Redis keys
- Jekyll LSI not calculated on localized blog posts HOT 1
- Wijiji10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from classifier-reborn.