jbrukh / bayesian Goto Github PK

Naive Bayesian Classification for Golang.

License: Other

Go 100.00%

bayesian's Introduction

Naive Bayesian Classification

Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports term frequency-inverse document frequency calculations (TF-IDF).

Background

This is meant to be an low-entry barrier Go library for basic Bayesian classification. See code comments for a refresher on naive Bayesian classifiers, and please take some time to understand underflow edge cases as this otherwise may result in innacurate classifications.

Installation

Using the go command:

go get github.com/jbrukh/bayesian
go install !$

Documentation

See the GoPkgDoc documentation here.

Features

Conditional probability and "log-likelihood"-like scoring.
Underflow detection.
Simple persistence of classifiers.
Statistics.
TF-IDF support.

Example 1 (Simple Classification)

To use the classifier, first you must create some classes and train it:

import "github.com/jbrukh/bayesian"

const (
    Good bayesian.Class = "Good"
    Bad  bayesian.Class = "Bad"
)

classifier := bayesian.NewClassifier(Good, Bad)
goodStuff := []string{"tall", "rich", "handsome"}
badStuff  := []string{"poor", "smelly", "ugly"}
classifier.Learn(goodStuff, Good)
classifier.Learn(badStuff,  Bad)

Then you can ascertain the scores of each class and the most likely class your data belongs to:

scores, likely, _ := classifier.LogScores(
                        []string{"tall", "girl"},
                     )

Magnitude of the score indicates likelihood. Alternatively (but with some risk of float underflow), you can obtain actual probabilities:

probs, likely, _ := classifier.ProbScores(
                        []string{"tall", "girl"},
                     )

Example 2 (TF-IDF Support)

To use the TF-IDF classifier, first you must create some classes and train it and you need to call ConvertTermsFreqToTfIdf() AFTER training and before calling classification methods such as LogScores, SafeProbScores, and ProbScores)

import "github.com/jbrukh/bayesian"

const (
    Good bayesian.Class = "Good"
    Bad bayesian.Class = "Bad"
)

// Create a classifier with TF-IDF support.
classifier := bayesian.NewClassifierTfIdf(Good, Bad)

goodStuff := []string{"tall", "rich", "handsome"}
badStuff  := []string{"poor", "smelly", "ugly"}

classifier.Learn(goodStuff, Good)
classifier.Learn(badStuff,  Bad)

// Required
classifier.ConvertTermsFreqToTfIdf()

Then you can ascertain the scores of each class and the most likely class your data belongs to:

scores, likely, _ := classifier.LogScores(
    []string{"tall", "girl"},
)

Magnitude of the score indicates likelihood. Alternatively (but with some risk of float underflow), you can obtain actual probabilities:

probs, likely, _ := classifier.ProbScores(
    []string{"tall", "girl"},
)

Use wisely.

bayesian's People

Stargazers

Watchers

Forkers

edsger luisbebop skirmish sridif savaki parkghost irr c9s dukex sajari funkygao uknownothingsnow archs jessonchan mirzap flowhealth rodrigozanel sweigert suhongrui drzippie jq-chen jimmy99 varver enjoylife alouche owulveryck trigrass2 yml glacials safaksoylu jackdoe alex-shekhter defcube pavelnikolov grafael rawoke083 dratner corneldamian pombredanne lelandhoover dchapes qxs820624 jcrotinger xqbumu chenqi123 charl wqx081 mattatcha shawnmasters koalacxr roolearntocode solertis guangminglion magastzheng angadn kinabalu crhntr vibhormehta rotblauer aryanugroho sanketskasar taylormike navossoc plexzhang sandy4321 kaichangson arcanabatch kl7sn ckmboullet heydicta eugenekostrikov urjitbhatia kispocok kazi308 dustinc zjensen joakinen sivaramalingamk mincau tevfik developgo codelingobot friendlydan codelingoteam ruivieira gravity-protocol oldtree vibramxiao jangocheng egtann mantyr signedbit sorenbak bcp-infosec-repo hjin-me kiwi633 golanglib axamon riaan53 nnqq

bayesian's Issues

Release 1.0 is really old - make a new release

1.0 is still importing things like "gob" instead of "encoding/gob" etc. Can you make a new release? I can also help co-maintain the project is that helps.

Tools like dep will pick up release versions for most people and they will get code that won't work for newer versions of go.

Thanks!

Seen() is always 0?

package main

import (
	"log"

	"github.com/jbrukh/bayesian"
)

const (
	Arabic  bayesian.Class = "Arabic"
	Malay   bayesian.Class = "Malay"
	Yiddish bayesian.Class = "Yiddish"
)

func main() {

	nbClassifier := bayesian.NewClassifier(Arabic, Malay, Yiddish)
	arabicStuff := []string{"algeria", "bahrain", "comoros"}
	malaysianStuff := []string{"malaysians", "bahasa"}
	yiddishStuff := []string{"jewish", "jews", "israel"}
	nbClassifier.Learn(arabicStuff, Arabic)
	nbClassifier.Learn(malaysianStuff, Malay)
	nbClassifier.Learn(yiddishStuff, Yiddish)

	log.Println(nbClassifier.Learned()) // 3
	log.Printf(`SEEN: %d`, nbClassifier.Seen()) // 0
}

what is good or bad?

Sorry, I didn't understand. After getting the result, how to know whether the result belongs to good or bad?

scores, likely, _ := classifier.LogScores(
                        []string{"tall", "girl"},
                     )

probs, likely, _ := classifier.ProbScores(
                        []string{"tall", "girl"},
                     )

likely == 1 is bad?
@jbrukh

Add classes after classifier creation

Apologies in advance as my knowledge of Go is still somewhat limited, so this may be a naive question.

I want to expose the naive bayes classifying as a HTTP Web service, with both train and classify endpoints. I have no trouble with that, but I want the train endpoint to be able to accept new labels (labels that aren't currently in the classifier). Right now the labels are simply specified as consts and passed into the constructor. Can you think of the best way to add the ability to add labels at run-time?

Prior probability includes word frequencies?

This is rather a question than an actual issue, but anyway.

First, did I get it right that the prior probability P(C_j) of a class is the number of document within that class, divided by the total number of documents?

And if so, why does the getPriors() function set the prior prob. of a class C to the number of words in documents of that class (classData.Total) divided by the total number of words? I'd expect that for the prior prob, words don't play any role, yet.

Probably I have a problem in understanding, so please try to enlight me.

Panic if underflow is detected in `SafeProbScores`

SafeProbScores ... If an underflow is detected, this method panics

Source

I am getting a bit confused by the comment in the method above according to the doc this method is suppose to panic but the code instead returns an error.

Am I missing something ?

request for a tag of an older commit

git tag -a 1.1 35eb93528ee -m "tag a specific older version that was built against"
git push --tags

In addition, it would be nice if current versions were tagged as well...

Return bayesian.Class Instead of Index?

LogScores, ProbScores, and SafeProbScores all have a return parameter that is the index of the most likely class. I think if you're ever willing to break the current api it would be a ton more useful to return the actual bayesian.Class.

It would make simple usage as below much easier.

As it stands I have to know the index of which I passed it into my classifier. That's knowledge I'd rather not have to know. It's kind of a difficult way to do it.

const (
    Good bayesian.Class = "Good"
    Bad  bayesian.Class = "Bad"
    Ugly bayesian.Class = "Ugly"
)

classifier := bayesian.NewClassifier(Good, Bad, Ugly)

_, c, _, _ := classifier.SafeProbScores(wht)

if c == Ugly {
    fmt.Println("oh no")
}

Wrong package name in docs

Now it github.com/navossoc/bayesian instead of github.com/jbrukh/bayesian.
Was broken in PR #23

Request for a new function that will enable adding of new class to an existing classifier

Hi,

I found this library very useful.
I think since this has a supervised learning mechanism,
it would be good if we can add a new class for stuffs that can be learned
that can't be categorized from the existing classes.

Thanks

Allow classifier to initialise with only one class

Current code panics in case the classifier is initialised with just 1 class.
However I have an edge case where there might only be a single class.
So I was thinking maybe classifier could be allowed to initialise with just 1 class.
I made changes in the code and tested for my use case. It worked. However, the unit tests fail in this case.
Is there any specific reason to keep this limitation?
What would be a good way to solve this, if any?