Git Product home page Git Product logo

nbayesfilter's Introduction

nbayesfilter

Naive bayes filter for Korean badchars. Written by Hyun Joon Seol ([email protected]).

The Problem

Typographical errors are one of the most common problems in data driven processing in all languages. Unfortunately, Korean suffers from this error too. In a large corpus, even if a small portion is erroneous it can constitute a big problem if their counts are large enough to make it into the lexicon after pruning. It takes the place for less searched-for queries and may hinder from delivering correct results in a query search scenario. This project aims to detect these bad characters with a data-driven Bernoulli Naive Bayes methodology. It uses scikit-learn package and approaches the problem with character-based bigrams.

Training Corpus

The training corpus, named correct.txt and error.txt contains manually checked queries that seem to be wrong but are correct, and queries that seem to be wrong and is acutally wrong (with the help of two part-time employees). The initial training set before labeling comes from a quick script that detects uncommon characters in Korean (double end phonemes, uncommon dipthongs, etc). These characters are defined in the file bad.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.