Git Product home page Git Product logo

easyanalyzer's Introduction

EasyAnalyzer

基于Lucene中的有限状态机实现的基于字典的中文分词器

FST:有限状态转移,见org.apache.lucene.util.fst.FST

词典:

湖北
工业
大学
湖北工业
工业大学
湖北工业大学
学生
大学生
工业大学生
奥迪Q5
奥迪

测试代码:

public class Test {
    public static void main(String[] args) throws IOException {
        System.out.print("前缀词匹配:");
        token(PrefixWordFSTAnalyzer.create("test/", true));
        System.out.print("前缀词优先匹配:");
        token(PrefixFirstAnalyzer.create("test/", true));

        System.out.print("最长匹配:");
        token(CompleteFSTAnalyzer.create("test/", true));

        System.out.print("最短匹配:");
        token(ShortestFSTAnalyzer.create("test/", true));

        System.out.print("最多数量匹配:");
        token(MaxCountAnalyzer.create("test/", true));
    }

    private static void token(Analyzer analyzer) throws IOException {
        final TokenStream tokenStream = analyzer.tokenStream("test", "奥迪Q湖北工业大学生");
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            final Iterator<Class<? extends Attribute>> iterator = tokenStream.getAttributeClassesIterator();
            while (iterator.hasNext()) {
                final Class<? extends Attribute> attr = iterator.next();
                System.out.print(tokenStream.getAttribute(attr));
                System.out.print(' ');
                break;
            }
        }
        System.out.println();
        tokenStream.close();
    }

}

分词效果:

前缀词匹配

奥迪 Q 湖北工业大学 湖北工业 湖北 生

前缀词优先匹配

奥迪Q 奥迪 湖北工业大学 湖北工业 湖北 生

最长匹配

奥迪 Q 湖北工业大学 生

最短匹配

奥迪 Q 湖北 工业 大学 生

最多数量匹配

奥迪 Q 湖北工业大学 湖北工业 湖北 工业大学生 工业大学 工业 大学生 大学

easyanalyzer's People

Contributors

gaohanghbut avatar

Stargazers

孙强 avatar 千橙 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.