Git Product home page Git Product logo

yxk9810 / chinese-tokenization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jackhcc/chinese-tokenization

0.0 0.0 0.0 46.45 MB

利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】

Python 90.73% Perl 9.27%

chinese-tokenization's Introduction

自然语言处理中文分词

方法概述

  • 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
  • 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
  • 预训练语⾔模型⽅法:Bert等

数据集概述

  • PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。

实验过程

实验结果

PKU数据集

模型 准确率 召回率 F1分数
Uni-Gram 0.8550 0.9342 0.8928
Uni-Gram+规则 0.9111 0.9496 0.9300
HMM 0.7936 0.8090 0.8012
CRF 0.9409 0.9396 0.9400
Bi-LSTM 0.9248 0.9236 0.9240
Bi-LSTM+CRF 0.9366 0.9354 0.9358
BERT 0.9712 0.9635 0.9673
BERT-CRF 0.9705 0.9619 0.9662
jieba 0.8559 0.7896 0.8214
pkuseg 0.9512 0.9224 0.9366
THULAC 0.9287 0.9295 0.9291

MSR数据集

模型 准确率 召回率 F1分数
Uni-Gram 0.9119 0.9633 0.9369
Uni-Gram+规则 0.9129 0.9634 0.9375
HMM 0.7786 0.8189 0.7983
CRF 0.9675 0.9676 0.9675
Bi-LSTM 0.9624 0.9625 0.9624
Bi-LSTM+CRF 0.9631 0.9632 0.9632
BERT 0.9841 0.9817 0.9829
BERT-CRF 0.9805 0.9787 0.9796
jieba 0.8204 0.8145 0.8174
pkuseg 0.8701 0.8894 0.8796
THULAC 0.8428 0.8880 0.8648

chinese-tokenization's People

Contributors

jackhcc avatar jaclab-beta avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.