Git Product home page Git Product logo

chinese-word-segmentation's Introduction

中文分词

中文分词-原生php库,可以自定义中文词典

原生php保存整个词典太消耗内存了; 为了把整个词典保存成有向无环图结构, 整整消耗了150多MB内存

  • 正在试验如何使用更低的内存保存词典结构
  • 计算分词的Tf-idf值
  • 封装常用的词性过滤

自定义

耗时和内存都在加载词典上,大量分词的场景只适合cli模式运行;

例如是分析电商评价的好坏情景,最好是自己设计词典:dict.txt内文示例(词频,词性定义不是必须的)

很好 1000 l
太好了 3 l
很实用 12 l
....

示例

require_once '../vendor/autoload.php';

ini_set('memory_limit','1024M');

echo "初始: ".(memory_get_usage()/1024/1024)."MB\n";

$dict = new \ChineseWordSegmentation\TrieTree();
$dict->load();

echo "使用: ".(memory_get_usage()/1024/1024)."MB\n";
echo "峰值: ".(memory_get_peak_usage()/1024/1024)."MB\n";

$str = "我爱北京***";
$tags = $dict->extract($str);


echo "语句:".$str."\n得到分词(词频 词性):\n";
print_r($tags);

结果

初始: 0.89760589599609MB
使用: 161.15187835693MB
峰值: 161.16103363037MB
语句:我爱北京***
得到分词(词频 词性):
Array
(
    [我] => 328841 r
    [爱] => 14878 v
    [北] => 17860 ns
    [北京] => 34488 ns
    [京] => 6583 ns
    [天] => 35979 q
    [天安] => 273 nz
    [***] => 34010 ns
    [安] => 8837 v
    [门] => 39823 n
)

简单过滤词性,计算分词的Tf-idf值,就可以得到需要的内容

其他选择

php扩展scws 但是定制词典麻烦

python结巴 本库/dict/dict.txt 也是使用这个库的词典

chinese-word-segmentation's People

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.