Git Product home page Git Product logo

Comments (12)

qinwf avatar qinwf commented on September 28, 2024

嗯嗯,这样是会好一些!

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

C++的最新版本的代码已经实现这个细粒度的规则了,晚些时候更新 C++ 的代码就可以了。

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

x-900 暂时不支持。

可以看这个 python 的实现的demo,最后一刚的例子,这个规则粒度太细,可能不适合添加。

http://jiebademo.ap01.aws.af.cm/

x-11 = 22

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

GitHub 版更新了,你可以试一下。

cutter = worker()
cutter["AK47 N95"]

from jiebar.

taiyun avatar taiyun commented on September 28, 2024

威武啊!

from jiebar.

taiyun avatar taiyun commented on September 28, 2024

重新发现的一个问题,小数会切开,而且小数点当成stopwords丢掉了。比如4.55,这个不应该分成4 55。

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

把符号symbol 设置为 TRUE 呢?

> library("jiebaR")
载入需要的程辑包:jiebaRD
> cutter = worker()
> cutter["4.55"]
[1] "4"  "55"
> cutter$symbol =T
> cutter["4.55"]
[1] "4.55"

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

数字 如 1.0 是支持的.。默认不输出符号,把 symbol 设置成为 TRUE 就可以了。

cutter = worker()
cutter$symbol = TRUE
cutter["把它升级到2.0"]
[1] ""   ""   "升级" ""   "2.0" 

xxxx-xxx-xxx 日期支持,在最近的 commit 添加了。

library(devtools)
install_github("qinwf/jiebaRD")
install_github("qinwf/jiebaR")
library("jiebaR")

cutter = worker()
cutter$symbol = TRUE
cutter["今天是2015-07-05"]
[1] "今天"       ""         "2015-07-05"

from jiebar.

taiyun avatar taiyun commented on September 28, 2024

支持全模式和搜索引擎模式吗?

 精确模式,试图将句子最精确地切开,适合文本分析;
 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

全模式的接口有,但是R里面还不能用,我过几天写一下吧。搜索引擎模式就是 QuerySegment。

from jiebar.

flydsc avatar flydsc commented on September 28, 2024

有没有可能自定义带标点符号的短语分词?比如我想要“一年之計在於春,一日之計在於晨” 被分为一个完整的词,而不是切分成小词,请问这样的情况在目前的系统里可以做吗?谢谢啦~

from jiebar.

qinwf avatar qinwf commented on September 28, 2024

挺特殊的需求,我不知道类似这样的需求的多不多?

@flydsc

这里有一条规则:
https://github.com/qinwf/jiebaR/blob/master/inst/include/lib/SegmentBase.hpp#L18
https://github.com/qinwf/jiebaR/blob/master/inst/include/lib/SegmentBase.hpp#L55-L67

逗号等字符为特殊字符处理了,现在的规则是只要遇到逗号、句号等字符都是要切分开的。

如果把这条规则去掉,那么遇到逗号,句号这些符号也跟其他字符一样,没有特别待遇。

我做了一个小的修改,在 comma 分支 https://github.com/qinwf/jiebaR/tree/comma ,你可以安装这个分支上的包试一下。这个分支还包括了最新的 master 上一些更新,在 http://qinwenfeng.com/jiebaR/basic/new.html 可以查看。

因为在用户词典和系统词典中都没有逗号和句号等特殊字符的词条,所以一般情况下它们还是会被拆分
的。但是没有上面这条规则,我们不能保证逗号和句号总是会被拆分,这是为了满足词条中包含可以逗号和句号这个特殊需求的让步。

使用 comma 分支上的包,在用户词典中加入这个词:

一年之計在於春,一日之計在於晨

# comma 分支
> library(jiebaR)
> cutter=worker(user ="../user.txt",symbol = T)
> cutter["古诗云“一年之計在於春,一日之計在於晨”"]
[1] "古诗"                          
[2] ""                            
[3] ""                             
[4] "一年之計在於春,一日之計在於晨"
[5] ""   
> cutter["古人云“一年之計在於春,一日之計在於晨”"]
[1] "古人云"                        
[2] ""                             
[3] "一年之計在於春,一日之計在於晨"
[4] ""                             
> cutter["今天是周二,是一个晴天。"]
[1] "今天" ""   "周二" ""   ""   "一个" "晴天" ""  

from jiebar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.