Git Product home page Git Product logo

Comments (17)

fxsjy avatar fxsjy commented on August 28, 2024 2

@fandywang , 长词优先。举个例子:

腾讯科技有限公司 0.01 腾讯 0.1 科技 0.2 有限公司 0.1

四个词的概率如上,可以看到腾讯科技有限公司的概率要比其他几个词低一个数量级。

但是P(腾讯)_P(科技)_P(有限公司) = 0.1_0.2_0.1 = 0.002 < 0.01

from jieba.

fxsjy avatar fxsjy commented on August 28, 2024 1

@massifor, 请参看这条issue: #14

如果词库很大的话,加载会比较慢,占用内存会增加。但是切分的的性能我估计不会有太大下降。

from jieba.

fxsjy avatar fxsjy commented on August 28, 2024

dict.txt 里面已经包含了搜狗公布的2006版免费词库,你说的搜狗的词库是什么版本?

from jieba.

massifor avatar massifor commented on August 28, 2024

搜狗输入法的词库,http://pinyin.sogou.com/dict/。因为我在做垂直领域的搜索引擎,所以想用搜狗的特定领域的词库。

另外一个问题,如果我把词库做的很大,会对分词的性能有影响吗?谢谢!

from jieba.

massifor avatar massifor commented on August 28, 2024

谢谢您的及时回复。

还请教两个专业问题,望不吝赐教。
我在做体育新闻的聚合网站,但是体育新闻有很多独有的人名,独有的关键词,用一般的字典来作分词,会出现很多问题。因此,我需要做体育新闻方面的专用词库。

1、我已经爬取了很多的体育新闻,可否以这些新闻作为语料,然后通过特征提取算法来提取关键字,形成字典?

2、如果1成立,有哪些比较开源软件可以实现语料到字典的转换?

非常感谢!

from jieba.

fxsjy avatar fxsjy commented on August 28, 2024

@massifor, 你说得意思是不是无监督分词?看看这篇文章,或许有帮助:http://www.matrix67.com/blog/archives/5044

from jieba.

fandywang avatar fandywang commented on August 28, 2024

主要是新词发现的工作,除了可以定期爬取垂直网站、搜索引擎和输入法等公布的query和词库外,matrix67这篇文章的确是个不错的思路。

不过,如果词典引入太多,会不会带来负作用呢?比如“腾讯科技有限公司”、“腾讯”、“腾讯科技”、“讯科”都在词库中

from jieba.

fandywang avatar fandywang commented on August 28, 2024

谢谢!

另外,如果某个词条可能多个词性,如何处理的呢? (不好意思,还未来得及详细看代码)

from jieba.

fxsjy avatar fxsjy commented on August 28, 2024

@fandywang , 由于python的速度限制,对于词典中有的词就只有一个词性。对于未登录词,才用HMM识别其词性。基本是就是把BMSE四种状态与词性全集交叉后做为状态序列。比如,('B','n'), ('B','v')都表示开头,但是前者表示是名词的开头,后者表示是动词的开头。 https://github.com/fxsjy/jieba/blob/master/jieba/posseg/prob_trans.py

from jieba.

fandywang avatar fandywang commented on August 28, 2024

明白了,谢谢

from jieba.

wilbyang avatar wilbyang commented on August 28, 2024

jieba分词能用于lucene么?

from jieba.

fxsjy avatar fxsjy commented on August 28, 2024

@wilbyang , 目前不能,只有python版的。

from jieba.

niorgai avatar niorgai commented on August 28, 2024

您好,关于您上面长词优先的例子,我想确认一下:
如果腾讯科技有限公司的词频和腾讯、科技、有限公司的词频在字典中都是3
那是不是意味这他们的概率都是相等的,然后长词优先所以还是会切分为“腾讯科技有限公司”?

from jieba.

yanyiwu avatar yanyiwu commented on August 28, 2024

@niorgai 你误解了长词优先。
长词优先的原因是 因为句子的概率计算是词的概率相乘,如果词越多,概率相乘之后句子的概率就会越少(因为概率是<1.0的)。

from jieba.

niorgai avatar niorgai commented on August 28, 2024

@aszxqw 你好,我在词典里面发现软件的词频是4601,中山大学词频是192,学院词频是29249,但是我设置中山大学软件学院的词频为3即可得到我想要的长句结果,可以帮忙解释下吗?

from jieba.

yanyiwu avatar yanyiwu commented on August 28, 2024

@niorgai
字典总词频是60101878
a = (4601/60101878) * (192/60101878) * (29249/60101878) = 1.1901463e-13
b = 3/60101878 = 4.99152456e-8
b > a

from jieba.

niorgai avatar niorgai commented on August 28, 2024

@aszxqw 其实跟我原来的想法是一样的谢啦

from jieba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.