Git Product home page Git Product logo

chinese_segment_augment's People

Contributors

jiangzhonglian avatar zhanzecheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinese_segment_augment's Issues

直接线性加和是最好的方法吗?

个人也写过一个类似的东东,尝试过几种左右熵和互信息的结合方式都不是很满意,请问还有更好的方法吗,尝试过加权和比值的多种参数。

程序下载后无法正常运行

主要问题有两个:

  1. 在加载文件时,文件路径错误。
    demo_run.py 43/48行缺少"/"
  2. 修改后文件model.py 84行:
    word[0], word[1], word[2] = word[1], word[2], word[0]
    TypeError: 'tuple' object does not support item assignment

计算左右熵算法问题

假设有两个词串分别是[a,b,c]和[b,c,a],[a,b,c]在计算左熵的时候会转换成b->c->a存储到树中,[b,c,a]在顺序存储的时候也会转换成b->c->a存储到树中,那么这个时候计算bc的左熵的时候会有问题把,额外把a的次数多加了一。

model.py的参数疑问

PMI = math.log(max(ch.count, 1), 2) - math.log(total, 2) - math.log(one_dict[child.char], 2) - math.log(one_dict[ch.char], 2)
为什么和log2( P(X,Y) / (P(X) * P(Y))感觉不一样?

UnicodeDecodeError: 'ascii' codec can't decode byte

运行报了这个错
#python3 demo_run.py
Traceback (most recent call last):
File "demo_run.py", line 44, in
stopwords = get_stopwords()
File "/data/home/tengenli/Chinese_segment_augment/utils.py", line 13, in get_stopwords
stopword = [line.strip() for line in f]
File "/data/home/tengenli/Chinese_segment_augment/utils.py", line 13, in
stopword = [line.strip() for line in f]
File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

这一步的意义是什么,为什么这样计算

==>result[key] = (values[0] + min(left[d], right[d])) * values[1]
这一步理解不了是在干什么,我的理解是只要取 左右熵中的最小值作为 这一步需要赋值的值就可以了

def find_word(self, N):
    # 通过搜索得到互信息
    # 例如: dict{ "a_b": (PMI, 出现概率), .. }
    bi = self.search_bi()
    # 通过搜索得到左右熵
    left = self.search_left()
    right = self.search_right()
    result = {}
    for key, values in bi.items():
        d = "".join(key.split('_'))
        # 计算公式 score = PMI + min(左熵, 右熵) => 熵越小,说明越有序,这词再一次可能性更大!
        #   PMI 是为了计算共现值。   values[0] 也是共现值
        result[key] = (values[0] + min(left[d], right[d])) * values[1]

add node的时候,是否考虑在trieNode里面加个字典

如果是长文档的话,前面add这一步很慢。我用C#试了一下添加一个子node的字典,提升比较明显,可能内存多耗一点,供参考。

if (node.DictChilds.ContainsKey(word))
                {
                    node = node.DictChilds[word];
                }
                else
                {
                    var newNode = new TrieNode(word);
                    node.Childs.Add(newNode);
                    node.DictChilds.Add(word, newNode);
                    node = newNode;
                }

个人觉得这个就是经典统计语言模型的贪心算法

实际上计算的就是 P(w2|w1) 的条件概率 = p(word) /p(w1) = I* pw2 ,也就是找到一种全局比较字词间的条件概率来获得w1->w2的映射概率,优先那些w1-w2概率高的,贪心算法,选择的那些词语概率高--通过学习语料获得的,--能更好地提高统计语言学习模型的准确率。

所谓的自由度实际上是个很不规范的东西,比如说有个单词多次在句尾结束,难道计算右熵为0?

P(S) = P(word1)P(word2|word1)P(word3|word2)...P(wordn|wordn-1)

我觉得通过训练类似的贝叶斯模型,然后调用模型训练语料的结果,来获得某些成词的置信度。可能更具有实用价值。简单来说 pw2 ,你们是用 左右熵来映射的,这是否可行我觉得很成问题。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.