Git Product home page Git Product logo

Comments (3)

jingjingxupku avatar jingjingxupku commented on August 22, 2024 1

由于jieba的核心模型是基于训练数据上的词频信息,所以当在特定数据集上训练时,我们在该数据集的训练集上重新统计了词频信息。

以下是所使用的统计词频的代码:

def cal(file_name, outname):
    mp = {}
    with open(file_name, encoding='utf-8') as f:
        lines = f.readlines()
    for line in lines:
        for i in line.strip().split(' '):
            if len(i)==0:
                continue
            if not i in mp:
                mp[i] = 0
            mp[i] += 1
    d = sorted(list(mp.items()), key=lambda x:(x[1], x[0]), reverse=True)
    with open(outname, 'w', encoding='utf-8') as f:
        for i, freq in d:
            f.write(i+' '+str(freq)+'\n')
cal('icwb2-data/training/msr_training.utf8', 'msr_dict.txt')

以下是使用词频文件进行分词的代码:

tokenizer = jieba.Tokenizer(dictionary='msr_dict.txt')
tokenizer.cut(data)

from pkuseg-python.

lianghuihaoking avatar lianghuihaoking commented on August 22, 2024

明白了,非常感谢!

from pkuseg-python.

echo-ray avatar echo-ray commented on August 22, 2024

用户自定义字典可以加词频吗?

from pkuseg-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.