Git Product home page Git Product logo

Comments (20)

crownpku avatar crownpku commented on May 25, 2024 2

@DoubleAix 谢谢你的提示,已更新代码和readme。如果你有更好的添加用户字典的方式也欢迎提出来。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024 1

Hi, you may refer to the following instructions from jieba, and add the corresponding code with your own dictionary in
https://github.com/crownpku/rasa_nlu_chi/blob/master/rasa_nlu/tokenizers/jieba_tokenizer.py

def tokenize(self, text):
        # type: (Text) -> List[Token]
        import jieba
        #MODIFICATION
        jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
        #MODIFICATION ENDS
        words = jieba.lcut(text.encode('utf-8'))

From jieba:

载入词典

开发者可以指定自己自定义的词典,以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力,但是自行添加新词可以保证更高的正确率
用法: jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
词典格式和 dict.txt 一样,一个词占一行;每一行分三部分:词语、词频(可省略)、词性(可省略),用空格隔开,顺序不可颠倒。file_name 若为路径或二进制方式打开的文件,则文件必须为 UTF-8 编码。
词频省略时使用自动计算的能保证分出该词的词频。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

能提供一个从配置文件加载的方法么?谢谢。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

@BrikerMan 最新的commit增加了配置文件加载的方法。
请把你的jieba userdic的file path加到sample_configs/config_jieba_mitie_sklearn.json的配置文件中。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

这个项目还没有跟官方的合并是吧?那我就得在这个下面写我的业务,不能直接 pip 安装 rasa nlu 实现对么。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

@BrikerMan rasa_nlu_chi本身一直在update rasa_nlu最新的代码。现在不能merge进官方仓库的原因是rasa_nlu主框架的language control部分还有问题,长远还是要作为language support合并进去。

中文业务的话,暂时可能还是要用rasa_nlu_chi.

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

嗯嗯。那就先用这个了。非常感谢。我在继续研究研究。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

遇到个错误。
配置文件加载没问题,已经找到训练数据。

Traceback (most recent call last):
  File "train.py", line 21, in <module>
    trainer.train(training_data)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/model.py", line 157, in train
    updates = component.train(working_data, self.config, **context)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 37, in train
    example.set("tokens", self.tokenize(example.text))
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 49, in tokenize
    if config['jieba_userdic'] != 'None':
NameError: name 'config' is not defined

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

原因是 tokenize 方法没有 config 属性,而且也不能每次 tokenize 时候加载一次字典。加到 train 方法里面了,这样能正常跑,不过也不合理。应该在 tokenizer 初始化时候进行加载。

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUConfig, **Any) -> None
        if config['language'] != 'zh':
            raise Exception("tokenizer_jieba is only used for Chinese. Check your configure json file.")
            # Add jieba userdict file
        if config['jieba_userdic'] != 'None':
            jieba.load_userdict(config['jieba_userdic'])
        for example in training_data.training_examples:
            example.set("tokens", self.tokenize(example.text))

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

init部分好像不好加config,牵扯到整个tokenizer的init定义。
貌似最简单的方法就是放去train里面,在处理training_data的时候load jieba userdict。不合理的话,是指每次train的时候都要load一次吗?但用户全量数据训练也是一次完整的流程,每次load一次userdict好像也没有什么问题。
我先这个方法把代码改好吧。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

这个不合理是, train 时候我加载了词典,但是预测时候不会走这里。导致我训练和预测的分词不一样。每次 train 加载一次全量的字典这个倒是没问题。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

@BrikerMan 我把import jieba从tokenizer拿出来了,防止每次运行tokenizer都要跑import。
然后把load jieba userdict放去了train函数里。
如果你有更好的实现方法,欢迎修改代码发pull request。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

@BrikerMan 明白你的意思了,inference确实是有问题。我想下怎么搞。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

如果我自己在项目里面自定义了 pipeline, 如何注册?我用 pip 方式安装了 rasa-nlu-chi。看了自定义 pipeline 要修改 rasa_nlu.registry.py 文件。如何能够不改变 resa 源文件的情况下加载自定义 pipeline ?

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

自定义pipeline只需要修改config文件就好了

{
  "name": "rasa_nlu_test",
  "pipeline": ["nlp_mitie",
        "tokenizer_jieba",
        "ner_mitie",
        "ner_synonyms",
        "intent_entity_featurizer_regex",
        "intent_featurizer_mitie",
        "intent_classifier_sklearn"],
  "language": "zh",
  "mitie_file": "./data/total_word_feature_extractor_zh.dat",
  "path" : "./models",
  "data" : "./data/examples/rasa/demo-rasa_zh.json",
  "jieba_userdic": "None"
}

你是要添加新的module吗还是?

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

我添加到这里以后提示

If you are creating your own component, make sure it is either listed as part of the component_classes in rasa_nlu.registry.py or is a proper name of a class in a module.

好像是需要注册一下这个 class 否则不知道从哪里 import 这个。我想注册一个大写汉字数字转阿拉伯数字的组件。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

新的组件是需要注册的。你可以以jieba_tokenizer为例,在项目中搜索下相关部分代码。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on May 25, 2024

嗯嗯,这个我看到了。就是想的有没有办法在不修改 rasa 代码情况下注册。

from rasa_nlu_chi.

crownpku avatar crownpku commented on May 25, 2024

关于加入jieba自定义词典,暂时没有找到非常优雅的做法。
现在(20171116)的版本,需要用户把jieba自定义词典放到rasa_nlu_chi/jieba_userdict/下面。系统在训练和预测时都会自动寻找并导入jieba分词。

from rasa_nlu_chi.

DoubleAix avatar DoubleAix commented on May 25, 2024

關於加入jieba字典的方法,我有一些疑問
因為使用python setup.py install ,把它安裝在site-packages裡面
unzip -l rasa_nlu-0.12.0a1-py3.6.egg | grep jieba
1665 03-12-2018 15:44 rasa_nlu/tokenizers/jieba_tokenizer.py
2200 03-12-2018 16:22 rasa_nlu/tokenizers/pycache/jieba_tokenizer.cpython-36.pyc
我rasa_nlu是引用這個位置的package
而非git clone https://github.com/crownpku/Rasa_NLU_Chi.git 目錄下的package

所以我專案目錄下,執行python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.json
依據你的源碼
import glob
import jieba
jieba_userdicts = glob.glob("./jieba_userdict/*")
for jieba_userdict in jieba_userdicts:
jieba.load_userdict(jieba_userdict)

是不在這個專案目錄下,要有jieba_userdict這個目錄,才能把字典放進去呢?

我覺得這個字典載入進去jieba程式,最好有提示(console)確定有載入,感覺有機會大家其實都沒有載入

不知道這樣我的理解有沒有錯?

謝謝!!

from rasa_nlu_chi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.