I want to add jieba custom dictionary, which config file can do it？

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, you may refer to the following instructions from <a href="https://github.com/fxsjy

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

遇到个错误。配置文件加载没问题，已经找到训练数据。 <div class="snippet-clipboard-content notranslate p

How does add jieba custom dictionary? about rasa_nlu_chi HOT 20 OPEN

crownpku commented on May 25, 2024

How does add jieba custom dictionary?

from rasa_nlu_chi.

Comments (20)

crownpku commented on May 25, 2024 2

@DoubleAix 谢谢你的提示，已更新代码和readme。如果你有更好的添加用户字典的方式也欢迎提出来。

from rasa_nlu_chi.

crownpku commented on May 25, 2024 1

Hi, you may refer to the following instructions from jieba, and add the corresponding code with your own dictionary in
https://github.com/crownpku/rasa_nlu_chi/blob/master/rasa_nlu/tokenizers/jieba_tokenizer.py

def tokenize(self, text):
        # type: (Text) -> List[Token]
        import jieba
        #MODIFICATION
        jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
        #MODIFICATION ENDS
        words = jieba.lcut(text.encode('utf-8'))

From jieba:

载入词典

开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率
用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
词典格式和 dict.txt 一样，一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。
词频省略时使用自动计算的能保证分出该词的词频。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

能提供一个从配置文件加载的方法么？谢谢。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

@BrikerMan 最新的commit增加了配置文件加载的方法。
请把你的jieba userdic的file path加到sample_configs/config_jieba_mitie_sklearn.json的配置文件中。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

这个项目还没有跟官方的合并是吧？那我就得在这个下面写我的业务，不能直接 pip 安装 rasa nlu 实现对么。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

@BrikerMan rasa_nlu_chi本身一直在update rasa_nlu最新的代码。现在不能merge进官方仓库的原因是rasa_nlu主框架的language control部分还有问题，长远还是要作为language support合并进去。

中文业务的话，暂时可能还是要用rasa_nlu_chi.

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

嗯嗯。那就先用这个了。非常感谢。我在继续研究研究。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

遇到个错误。
配置文件加载没问题，已经找到训练数据。

Traceback (most recent call last):
  File "train.py", line 21, in <module>
    trainer.train(training_data)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/model.py", line 157, in train
    updates = component.train(working_data, self.config, **context)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 37, in train
    example.set("tokens", self.tokenize(example.text))
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 49, in tokenize
    if config['jieba_userdic'] != 'None':
NameError: name 'config' is not defined

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

原因是 tokenize 方法没有 config 属性，而且也不能每次 tokenize 时候加载一次字典。加到 train 方法里面了，这样能正常跑，不过也不合理。应该在 tokenizer 初始化时候进行加载。

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUConfig, **Any) -> None
        if config['language'] != 'zh':
            raise Exception("tokenizer_jieba is only used for Chinese. Check your configure json file.")
            # Add jieba userdict file
        if config['jieba_userdic'] != 'None':
            jieba.load_userdict(config['jieba_userdic'])
        for example in training_data.training_examples:
            example.set("tokens", self.tokenize(example.text))

from rasa_nlu_chi.

crownpku commented on May 25, 2024

init部分好像不好加config，牵扯到整个tokenizer的init定义。
貌似最简单的方法就是放去train里面，在处理training_data的时候load jieba userdict。不合理的话，是指每次train的时候都要load一次吗？但用户全量数据训练也是一次完整的流程，每次load一次userdict好像也没有什么问题。
我先这个方法把代码改好吧。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

这个不合理是， train 时候我加载了词典，但是预测时候不会走这里。导致我训练和预测的分词不一样。每次 train 加载一次全量的字典这个倒是没问题。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

@BrikerMan 我把import jieba从tokenizer拿出来了，防止每次运行tokenizer都要跑import。
然后把load jieba userdict放去了train函数里。
如果你有更好的实现方法，欢迎修改代码发pull request。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

@BrikerMan 明白你的意思了，inference确实是有问题。我想下怎么搞。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

如果我自己在项目里面自定义了 pipeline，如何注册？我用 pip 方式安装了 rasa-nlu-chi。看了自定义 pipeline 要修改 rasa_nlu.registry.py 文件。如何能够不改变 resa 源文件的情况下加载自定义 pipeline ？

from rasa_nlu_chi.

crownpku commented on May 25, 2024

自定义pipeline只需要修改config文件就好了

{
  "name": "rasa_nlu_test",
  "pipeline": ["nlp_mitie",
        "tokenizer_jieba",
        "ner_mitie",
        "ner_synonyms",
        "intent_entity_featurizer_regex",
        "intent_featurizer_mitie",
        "intent_classifier_sklearn"],
  "language": "zh",
  "mitie_file": "./data/total_word_feature_extractor_zh.dat",
  "path" : "./models",
  "data" : "./data/examples/rasa/demo-rasa_zh.json",
  "jieba_userdic": "None"
}

你是要添加新的module吗还是？

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

我添加到这里以后提示

If you are creating your own component, make sure it is either listed as part of the component_classes in rasa_nlu.registry.py or is a proper name of a class in a module.

好像是需要注册一下这个 class 否则不知道从哪里 import 这个。我想注册一个大写汉字数字转阿拉伯数字的组件。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

新的组件是需要注册的。你可以以jieba_tokenizer为例，在项目中搜索下相关部分代码。

from rasa_nlu_chi.

BrikerMan commented on May 25, 2024

嗯嗯，这个我看到了。就是想的有没有办法在不修改 rasa 代码情况下注册。

from rasa_nlu_chi.

crownpku commented on May 25, 2024

关于加入jieba自定义词典，暂时没有找到非常优雅的做法。
现在(20171116)的版本，需要用户把jieba自定义词典放到rasa_nlu_chi/jieba_userdict/下面。系统在训练和预测时都会自动寻找并导入jieba分词。

from rasa_nlu_chi.

DoubleAix commented on May 25, 2024

關於加入jieba字典的方法，我有一些疑問
因為使用python setup.py install ，把它安裝在site-packages裡面
unzip -l rasa_nlu-0.12.0a1-py3.6.egg | grep jieba
1665 03-12-2018 15:44 rasa_nlu/tokenizers/jieba_tokenizer.py
2200 03-12-2018 16:22 rasa_nlu/tokenizers/pycache/jieba_tokenizer.cpython-36.pyc
我rasa_nlu是引用這個位置的package
而非git clone https://github.com/crownpku/Rasa_NLU_Chi.git 目錄下的package

所以我專案目錄下，執行python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.json
依據你的源碼
import glob
import jieba
jieba_userdicts = glob.glob("./jieba_userdict/*")
for jieba_userdict in jieba_userdicts:
jieba.load_userdict(jieba_userdict)

是不在這個專案目錄下，要有jieba_userdict這個目錄，才能把字典放進去呢？

我覺得這個字典載入進去jieba程式，最好有提示(console)確定有載入，感覺有機會大家其實都沒有載入

不知道這樣我的理解有沒有錯？

謝謝！！

from rasa_nlu_chi.

How does add jieba custom dictionary? about rasa_nlu_chi HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent