Comments (20)
@DoubleAix 谢谢你的提示,已更新代码和readme。如果你有更好的添加用户字典的方式也欢迎提出来。
from rasa_nlu_chi.
Hi, you may refer to the following instructions from jieba, and add the corresponding code with your own dictionary in
https://github.com/crownpku/rasa_nlu_chi/blob/master/rasa_nlu/tokenizers/jieba_tokenizer.py
def tokenize(self, text):
# type: (Text) -> List[Token]
import jieba
#MODIFICATION
jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
#MODIFICATION ENDS
words = jieba.lcut(text.encode('utf-8'))
From jieba:
载入词典
开发者可以指定自己自定义的词典,以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力,但是自行添加新词可以保证更高的正确率
用法: jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
词典格式和 dict.txt 一样,一个词占一行;每一行分三部分:词语、词频(可省略)、词性(可省略),用空格隔开,顺序不可颠倒。file_name 若为路径或二进制方式打开的文件,则文件必须为 UTF-8 编码。
词频省略时使用自动计算的能保证分出该词的词频。
from rasa_nlu_chi.
能提供一个从配置文件加载的方法么?谢谢。
from rasa_nlu_chi.
@BrikerMan 最新的commit增加了配置文件加载的方法。
请把你的jieba userdic的file path加到sample_configs/config_jieba_mitie_sklearn.json的配置文件中。
from rasa_nlu_chi.
这个项目还没有跟官方的合并是吧?那我就得在这个下面写我的业务,不能直接 pip 安装 rasa nlu 实现对么。
from rasa_nlu_chi.
@BrikerMan rasa_nlu_chi本身一直在update rasa_nlu最新的代码。现在不能merge进官方仓库的原因是rasa_nlu主框架的language control部分还有问题,长远还是要作为language support合并进去。
中文业务的话,暂时可能还是要用rasa_nlu_chi.
from rasa_nlu_chi.
嗯嗯。那就先用这个了。非常感谢。我在继续研究研究。
from rasa_nlu_chi.
遇到个错误。
配置文件加载没问题,已经找到训练数据。
Traceback (most recent call last):
File "train.py", line 21, in <module>
trainer.train(training_data)
File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/model.py", line 157, in train
updates = component.train(working_data, self.config, **context)
File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 37, in train
example.set("tokens", self.tokenize(example.text))
File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 49, in tokenize
if config['jieba_userdic'] != 'None':
NameError: name 'config' is not defined
from rasa_nlu_chi.
原因是 tokenize 方法没有 config 属性,而且也不能每次 tokenize 时候加载一次字典。加到 train 方法里面了,这样能正常跑,不过也不合理。应该在 tokenizer 初始化时候进行加载。
def train(self, training_data, config, **kwargs):
# type: (TrainingData, RasaNLUConfig, **Any) -> None
if config['language'] != 'zh':
raise Exception("tokenizer_jieba is only used for Chinese. Check your configure json file.")
# Add jieba userdict file
if config['jieba_userdic'] != 'None':
jieba.load_userdict(config['jieba_userdic'])
for example in training_data.training_examples:
example.set("tokens", self.tokenize(example.text))
from rasa_nlu_chi.
init部分好像不好加config,牵扯到整个tokenizer的init定义。
貌似最简单的方法就是放去train里面,在处理training_data的时候load jieba userdict。不合理的话,是指每次train的时候都要load一次吗?但用户全量数据训练也是一次完整的流程,每次load一次userdict好像也没有什么问题。
我先这个方法把代码改好吧。
from rasa_nlu_chi.
这个不合理是, train 时候我加载了词典,但是预测时候不会走这里。导致我训练和预测的分词不一样。每次 train 加载一次全量的字典这个倒是没问题。
from rasa_nlu_chi.
@BrikerMan 我把import jieba从tokenizer拿出来了,防止每次运行tokenizer都要跑import。
然后把load jieba userdict放去了train函数里。
如果你有更好的实现方法,欢迎修改代码发pull request。
from rasa_nlu_chi.
@BrikerMan 明白你的意思了,inference确实是有问题。我想下怎么搞。
from rasa_nlu_chi.
如果我自己在项目里面自定义了 pipeline, 如何注册?我用 pip 方式安装了 rasa-nlu-chi
。看了自定义 pipeline 要修改 rasa_nlu.registry.py 文件。如何能够不改变 resa 源文件的情况下加载自定义 pipeline ?
from rasa_nlu_chi.
自定义pipeline只需要修改config文件就好了
{
"name": "rasa_nlu_test",
"pipeline": ["nlp_mitie",
"tokenizer_jieba",
"ner_mitie",
"ner_synonyms",
"intent_entity_featurizer_regex",
"intent_featurizer_mitie",
"intent_classifier_sklearn"],
"language": "zh",
"mitie_file": "./data/total_word_feature_extractor_zh.dat",
"path" : "./models",
"data" : "./data/examples/rasa/demo-rasa_zh.json",
"jieba_userdic": "None"
}
你是要添加新的module吗还是?
from rasa_nlu_chi.
我添加到这里以后提示
If you are creating your own component, make sure it is either listed as part of the component_classes in rasa_nlu.registry.py or is a proper name of a class in a module.
好像是需要注册一下这个 class 否则不知道从哪里 import 这个。我想注册一个大写汉字数字转阿拉伯数字的组件。
from rasa_nlu_chi.
新的组件是需要注册的。你可以以jieba_tokenizer为例,在项目中搜索下相关部分代码。
from rasa_nlu_chi.
嗯嗯,这个我看到了。就是想的有没有办法在不修改 rasa 代码情况下注册。
from rasa_nlu_chi.
关于加入jieba自定义词典,暂时没有找到非常优雅的做法。
现在(20171116)的版本,需要用户把jieba自定义词典放到rasa_nlu_chi/jieba_userdict/
下面。系统在训练和预测时都会自动寻找并导入jieba分词。
from rasa_nlu_chi.
關於加入jieba字典的方法,我有一些疑問
因為使用python setup.py install ,把它安裝在site-packages裡面
unzip -l rasa_nlu-0.12.0a1-py3.6.egg | grep jieba
1665 03-12-2018 15:44 rasa_nlu/tokenizers/jieba_tokenizer.py
2200 03-12-2018 16:22 rasa_nlu/tokenizers/pycache/jieba_tokenizer.cpython-36.pyc
我rasa_nlu是引用這個位置的package
而非git clone https://github.com/crownpku/Rasa_NLU_Chi.git 目錄下的package
所以我專案目錄下,執行python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.json
依據你的源碼
import glob
import jieba
jieba_userdicts = glob.glob("./jieba_userdict/*")
for jieba_userdict in jieba_userdicts:
jieba.load_userdict(jieba_userdict)
是不在這個專案目錄下,要有jieba_userdict這個目錄,才能把字典放進去呢?
我覺得這個字典載入進去jieba程式,最好有提示(console)確定有載入,感覺有機會大家其實都沒有載入
不知道這樣我的理解有沒有錯?
謝謝!!
from rasa_nlu_chi.
Related Issues (20)
- AttributeError: 'str' object has no attribute 'get' HOT 1
- invalid model requested. Using default HOT 1
- "error": "No project found with name 'rasa_nlu_test'." HOT 2
- 为什么我识别不到total_word_feature_extractor_zh.dat文件
- 错误Expecting value: line 1 column 1 (char 0) json格式不正确 HOT 2
- 请问如何训练自己需要的语句
- 安装启动server后查询意图报错。 HOT 5
- 按照教程发送请求后,返回"error": "y should be a 1d array, got an array of shape (1, 5) instead." HOT 1
- 下载rasa_nlu_chi之后使用 python setup.py install 报错 HOT 1
- Error found when running python setup.py install HOT 1
- entities must span whole tokens. Wrong entity end. HOT 1
- 我在训练的时候一直报这个错误是什么原因? ImportError: cannot import name 'PROTOCOL_TLS'
- 在测试时为啥这个是null无效的
- ERROR __main__ - bad input shape (1, 5) HOT 9
- 能否添加look up table 在中文下的应用 HOT 8
- 请问intent识别后 怎么进行和rasa core的集成呢
- error sklearn
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 3: invalid continuation byte HOT 2
- 您好, 根据教程测试的时候报如下错误 HOT 1
- 支持比较新的rasa版本 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasa_nlu_chi.