Deep Learning and NLP enthusiast.
luozhouyang / autophrasex Goto Github PK
View Code? Open in Web Editor NEWAutomated Phrase Mining from Massive Text Corpora in Python.
License: Apache License 2.0
Automated Phrase Mining from Massive Text Corpora in Python.
License: Apache License 2.0
阅读后,发现代码更新随机森林分类器后直接对候选集进行分类。POS-Guided Phrasal Segmentation是否是必要的环节呢?相比原论文jieba中的HMM会提升一点性能嘛?恳请得到回答,非常感谢!!
你好,在实践中对参数‘corpus_files’ 和 ‘quality_phrase_files有些疑问。
实践代码如下:
from autophrasex import *
autophrase = AutoPhrase(
reader=DefaultCorpusReader(tokenizer=JiebaTokenizer()),
selector=DefaultPhraseSelector(),
extractors=[
NgramsExtractor(N=4),
IDFExtractor(),
EntropyExtractor()
]
)
predictions = autophrase.mine(
corpus_files=['answers.txt'],
quality_phrase_files='userDic.txt', #quality_phrase_files??像是停用词
callbacks=[
LoggingCallback(),
ConstantThresholdScheduler(),
EarlyStopping(patience=2, min_delta=3)
# EarlyStopping()
]
)
for pred in predictions:
print(pred)
非常感谢大家的帮助,谢谢!
def _predict_proba(self, phrases):
features = [self._compose_feature(phrase) for phrase in phrases]
pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)] (**prob[1] in this line**)
pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
return pairs
你好,我想问下,输入文件的格式是怎样的?我运行的时候出现以下bug,我猜测应该是输入特征的问题导致本来应该是二维输出最后变成了一维的。我的输入文件就是每行一条文本无空格,比如:
我是一个人。
哈哈哈哈哈。
那里有个苹果。
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-097136691c2f> in <module>
6 LoggingCallback(),
7 ConstantThresholdScheduler(),
----> 8 EarlyStopping(patience=2, min_delta=3)
9 ])
/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in mine(self, corpus_files, quality_phrase_files, N, epochs, callbacks, topk, filter_fn, **kwargs)
122
123 callback.on_epoch_reorganize_phrase_pools_begin(epoch, pos_pool, neg_pool)
--> 124 pos_pool, neg_pool = self._reorganize_phrase_pools(pos_pool, neg_pool, **kwargs)
125 callback.on_epoch_reorganize_phrase_pools_end(epoch, pos_pool, neg_pool)
126
/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in _reorganize_phrase_pools(self, pos_pool, neg_pool, **kwargs)
157 new_pos_pool.extend(deepcopy(pos_pool))
158
--> 159 pairs = self._predict_proba(neg_pool)
160 pairs = sorted(pairs, key=lambda x: x[1], reverse=True)
161 # print(pairs[:10])
/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in _predict_proba(self, phrases)
184 def _predict_proba(self, phrases):
185 features = [self._compose_feature(phrase) for phrase in phrases]
--> 186 pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)]
187 pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
188 return pairs
/usr/local/lib/python3.6/dist-packages/autophrasex/autophrase.py in <listcomp>(.0)
184 def _predict_proba(self, phrases):
185 features = [self._compose_feature(phrase) for phrase in phrases]
--> 186 pos_probs = [prob[1] for prob in self.classifier.predict_proba(features)]
187 pairs = [(phrase, prob) for phrase, prob in zip(phrases, pos_probs)]
188 return pairs
IndexError: index 1 is out of bounds for axis 0 with size 1
2021-04-16 11:12:39,442 INFO autophrase.py 33] Load quality phrases finished. There are 10386 quality phrases in total.
2021-04-16 11:12:39,937 INFO autophrase.py 36] Selected 1000 frequent phrases.
2021-04-16 11:12:39,938 INFO autophrase.py 39] Size of initial positive pool: 118
2021-04-16 11:12:39,939 INFO autophrase.py 40] Size of initial negative pool: 782
2021-04-16 11:12:39,940 INFO autophrase.py 46] Starting to train model at epoch 1 ...
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-28-7f2482f60f66> in <module>
4 strategy=strategy,
5 N=4,
----> 6 epochs=10)
7
8 for pred in predictions:
~/.pylib/Lib/site-packages/autophrasex/autophrase.py in mine(self, input_doc_files, quality_phrase_files, strategy, N, **kwargs)
45 for epoch in range(kwargs.pop('epochs', 5)):
46 logging.info('Starting to train model at epoch %d ...', epoch + 1)
---> 47 x, y = strategy.compose_training_data(pos_pool, neg_pool, **kwargs)
48 self.classifier.fit(x, y)
49 logging.info('Finished to train model at epoch %d', epoch + 1)
~/.pylib/Lib/site-packages/autophrasex/strategy.py in compose_training_data(self, pos_pool, neg_pool, **kwargs)
155 for p in pos_pool:
156 p = ' '.join(self.tokenizer.tokenize(p))
--> 157 examples.append((self.build_input_features(p), 1))
158 for p in neg_pool:
159 p = ' '.join(self.tokenizer.tokenize(p))
~/.pylib/Lib/site-packages/autophrasex/strategy.py in build_input_features(self, phrase, **kwargs)
210 doc_freq = self.idf_callback.doc_freq_of(phrase)
211 idf = self.idf_callback.idf_of(phrase)
--> 212 pmi = self.ngrams_callback.pmi_of(phrase)
213 left_entropy = self.entropy_callback.left_entropy_of(phrase)
214 right_entropy = self.entropy_callback.right_entropy_of(phrase)
~/.pylib/Lib/site-packages/autophrasex/callbacks.py in pmi_of(self, ngram)
79 ngram_total_occur = sum(self.ngrams_freq[n].values())
80 freq = self.ngrams_freq[n].get(''.join(ngram.split(' ')), 0)
---> 81 return self._pmi_of(ngram, n, freq, unigram_total_occur, ngram_total_occur)
82
83 def pmi(self):
~/.pylib/Lib/site-packages/autophrasex/callbacks.py in _pmi_of(self, ngram, n, freq, unigram_total_occur, ngram_total_occur)
61 indep_prob = reduce(
62 mul, [self.ngrams_freq[1][unigram] for unigram in ngram.split(' ')]) / (unigram_total_occur ** n)
---> 63 pmi = math.log((joint_prob + self.epsilon) / (indep_prob + self.epsilon), 2)
64 return pmi
65
ZeroDivisionError: float division by zero
Debug 后发现是 callback 中 epsilon 默认为 0 导致的,建议改为一个极小值,或者在开始的实例中手动传入 epsilon=1e-9
之类的参数,对使用者更友好一些
本文代码环境为 win10 core-i7 python3
该句导入出错:from autophrasex import AutoPhrase, BaiduLacTokenizer, Strategy
能否将LAC模块替换为其他相同功能的包,百度LAC包repos下也有人反馈有导入错误。
...autophrasex_demo.py", line 11, in <module>
from autophrasex import AutoPhrase, BaiduLacTokenizer, Strategy
File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\__init__.py", line 3, in <module>
from .autophrase import AutoPhrase
File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\autophrase.py", line 8, in <module>
from .strategy import AbstractStrategy
File "D:\ProgramData\Anaconda3\lib\site-packages\autophrasex\strategy.py", line 6, in <module>
from LAC import LAC
File "D:\ProgramData\Anaconda3\lib\site-packages\LAC\__init__.py", line 23, in <module>
from .lac import LAC
File "D:\ProgramData\Anaconda3\lib\site-packages\LAC\lac.py", line 28, in <module>
import paddle.fluid as fluid
File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\__init__.py", line 29, in <module>
from .fluid import monkey_patch_variable
File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\__init__.py", line 35, in <module>
from . import framework
File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 34, in <module>
from .proto import framework_pb2
File "D:\ProgramData\Anaconda3\lib\site-packages\paddle\fluid\proto\framework_pb2.py", line 11, in <module>
from google.protobuf import descriptor_pb2
File "D:\ProgramData\Anaconda3\lib\site-packages\google\protobuf\descriptor_pb2.py", line 1840, in <module>
__module__ = 'google.protobuf.descriptor_pb2'
TypeError: expected bytes, Descriptor found
因为短语经过过滤后,可能全部都在pos_pool中。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.