2shou / textgrocery Goto Github PK

View Code? Open in Web Editor NEW

678.0 51.0 206.0 471 KB

A simple short-text classification tool based on LibLinear

License: GNU General Public License v3.0

Makefile 1.24% Python 31.12% C 29.24% C++ 38.40%

textgrocery's Introduction

TextGrocery

A simple, efficient short-text classification tool based on LibLinear

Embed with jieba as default tokenizer to support Chinese tokenize

Other languages: 更详细的中文文档

Performance

Train set: 48k news titles with 32 labels
Test set: 16k news titles with 32 labels
Compare with svm and naive-bayes of scikit-learn

Classifier	Accuracy	Time cost(s)
scikit-learn(nb)	76.8%	134
scikit-learn(svm)	76.9%	121
TextGrocery	79.6%	49

Sample Code

>>> from tgrocery import Grocery
# Create a grocery(don't forget to set a name)
>>> grocery = Grocery('sample')
# Train from list
>>> train_src = [
    ('education', 'Student debt to cost Britain billions within decades'),
    ('education', 'Chinese education for TV experiment'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
    ('sports', 'Summit Series look launches HBO Canada sports doc series: Mudhar')
]
>>> grocery.train(train_src)
# Or train from file
# Format: Label\tText
>>> grocery.train('train_ch.txt')
# Save model
>>> grocery.save()
# Load model(the same name as previous)
>>> new_grocery = Grocery('sample')
>>> new_grocery.load()
# Predict
>>> new_grocery.predict('Abbott government spends $8 million on higher education media blitz')
education
# Test from list
>>> test_src = [
    ('education', 'Abbott government spends $8 million on higher education media blitz'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
]
>>> new_grocery.test(test_src)
# Return Accuracy
1.0
# Or test from file
>>> new_grocery.test('test_ch.txt')
# Custom tokenize
>>> custom_grocery = Grocery('custom', custom_tokenize=list)

More examples: sample/

Install

$ pip install tgrocery

Only test under Unix-based System

textgrocery's People

Contributors

Stargazers

Watchers

Forkers

mlistools huosu masdude renever giserh getwingm userfine aurora1625 iruimeng darcy0511 ioio00ioio gitoffice jos666 theosem weizier mozii pfjob09 apanly leoking01 fanfannothing travis-sun shenbeyond xqk gavinhan randy-ran liangkai liaopan ty01csbaidu ernestgong lqshixinlei tomzhang zhangwj0101 zhangweiabc xiaoxiamii zuiwufenghua fangzheng354 lu839684437 easonlv likaiguo sdutheone user99999 oldray meergod littleji chioulf2 zpjlove wangke chivalrouss adrianhust knight3323 nixuehan njnubobo techstone harveycheng xinghudamowang lemon-zmd simmoncn rickyall prashnts tinycq zoneplus sandywei liormagen laisun lovetimil guojiangwei2 jammy112 hydercps qingniufly guoyilin huntzhan jinluyang benderpan stanleybishop yongliangliu novopo leempan yangxs coddinglxf raink helloav8d pst2016 shejianmin lxj0276 delphine0379 zhaoguochen tjuchen sirlancer chagge lei-zhen laucyun cyy0523xc makwen1995 eliyastar sepam sunhuang163 xf87 khronosplus massful yangmingsong

textgrocery's Issues

Not running in Debian

I used this library successfully in OS X (10.10.3) but when I try to run the same code on my Debian (wheezy) system, it fails with this error:

Traceback (most recent call last):
    File "generate_classifications.py", line 4, in <module>
        from tgrocery import Grocery
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/__init__.py", line 2, in <module>
        from classifier import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/classifier.py", line 6, in <module>
        from .learner import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/__init__.py", line 1, in <module>
        from .learner import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/learner.py", line 26, in <module>
        import liblinear
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/liblinear/python/liblinear.py", line 9, in <module>
        liblinear = CDLL(path.join(path.dirname(path.abspath(__file__)), '../liblinear.so.1'))
    File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
        self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python2.7/dist-packages/tgrocery/learner/liblinear/python/../liblinear.so.1: cannot open shared object file: No such file or directory

More details:

/usr/local/lib is in my path.
OSX is running python 2.7.6, Debian is running python 2.7.3

Any idea why this might be happening?

从文件读取时总是失败

我从文件中读取数据训练时一直失败

optimization finished, #iter = 1
Objective value = 0.000000
nSV = 0
True
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.229 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "/home/sinchuck/PythonPratice/sougou_result.py", line 15, in <module>
    predict_result = grocery.predict('纹身 图片 编程 软件 古风 模具 官网 螺距 螺纹 酒吧 表情 男生 数控技术 客服 切削液 价格表 分界线 钻石 喊麦 霸气')
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/__init__.py", line 43, in predict
    return self.model.predict_text(single_text)
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/classifier.py", line 57, in predict_text
    y = self.text_converter.get_class_name(int(y))
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/converter.py", line 140, in get_class_name
    return self.class_map.to_class_name(class_idx)
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/converter.py", line 114, in to_class_name
    'class idx ({0}) should be less than the number of classes ({0}).'.format(idx, len(self.idx2class)))
KeyError: 'class idx (34190240) should be less than the number of classes (34190240).'

这是程序源码:

#!/usr/bin/env python
# encoding: utf-8

from tgrocery import Grocery

grocery = Grocery('sougou')
train_src = '/home/sinchuck/sougou/age.txt'
grocery.train(train_src)
print grocery.get_load_status()

predict_result = grocery.predict('纹身 图片 编程 软件 古风 模具 官网 螺距 螺纹 酒吧 表情 男生 数控技术 客服 切削液 价格表 分界线 钻石 喊麦 霸气')
print predict_result

age.txt文件如下:

1		长官 双沟 教师节 柔和 图片 农场 诗句 价格表 行李 小说 蒜苔 征文 表情 家常 根号 发票 星座 百度 天才 魔棒
4		剖腹产 奶粉 喜宝 属鸡 小儿 肚子 属猪 刀口 芦花 价格表 心脏病 功效 线头 声音 儿歌 眼睫毛 湿疹 胶囊 肠胃 先天性
1		小说 书包 全文 养女 软化 成妃 师徒 涟漪 歌曲 媚媚 肉文 进化史 销魂 温柔 伟大 爸爸 幸福 冥界 头发 弄潮
6		发票 汽车 购车 新车 进口车 有限公司 个人 增值税 专用发票 区别 公证书 权益法 消费者 原车 倾尘 赔偿标准 汽车贸易 个人信息 上户 结果
3		编码器 拉链 脉冲雷达 电影 孤舰 处理器 丧尸 口碑 市长 国度 文件 皇帝 暴风 系统 编码 发动机 抗体 封闭抗体 间谍活动 科委
2		苹果 百度 高手 鸡腿 校园 老版 炖法 缠绵 剧情 链接 语音 资源 技巧 学姐 电影 记录 墙式 图片 完整版 版聊
3		软肋 铠甲 待人 国品 浏览器 游戏 纯金 信用社 坚果 美食 分队 百度 金牌 视频 眉毛 节目 酒店 反光膜 电玩 年轻
3		十字绣 农村 服务站 官网 汽车 罚款 成人 前途 学校 流行歌曲 飞云 前景 行业 厨师 时间段 美容 传奇 装饰 对话框 售票点
3		运程 股票 线图 黑衣人 投弹手 图片 意思 生肖 米粒 太极拳 虱子 熊市 煤矿 天空 官网 技付 能市 神符 演员表 侧神
1		百度 废材 逆天 拳皇 魔盗 武神 风云 人类 妖孽 风暴 机甲 星河 仙侠 酒神 重生 原形 电影 帝国 飞天 女友

请问这是什么原因?(备注,同样的代码,只要把train_src改成list类型就可以运行成功)
希望可以收到您的回复,谢谢

tgrocery removes files from the folder

Hi Developers,

I was testing some stuff using Grocery and used this folder to create the models.

grocery = Grocery('/Users/rahulkumar/Desktop/')
When i executed grocery.save() it deleted all the files from the Desktop folder and now i have nothing left on the folder.

I checked the log and it performed os.rmdir() operation on that folder.

Please fix this workflow. It's very crucial.

Also, i am not sure if I can recover my files.

not work under windows platform

C:\Windows\System32>pip install tgrocery
Downloading/unpacking tgrocery
  Downloading tgrocery-0.1.3.tar.gz
  Running setup.py (path:c:\users\r\appdata\local\temp\pip_build_r\tgrocery\setu
p.py) egg_info for package tgrocery

    package init file 'tgrocery\learner\liblinear\python\__init__.py' not found
(or not a regular file)
Requirement already satisfied (use --upgrade to upgrade): jieba in c:\python27\l
ib\site-packages (from tgrocery)
Installing collected packages: tgrocery
  Running setup.py install for tgrocery
    'make' 不是内部或外部命令，也不是可运行的程序
    或批处理文件。
    'cp' 不是内部或外部命令，也不是可运行的程序
    或批处理文件。
    'cp' 不是内部或外部命令，也不是可运行的程序
    或批处理文件。
    package init file 'tgrocery\learner\liblinear\python\__init__.py' not found
(or not a regular file)

Successfully installed tgrocery
Cleaning up...

如何对文本进行预测

python新人，看了代码之后，对单一语句预测没问题，可是对一个一行一行排列的文本进行预测要怎么做呢？grocerytextmodel的predict_text可以做到吗？不是很懂这个函数的用法

考虑输出更详细的模型信息

预测结果中貌似只有各类的结果，如果需要迭代模型的话，能否给出识别路径，或者因子的重要性之类的吗

OSError找不到liblinear.so.1 文件

/usr/anaconda2/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/liblinear.py in <module>()
      7 
      8 # For unix the prefix 'lib' is not considered.
----> 9 liblinear = CDLL(path.join(path.dirname(path.abspath(__file__)), '../liblinear.so.1'))
     10 
     11 # Construct constants

/usr/anaconda2/lib/python2.7/ctypes/__init__.pyc in __init__(self, name, mode, handle, use_errno, use_last_error)
    363 
    364         if handle is None:
--> 365             self._handle = _dlopen(self._name, mode)
    366         else:
    367             self._handle = handle

`OSError: /usr/anaconda2/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/../liblinear.so.1: cannot open shared object file: No such file or directory`

请问"../liblinear.so.1"这个文件是什么呢，python版本是2.7.11

能提供对比案例baseline的案例和数据吗？

Compare with svm and naive-bayes of scikit-learn
....
数据和案例，没看到

为什么返回值是一个object？不是文档中的‘education’？

![qq 20160909160756](https://cloud.githubusercontent.com/assets/17672927/18380018/ff0e3458-76a7-11e6-9466-37e63c3a807c.png

在 2 to 3 后出现 /usr/local/lib/python3.4/site-packages/tgrocery/learner/util.so.1

这个是错误怎么解决呢？

请教一下运行环境问题

请教一下，在MAC osx，python2.7环境下，试运行

from tgrocery import Grocery

会出现错误：OSError: dlopen(/Users/user/Library/Python/2.7/lib/python/site-packages/tgrocery/learner/util.so.1, 6): image not found

请问是什么原因，谢谢！

多分类的训练语料数目的比例如何确定

比如我现在有3个分类 C1, C2, C3, 这三个分类的语料的比例要保持1：1：1吗？

关于Stopwords 和词性过滤

非常感谢这个项目。

对你 blog 中提到的「二元分词（Bigram），不去停顿词，不做词性过滤」有些困惑，难道不是去掉停词，以及词性选择名词或者一层 tfidf 筛选后选作特征会更好一些？

util = CDLL(os.path.join(os.path.dirname(os.path.abspath(file)), 'util.so.1')) File "x:\Program Files\Anaconda2\lib\ctypes\init.py", line 365, in init self._handle = _dlopen(self._name, mode)

这个win 10 下面出现这个
python 2.7

Is this project related to LibShortText?

The numbers are identical!

http://guoze.me/2014/09/25/libshorttext-introduction/

scikit-learn(nb) 76.8% 134
scikit-learn(svm) 76.9% 121
libshorttext 79.6% 49

Classifier Accuracy Time cost(s)
scikit-learn(nb) 76.8% 134
scikit-learn(svm) 76.9% 121
TextGrocery 79.6% 49

有没有介绍TextGrocery底层原理的博文呢？

试了一下用hanlp和textgrocery对自己的短文本数据进行分类。发现textgrocery比hanlp好的不止一点。
那么问题来了，hanlp采用的试自带分词器，textgrocery用的jieba分词器；二者都采用的SVM。。。总感觉造成这么大效果差异不应该仅仅试因为分词器的不同吧，底层原理还有什么不同吗？很想了解了一下textgrocery底层原理

ValueError: Error: Initial-solution specification supported only for solver L2R_LR and L2R_L2LOSS_SVC

IndentationError: unexpected indent

from tgrocery import Grocery
grocery = Grocery('sample')
train_src = [
... ('education', '名师指导托福语法技巧：名词的复数形式'),
... ('education', '**高考成绩海外认可是“狼来了”吗？'),
... ('sports', '图文：法网孟菲尔斯苦战进16强孟菲尔斯怒吼'),
... ('sports', '四川丹棱举行全国长距登山挑战赛近万人参与')
... ]
grocery.train(train_src)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.315 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/tgrocery/init.py", line 36, in train
model = train(self.train_svm_file, '', '-s 4')
File "/usr/local/lib/python2.7/site-packages/tgrocery/learner/learner.py", line 394, in train
m = liblinear_train(learner_prob, learner_param)
File "/usr/local/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/liblinearutil.py", line 147, in train
raise ValueError('Error: %s' % err_msg)
ValueError: Error: Initial-solution specification supported only for solver L2R_LR and L2R_L2LOSS_SVC

predict和test方法为啥返回值类型是str、float

Traceback (most recent call last):
File "/home/fatherfox/PycharmProjects/grocery/test.py", line 28, in
print grocery.predict('考生必读：新托福写作考试评分标准').dec_values
AttributeError: 'str' object has no attribute 'dec_values'

单条数据多分类能否做到呢？

能否对同样一条数据做到多个分类呢？

cannot load from file.

Could you please post a sample file format?

Cannot load from txt because I didn't get the correct format.

Many thanks for you help

没有升级到py3.x吗

python3不友好……ImportError: No module named 'converter'

不知道有没有python3版本下的推荐……

OSError: /usr/local/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/../liblinear.so.1: cannot open shared object file: No such file or directory

PredictResult TypeError

可以选择不使用bi-gram吗？我可以加入这个项目吗？

为什么我pip安装都有问题

pip提示安装成功了，但是import tgrocery，报错如下，哪位大神能解释一下
python3版本的问题：
from converter import *
ImportError: No module named 'converter'

python2版本的问题：
Traceback (most recent call last):
File "", line 1, in
File "D:\anaconda2\lib\site-packages\tgrocery_init_.py", line 2, in
from classifier import *
File "D:\anaconda2\lib\site-packages\tgrocery\classifier.py", line 6, in
from .learner import *
File "D:\anaconda2\lib\site-packages\tgrocery\learner_init_.py", line 1, in
from .learner import *
File "D:\anaconda2\lib\site-packages\tgrocery\learner\learner.py", line 21, in
util = CDLL(os.path.join(os.path.dirname(os.path.abspath(file)), 'util.so.1'))
File "D:\anaconda2\lib\ctypes_init_.py", line 362, in init
self._handle = _dlopen(self._name, mode)
WindowsError: [Error 126]

can it use in windows system? when i run it in windows,it's always core dump,but correct in linux.

用自己的语料训练了一下，tgrocery准确率并没有提高

训练数据和测试数据是80%，20%

scikit-learn(svm)：准确率为78%
tgrocery：准确率为0.76176%

不过速度上tgrocery会好一点。自己的语料是属于短文本（大部分10个字以内的），还以为tgrocery是针对短文本分类，准确率会好点呢。
不知道有没有什么方法可以提高点准确率呢？或者大神们有推荐什么适合做短文本分类的方法？谢谢了

请问用了jieba分词，怎样导入自定义词典

请问用了jieba分词，怎样导入自定义词典
我自己用jieba分词可以导入自定义的词典，在TextGorcery里，这么弄可以吗？