Git Product home page Git Product logo

textgrocery's Introduction

TextGrocery

Build Status

A simple, efficient short-text classification tool based on LibLinear

Embed with jieba as default tokenizer to support Chinese tokenize

Other languages: 更详细的中文文档

Performance

  • Train set: 48k news titles with 32 labels
  • Test set: 16k news titles with 32 labels
  • Compare with svm and naive-bayes of scikit-learn
Classifier Accuracy Time cost(s)
scikit-learn(nb) 76.8% 134
scikit-learn(svm) 76.9% 121
TextGrocery 79.6% 49

Sample Code

>>> from tgrocery import Grocery
# Create a grocery(don't forget to set a name)
>>> grocery = Grocery('sample')
# Train from list
>>> train_src = [
    ('education', 'Student debt to cost Britain billions within decades'),
    ('education', 'Chinese education for TV experiment'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
    ('sports', 'Summit Series look launches HBO Canada sports doc series: Mudhar')
]
>>> grocery.train(train_src)
# Or train from file
# Format: Label\tText
>>> grocery.train('train_ch.txt')
# Save model
>>> grocery.save()
# Load model(the same name as previous)
>>> new_grocery = Grocery('sample')
>>> new_grocery.load()
# Predict
>>> new_grocery.predict('Abbott government spends $8 million on higher education media blitz')
education
# Test from list
>>> test_src = [
    ('education', 'Abbott government spends $8 million on higher education media blitz'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
]
>>> new_grocery.test(test_src)
# Return Accuracy
1.0
# Or test from file
>>> new_grocery.test('test_ch.txt')
# Custom tokenize
>>> custom_grocery = Grocery('custom', custom_tokenize=list)

More examples: sample/

Install

$ pip install tgrocery

Only test under Unix-based System

textgrocery's People

Contributors

2shou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textgrocery's Issues

Not running in Debian

I used this library successfully in OS X (10.10.3) but when I try to run the same code on my Debian (wheezy) system, it fails with this error:

Traceback (most recent call last):
    File "generate_classifications.py", line 4, in <module>
        from tgrocery import Grocery
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/__init__.py", line 2, in <module>
        from classifier import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/classifier.py", line 6, in <module>
        from .learner import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/__init__.py", line 1, in <module>
        from .learner import *
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/learner.py", line 26, in <module>
        import liblinear
    File "/usr/local/lib/python2.7/dist-packages/tgrocery/learner/liblinear/python/liblinear.py", line 9, in <module>
        liblinear = CDLL(path.join(path.dirname(path.abspath(__file__)), '../liblinear.so.1'))
    File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
        self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python2.7/dist-packages/tgrocery/learner/liblinear/python/../liblinear.so.1: cannot open shared object file: No such file or directory

More details:

  • /usr/local/lib is in my path.
  • OSX is running python 2.7.6, Debian is running python 2.7.3

Any idea why this might be happening?

词库功能建议

建议暴露接口能对词库进行删除,增加, 修改,以及权重值的添加。

从文件读取时总是失败

我从文件中读取数据训练时一直失败

optimization finished, #iter = 1
Objective value = 0.000000
nSV = 0
True
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.229 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "/home/sinchuck/PythonPratice/sougou_result.py", line 15, in <module>
    predict_result = grocery.predict('纹身 图片 编程 软件 古风 模具 官网 螺距 螺纹 酒吧 表情 男生 数控技术 客服 切削液 价格表 分界线 钻石 喊麦 霸气')
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/__init__.py", line 43, in predict
    return self.model.predict_text(single_text)
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/classifier.py", line 57, in predict_text
    y = self.text_converter.get_class_name(int(y))
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/converter.py", line 140, in get_class_name
    return self.class_map.to_class_name(class_idx)
  File "/usr/local/lib/python2.7/dist-packages/tgrocery/converter.py", line 114, in to_class_name
    'class idx ({0}) should be less than the number of classes ({0}).'.format(idx, len(self.idx2class)))
KeyError: 'class idx (34190240) should be less than the number of classes (34190240).'

这是程序源码:

#!/usr/bin/env python
# encoding: utf-8

from tgrocery import Grocery

grocery = Grocery('sougou')
train_src = '/home/sinchuck/sougou/age.txt'
grocery.train(train_src)
print grocery.get_load_status()

predict_result = grocery.predict('纹身 图片 编程 软件 古风 模具 官网 螺距 螺纹 酒吧 表情 男生 数控技术 客服 切削液 价格表 分界线 钻石 喊麦 霸气')
print predict_result

age.txt文件如下:

1		长官 双沟 教师节 柔和 图片 农场 诗句 价格表 行李 小说 蒜苔 征文 表情 家常 根号 发票 星座 百度 天才 魔棒
4		剖腹产 奶粉 喜宝 属鸡 小儿 肚子 属猪 刀口 芦花 价格表 心脏病 功效 线头 声音 儿歌 眼睫毛 湿疹 胶囊 肠胃 先天性
1		小说 书包 全文 养女 软化 成妃 师徒 涟漪 歌曲 媚媚 肉文 进化史 销魂 温柔 伟大 爸爸 幸福 冥界 头发 弄潮
6		发票 汽车 购车 新车 进口车 有限公司 个人 增值税 专用发票 区别 公证书 权益法 消费者 原车 倾尘 赔偿标准 汽车贸易 个人信息 上户 结果
3		编码器 拉链 脉冲雷达 电影 孤舰 处理器 丧尸 口碑 市长 国度 文件 皇帝 暴风 系统 编码 发动机 抗体 封闭抗体 间谍活动 科委
2		苹果 百度 高手 鸡腿 校园 老版 炖法 缠绵 剧情 链接 语音 资源 技巧 学姐 电影 记录 墙式 图片 完整版 版聊
3		软肋 铠甲 待人 国品 浏览器 游戏 纯金 信用社 坚果 美食 分队 百度 金牌 视频 眉毛 节目 酒店 反光膜 电玩 年轻
3		十字绣 农村 服务站 官网 汽车 罚款 成人 前途 学校 流行歌曲 飞云 前景 行业 厨师 时间段 美容 传奇 装饰 对话框 售票点
3		运程 股票 线图 黑衣人 投弹手 图片 意思 生肖 米粒 太极拳 虱子 熊市 煤矿 天空 官网 技付 能市 神符 演员表 侧神
1		百度 废材 逆天 拳皇 魔盗 武神 风云 人类 妖孽 风暴 机甲 星河 仙侠 酒神 重生 原形 电影 帝国 飞天 女友

请问这是什么原因?(备注,同样的代码,只要把train_src改成list类型就可以运行成功)
希望可以收到您的回复,谢谢

tgrocery removes files from the folder

Hi Developers,

I was testing some stuff using Grocery and used this folder to create the models.

grocery = Grocery('/Users/rahulkumar/Desktop/')
When i executed grocery.save() it deleted all the files from the Desktop folder and now i have nothing left on the folder.

I checked the log and it performed os.rmdir() operation on that folder.

Please fix this workflow. It's very crucial.

Also, i am not sure if I can recover my files.

not work under windows platform

C:\Windows\System32>pip install tgrocery
Downloading/unpacking tgrocery
  Downloading tgrocery-0.1.3.tar.gz
  Running setup.py (path:c:\users\r\appdata\local\temp\pip_build_r\tgrocery\setu
p.py) egg_info for package tgrocery

    package init file 'tgrocery\learner\liblinear\python\__init__.py' not found
(or not a regular file)
Requirement already satisfied (use --upgrade to upgrade): jieba in c:\python27\l
ib\site-packages (from tgrocery)
Installing collected packages: tgrocery
  Running setup.py install for tgrocery
    'make' 不是内部或外部命令,也不是可运行的程序
    或批处理文件。
    'cp' 不是内部或外部命令,也不是可运行的程序
    或批处理文件。
    'cp' 不是内部或外部命令,也不是可运行的程序
    或批处理文件。
    package init file 'tgrocery\learner\liblinear\python\__init__.py' not found
(or not a regular file)

Successfully installed tgrocery
Cleaning up...

如何对文本进行预测

python新人,看了代码之后,对单一语句预测没问题,可是对一个一行一行排列的文本进行预测要怎么做呢?grocerytextmodel的predict_text可以做到吗?不是很懂这个函数的用法

考虑输出更详细的模型信息

预测结果中貌似只有各类的结果,如果需要迭代模型的话,能否给出识别路径,或者因子的重要性之类的吗

OSError找不到liblinear.so.1 文件

/usr/anaconda2/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/liblinear.py in <module>()
      7 
      8 # For unix the prefix 'lib' is not considered.
----> 9 liblinear = CDLL(path.join(path.dirname(path.abspath(__file__)), '../liblinear.so.1'))
     10 
     11 # Construct constants

/usr/anaconda2/lib/python2.7/ctypes/__init__.pyc in __init__(self, name, mode, handle, use_errno, use_last_error)
    363 
    364         if handle is None:
--> 365             self._handle = _dlopen(self._name, mode)
    366         else:
    367             self._handle = handle

`OSError: /usr/anaconda2/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/../liblinear.so.1: cannot open shared object file: No such file or directory`

请问"../liblinear.so.1"这个文件是什么呢,python版本是2.7.11

请教一下运行环境问题

请教一下,在MAC osx,python2.7环境下,试运行

from tgrocery import Grocery

会出现错误:OSError: dlopen(/Users/user/Library/Python/2.7/lib/python/site-packages/tgrocery/learner/util.so.1, 6): image not found

请问是什么原因,谢谢!

关于Stopwords 和词性过滤

非常感谢这个项目。

对你 blog 中提到的「二元分词(Bigram),不去停顿词,不做词性过滤」有些困惑,难道不是去掉停词,以及词性选择名词或者一层 tfidf 筛选后选作特征会更好一些?

有没有介绍TextGrocery底层原理的博文呢?

试了一下用hanlp和textgrocery对自己的短文本数据进行分类。发现textgrocery比hanlp好的不止一点。
那么问题来了,hanlp采用的试自带分词器,textgrocery用的jieba分词器;二者都采用的SVM。。。总感觉造成这么大效果差异不应该仅仅试因为分词器的不同吧,底层原理还有什么不同吗?很想了解了一下textgrocery底层原理

ValueError: Error: Initial-solution specification supported only for solver L2R_LR and L2R_L2LOSS_SVC

IndentationError: unexpected indent

from tgrocery import Grocery
grocery = Grocery('sample')
train_src = [
... ('education', '名师指导托福语法技巧:名词的复数形式'),
... ('education', '**高考成绩海外认可 是“狼来了”吗?'),
... ('sports', '图文:法网孟菲尔斯苦战进16强 孟菲尔斯怒吼'),
... ('sports', '四川丹棱举行全国长距登山挑战赛 近万人参与')
... ]
grocery.train(train_src)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.315 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/tgrocery/init.py", line 36, in train
model = train(self.train_svm_file, '', '-s 4')
File "/usr/local/lib/python2.7/site-packages/tgrocery/learner/learner.py", line 394, in train
m = liblinear_train(learner_prob, learner_param)
File "/usr/local/lib/python2.7/site-packages/tgrocery/learner/liblinear/python/liblinearutil.py", line 147, in train
raise ValueError('Error: %s' % err_msg)
ValueError: Error: Initial-solution specification supported only for solver L2R_LR and L2R_L2LOSS_SVC

predict和test方法为啥返回值类型是str、float

Traceback (most recent call last):
File "/home/fatherfox/PycharmProjects/grocery/test.py", line 28, in
print grocery.predict('考生必读:新托福写作考试评分标准').dec_values
AttributeError: 'str' object has no attribute 'dec_values'

cannot load from file.

Could you please post a sample file format?

Cannot load from txt because I didn't get the correct format.

Many thanks for you help

为什么我pip安装都有问题

pip提示安装成功了,但是import tgrocery,报错如下,哪位大神能解释一下
python3版本的问题:
from converter import *
ImportError: No module named 'converter'

python2版本的问题:
Traceback (most recent call last):
File "", line 1, in
File "D:\anaconda2\lib\site-packages\tgrocery_init_.py", line 2, in
from classifier import *
File "D:\anaconda2\lib\site-packages\tgrocery\classifier.py", line 6, in
from .learner import *
File "D:\anaconda2\lib\site-packages\tgrocery\learner_init_.py", line 1, in
from .learner import *
File "D:\anaconda2\lib\site-packages\tgrocery\learner\learner.py", line 21, in
util = CDLL(os.path.join(os.path.dirname(os.path.abspath(file)), 'util.so.1'))
File "D:\anaconda2\lib\ctypes_init_.py", line 362, in init
self._handle = _dlopen(self._name, mode)
WindowsError: [Error 126]

用自己的语料训练了一下,tgrocery准确率并没有提高

训练数据和测试数据是80%,20%

scikit-learn(svm):准确率为78%
tgrocery:准确率为0.76176%

不过速度上tgrocery会好一点。自己的语料是属于短文本(大部分10个字以内的),还以为tgrocery是针对短文本分类,准确率会好点呢。
不知道有没有什么方法可以提高点准确率呢?或者大神们有推荐什么适合做短文本分类的方法?谢谢了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.