yanyiwu / cppjieba Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 684.0 9.37 MB

"结巴"中文分词的C++版本

License: MIT License

CMake 1.41% C++ 98.59%

cppjieba's People

Contributors

Stargazers

Watchers

Forkers

travis-sun tangbo2014 zhaow alienfeel changbiao shi-jay breakstonebychest jannson vvhh2002 alanguo001 lzzgeo aholic wtmmac shdut fc13240 wangby 5guo chinadev blueicesir miffa zhanghaojie ultimate010 neumayue tuang xiangbai regex-young sdu2011 gisupc zhangbinbin dayu321 matrixq lixiangnlp lcode hehaotian lnsoso pastqing ruccsbingo shenmeng fanfannothing limengxiao yangacer zhongyang mengyuliu psaux monicall axure ithacadream xiaofann lostfish aboluo gclxry langker chagge peterxiatian gladuo ccpaging chinalongganhu gqfjob ongbe samucc korepwx-note dulumao leoking01 shengyudingli shauwe plutochn smarthomekit gfthr blueflord yunlong kangshiyin xuyangsunny fortranlee dodo0112 yinizhizhu chion82 qiqipipioioi arvinxuxie jumpyd he3210 allanxiang lisonma haolu86 kinglebron songcheng macsummer hermithacker casywang wangdai yoyoworms tangxman cdlz icewwn sunnyss12 killedision albert1988 songinfo suninus lafener nanjunxiao

cppjieba's Issues

Is there any API count offset by characters(including Chinese) instead of bytes ?

Hi,
I'm trying to use Jieba.Cut(text, result) here, but the result shows that, it counts offsets by bytes, not unicode characters.
My text content have Chinese and English characters mixed, so I wonder is there any way to make it ? thanks for your great work!

词性标注添加用户词典

结巴分词的分词MixSegment类里面可以指定用户词典，
为什么词性标注里面PosTagger就不能指定用户词典？
这样我如果要用词性标注的那个接口, 而且要支持自己定义的用户词典应该怎么办？

Add user_dict to server

Is it possible to add user_dict to server by adding a line to the Server.conf?

usr_dict_path=/usr/share/CppJieba/dict/user.dict.utf8

Sorry I am not familiar with cpp.

Thanks.

cpp版本支持多线程吗？

如题。

cppjieba在taging模式的时候，HMM会自动启动么？

如题

用C++改写后，性能如何？

你好，我是结巴分词的作者，请问cppjieb的分词速度如何？

终于在vs2010里用起来了，就是加载词典消耗时间很长，不知道有没有解决办法。

vs 用的GBK编码研究了半天才搞明白

新建文本文档.txt

能否提供Python封装C++的版本?

无法切分英文单词和标点符号

模式是 Mixsegment，使用默认词典。

"小明硕士毕业于**科学院计算院，先就职于IBM，后在日本京都大学深造“

结果是：

"小明" "硕士" "毕业" "于"
"**科学院" "计算" "院" ","
"先" "就职" "于" "IBM,"
"后" "在" "日本京都大学" "深造"

有支持window的计划吗?

可以实现cmake的install吗？

标准流程
make
make test
make install
#include <cppjieba/Jieba.hpp>

windows 下如何使用

帮助文档里面都是在linux 下面的。。请问下有人在windows下面使用过这个东西么？使用c/c++ 来调用。本人新手求指点

MixSegment 切分会自动组合两个连续单字

在自定义字典里有 A 与 B 两个词语，在分词结果里假设 A B是前后连续的，那么会在结果里直接将A B 合并成 AB。

是否可以考虑给mysql做一个类ngram的插件

mysql5.7已经包含ngram和日本那个mecab

在系列cut函数中，去掉if(!_getInitFlag())会不会是更好的设计？

先承认我可能过于纠结这点小事情了，它对性能可能没任何影响，但总觉得有大量循环运行的cut函数里，总得判断一次if(!_getInitFlag())很别扭。
1、个人觉得良好的设计应该是用assert，用户调用这个函数的时候默认应该是_getInitFlag为True的，让用户来保证先初期化才能使用cut。
2、当然如果实在担心用户不初始化就使用cut带来的错误，可以使用类似python的动态绑定的方法，
BaseSegment(){
_setInitFlag(False);
BIND(this->cut, not_init_cut);
}
virtual bool init(){
BIND(this->cut, have_inited_cut);
}
not_init_cut(str) {
return False;/* Not init /
}
have_inited_cut(str){
/ Just cut , not if(!_getInitFlag()) */
}
个人比较偏向一方法，简单明了。对于第2方法，不知道C++实现方便不？

补充：在实践中使用的话，一般用户都需要作singleton处理，因为全局的应用程序就使用一个XXXSegment对象就够了，那个地方的代码用户肯定会加载字典并初始化对象，所以在cut当中再判断是否初始化没太多必要了:)

utf8ToUnicode 不支持所有 Unicode 字符

cppjieba/src/Limonp/StringUtil.hpp

Line 165 in aed1c8f

bool utf8ToUnicode(const char * const str, size_t len, Uint16Container& vec) {

爲何不直接支持 24 位的 Unicode 碼呢？

windows下如何编译？

[求助]windows下不能正确分词

移植到了VS上，修改了编码格式为UTF-8，设置了string默认编码为UTF-8，运行后仍无法正确显示。

他来到了网易杭研大厦
[demo] Cut With HMM
他/来/到/了/网/易/杭/研/大/厦
[demo] Cut Without HMM
他/来/到/了/网/易/杭/研/大/厦
我来到北京清华大学
[demo] CutAll
我/来/到/北/京/清/华/大/学
小明硕士毕业于**科学院计算所，后在日本京都大学深造
[demo] CutForSearch
小/明/硕/士/毕/业/于/中/国/科/学/院/计/算/所/，/后/在/日/本/京/都/大/学/深/造
[demo] Insert User Word
男/默/女/泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小", "offset": 0}, {"word": "明", "offset": 2}, {"word": "硕", "offse
t": 4}, {"word": "士", "offset": 6}, {"word": "毕", "offset": 8}, {"word": "业",
"offset": 10}, {"word": "于", "offset": 12}, {"word": "中", "offset": 14}, {"wo
rd": "国", "offset": 16}, {"word": "科", "offset": 18}, {"word": "学", "offset":
20}, {"word": "院", "offset": 22}, {"word": "计", "offset": 24}, {"word": "算",
"offset": 26}, {"word": "所", "offset": 28}, {"word": "，", "offset": 30}, {"wo
rd": "后", "offset": 32}, {"word": "在", "offset": 34}, {"word": "日", "offset":
36}, {"word": "本", "offset": 38}, {"word": "京", "offset": 40}, {"word": "都",
"offset": 42}, {"word": "大", "offset": 44}, {"word": "学", "offset": 46}, {"wo
rd": "深", "offset": 48}, {"word": "造", "offset": 50}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰
。
[我:x, 是:x, 拖:x, 拉:x, 机:x, 学:x, 院:x, 手:x, 扶:x, 拖:x, 拉:x, 机:x, 专:x,
业:x, 的:x, 。:x, 不:x, 用:x, 多:x, 久:x, ，:x, 我:x, 就:x, 会:x, 升:x, 职:x, 加
:x, 薪:x, ，:x, 当:x, 上:x, CEO:eng, ，:x, 走:x, 上:x, 人:x, 生:x, 巅?x, 濉?x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰
。
[{"word": "CEO", "offset": [62], "weight": -1.08658e+063}]

MixSegment无法切分英文+中文组合词

直接调用MixSegment，导入词典是 jieba.dict.utf8与hmm_modle.utf8；输入“B超 T恤”，返回结果是["B", "超", " T", "恤"]；
可是python版结巴是ok的；想问问这是为什么？抑或是我哪打开方式不对？
谢谢～

修复标准库函数isspace在linux和windows上实现有差异导致的问题

相关讨论： efd029c#commitcomment-15776910

Demo 跑起来有问题啊

我从vs里面跑demo项目的时候，结果并不正确
[demo] Cut With HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Cut Without HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutAll
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutForSearch
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Insert User Word
膼/默/女/!
膼默女!
[demo] Locate Words
膹, 0, 1
蕞, 1, 2
蕫, 2, 3
婴, 3, 4
莪, 4, 5
猿, 5, 6
菂, 6, 7
[demo] TAGGING
我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰
。
["螔:x", "蕠:x", "蛷:x", "-:x", "酆:x", "学:x", "院:x", "蕱:x", "锥:x", "蛷:x",
"-:x", "酆:x", "专:x", "业:x", "談:x", "c:x", "一:x", "觾:x", "譅:x", "迌:x", "
矛:x", "螔:x", "迧:x", "邸:x", "山:x", "职:x", "軗:x", "薪:x", "矛:x", "毡:x", "
蓮:x", "CEO:eng", "矛:x", "谉:x", "蓮:x", "葖:x", "珊:x", "釠?x", "濉?x"]
[demo] KEYWORD
78
81
2016-04-07 17:00:46 E:\Project\cppjieba\include\cppjieba/KeywordExtractor.hpp:81
ERROR words illegal
我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰
。

我是win7 专业版英文64位系统，cmd default code page 是936 GBK

trie的build_代码看不懂，能不能加点注释

目的是什么，相关数据结构做什么调整？

OS X Mountain Lion下编译不过

g++ -c -Wall -O3 keywordext_demo.cpp
In file included from keywordext_demo.cpp:3:
In file included from ./../cppjieba/headers.h:8:
In file included from ./../cppjieba/../cppcommon/headers.h:15:
./../cppjieba/../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppjieba/../cppcommon/sort_functs.h:270:37: note: place parentheses around
the '&&' expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppjieba/../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated:
first deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char , struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
cd ../cppjieba && make
g++ -c -g -Wall -DDEBUG HMMSegment.cpp
In file included from HMMSegment.cpp:1:
In file included from ./HMMSegment.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG KeyWordExt.cpp
In file included from KeyWordExt.cpp:5:
In file included from ./KeyWordExt.h:8:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MPSegment.cpp
In file included from MPSegment.cpp:5:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MixSegment.cpp
In file included from MixSegment.cpp:1:
In file included from ./MixSegment.h:4:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG SegmentBase.cpp
In file included from SegmentBase.cpp:1:
In file included from ./SegmentBase.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG Trie.cpp
In file included from Trie.cpp:5:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
ar rc libcppjieba.a HMMSegment.o KeyWordExt.o MPSegment.o MixSegment.o SegmentBase.o Trie.o
ar: temporary file: No such file or directory
make[1]: ** [libcppjieba.a] Error 1
make: *** [../cppjieba/libcppjieba.a] Error 2

停用词需要特殊设置么，下载编译出来运行停用词还是分出来了。

实现了关键词提取的textrank算法，请yanyi check一下。另外，idf的方法和textrank哪个更优？

/*
 * TextRank.hpp
 *
 *  Created on: 2015年7月7日
 *      Author: oliverlwang
 */
#ifndef TEXTRANK_H_
#define TEXTRANK_H_

#include "UndirectWeightedGraph.hpp"

namespace CppJieba
{

class TextRank
{
public:
    TextRank() : _span(2) {};
    virtual ~TextRank(){};

    /**
     * @brief get the TopN Keywords.
     *
     * @param vector words input words
     * @param vector keywords keywords with score
     * @param int topN how many keywords do you want
     *
     * @retval
     */
    int textRank(vector<string>& words, map<string, double>& wordmap)
    {
        try
        {
            UndirectWeightedGraph graph;
            map< pair<string, string>, double> cm;

            for(size_t i = 0; i < words.size(); ++i)
            {
                /* syntactic filter */

                /* ngram, when span=2, f-measure gets best result */
                for(size_t j = i + 1; j < i + _span; ++j)
                {
                    if(j >= words.size())
                        break;
                    /* using std::pair as union key */
                    pair<string, string> key = make_pair(words[i], words[j]);
                    cm[key] += 1.0;
                }
            }

            /* add edge */
            for(map< pair<string, string>, double>::iterator it = cm.begin(); it != cm.end(); ++it)
            {
                /* do not add edge between the same vertex */
                if(it->first.first == it->first.second)
                    continue;

                graph.addEdge(it->first.first, it->first.second, it->second);
            }

            /* rank */
            graph.rank();

            wordmap.clear();
            wordmap = graph.ws;
        }
        catch(exception &e)
        {
            cerr << e.what() << endl;
            return -1;
        }
        return 0;
    }

private:
    int _span;             /* scanning span */


};

} /* namespace CppJieba */
#endif /* TEXTRANK_H_ */


/*
 * UndirectWeightedGraph.hpp
 *
 *  Created on: 2015年7月7日
 *      Author: oliverlwang
 */

#ifndef UNDIRECTWEIGHTEDGRAPH_H_
#define UNDIRECTWEIGHTEDGRAPH_H_

#include <iostream>
#include <algorithm>
#include <vector>
#include <map>
#include <set>

using namespace std;

namespace CppJieba
{

/* edge type */
struct edge_t
{
    string start;
    string end;
    double weight;
};

class UndirectWeightedGraph
{
public:
    UndirectWeightedGraph(): _dampingFactor(0.85){};
    virtual ~UndirectWeightedGraph(){};

    /**
     * @brief add an edge for the UndirectedWeighted Graph
     *
     * @param string &start
     * @param  string &end
     * @param
     *
     * @retval
     */
    void addEdge(const string &start, const string &end, double weight)
    {
        /* add an out edge for vertex start */
        edge_t _edge;
        _edge.start = start;
        _edge.end = end;
        _edge.weight = weight;

        _graph[start].push_back(_edge);

        /* add an out edge for vertex end */
        _edge.start = end;
        _edge.end = start;

        _graph[end].push_back(_edge);
    }

    /**
     * @brief rank the words according to its score
     *
     * @param none
     *
     * @retval none
     */
    void rank()
    {
        map<string, double> outSum;

        /* initialize words score */
        double wsdef = (_graph.size() > 0) ? (1.0 / _graph.size()) : 1.0;

        for(map<string, vector<edge_t> >::iterator it = _graph.begin(); it != _graph.end(); ++it)
        {
            ws[it->first] = wsdef;
            outSum[it->first] = weightOutSum(it->second);
        }

        /* iterator 10 times */
        for(int i = 0; i < 10; ++i)
        {
            /* stl map is sorted by key by default */
            for(map<string, vector<edge_t> >::iterator i = _graph.begin(); i != _graph.end(); ++i)
            {
                double score = 0.0;
                for(vector<edge_t>::iterator j = i->second.begin(); j != i->second.end(); ++j)
                {
                    score += j->weight / outSum[j->end] * ws[j->end];
                }

                ws[i->first] = (1.0 - _dampingFactor) + _dampingFactor * score;
            }
        }

        /* normalize */
        double max_rank = max_element(ws.begin(), ws.end())->second;
        double min_rank = min_element(ws.begin(), ws.end())->second;

        for(map<string, double>::iterator m = ws.begin(); m != ws.end(); ++m)
        {
            m->second = (m->second - min_rank / 10.0) / (max_rank - min_rank / 10.0);
        }
    }

public:
    /* words score */
    map<string, double> ws;

private:
    /* Graph, key: vertex which is a words or term, value: In(vertex) */
      map<string, vector<edge_t> > _graph;

    /* d is the damping factor that can be set between 0 and 1, always set to 0.85 */
    double _dampingFactor;

private:
    /* calculate the weight sum of out edge */
    double weightOutSum(const vector<edge_t>& v)
    {
        double sum = 0.0;
        for(vector<edge_t>::const_iterator it = v.begin(); it != v.end(); ++it)
        {
            sum += it->weight;
        }
        return sum;
    }

};

}/* namespace CppJieba */

#endif /* UNDIRECTWEIGHTEDGRAPH_H_ */

mac下编译不顺畅。

mac 下编译：
Scanning dependencies of target cjsegment
[ 9%] Building CXX object src/CMakeFiles/cjsegment.dir/segment.cpp.o
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/segment.cpp:5:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/ArgvContext.hpp:11:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/str_functs.hpp:24:
/Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/std_outbound.hpp:10:10: fatal error:
'tr1/unordered_map' file not found

include <tr1/unordered_map>

1 error generated.
make[2]: *** [src/CMakeFiles/cjsegment.dir/segment.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/cjsegment.dir/all] Error 2
make: *** [all] Error 2

请问如何指定分词模式？

例子里面只有一个key参数，如何选择模式？

独立出 jiebalib 的建议

最近想基于 cppjieba 写一个 Erlang 的接口，发现如下问题

std_outbound.hpp 中 <tr1/unordered_map> 依赖问题
cppjieba 中集成了多种格式编码的字典文件、web服务器及 daemon script 等模块

所以暂时用 nodejieba 的代码构建了项目。
因此有以下建议：

独立出一个最小化的 jiebalib 项目，去除依赖，只提供引擎功能

请问是否能够提供词性标注的使用示例？

这个链接https://github.com/aszxqw/cppjieba/blob/master/test/tagging_demo.cpp已经失效。

用户自定义词典格式和系统词典格式不统一？

            void _loadUserDict(const string& filePath, double defaultWeight, const string& defaultTag)
            {
  .................
                    buf.clear();
                    split(line, buf, " ");
                    assert(buf.size() >= 1);
                    if(!TransCode::decode(buf[0], nodeInfo.word))
                    {
                        LogError("line[%u:%s] illegal.", lineno, line.c_str());
                        continue;
                    }
                    if(nodeInfo.word.size() == 1)
                    {
                        _userDictSingleChineseWord.insert(nodeInfo.word[0]);
                    }
                    nodeInfo.weight = defaultWeight;
                    nodeInfo.tag = (buf.size() == 2 ? buf[1] : defaultTag);
............
            }

            void _loadDict(const string& filePath) 
            {
............
                    assert(buf.size() == DICT_COLUMN_NUM);

                    if(!TransCode::decode(buf[0], nodeInfo.word))
                    {
                        LogError("line[%u:%s] illegal.", lineno, line.c_str());
                        continue;
                    }
                    nodeInfo.weight = atof(buf[1].c_str());
                    nodeInfo.tag = buf[2];

                    _nodeInfos.push_back(nodeInfo);
                }
            }

这两段，好像系统词典的格式是 流水行云 2 n 用户词典的格式是 蓝翔 nz ，而且好像是没法设置词频的是吧？

文本大小限制？

cppjieba非常稳定、可靠。

但是，似乎当输入文本超过一定大小的时候，就没有返回，也不会提示出错。下面是python 3.4测试代码：

#!/usr/bin/env python3

import urllib.request
import sys

def cut(sentence):
    u = "http://127.0.0.1:11200/"
    req = urllib.request.Request(u, data=sentence.encode('utf-8'))
    try:
        f = urllib.request.urlopen(req)
        s = f.read().decode('utf-8')
    except:
        print("Unexpected error:", sys.exc_info()[0])
        return
    return s

sentence = "南京市长江大桥"
print("repeat 500 times:")
s1 = sentence * 500
print(cut(s1))

print("repeat 2000 times:")
s2 = sentence * 2000
print(cut(s2))

有时候，第一次运行时，2000次重复一样可以被正确分割。但是，重复运行这段代码，就还是没有返回数据了。

Ubuntu 14.04 LTS.

谢谢。

词性标注时“我很帅”只能分成“我”、“很帅”，但是“我非常帅”能分成“我”，“非常”，“帅”？

“很”、“非常”这两个词在词典里面都有啊，而且，“很”的词频高于“非常”，为什么“很帅”不能分成2个词，而“非常帅”分成了2个词。我刚开始使用词性标注类库，不太懂算法原理。麻烦有懂的能解答下么？

when compile will g++, should with -std=c++0x

g++ -std=c++0x -o server server.cpp -L/usr/lib/CppJieba/ -L/usr/lib/CppJieba/Husky -lcppjieba -lhusky -lpthread

多词典的分割符是否可以修改为 `|` 或者 `;`

Windows 下路径可能会是 C:/test/test。

可以使用mmap的方式了节约TRIE树或字典的内存吗？

字典或TRIE树只需要在一个进程当中创建一次就可以了，其它进程再次分词的时候只需要读到此TRIE树的内存，而不需要再一次从文件加载并创建字典。使用mmap方式可以节约内存与进一步减少启动时间。
基于mmap的C++ allocator有类似这样的开源项目：
https://github.com/johannesthoma/mmap_allocator

虽然http方式也可以实现集中处理分词的效果，但若mmap相信效率有更好的提升。

线程安全

请问cppjieba的CUT系列函数是否线程安全呢？

auto start under ubuntu

cjserver is very handy. Created an upstart conf for cjserver so that it could auto start under ubuntu. Tested under ubuntu Server14.04.

Create /etc/init/cjserver.conf as following:

description "cjserver"
start on (local-filesystems and runlevel [2345])
stop on runlevel [016]
respawn

script
    exec cjserver /etc/CppJieba/server.conf
end script

Start/Stop cjserver manually as:

sudo service cjserver start
sudo service cjserver stop
sudo service cjserver status

inconsistent with jieba raw python version, seems less accurate

string s = "附近可点击开飞机的科技开发的开放的了骄傲的龙卷风房贷款及付3的即可看见空间打开"
*## python version: jieba.posseg.cut *
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及 c
付 v
3 m
的 uj
即可 d
看见 v
空间 n
打开 v

## cppjieba version tag
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及付 x
3 x
的 uj
即可 d
看见 v
空间 n
打开 v

## inconsistent lies in number 3

增加windows ci 测试 windows环境的兼容性

考虑把cppjieba的server部分独立出来，只留和分词相关的核心代码，轻装上阵。

当读入20M以上的文件时分词出错，关键词抽取出错！

当读入20M以上的文件时分词出错，关键词抽取出错！
string s length: 14490184
[demo] CutAll
_count: 390
_words.size: 390

[demo] KEYWORD
2016-04-15 14:43:13 ../include/cppjieba/KeywordExtractor.hpp:79 ERROR words illegal

Does it support words "like this" in custom dictionary?

Hi there,

as the title...

I want to specify some special vocabulary to process.

e.g. "android Google", I want this to be a vocabulary "android Google" instead of spread words "android,Google".

Can the jieba or cppjieba do that?

Many Thanks.

search mode 的时候能否提取 token 的 offset?

如果不能的话是否考虑加入这个功能？

请问支持gbk格式的输入吗？

RT，发现gbk格式输入时，分词结果都为单个字

BUG: CutForSearch与python行为不一致

例子：
vector words;
jieba.DoCutForSearch("他心理健康", words);
得到：
他
心理健康

Python:
seg_list = jieba.cut_for_search("他心理健康") # 搜索引擎模式
for i in seg_list:
print i
得到：
他
心理
健康
心理健康

C++版有bug, 导致搜心理搜不到词条，得搜心理健康。

HTTP接口是否支持批处理？

因为我不希望对每一句话都调用一次。

更换分词方法？

如题

用户自定义词典可否支持包含标点符号的词组？

yanyiwu 你好，

首先感谢你的辛勤工作。

我想使用 cc-cedict 作为分词词典，然而里面会包含类似“同一个世界，同一个梦想”这个包含标点的词组，结巴并不能将它识别为一个词组。不知能否加上一个参数选项以支持这种需求？

Hi yanyiwu, how about making a branch for c++11 or c++14

you known , so much good stuffs in c++11 and c++14, like lambda expression, you'll omit
static bool compare(const X&....) { .................. return ..} instead, [ ](X x, Y y){return x < y;}
maybe less code?

std::unordered_map and std::unordered_set also better choise than TR1.....

分享的词典下载地址已经挂了

其它词典资料分享

dict.367W.utf8.tar.gz iLife([email protected])

README里