Git Product home page Git Product logo

cppjieba's People

Contributors

aholic avatar appotry avatar bigelephant29 avatar bitdeli-chef avatar bung87 avatar byronhe avatar dlackty avatar ixqbar avatar iynehz avatar jaiminpan avatar maliubiao avatar npes87184 avatar qinwf avatar questionfish avatar royguo avatar shove70 avatar silencezjl avatar vsooda avatar w32zhong avatar wangfenjin avatar xuangong avatar yanyiwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cppjieba's Issues

词性标注添加用户词典

结巴分词的分词MixSegment类里面可以指定用户词典,
为什么词性标注里面PosTagger就不能指定用户词典?
这样我如果要用词性标注的那个接口, 而且要支持自己定义的用户词典应该怎么办?

Add user_dict to server

Is it possible to add user_dict to server by adding a line to the Server.conf?

usr_dict_path=/usr/share/CppJieba/dict/user.dict.utf8

Sorry I am not familiar with cpp.

Thanks.

无法切分英文单词和标点符号

模式是 Mixsegment,使用默认词典。

"小明硕士毕业于**科学院计算院,先就职于IBM,后在日本京都大学深造“

结果是:

"小明" "硕士" "毕业" "于"
"**科学院" "计算" "院" ","
"先" "就职" "于" "IBM,"
"后" "在" "日本京都大学" "深造"

windows 下如何使用

帮助 文档里面都是在linux 下面的。。请问下 有人在windows下面使用过这个东西么?使用c/c++ 来调用。 本人新手 求指点

在系列cut函数中,去掉if(!_getInitFlag())会不会是更好的设计?

先承认我可能过于纠结这点小事情了,它对性能可能没任何影响,但总觉得有大量循环运行的cut函数里,总得判断一次if(!_getInitFlag())很别扭。
1、个人觉得良好的设计应该是用assert,用户调用这个函数的时候默认应该是_getInitFlag为True的,让用户来保证先初期化才能使用cut。
2、当然如果实在担心用户不初始化就使用cut带来的错误,可以使用类似python的动态绑定的方法,
BaseSegment(){
_setInitFlag(False);
BIND(this->cut, not_init_cut);
}
virtual bool init(){
BIND(this->cut, have_inited_cut);
}
not_init_cut(str) {
return False;/* Not init /
}
have_inited_cut(str){
/
Just cut , not if(!_getInitFlag()) */
}
个人比较偏向一方法,简单明了。对于第2方法,不知道C++实现方便不?

补充:在实践中使用的话,一般用户都需要作singleton处理,因为全局的应用程序就使用一个XXXSegment对象就够了,那个地方的代码用户肯定会加载字典并初始化对象,所以在cut当中再判断是否初始化没太多必要了:)

[求助]windows下不能正确分词

移植到了VS上,修改了编码格式为UTF-8, 设置了string默认编码为UTF-8,运行后仍无法正确显示。

他来到了网易杭研大厦
[demo] Cut With HMM
他/来/到/了/网/易/杭/研/大/厦
[demo] Cut Without HMM
他/来/到/了/网/易/杭/研/大/厦
我来到北京清华大学
[demo] CutAll
我/来/到/北/京/清/华/大/学
小明硕士毕业于**科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小/明/硕/士/毕/业/于/中/国/科/学/院/计/算/所/,/后/在/日/本/京/都/大/学/深/造
[demo] Insert User Word
男/默/女/泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小", "offset": 0}, {"word": "明", "offset": 2}, {"word": "硕", "offse
t": 4}, {"word": "士", "offset": 6}, {"word": "毕", "offset": 8}, {"word": "业",
"offset": 10}, {"word": "于", "offset": 12}, {"word": "中", "offset": 14}, {"wo
rd": "国", "offset": 16}, {"word": "科", "offset": 18}, {"word": "学", "offset":
20}, {"word": "院", "offset": 22}, {"word": "计", "offset": 24}, {"word": "算",
"offset": 26}, {"word": "所", "offset": 28}, {"word": ",", "offset": 30}, {"wo
rd": "后", "offset": 32}, {"word": "在", "offset": 34}, {"word": "日", "offset":
36}, {"word": "本", "offset": 38}, {"word": "京", "offset": 40}, {"word": "都",
"offset": 42}, {"word": "大", "offset": 44}, {"word": "学", "offset": 46}, {"wo
rd": "深", "offset": 48}, {"word": "造", "offset": 50}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰

[我:x, 是:x, 拖:x, 拉:x, 机:x, 学:x, 院:x, 手:x, 扶:x, 拖:x, 拉:x, 机:x, 专:x,
业:x, 的:x, 。:x, 不:x, 用:x, 多:x, 久:x, ,:x, 我:x, 就:x, 会:x, 升:x, 职:x, 加
:x, 薪:x, ,:x, 当:x, 上:x, CEO:eng, ,:x, 走:x, 上:x, 人:x, 生:x, 巅?x, 濉?x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰

[{"word": "CEO", "offset": [62], "weight": -1.08658e+063}]

MixSegment无法切分 英文+中文 组合词

直接调用MixSegment, 导入词典是 jieba.dict.utf8与hmm_modle.utf8;输入“B超 T恤”,返回结果是["B", "超", " T", "恤"];
可是python版结巴是ok的;想问问这是为什么?抑或是我哪打开方式不对?
谢谢~

Demo 跑起来有问题啊

我从vs里面跑demo项目的时候,结果并不正确
[demo] Cut With HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Cut Without HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutAll
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutForSearch
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Insert User Word
膼/默/女/!
膼默女!
[demo] Locate Words
膹, 0, 1
蕞, 1, 2
蕫, 2, 3
婴, 3, 4
莪, 4, 5
猿, 5, 6
菂, 6, 7
[demo] TAGGING
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰

["螔:x", "蕠:x", "蛷:x", "-:x", "酆:x", "学:x", "院:x", "蕱:x", "锥:x", "蛷:x",
"-:x", "酆:x", "专:x", "业:x", "談:x", "c:x", "一:x", "觾:x", "譅:x", "迌:x", "
矛:x", "螔:x", "迧:x", "邸:x", "山:x", "职:x", "軗:x", "薪:x", "矛:x", "毡:x", "
蓮:x", "CEO:eng", "矛:x", "谉:x", "蓮:x", "葖:x", "珊:x", "釠?x", "濉?x"]
[demo] KEYWORD
78
81
2016-04-07 17:00:46 E:\Project\cppjieba\include\cppjieba/KeywordExtractor.hpp:81
ERROR words illegal
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰

我是win7 专业版英文64位系统,cmd default code page 是936 GBK

OS X Mountain Lion下编译不过

g++ -c -Wall -O3 keywordext_demo.cpp
In file included from keywordext_demo.cpp:3:
In file included from ./../cppjieba/headers.h:8:
In file included from ./../cppjieba/../cppcommon/headers.h:15:
./../cppjieba/../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppjieba/../cppcommon/sort_functs.h:270:37: note: place parentheses around
the '&&' expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppjieba/../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated:
first deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char , struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
cd ../cppjieba && make
g++ -c -g -Wall -DDEBUG HMMSegment.cpp
In file included from HMMSegment.cpp:1:
In file included from ./HMMSegment.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG KeyWordExt.cpp
In file included from KeyWordExt.cpp:5:
In file included from ./KeyWordExt.h:8:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MPSegment.cpp
In file included from MPSegment.cpp:5:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MixSegment.cpp
In file included from MixSegment.cpp:1:
In file included from ./MixSegment.h:4:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG SegmentBase.cpp
In file included from SegmentBase.cpp:1:
In file included from ./SegmentBase.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG Trie.cpp
In file included from Trie.cpp:5:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...
(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
ar rc libcppjieba.a HMMSegment.o KeyWordExt.o MPSegment.o MixSegment.o SegmentBase.o Trie.o
ar: temporary file: No such file or directory
make[1]: *
* [libcppjieba.a] Error 1
make: *** [../cppjieba/libcppjieba.a] Error 2

实现了关键词提取的textrank算法,请yanyi check一下。另外,idf的方法和textrank哪个更优?

/*
 * TextRank.hpp
 *
 *  Created on: 2015年7月7日
 *      Author: oliverlwang
 */
#ifndef TEXTRANK_H_
#define TEXTRANK_H_

#include "UndirectWeightedGraph.hpp"

namespace CppJieba
{

class TextRank
{
public:
    TextRank() : _span(2) {};
    virtual ~TextRank(){};

    /**
     * @brief get the TopN Keywords.
     *
     * @param vector words input words
     * @param vector keywords keywords with score
     * @param int topN how many keywords do you want
     *
     * @retval
     */
    int textRank(vector<string>& words, map<string, double>& wordmap)
    {
        try
        {
            UndirectWeightedGraph graph;
            map< pair<string, string>, double> cm;

            for(size_t i = 0; i < words.size(); ++i)
            {
                /* syntactic filter */

                /* ngram, when span=2, f-measure gets best result */
                for(size_t j = i + 1; j < i + _span; ++j)
                {
                    if(j >= words.size())
                        break;
                    /* using std::pair as union key */
                    pair<string, string> key = make_pair(words[i], words[j]);
                    cm[key] += 1.0;
                }
            }

            /* add edge */
            for(map< pair<string, string>, double>::iterator it = cm.begin(); it != cm.end(); ++it)
            {
                /* do not add edge between the same vertex */
                if(it->first.first == it->first.second)
                    continue;

                graph.addEdge(it->first.first, it->first.second, it->second);
            }

            /* rank */
            graph.rank();

            wordmap.clear();
            wordmap = graph.ws;
        }
        catch(exception &e)
        {
            cerr << e.what() << endl;
            return -1;
        }
        return 0;
    }

private:
    int _span;             /* scanning span */


};

} /* namespace CppJieba */
#endif /* TEXTRANK_H_ */


/*
 * UndirectWeightedGraph.hpp
 *
 *  Created on: 2015年7月7日
 *      Author: oliverlwang
 */

#ifndef UNDIRECTWEIGHTEDGRAPH_H_
#define UNDIRECTWEIGHTEDGRAPH_H_

#include <iostream>
#include <algorithm>
#include <vector>
#include <map>
#include <set>

using namespace std;

namespace CppJieba
{

/* edge type */
struct edge_t
{
    string start;
    string end;
    double weight;
};

class UndirectWeightedGraph
{
public:
    UndirectWeightedGraph(): _dampingFactor(0.85){};
    virtual ~UndirectWeightedGraph(){};

    /**
     * @brief add an edge for the UndirectedWeighted Graph
     *
     * @param string &start
     * @param  string &end
     * @param
     *
     * @retval
     */
    void addEdge(const string &start, const string &end, double weight)
    {
        /* add an out edge for vertex start */
        edge_t _edge;
        _edge.start = start;
        _edge.end = end;
        _edge.weight = weight;

        _graph[start].push_back(_edge);

        /* add an out edge for vertex end */
        _edge.start = end;
        _edge.end = start;

        _graph[end].push_back(_edge);
    }

    /**
     * @brief rank the words according to its score
     *
     * @param none
     *
     * @retval none
     */
    void rank()
    {
        map<string, double> outSum;

        /* initialize words score */
        double wsdef = (_graph.size() > 0) ? (1.0 / _graph.size()) : 1.0;

        for(map<string, vector<edge_t> >::iterator it = _graph.begin(); it != _graph.end(); ++it)
        {
            ws[it->first] = wsdef;
            outSum[it->first] = weightOutSum(it->second);
        }

        /* iterator 10 times */
        for(int i = 0; i < 10; ++i)
        {
            /* stl map is sorted by key by default */
            for(map<string, vector<edge_t> >::iterator i = _graph.begin(); i != _graph.end(); ++i)
            {
                double score = 0.0;
                for(vector<edge_t>::iterator j = i->second.begin(); j != i->second.end(); ++j)
                {
                    score += j->weight / outSum[j->end] * ws[j->end];
                }

                ws[i->first] = (1.0 - _dampingFactor) + _dampingFactor * score;
            }
        }

        /* normalize */
        double max_rank = max_element(ws.begin(), ws.end())->second;
        double min_rank = min_element(ws.begin(), ws.end())->second;

        for(map<string, double>::iterator m = ws.begin(); m != ws.end(); ++m)
        {
            m->second = (m->second - min_rank / 10.0) / (max_rank - min_rank / 10.0);
        }
    }

public:
    /* words score */
    map<string, double> ws;

private:
    /* Graph, key: vertex which is a words or term, value: In(vertex) */
      map<string, vector<edge_t> > _graph;

    /* d is the damping factor that can be set between 0 and 1, always set to 0.85 */
    double _dampingFactor;

private:
    /* calculate the weight sum of out edge */
    double weightOutSum(const vector<edge_t>& v)
    {
        double sum = 0.0;
        for(vector<edge_t>::const_iterator it = v.begin(); it != v.end(); ++it)
        {
            sum += it->weight;
        }
        return sum;
    }

};

}/* namespace CppJieba */

#endif /* UNDIRECTWEIGHTEDGRAPH_H_ */

mac下编译不顺畅。

mac 下编译:
Scanning dependencies of target cjsegment
[ 9%] Building CXX object src/CMakeFiles/cjsegment.dir/segment.cpp.o
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/segment.cpp:5:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/ArgvContext.hpp:11:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/str_functs.hpp:24:
/Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/std_outbound.hpp:10:10: fatal error:
'tr1/unordered_map' file not found

include <tr1/unordered_map>

     ^

1 error generated.
make[2]: *** [src/CMakeFiles/cjsegment.dir/segment.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/cjsegment.dir/all] Error 2
make: *** [all] Error 2

独立出 jiebalib 的建议

最近想基于 cppjieba 写一个 Erlang 的接口 ,发现如下问题

  1. std_outbound.hpp 中 <tr1/unordered_map> 依赖问题
  2. cppjieba 中集成了多种格式编码的字典文件、web服务器及 daemon script 等模块

所以暂时用 nodejieba 的代码构建了项目。
因此有以下建议:

独立出一个最小化的 jiebalib 项目,去除依赖,只提供引擎功能

用户自定义词典格式和系统词典格式不统一?

            void _loadUserDict(const string& filePath, double defaultWeight, const string& defaultTag)
            {
  .................
                    buf.clear();
                    split(line, buf, " ");
                    assert(buf.size() >= 1);
                    if(!TransCode::decode(buf[0], nodeInfo.word))
                    {
                        LogError("line[%u:%s] illegal.", lineno, line.c_str());
                        continue;
                    }
                    if(nodeInfo.word.size() == 1)
                    {
                        _userDictSingleChineseWord.insert(nodeInfo.word[0]);
                    }
                    nodeInfo.weight = defaultWeight;
                    nodeInfo.tag = (buf.size() == 2 ? buf[1] : defaultTag);
............
            }
            void _loadDict(const string& filePath) 
            {
............
                    assert(buf.size() == DICT_COLUMN_NUM);

                    if(!TransCode::decode(buf[0], nodeInfo.word))
                    {
                        LogError("line[%u:%s] illegal.", lineno, line.c_str());
                        continue;
                    }
                    nodeInfo.weight = atof(buf[1].c_str());
                    nodeInfo.tag = buf[2];

                    _nodeInfos.push_back(nodeInfo);
                }
            }

这两段,好像系统词典的格式是 流水行云 2 n 用户词典的格式是 蓝翔 nz ,而且好像是没法设置词频的是吧?

文本大小限制?

cppjieba非常稳定、可靠。

但是,似乎当输入文本超过一定大小的时候,就没有返回,也不会提示出错。下面是python 3.4测试代码:

#!/usr/bin/env python3

import urllib.request
import sys

def cut(sentence):
    u = "http://127.0.0.1:11200/"
    req = urllib.request.Request(u, data=sentence.encode('utf-8'))
    try:
        f = urllib.request.urlopen(req)
        s = f.read().decode('utf-8')
    except:
        print("Unexpected error:", sys.exc_info()[0])
        return
    return s

sentence = "南京市长江大桥"
print("repeat 500 times:")
s1 = sentence * 500
print(cut(s1))

print("repeat 2000 times:")
s2 = sentence * 2000
print(cut(s2))

有时候,第一次运行时,2000次重复一样可以被正确分割。但是,重复运行这段代码,就还是没有返回数据了。

Ubuntu 14.04 LTS.

谢谢。

可以使用mmap的方式了节约TRIE树或字典的内存吗?

字典或TRIE树只需要在一个进程当中创建一次就可以了,其它进程再次分词的时候只需要读到此TRIE树的内存,而不需要再一次从文件加载并创建字典。使用mmap方式可以节约内存与进一步减少启动时间。
基于mmap的C++ allocator有类似这样的开源项目:
https://github.com/johannesthoma/mmap_allocator

虽然http方式也可以实现集中处理分词的效果,但若mmap相信效率有更好的提升。

线程安全

请问cppjieba的CUT系列函数是否线程安全呢?

auto start under ubuntu

cjserver is very handy. Created an upstart conf for cjserver so that it could auto start under ubuntu. Tested under ubuntu Server14.04.

Create /etc/init/cjserver.conf as following:

description "cjserver"
start on (local-filesystems and runlevel [2345])
stop on runlevel [016]
respawn

script
    exec cjserver /etc/CppJieba/server.conf
end script

Start/Stop cjserver manually as:

sudo service cjserver start
sudo service cjserver stop
sudo service cjserver status

inconsistent with jieba raw python version, seems less accurate

string s = "附近可点击开飞机的科技开发的开放的了骄傲的龙卷风房贷款及付3的即可看见空间打开"
*## python version: jieba.posseg.cut *
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及 c
付 v
3 m
的 uj
即可 d
看见 v
空间 n
打开 v

## cppjieba version tag
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及付 x
3 x
的 uj
即可 d
看见 v
空间 n
打开 v

## inconsistent lies in number 3

Does it support words "like this" in custom dictionary?

Hi there,

as the title...

I want to specify some special vocabulary to process.

e.g. "android Google", I want this to be a vocabulary "android Google" instead of spread words "android,Google".

Can the jieba or cppjieba do that?

Many Thanks.

BUG: CutForSearch与python行为不一致

例子:
vector words;
jieba.DoCutForSearch("他心理健康", words);
得到:

心理健康

Python:
seg_list = jieba.cut_for_search("他心理健康") # 搜索引擎模式
for i in seg_list:
print i
得到:

心理
健康
心理健康

C++版有bug, 导致搜心理搜不到词条,得搜心理健康。

用户自定义词典可否支持包含标点符号的词组?

yanyiwu 你好,

首先感谢你的辛勤工作。

我想使用 cc-cedict 作为分词词典,然而里面会包含类似“同一个世界,同一个梦想”这个包含标点的词组,结巴并不能将它识别为一个词组。不知能否加上一个参数选项以支持这种需求?

Hi yanyiwu, how about making a branch for c++11 or c++14

you known , so much good stuffs in c++11 and c++14, like lambda expression, you'll omit
static bool compare(const X&....) { .................. return ..} instead, [ ](X x, Y y){return x < y;}
maybe less code?

std::unordered_map and std::unordered_set also better choise than TR1.....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.