yanyiwu / cppjieba Goto Github PK
View Code? Open in Web Editor NEW"结巴"中文分词的C++版本
License: MIT License
"结巴"中文分词的C++版本
License: MIT License
Hi,
I'm trying to use Jieba.Cut(text, result)
here, but the result shows that, it counts offset
s by bytes, not unicode characters.
My text content have Chinese and English characters mixed, so I wonder is there any way to make it ? thanks for your great work!
结巴分词的分词MixSegment类里面可以指定用户词典,
为什么词性标注里面PosTagger就不能指定用户词典?
这样我如果要用词性标注的那个接口, 而且要支持自己定义的用户词典应该怎么办?
Is it possible to add user_dict to server by adding a line to the Server.conf?
usr_dict_path=/usr/share/CppJieba/dict/user.dict.utf8
Sorry I am not familiar with cpp.
Thanks.
如题。
如题
你好,我是结巴分词的作者,请问cppjieb的分词速度如何?
vs 用的GBK编码 研究了半天才搞明白
模式是 Mixsegment,使用默认词典。
"小明硕士毕业于**科学院计算院,先就职于IBM,后在日本京都大学深造“
结果是:
"小明" "硕士" "毕业" "于"
"**科学院" "计算" "院" ","
"先" "就职" "于" "IBM,"
"后" "在" "日本京都大学" "深造"
RT
标准流程
make
make test
make install
#include <cppjieba/Jieba.hpp>
帮助 文档里面都是在linux 下面的。。请问下 有人在windows下面使用过这个东西么?使用c/c++ 来调用。 本人新手 求指点
在自定义字典里有 A 与 B 两个词语,在分词结果里假设 A B是前后连续的,那么 会在结果里直接将A B 合并成 AB。
mysql5.7已经包含ngram和日本那个mecab
先承认我可能过于纠结这点小事情了,它对性能可能没任何影响,但总觉得有大量循环运行的cut函数里,总得判断一次if(!_getInitFlag())很别扭。
1、个人觉得良好的设计应该是用assert,用户调用这个函数的时候默认应该是_getInitFlag为True的,让用户来保证先初期化才能使用cut。
2、当然如果实在担心用户不初始化就使用cut带来的错误,可以使用类似python的动态绑定的方法,
BaseSegment(){
_setInitFlag(False);
BIND(this->cut, not_init_cut);
}
virtual bool init(){
BIND(this->cut, have_inited_cut);
}
not_init_cut(str) {
return False;/* Not init /
}
have_inited_cut(str){
/ Just cut , not if(!_getInitFlag()) */
}
个人比较偏向一方法,简单明了。对于第2方法,不知道C++实现方便不?
补充:在实践中使用的话,一般用户都需要作singleton处理,因为全局的应用程序就使用一个XXXSegment对象就够了,那个地方的代码用户肯定会加载字典并初始化对象,所以在cut当中再判断是否初始化没太多必要了:)
cppjieba/src/Limonp/StringUtil.hpp
Line 165 in aed1c8f
爲何不直接支持 24 位的 Unicode 碼呢?
移植到了VS上,修改了编码格式为UTF-8, 设置了string默认编码为UTF-8,运行后仍无法正确显示。
他来到了网易杭研大厦
[demo] Cut With HMM
他/来/到/了/网/易/杭/研/大/厦
[demo] Cut Without HMM
他/来/到/了/网/易/杭/研/大/厦
我来到北京清华大学
[demo] CutAll
我/来/到/北/京/清/华/大/学
小明硕士毕业于**科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小/明/硕/士/毕/业/于/中/国/科/学/院/计/算/所/,/后/在/日/本/京/都/大/学/深/造
[demo] Insert User Word
男/默/女/泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小", "offset": 0}, {"word": "明", "offset": 2}, {"word": "硕", "offse
t": 4}, {"word": "士", "offset": 6}, {"word": "毕", "offset": 8}, {"word": "业",
"offset": 10}, {"word": "于", "offset": 12}, {"word": "中", "offset": 14}, {"wo
rd": "国", "offset": 16}, {"word": "科", "offset": 18}, {"word": "学", "offset":
20}, {"word": "院", "offset": 22}, {"word": "计", "offset": 24}, {"word": "算",
"offset": 26}, {"word": "所", "offset": 28}, {"word": ",", "offset": 30}, {"wo
rd": "后", "offset": 32}, {"word": "在", "offset": 34}, {"word": "日", "offset":
36}, {"word": "本", "offset": 38}, {"word": "京", "offset": 40}, {"word": "都",
"offset": 42}, {"word": "大", "offset": 44}, {"word": "学", "offset": 46}, {"wo
rd": "深", "offset": 48}, {"word": "造", "offset": 50}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰
。
[我:x, 是:x, 拖:x, 拉:x, 机:x, 学:x, 院:x, 手:x, 扶:x, 拖:x, 拉:x, 机:x, 专:x,
业:x, 的:x, 。:x, 不:x, 用:x, 多:x, 久:x, ,:x, 我:x, 就:x, 会:x, 升:x, 职:x, 加
:x, 薪:x, ,:x, 当:x, 上:x, CEO:eng, ,:x, 走:x, 上:x, 人:x, 生:x, 巅?x, 濉?x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰
。
[{"word": "CEO", "offset": [62], "weight": -1.08658e+063}]
直接调用MixSegment, 导入词典是 jieba.dict.utf8与hmm_modle.utf8;输入“B超 T恤”,返回结果是["B", "超", " T", "恤"];
可是python版结巴是ok的;想问问这是为什么?抑或是我哪打开方式不对?
谢谢~
我从vs里面跑demo项目的时候,结果并不正确
[demo] Cut With HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Cut Without HMM
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutAll
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/C/E/O/矛/谉/蓮/葖/珊/釠?濉
[demo] CutForSearch
螔/蕠/蛷/-/酆/学/院/蕱/锥/蛷/-/酆/专/业/談/c/一/觾/譅/迌/矛/螔/迧/邸/山/职/軗/薪
/矛/毡/蓮/CEO/矛/谉/蓮/葖/珊/釠?濉
[demo] Insert User Word
膼/默/女/!
膼默女!
[demo] Locate Words
膹, 0, 1
蕞, 1, 2
蕫, 2, 3
婴, 3, 4
莪, 4, 5
猿, 5, 6
菂, 6, 7
[demo] TAGGING
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰
。
["螔:x", "蕠:x", "蛷:x", "-:x", "酆:x", "学:x", "院:x", "蕱:x", "锥:x", "蛷:x",
"-:x", "酆:x", "专:x", "业:x", "談:x", "c:x", "一:x", "觾:x", "譅:x", "迌:x", "
矛:x", "螔:x", "迧:x", "邸:x", "山:x", "职:x", "軗:x", "薪:x", "矛:x", "毡:x", "
蓮:x", "CEO:eng", "矛:x", "谉:x", "蓮:x", "葖:x", "珊:x", "釠?x", "濉?x"]
[demo] KEYWORD
78
81
2016-04-07 17:00:46 E:\Project\cppjieba\include\cppjieba/KeywordExtractor.hpp:81
ERROR words illegal
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰
。
我是win7 专业版英文64位系统,cmd default code page 是936 GBK
目的是什么,相关数据结构做什么调整?
g++ -c -Wall -O3 keywordext_demo.cpp
In file included from keywordext_demo.cpp:3:
In file included from ./../cppjieba/headers.h:8:
In file included from ./../cppjieba/../cppcommon/headers.h:15:
./../cppjieba/../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppjieba/../cppcommon/sort_functs.h:270:37: note: place parentheses around
the '&&' expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppjieba/../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated:
first deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char , struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
cd ../cppjieba && make
g++ -c -g -Wall -DDEBUG HMMSegment.cpp
In file included from HMMSegment.cpp:1:
In file included from ./HMMSegment.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG KeyWordExt.cpp
In file included from KeyWordExt.cpp:5:
In file included from ./KeyWordExt.h:8:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MPSegment.cpp
In file included from MPSegment.cpp:5:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG MixSegment.cpp
In file included from MixSegment.cpp:1:
In file included from ./MixSegment.h:4:
In file included from ./MPSegment.h:10:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG SegmentBase.cpp
In file included from SegmentBase.cpp:1:
In file included from ./SegmentBase.h:7:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
g++ -c -g -Wall -DDEBUG Trie.cpp
In file included from Trie.cpp:5:
In file included from ./Trie.h:15:
In file included from ./../cppcommon/headers.h:15:
./../cppcommon/sort_functs.h:270:37: warning: '&&' within '||'
[-Wlogical-op-parentheses]
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./../cppcommon/sort_functs.h:270:37: note: place parentheses around the '&&'
expression to silence this warning
...(nodeL.first)==(nodeR.first)&&nodeL.second<nodeR.second;
^
( )
./../cppcommon/sort_functs.h:316:7: warning: 'stat64' is deprecated: first
deprecated in OS X 10.6 [-Wdeprecated-declarations]
if(stat64(pchFileIn, &stllFileSize)<0)
^
/usr/include/sys/stat.h:466:5: note: 'stat64' declared here
int stat64(const char *, struct stat64 *) __OSX_AVAILABLE_BUT_DEPREC...
^
2 warnings generated.
ar rc libcppjieba.a HMMSegment.o KeyWordExt.o MPSegment.o MixSegment.o SegmentBase.o Trie.o
ar: temporary file: No such file or directory
make[1]: ** [libcppjieba.a] Error 1
make: *** [../cppjieba/libcppjieba.a] Error 2
/*
* TextRank.hpp
*
* Created on: 2015年7月7日
* Author: oliverlwang
*/
#ifndef TEXTRANK_H_
#define TEXTRANK_H_
#include "UndirectWeightedGraph.hpp"
namespace CppJieba
{
class TextRank
{
public:
TextRank() : _span(2) {};
virtual ~TextRank(){};
/**
* @brief get the TopN Keywords.
*
* @param vector words input words
* @param vector keywords keywords with score
* @param int topN how many keywords do you want
*
* @retval
*/
int textRank(vector<string>& words, map<string, double>& wordmap)
{
try
{
UndirectWeightedGraph graph;
map< pair<string, string>, double> cm;
for(size_t i = 0; i < words.size(); ++i)
{
/* syntactic filter */
/* ngram, when span=2, f-measure gets best result */
for(size_t j = i + 1; j < i + _span; ++j)
{
if(j >= words.size())
break;
/* using std::pair as union key */
pair<string, string> key = make_pair(words[i], words[j]);
cm[key] += 1.0;
}
}
/* add edge */
for(map< pair<string, string>, double>::iterator it = cm.begin(); it != cm.end(); ++it)
{
/* do not add edge between the same vertex */
if(it->first.first == it->first.second)
continue;
graph.addEdge(it->first.first, it->first.second, it->second);
}
/* rank */
graph.rank();
wordmap.clear();
wordmap = graph.ws;
}
catch(exception &e)
{
cerr << e.what() << endl;
return -1;
}
return 0;
}
private:
int _span; /* scanning span */
};
} /* namespace CppJieba */
#endif /* TEXTRANK_H_ */
/*
* UndirectWeightedGraph.hpp
*
* Created on: 2015年7月7日
* Author: oliverlwang
*/
#ifndef UNDIRECTWEIGHTEDGRAPH_H_
#define UNDIRECTWEIGHTEDGRAPH_H_
#include <iostream>
#include <algorithm>
#include <vector>
#include <map>
#include <set>
using namespace std;
namespace CppJieba
{
/* edge type */
struct edge_t
{
string start;
string end;
double weight;
};
class UndirectWeightedGraph
{
public:
UndirectWeightedGraph(): _dampingFactor(0.85){};
virtual ~UndirectWeightedGraph(){};
/**
* @brief add an edge for the UndirectedWeighted Graph
*
* @param string &start
* @param string &end
* @param
*
* @retval
*/
void addEdge(const string &start, const string &end, double weight)
{
/* add an out edge for vertex start */
edge_t _edge;
_edge.start = start;
_edge.end = end;
_edge.weight = weight;
_graph[start].push_back(_edge);
/* add an out edge for vertex end */
_edge.start = end;
_edge.end = start;
_graph[end].push_back(_edge);
}
/**
* @brief rank the words according to its score
*
* @param none
*
* @retval none
*/
void rank()
{
map<string, double> outSum;
/* initialize words score */
double wsdef = (_graph.size() > 0) ? (1.0 / _graph.size()) : 1.0;
for(map<string, vector<edge_t> >::iterator it = _graph.begin(); it != _graph.end(); ++it)
{
ws[it->first] = wsdef;
outSum[it->first] = weightOutSum(it->second);
}
/* iterator 10 times */
for(int i = 0; i < 10; ++i)
{
/* stl map is sorted by key by default */
for(map<string, vector<edge_t> >::iterator i = _graph.begin(); i != _graph.end(); ++i)
{
double score = 0.0;
for(vector<edge_t>::iterator j = i->second.begin(); j != i->second.end(); ++j)
{
score += j->weight / outSum[j->end] * ws[j->end];
}
ws[i->first] = (1.0 - _dampingFactor) + _dampingFactor * score;
}
}
/* normalize */
double max_rank = max_element(ws.begin(), ws.end())->second;
double min_rank = min_element(ws.begin(), ws.end())->second;
for(map<string, double>::iterator m = ws.begin(); m != ws.end(); ++m)
{
m->second = (m->second - min_rank / 10.0) / (max_rank - min_rank / 10.0);
}
}
public:
/* words score */
map<string, double> ws;
private:
/* Graph, key: vertex which is a words or term, value: In(vertex) */
map<string, vector<edge_t> > _graph;
/* d is the damping factor that can be set between 0 and 1, always set to 0.85 */
double _dampingFactor;
private:
/* calculate the weight sum of out edge */
double weightOutSum(const vector<edge_t>& v)
{
double sum = 0.0;
for(vector<edge_t>::const_iterator it = v.begin(); it != v.end(); ++it)
{
sum += it->weight;
}
return sum;
}
};
}/* namespace CppJieba */
#endif /* UNDIRECTWEIGHTEDGRAPH_H_ */
mac 下编译:
Scanning dependencies of target cjsegment
[ 9%] Building CXX object src/CMakeFiles/cjsegment.dir/segment.cpp.o
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/segment.cpp:5:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/ArgvContext.hpp:11:
In file included from /Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/str_functs.hpp:24:
/Users/jungle/workspace/my_lib/segment/cppjieba-2.4.0/src/Limonp/std_outbound.hpp:10:10: fatal error:
'tr1/unordered_map' file not found
^
1 error generated.
make[2]: *** [src/CMakeFiles/cjsegment.dir/segment.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/cjsegment.dir/all] Error 2
make: *** [all] Error 2
例子里面只有一个key参数,如何选择模式?
最近想基于 cppjieba 写一个 Erlang 的接口 ,发现如下问题
所以暂时用 nodejieba 的代码构建了项目。
因此有以下建议:
独立出一个最小化的 jiebalib 项目,去除依赖,只提供引擎功能
void _loadUserDict(const string& filePath, double defaultWeight, const string& defaultTag)
{
.................
buf.clear();
split(line, buf, " ");
assert(buf.size() >= 1);
if(!TransCode::decode(buf[0], nodeInfo.word))
{
LogError("line[%u:%s] illegal.", lineno, line.c_str());
continue;
}
if(nodeInfo.word.size() == 1)
{
_userDictSingleChineseWord.insert(nodeInfo.word[0]);
}
nodeInfo.weight = defaultWeight;
nodeInfo.tag = (buf.size() == 2 ? buf[1] : defaultTag);
............
}
void _loadDict(const string& filePath)
{
............
assert(buf.size() == DICT_COLUMN_NUM);
if(!TransCode::decode(buf[0], nodeInfo.word))
{
LogError("line[%u:%s] illegal.", lineno, line.c_str());
continue;
}
nodeInfo.weight = atof(buf[1].c_str());
nodeInfo.tag = buf[2];
_nodeInfos.push_back(nodeInfo);
}
}
这两段,好像系统词典的格式是 流水行云 2 n
用户词典的格式是 蓝翔 nz
,而且好像是没法设置词频的是吧?
cppjieba非常稳定、可靠。
但是,似乎当输入文本超过一定大小的时候,就没有返回,也不会提示出错。下面是python 3.4测试代码:
#!/usr/bin/env python3
import urllib.request
import sys
def cut(sentence):
u = "http://127.0.0.1:11200/"
req = urllib.request.Request(u, data=sentence.encode('utf-8'))
try:
f = urllib.request.urlopen(req)
s = f.read().decode('utf-8')
except:
print("Unexpected error:", sys.exc_info()[0])
return
return s
sentence = "南京市长江大桥"
print("repeat 500 times:")
s1 = sentence * 500
print(cut(s1))
print("repeat 2000 times:")
s2 = sentence * 2000
print(cut(s2))
有时候,第一次运行时,2000次重复一样可以被正确分割。但是,重复运行这段代码,就还是没有返回数据了。
Ubuntu 14.04 LTS.
谢谢。
“很”、“非常”这两个词在词典里面都有啊,而且,“很”的词频高于“非常”,为什么“很帅”不能分成2个词,而“非常帅”分成了2个词。我刚开始使用词性标注类库,不太懂算法原理。麻烦有懂的能解答下么?
g++ -std=c++0x -o server server.cpp -L/usr/lib/CppJieba/ -L/usr/lib/CppJieba/Husky -lcppjieba -lhusky -lpthread
Windows 下路径可能会是 C:/test/test
。
字典或TRIE树只需要在一个进程当中创建一次就可以了,其它进程再次分词的时候只需要读到此TRIE树的内存,而不需要再一次从文件加载并创建字典。使用mmap方式可以节约内存与进一步减少启动时间。
基于mmap的C++ allocator有类似这样的开源项目:
https://github.com/johannesthoma/mmap_allocator
虽然http方式也可以实现集中处理分词的效果,但若mmap相信效率有更好的提升。
请问cppjieba的CUT系列函数是否线程安全呢?
cjserver is very handy. Created an upstart conf for cjserver so that it could auto start under ubuntu. Tested under ubuntu Server14.04.
Create /etc/init/cjserver.conf
as following:
description "cjserver"
start on (local-filesystems and runlevel [2345])
stop on runlevel [016]
respawn
script
exec cjserver /etc/CppJieba/server.conf
end script
Start/Stop cjserver manually as:
sudo service cjserver start
sudo service cjserver stop
sudo service cjserver status
string s = "附近可点击开飞机的科技开发的开放的了骄傲的龙卷风房贷款及付3的即可看见空间打开"
*## python version: jieba.posseg.cut *
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及 c
付 v
3 m
的 uj
即可 d
看见 v
空间 n
打开 v
## cppjieba version tag
附近 f
可 v
点击 v
开 v
飞机 n
的 uj
科技开发 nt
的 uj
开放 v
的 uj
了 ul
骄傲 a
的 uj
龙卷风 nr
房 n
贷款 n
及付 x
3 x
的 uj
即可 d
看见 v
空间 n
打开 v
## inconsistent lies in number 3
rt
当读入20M以上的文件时 分词出错,关键词抽取出错!
string s length: 14490184
[demo] CutAll
_count: 390
_words.size: 390
[demo] KEYWORD
2016-04-15 14:43:13 ../include/cppjieba/KeywordExtractor.hpp:79 ERROR words illegal
Hi there,
as the title...
I want to specify some special vocabulary to process.
e.g. "android Google", I want this to be a vocabulary "android Google" instead of spread words "android,Google".
Can the jieba or cppjieba do that?
Many Thanks.
如果不能的话是否考虑加入这个功能?
RT,发现gbk格式输入时,分词结果都为单个字
例子:
vector words;
jieba.DoCutForSearch("他心理健康", words);
得到:
他
心理健康
Python:
seg_list = jieba.cut_for_search("他心理健康") # 搜索引擎模式
for i in seg_list:
print i
得到:
他
心理
健康
心理健康
C++版有bug, 导致搜心理搜不到词条,得搜心理健康。
因为我不希望对每一句话都调用一次。
如题
yanyiwu 你好,
首先感谢你的辛勤工作。
我想使用 cc-cedict 作为分词词典,然而里面会包含类似“同一个世界,同一个梦想”这个包含标点的词组,结巴并不能将它识别为一个词组。不知能否加上一个参数选项以支持这种需求?
you known , so much good stuffs in c++11 and c++14, like lambda expression, you'll omit
static bool compare(const X&....) { .................. return ..} instead, [ ](X x, Y y){return x < y;}
maybe less code?
std::unordered_map and std::unordered_set also better choise than TR1.....
其它词典资料分享
dict.367W.utf8.tar.gz iLife([email protected])
README里
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.