Git Product home page Git Product logo

gojieba's Introduction

GoJieba English

Go Author Tag Performance License GoDoc Coverage Status codebeat badge Go Report Card Awesome

GoJieba是"结巴"中文分词的Golang语言版本。

简介

  • 支持多种分词方式,包括: 最大概率模式, HMM新词发现模式, 搜索引擎模式, 全模式
  • 核心算法底层由C++实现,性能高效。
  • 字典路径可配置,NewJieba(...string), NewExtractor(...string) 可变形参,当参数为空时使用默认词典(推荐方式)

用法

go get github.com/yanyiwu/gojieba

分词示例

package main

import (
	"fmt"
	"strings"

	"github.com/yanyiwu/gojieba"
)

func main() {
	var s string
	var words []string
	use_hmm := true
	x := gojieba.NewJieba()
	defer x.Free()

	s = "我来到北京清华大学"
	words = x.CutAll(s)
	fmt.Println(s)
	fmt.Println("全模式:", strings.Join(words, "/"))

	words = x.Cut(s, use_hmm)
	fmt.Println(s)
	fmt.Println("精确模式:", strings.Join(words, "/"))
	s = "比特币"
	words = x.Cut(s, use_hmm)
	fmt.Println(s)
	fmt.Println("精确模式:", strings.Join(words, "/"))

	x.AddWord("比特币")
	// `AddWordEx` 支持指定词语的权重,作为 `AddWord` 权重太低加词失败的补充。
	// `tag` 参数可以为空字符串,也可以指定词性。
	// x.AddWordEx("比特币", 100000, "")
	s = "比特币"
	words = x.Cut(s, use_hmm)
	fmt.Println(s)
	fmt.Println("添加词典后,精确模式:", strings.Join(words, "/"))

	s = "他来到了网易杭研大厦"
	words = x.Cut(s, use_hmm)
	fmt.Println(s)
	fmt.Println("新词识别:", strings.Join(words, "/"))

	s = "小明硕士毕业于**科学院计算所,后在日本京都大学深造"
	words = x.CutForSearch(s, use_hmm)
	fmt.Println(s)
	fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

	s = "长春市长春药店"
	words = x.Tag(s)
	fmt.Println(s)
	fmt.Println("词性标注:", strings.Join(words, ","))

	s = "区块链"
	words = x.Tag(s)
	fmt.Println(s)
	fmt.Println("词性标注:", strings.Join(words, ","))

	s = "长江大桥"
	words = x.CutForSearch(s, !use_hmm)
	fmt.Println(s)
	fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

	wordinfos := x.Tokenize(s, gojieba.SearchMode, !use_hmm)
	fmt.Println(s)
	fmt.Println("Tokenize:(搜索引擎模式)", wordinfos)

	wordinfos = x.Tokenize(s, gojieba.DefaultMode, !use_hmm)
	fmt.Println(s)
	fmt.Println("Tokenize:(默认模式)", wordinfos)

	keywords := x.ExtractWithWeight(s, 5)
	fmt.Println("Extract:", keywords)
}
我来到北京清华大学
全模式: 我/来到/北京/清华/清华大学/华大/大学
我来到北京清华大学
精确模式: 我/来到/北京/清华大学
比特币
精确模式: 比特/币
比特币
添加词典后,精确模式: 比特币
他来到了网易杭研大厦
新词识别: 他/来到/了/网易/杭研/大厦
小明硕士毕业于**科学院计算所,后在日本京都大学深造
搜索引擎模式: 小明/硕士/毕业/于/**/科学/学院/科学院/**科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
长春市长春药店
词性标注: 长春市/ns,长春/ns,药店/n
区块链
词性标注: 区块链/nz
长江大桥
搜索引擎模式: 长江/大桥/长江大桥
长江大桥
Tokenize: [{长江 0 6} {大桥 6 12} {长江大桥 0 12}]

See example in jieba_test, extractor_test

Benchmark

Jieba中文分词系列性能评测

Unittest

go test ./...

Benchmark

go test -bench "Jieba" -test.benchtime 10s
go test -bench "Extractor" -test.benchtime 10s

Contributors

Code Contributors

This project exists thanks to all the people who contribute.

Contact

gojieba's People

Contributors

amberooo avatar bitdeli-chef avatar coseyo avatar elprup avatar evsio0n avatar franciosi avatar hugolgst avatar kooksee avatar matryer avatar sillydong avatar sy-lht avatar ttys3 avatar yanyiwu avatar youyoubao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gojieba's Issues

go mod 管理依赖时,运行项目报错

github.com/yanyiwu/gojieba

In file included from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Unicode.hpp:9,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/DictTrie.hpp:15,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/QuerySegment.hpp:8,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp: In instantiation of 'void limonp::LocalVector::reserve(size_t) [with T = std::pair<long long unsigned int, const cppjieba::DictUnit*>; size_t = long long unsigned int]':
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp:83:7: required from 'void limonp::LocalVector::push_back(const T&) [with T = std::pair<long long unsigned int, const cppjieba::DictUnit*>]'
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Trie.hpp:99:81: required from here
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp:95:11: warning: 'void* memcpy(void*, const void*, size_t)' writing to an object of type 'struct std::pair<long long unsigned int, const cppjieba::DictUnit*>' with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
memcpy(ptr_, old, sizeof(T) * capacity_);
~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/utility:70,
from D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/algorithm:60,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/QuerySegment.hpp:4,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_pair.h:198:12: note: 'struct std::pair<long long unsigned int, const cppjieba::DictUnit*>' declared here
struct pair
^~~~
In file included from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Unicode.hpp:9,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/DictTrie.hpp:15,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/QuerySegment.hpp:8,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp: In instantiation of 'limonp::LocalVector& limonp::LocalVector::operator=(const limonp::LocalVector&) [with T = std::pair<long long unsigned int, const cppjieba::DictUnit*>]':
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp:33:11: required from 'limonp::LocalVector::LocalVector(const limonp::LocalVector&) [with T = std::pair<long long unsigned int, const cppjieba::DictUnit*>]'
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Trie.hpp:28:8: required from 'void std::_Construct(_T1*, _Args&& ...) [with _T1 = cppjieba::Dag; _Args = {const cppjieba::Dag&}]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_uninitialized.h:83:18: required from 'static _ForwardIterator std::__uninitialized_copy<_TrivialValueTypes>::__uninit_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; bool _TrivialValueTypes = false]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_uninitialized.h:134:15: required from '_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_uninitialized.h:289:37: required from '_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, std::allocator<_Tp>&) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; _Tp = cppjieba::Dag]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_uninitialized.h:311:2: required from '_ForwardIterator std::__uninitialized_move_if_noexcept_a(_InputIterator, _InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; _Allocator = std::allocatorcppjieba::Dag]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/vector.tcc:611:7: required from 'void std::vector<_Tp, _Alloc>::_M_default_append(std::vector<_Tp, _Alloc>::size_type) [with _Tp = cppjieba::Dag; _Alloc = std::allocatorcppjieba::Dag; std::vector<_Tp, _Alloc>::size_type = long long unsigned int]'
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_vector.h:827:4: required from 'void std::vector<_Tp, _Alloc>::resize(std::vector<_Tp, _Alloc>::size_type) [with _Tp = cppjieba::Dag; _Alloc = std::allocatorcppjieba::Dag; std::vector<Tp, Alloc>::size_type = long long unsigned int]'
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Trie.hpp:86:27: required from here
D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/limonp/LocalVector.hpp:63:13: warning: 'void* memcpy(void*, const void*, size_t)' writing to an object of type 'struct std::pair<long long unsigned int, const cppjieba::DictUnit*>' with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
memcpy(ptr
, vec.ptr
, vec.size() * sizeof(T));
~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/utility:70,
from D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/algorithm:60,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/QuerySegment.hpp:4,
from D:\ProgramFiles\Go\gopath\pkg\mod\github.com\yanyiwu\[email protected]\deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
D:/ProgramFiles/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/stl_pair.h:198:12: note: 'struct std::pair<long long unsigned int, const cppjieba::DictUnit*>' declared here
struct pair
^~~~


请问怎么处理?谢谢!

非 hmm 时英文单词和数字被切成单个字符

问题如题所述,例子如下:

使用默认的字典,精确模式,输入和输出:

s := "最近一直用这款,这次涨价了!不过看了下生产日期是2015年的!原来是2014所以便宜!good."

// hmm = false
最近/一直/用/这/款/,/这次/涨价/了/!/不过/看/了/下/生产日期/是/2/0/1/5/年/的/!/原来/是/2/0/1/4/所以/便宜/!/g/o/o/d/

hmm = true
最近/一直/用/这款/,/这次/涨价/了/!/不过/看/了/下/生产日期/是/2015/年的/!/原来/是/2014/所以/便宜/!/good/.

使用 Python 版本没这个问题:

# hmm = True
最近/一直/用/这款/,/这次/涨价/了/!/不过/看/了/下/生产日期/是/2015/年/的/!/原来/是/2014/所以/便宜/!/good/.

# hmm = False
最近/一直/用/这/款/,/这次/涨价/了/!/不过/看/了/下/生产日期/是/2015/年/的/!/原来/是/2014/所以/便宜/!/good/.

我现在有个使用场景时,不需要 hmm,但希望英文单词和数字不会被切成单独的字符。请问该如何处理?

编译项目提高提供的测试代码,回显很多c++编译警告

github.com/yanyiwu/gojieba

In file included from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Unicode.hpp:9,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/DictTrie.hpp:15,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/QuerySegment.hpp:8,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp: In instantiation of ‘void limonp::LocalVector::reserve(size_t) [with T = std::pair<long unsigned int, const cppjieba::DictUnit*>; size_t = long unsigned int]’:
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp:83:7: required from ‘void limonp::LocalVector::push_back(const T&) [with T = std::pair<long unsigned int, const cppjieba::DictUnit*>]’
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Trie.hpp:99:81: required from here
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp:95:11: 警告:‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct std::pair<long unsigned int, const cppjieba::DictUnit*>’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
95 | memcpy(ptr_, old, sizeof(T) * capacity_);
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/9.1.0/utility:70,
from /usr/include/c++/9.1.0/algorithm:60,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/QuerySegment.hpp:4,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
/usr/include/c++/9.1.0/bits/stl_pair.h:208:12: 附注:‘struct std::pair<long unsigned int, const cppjieba::DictUnit*>’ declared here
208 | struct pair
| ^~~~
In file included from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Unicode.hpp:9,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/DictTrie.hpp:15,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/QuerySegment.hpp:8,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp: In instantiation of ‘limonp::LocalVector& limonp::LocalVector::operator=(const limonp::LocalVector&) [with T = std::pair<long unsigned int, const cppjieba::DictUnit*>]’:
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp:33:11: required from ‘limonp::LocalVector::LocalVector(const limonp::LocalVector&) [with T = std::pair<long unsigned int, const cppjieba::DictUnit*>]’
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Trie.hpp:28:8: required from ‘void std::_Construct(_T1*, _Args&& ...) [with _T1 = cppjieba::Dag; _Args = {const cppjieba::Dag&}]’
/usr/include/c++/9.1.0/bits/stl_uninitialized.h:83:18: required from ‘static _ForwardIterator std::__uninitialized_copy<_TrivialValueTypes>::__uninit_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; bool _TrivialValueTypes = false]’
/usr/include/c++/9.1.0/bits/stl_uninitialized.h:134:15: required from ‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*]’
/usr/include/c++/9.1.0/bits/stl_uninitialized.h:289:37: required from ‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, std::allocator<_Tp>&) [with _InputIterator = const cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; _Tp = cppjieba::Dag]’
/usr/include/c++/9.1.0/bits/stl_uninitialized.h:311:2: required from ‘_ForwardIterator std::__uninitialized_move_if_noexcept_a(_InputIterator, _InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = cppjieba::Dag*; _ForwardIterator = cppjieba::Dag*; _Allocator = std::allocatorcppjieba::Dag]’
/usr/include/c++/9.1.0/bits/vector.tcc:659:48: required from ‘void std::vector<_Tp, _Alloc>::_M_default_append(std::vector<_Tp, _Alloc>::size_type) [with _Tp = cppjieba::Dag; _Alloc = std::allocatorcppjieba::Dag; std::vector<_Tp, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/9.1.0/bits/stl_vector.h:937:4: required from ‘void std::vector<_Tp, _Alloc>::resize(std::vector<_Tp, _Alloc>::size_type) [with _Tp = cppjieba::Dag; _Alloc = std::allocatorcppjieba::Dag; std::vector<Tp, Alloc>::size_type = long unsigned int]’
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Trie.hpp:86:27: required from here
/home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/limonp/LocalVector.hpp:63:13: 警告:‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct std::pair<long unsigned int, const cppjieba::DictUnit*>’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
63 | memcpy(ptr
, vec.ptr
, vec.size() * sizeof(T));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/9.1.0/utility:70,
from /usr/include/c++/9.1.0/algorithm:60,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/QuerySegment.hpp:4,
from /home/susu/go/pkg/mod/github.com/yanyiwu/[email protected]/deps/cppjieba/Jieba.hpp:4,
from jieba.cpp:5:
/usr/include/c++/9.1.0/bits/stl_pair.h:208:12: 附注:‘struct std::pair<long unsigned int, const cppjieba::DictUnit*>’ declared here
208 | struct pair
| ^~~~

出错的情况下可否被go捕获,而不是直接崩溃退出

由于使用这个插件时我采取的是动态加载词库文件,词库文件可能不存在,也可能内容有错,因此需要对NewJieba这个方法进行容错判断,并对加载失败提供降级兼容方案。现在我的写法是这样的,不能达到捕获错误:

package main
import (
"fmt"
"github.com/yanyiwu/gojieba"
)

func main(){
var jb *gojieba.Jieba
func(){
defer func() {
if err := recover(); err != nil {
fmt.Println("error:",err)
jb = gojieba.NewJieba()
}
}()
jb = gojieba.NewJieba("not_exist_file")
}()

fmt.Println(jb)

}

bleve下查找能否更精确一些?

例如:

req := bleve.NewSearchRequest(bleve.NewQueryStringQuery("财务管理办法"))
		req.Highlight = bleve.NewHighlight()
		res, err := index.Search(req)
		if err != nil {
			panic(err)
		}
		fmt.Println(res)

结果中,只要有财务、管理、办法的都查找出来了,能否精确一些,只找"财务管理办法"?

调用 go.Tag函数有时候会crash

pure virtual method called
terminate called without an active exception
SIGABRT: abort
PC=0x7f2526d82067 m=10 sigcode=18446744073709551610
signal arrived during cgo execution

安装后运行不成功啊。

报这个错:2016-08-19 10:17:27 ./deps/cppjieba/DictTrie.hpp:153 FATAL exp: [ifs.is_open()] false. open /Users/hqw/Desktop:/Users/hqw/Desktop/gofiles/src/github.com/yanyiwu/gojieba/dict/jieba.dict.utf8 failed.
SIGABRT: abort
PC=0x7fff8aead866 m=0
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x40b6290, 0xc82004dca8, 0xc800000000)
/usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc82004dc70 sp=0xc82004dc40
github.com/yanyiwu/gojieba._Cfunc_NewJieba(0x4500000, 0x4500070, 0x45000e0, 0x0)
??:0 +0x42 fp=0xc82004dca8 sp=0xc82004dc70
github.com/yanyiwu/gojieba.NewJieba(0x0, 0x0, 0x0, 0x0)
/Users/hqw/Desktop/gofiles/src/github.com/yanyiwu/gojieba/jieba.go:37 +0x22d fp=0xc82004dda0 sp=0xc82004dca8
main.main()
/Users/hqw/gojieba.go:14 +0x62 fp=0xc82004df50 sp=0xc82004dda0
runtime.main()
/usr/local/go/src/runtime/proc.go:188 +0x2b0 fp=0xc82004dfa0 sp=0xc82004df50
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82004dfa8 sp=0xc82004dfa0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1

rax 0x0
rbx 0x7fff72e77310
rcx 0x7fff5fbff0e8
rdx 0x0
rdi 0x303
rsi 0x6
rbp 0x7fff5fbff110
rsp 0x7fff5fbff0e8
r8 0x4600100
r9 0x0
r10 0x8000000
r11 0x206
r12 0x500000a
r13 0x5000000
r14 0x6
r15 0x7fff74e55398
rip 0x7fff8aead866
rflags 0x206
cs 0x7
fs 0x0
gs 0x72e70000
exit status 2
错误: 进程退出代码 1.

NewJieba() memory leak [resolved] (it is go test's problem)

env

❯ go version
go version go1.12.5 linux/amd64

❯ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 8.3.0 (GCC) 

add below code to jieba_test.go:

func BenchmarkNewJiebaMemLeak(b *testing.B) {
	b.ResetTimer()
	//equals with x := NewJieba(DICT_PATH, HMM_PATH, USER_DICT_PATH)
	x := NewJieba()
	defer x.Free()
	// Stop Timer before x.Free()
	defer b.StopTimer()
}

run the bench and watch memory usage:

go test -bench=BenchmarkNewJiebaMemLeak .

we can see memory usage from 76.0 MiB to 718.3 MiB,
does x.Free() really free the memory?

[root@8700k hacklog]# ps_mem -w 1 | grep jieba
118.6 MiB +  48.5 KiB = 118.6 MiB	gojieba.test
118.6 MiB +  48.5 KiB = 118.7 MiB	gojieba.test
331.1 MiB +  48.5 KiB = 331.1 MiB	gojieba.test
359.6 MiB +  48.5 KiB = 359.7 MiB	gojieba.test
475.1 MiB +  48.5 KiB = 475.1 MiB	gojieba.test
589.6 MiB +  48.5 KiB = 589.6 MiB	gojieba.test
568.3 MiB +  48.5 KiB = 568.4 MiB	gojieba.test
675.7 MiB +  48.5 KiB = 675.8 MiB	gojieba.test
633.1 MiB +  48.5 KiB = 633.2 MiB	gojieba.test

ps_mem is A utility to accurately report the in core memory usage for a program,
you can install it by pip install ps_mem

the result:

❯ go test -test.bench=BenchmarkNewJiebaMemLeak -test.benchmem
/home/hacklog/go/src/github.com/yanyiwu/gojieba/config_test.go
goos: linux
goarch: amd64
pkg: github.com/yanyiwu/gojieba
BenchmarkNewJiebaMemLeak-12    	2000000000	         0.20 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/yanyiwu/gojieba	10.649s

if comment out the Free() call (defer x.Free()), the result will be

 79.9 MiB +  41.5 KiB =  79.9 MiB	gojieba.test
118.6 MiB +  41.5 KiB = 118.7 MiB	gojieba.test
424.5 MiB +  41.5 KiB = 424.6 MiB	gojieba.test
716.5 MiB +  41.5 KiB = 716.6 MiB	gojieba.test
987.3 MiB +  41.5 KiB = 987.3 MiB	gojieba.test
  1.3 GiB +  41.5 KiB =   1.3 GiB	gojieba.test
  1.6 GiB +  41.5 KiB =   1.6 GiB	gojieba.test
  1.9 GiB +  41.5 KiB =   1.9 GiB	gojieba.test
  2.2 GiB +  41.5 KiB =   2.2 GiB	gojieba.test

Readme中bleve示例代码运行时出错

出错提示: panic: error building tokenizer: config idf not found

原因:在AddCustomTokenizer函数中缺少了idf和stop_words参数,建议加上

    err := indexMapping.AddCustomTokenizer("gojieba",
        map[string]interface{}{
            "dictpath":     gojieba.DICT_PATH,
            "hmmpath":      gojieba.HMM_PATH,
            "userdictpath": gojieba.USER_DICT_PATH,
            "idf":          gojieba.IDF_PATH,
            "stop_words":   gojieba.STOP_WORDS_PATH,
            "type":         "gojieba",
        },
    )

定时创建索引的时候出现cgo异常

你好,我构建索引的时候会出现异常,这个异常时间歇性的,下面是错误信息
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort
PC=0x7f905a6be495 m=7 sigcode=18446744073709551610
signal arrived during cgo execution

请问这是什么原因呢

deps/cppjieba/DictTrie.hpp:153 FATAL exp: [ifs.is_open()] false. open

您好,项目导入了您的gojieba包,但是运行出错,win7 64位,请问这是什么原因?
gcc用的是mingw-w64 版本:x86_64-6.3.0-release-posix-seh-rt_v5-rev1

./deps/cppjieba/DictTrie.hpp:153 FATAL exp: [ifs.is_open()] false. open /workspace/goWorkSpace/src/github.com/yanyiwu/gojieba/dict/jieba.dict.utf8 failed.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

词库2.3G导致64G机器不够用

package main

import (
	"encoding/json"
	"flag"
	"fmt"
	"github.com/yanyiwu/gojieba"
	"io"
	"net/http"
	"runtime"
	"strings"
	"time"
)

var (
	host = flag.String("host","127.0.0.1","HTTP服务器主机名")
	port = flag.Int("port",8888,"HTTP服务器端口")
	x = gojieba.NewJieba("/tmp/test.dict.utf8")
)

/**
启动命令如下(其中host(127.0.0.1)、port(8888)可不传,均有默认参数)
go run server.go -host 0.0.0.0 -port 3306
 */
func main()  {
	flag.Parse()

	//将线程数设置为CPU数
	runtime.GOMAXPROCS(runtime.NumCPU())

	http.HandleFunc("/segmentation",Handler)
	fmt.Println(fmt.Sprintf("%s:%d",*host,*port))
	http.ListenAndServe(fmt.Sprintf("%s:%d",*host,*port),nil)
}

func Handler(w http.ResponseWriter, req *http.Request)  {
	start_time := time.Now().UnixNano() / 1000000
	// 得到要分词的文本
	text := req.URL.Query().Get("company_name")
	if text == ""{
		text = req.PostFormValue("company_name")
	}

	words := x.Tag(text)
	split_word := []string{}
	list := make([]string,0)
	for _,word:= range words{
		split_word = strings.Split(word,"/")
		if split_word[1] == "n" {
			list = append(list, split_word[0])
		}
	}
	end_time := time.Now().UnixNano() / 1000000
	fmt.Println("处理时间:",(end_time-start_time),"ms")
	response,_ := json.Marshal(list)

	w.Header().Set("Content-Type", "application/json")
	io.WriteString(w, string(response))
}

注:/tmp/test.dict.utf8单文件大约五千万数据,
词库格式如下:

常州市伟芳机械有限公司 2 n
兰州金乐塑胶有限公司 2 n
河南兆龙电气设备有限公司 2 n
青岛德润鑫文化传媒有限公司 2 n
重庆禾加合科技发展有限公司 2 n
潍坊崔旺建材销售有限公司 2 n
甘肃龙发装饰工程有限公司 2 n
任丘市大卫电动车有限公司 2 n
建湖县众友服饰有限公司 2 n
曹县小金豆电子商务有限公司 2 n

Example panic

package main

import (
    "fmt"
    "strings"

    "github.com/yanyiwu/gojieba"
)

func main() {
    var s string
    var words []string
    use_hmm := true
    x := gojieba.NewJieba()
    defer x.Free()

    s = "我来到北京清华大学"
    words = x.CutAll(s)
    fmt.Println(s)
    fmt.Println("全模式:", strings.Join(words, "/"))

    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("精确模式:", strings.Join(words, "/"))
    s = "比特币"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("精确模式:", strings.Join(words, "/"))

    x.AddWord("比特币")
    s = "比特币"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("添加词典后,精确模式:", strings.Join(words, "/"))


    s = "他来到了网易杭研大厦"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("新词识别:", strings.Join(words, "/"))

    s = "小明硕士毕业于**科学院计算所,后在日本京都大学深造"
    words = x.CutForSearch(s, use_hmm)
    fmt.Println(s)
    fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

    s = "长春市长春药店"
    words = x.Tag(s)
    fmt.Println(s)
    fmt.Println("词性标注:", strings.Join(words, ","))

    s = "区块链"
    words = x.Tag(s)
    fmt.Println(s)
    fmt.Println("词性标注:", strings.Join(words, ","))

    s = "长江大桥"
    words = x.CutForSearch(s, !use_hmm)
    fmt.Println(s)
    fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

    wordinfos := x.Tokenize(s, gojieba.SearchMode, !use_hmm)
    fmt.Println(s)
    fmt.Println("Tokenize:(搜索引擎模式)", wordinfos)

    wordinfos = x.Tokenize(s, gojieba.DefaultMode, !use_hmm)
    fmt.Println(s)
    fmt.Println("Tokenize:(默认模式)", wordinfos)

    ex := gojieba.NewExtractor()
    defer ex.Free()
    keywords := ex.ExtractWithWeight(s, 5)
    fmt.Println("Extract:", keywords)
}
2016-09-06 16:10:10 ./deps/cppjieba/DictTrie.hpp:153 FATAL exp: [ifs.is_open()] false. open /Users/lingchax/.go/src/github.com/yanyiwu/gojieba/dict/jieba.dict.utf8 failed.
SIGABRT: abort
PC=0x7fff9080cf06 m=0
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x40925b0, 0xc42004f9a0, 0xc400000000)
        /usr/local/Cellar/go/1.7/libexec/src/runtime/cgocall.go:131 +0x110 fp=0xc42004f970 sp=0xc42004f930
hello/vendor/github.com/yanyiwu/gojieba._Cfunc_NewJieba(0x4503290, 0x45032e0, 0x4503370, 0x0)
        ??:0 +0x4e fp=0xc42004f9a0 sp=0xc42004f970
hello/vendor/github.com/yanyiwu/gojieba.NewJieba(0x0, 0x0, 0x0, 0x0)
        /Users/lingchax/.go/src/hello/vendor/github.com/yanyiwu/gojieba/jieba.go:37 +0x1b3 fp=0xc42004fa90 sp=0xc42004f9a0
main.main()
        /Users/lingchax/.go/src/hello/main.go:14 +0x51 fp=0xc42004ff48 sp=0xc42004fa90
runtime.main()
        /usr/local/Cellar/go/1.7/libexec/src/runtime/proc.go:183 +0x1f4 fp=0xc42004ffa0 sp=0xc42004ff48
runtime.goexit()
        /usr/local/Cellar/go/1.7/libexec/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42004ffa8 sp=0xc42004ffa0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
        /usr/local/Cellar/go/1.7/libexec/src/runtime/asm_amd64.s:2086 +0x1

rax    0x0
rbx    0x6
rcx    0x7fff5fbff0f8
rdx    0x0
rdi    0x307
rsi    0x6
rbp    0x7fff5fbff120
rsp    0x7fff5fbff0f8
r8     0x8
r9     0x0
r10    0x8000000
r11    0x206
r12    0x7fff5fbff40a
r13    0x4802000
r14    0x7fff78114000
r15    0x7fff759af398
rip    0x7fff9080cf06
rflags 0x206
cs     0x7
fs     0x0
gs     0x0
exit status 2

go env:

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/lingchax/.go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.7/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.7/libexec/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/1x/d4dgqvms23bgmjpyshv4j6580000gn/T/go-build873055974=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

词性标注中,实数被标为 eng

s := `电阻\2.21kΩ±1% 1/10W\0603`
sl := x.Tag(s)
fmt.Println(sl)

结果:
[电阻/n /x 2.21/eng k/x Ω/x ±/x 1/x %/x /x 1/x //x 10/m W/x /x 0603/m]
English [2.21]
Number [10 0603]

更新一下:
jieba 中结果是标注为 num,望解决

words = psg.cut(r"电阻\2.21kΩ±1% 1/10W\0603")
for word, flag in words :
    print("%s %s"%(word, flag))

跨平台编译

这个没有办法跨平台编译吗。我在Mac上想编译成Linux上使用的,你这个怎么做呢

windows 7 64位,测试gojieba 的bleve示例代码,undefined: gojieba.Jieba

环境如题,go 1.7.3版本。

>go build blevemaintest.go
github.com\yanyiwu\gojieba\bleve\tokenizer.go:12: undefined: gojieba.Jieba
>cgo  blevemain.go
cannot find import "C"

发现gojieba.Jieba 是在jieba.go 里面定义的c/c++语言的方式:

package gojieba

/*
#cgo CXXFLAGS: -I./deps -DLOGGING_LEVEL=LL_WARNING -O3 -Wall
#include <stdlib.h>
#include "jieba.h"
*/
import "C"
import "unsafe"

type TokenizeMode int

const (
	DefaultMode TokenizeMode = iota
	SearchMode
)

type Word struct {
	Str   string
	Start int
	End   int
}

type Jieba struct {
	jieba C.Jieba
}

怎么处理cgo引起的undefined: gojieba.Jieba错误?

提問: 調用自定義分詞檔會崩潰

// 以下代碼會造成程式崩潰
gojieba.NewJieba("./dict/a.txt")

請問這問題要如何處理呢?

OS: Debian
Golang version: 1.8.1

// 以下崩潰訊息
2017-04-19 06:09:43 ./deps/cppjieba/DictTrie.hpp:160 FATAL exp: [buf.size() == DICT_COLUMN_NUM] false. split result illegal, line:
SIGABRT: abort
PC=0x7f3409ad3067 m=4 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x6db7a0, 0xc42045fe00, 0xc42045fe00)
/usr/local/go/src/runtime/cgocall.go:131 +0xe2 fp=0xc42045fdd0 sp=0xc42045fd90
github.com/yanyiwu/gojieba._Cfunc_NewJieba(0x7f34040008c0, 0x7f34040008e0, 0x7f3404000940, 0x7f34040009a0, 0x7f34040009f0, 0x0)
github.com/yanyiwu/gojieba/_obj/_cgo_gotypes.go:214 +0x4e fp=0xc42045fe00 sp=0xc42045fdd0
github.com/yanyiwu/gojieba.NewJieba(0xc42045ff38, 0x1, 0x1, 0x0)
/home/lifelong-study/go/src/github.com/yanyiwu/gojieba/jieba.go:43 +0x268 fp=0xc42045ff00 sp=0xc42045fe00
main.main()
/home/lifelong-study/Desktop/Go/Jieba/Preupload/main.go:56 +0x1e9 fp=0xc42045ff88 sp=0xc42045ff00
runtime.main()
/usr/local/go/src/runtime/proc.go:185 +0x20a fp=0xc42045ffe0 sp=0xc42045ff88
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42045ffe8 sp=0xc42045ffe0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1

goroutine 5 [select]:
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42005c720, 0xc42005c780)
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:75 +0x13a
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:67 +0xdf

goroutine 6 [select]:
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42005c720, 0xc42005c780)
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:75 +0x13a
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:67 +0xdf

goroutine 7 [select]:
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42005c720, 0xc42005c780)
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:75 +0x13a
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:67 +0xdf

goroutine 8 [select]:
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42005c720, 0xc42005c780)
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:75 +0x13a
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
/home/lifelong-study/go/src/github.com/blevesearch/bleve/index/analysis.go:67 +0xdf

rax 0x0
rbx 0x7f3408a9a910
rcx 0xffffffffffffffff
rdx 0x6
rdi 0x76c6
rsi 0x76c9
rbp 0x7f340a66a660
rsp 0x7f3408a9a548
r8 0x7f3408a9b700
r9 0x7f3408a9b700
r10 0x8
r11 0x206
r12 0x7f340a668740
r13 0x7f3408a9a690
r14 0x7f3408a9a700
r15 0x7f3408a9a730
rip 0x7f3409ad3067
rflags 0x206
cs 0x33
fs 0x0
gs 0x0
exit status 2
[Finished in 1.6s with exit code 1]
[shell_cmd: go run /home/lifelong-study/Desktop/Go/Jieba/Preupload/main.go]
[dir: /home/lifelong-study/Desktop/Go/Jieba/Preupload]
[path: /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/go/bin]

停用词在Extract时无效

把“硕士”加到./dict/stop_words.utf8中

package main

import (
"fmt"
"github.com/yanyiwu/gojieba"
)

func main() {
dictDir := "./dict/"
x := gojieba.NewJieba(dictDir+"jieba.dict.utf8", dictDir+"hmm_model.utf8", dictDir+"user.dict.utf8", dictDir+"idf.utf8", dictDir+"stop_words.utf8")
defer x.Free()

    s := "小明硕士毕业于**科学院计算所,后在日本京都大学深造"
    keywords := x.Extract(s, 5)
    fmt.Println("Extract:", keywords)

}

go run main.go
Extract: [日本京都大学 计算所 小明 深造 硕士]

用户自定义的字典不是应该优先么?

譬如自定义字典

但是jieba字典

啊啊啊

结果是有限按照jieba自带字典库,来分词的,也没有看到如何配置优先级,请问该如何设置用户自定义字典优先呢?

在不同的包里进行初始化时没办法使用

在同一个包里操作时,可以使用,但我项目里有一个初始化包
将 NewJieba放在初始化包里初始化,其它包调用这个*Jieba对象时出错
错误信息
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x34b6808 pc=0x9b9b8a]

runtime stack:
runtime.throw(0xafb348, 0x2a)
/data/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
/data/go/src/runtime/sigpanic_unix.go:12 +0x2cc

goroutine 285 [syscall, locked to thread]:
runtime.cgocall(0x9a8630, 0xc4206d7518, 0xc400000000)
/data/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4206d74d0 sp=0xc4206d7490
github.com/yanyiwu/gojieba._Cfunc_ExtractWithWeight(0x20b75f0, 0x7f5be80008c0, 0x4, 0x0)
??:0 +0x4e fp=0xc4206d7518 sp=0xc4206d74d0
github.com/yanyiwu/gojieba.(*Jieba).ExtractWithWeight(0xc42002e4d0, 0xc420c48300, 0x27, 0x4, 0x0, 0x0, 0x0)
/data/gowork/src/github.com/yanyiwu/gojieba/jieba.go:130 +0x10c fp=0xc4206d7588 sp=0xc4206d7518

英文分词问题,词语拆分成字母了

在某些设置状态下,英文单词期望被拆成词语,但是实际上会被拆分成字母,举例说明:

s := "这是一段中文, This is an English sentence"
resF := x.Cut(s, false)  
 // "这/是/一段/中文/,/ /T/h/i/s/ /i/s/ /a/n/ /E/n/g/l/i/s/h/ /s/e/n/t/e/n/c/e/"

resT := x.Cut(s, true)   
// "这是/一段/中文/,/ /This/ /is/ /an/ /English/ /sentence/"

res := x.CutAll(s)    
// "这/是/一段/中文/,/ /T/h/i/s/ /i/s/ /a/n/ /E/n/g/l/i/s/h/ /s/e/n/t/e/n/c/e/"

resF = x.CutForSearch(s, false)   
// "这/是/一段/中文/,/ /T/h/i/s/ /i/s/ /a/n/ /E/n/g/l/i/s/h/ /s/e/n/t/e/n/c/e/"

resT = x.CutForSearch(s, true)
// "这是/一段/中文/,/ /This/ /is/ /an/ /English/ /sentence/"

总结下来,应该是只有在hmm=true的时候,才能实现词语级别的拆分,而Cut(s, hmm=false),CutAll(s),CutForSearch(s, hmm=false) 都会被拆成字母。

望解决。

添加新词山东后反而切词错误了

package main

import (
"fmt"
"strings"

"github.com/yanyiwu/gojieba"

)

func main() {
var s string
var words []string
x := gojieba.NewJieba()
defer x.Free()
s = "山东苹果"
words = x.Cut(s, false)
fmt.Println("cut:", strings.Join(words, "/"))
x.AddWord("山东")
words = x.Cut(s, false)
fmt.Println("cut:", strings.Join(words, "/"))

}

输出
cut: 山东/苹果
cut: 山/东/苹果

可见我主动添加了山东这个词,反倒引起了切词的失败?

你好,安装之后运行不成功

2019-08-21 09:53:57 /home/lxx/code/go/src/di/marmot/vendor/github.com/yanyiwu/gojieba/deps/cppjieba/HMMModel.hpp:43 FATAL exp: [tmp.size() == STATUS_SUM] false.
SIGABRT: abort
PC=0x7f64097961d7 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0xa04010, 0xc420057970, 0xc420057978)
/usr/lib/golang/src/runtime/cgocall.go:132 +0xe4 fp=0xc420057940 sp=0xc420057900 pc=0x409ff4
di/marmot/vendor/github.com/yanyiwu/gojieba._Cfunc_NewJieba(0x7f63e00008c0, 0x7f63e00008f0, 0x7f63e0000920, 0x7f63e0000990, 0x7f63e0000a00, 0x0)
di/marmot/vendor/github.com/yanyiwu/gojieba/_obj/_cgo_gotypes.go:214 +0x4e fp=0xc420057970 sp=0xc420057940 pc=0x80ba9e
di/marmot/vendor/github.com/yanyiwu/gojieba.NewJieba(0xc429e235e0, 0x2, 0x2, 0x0)
/home/lxx/code/go/src/di/marmot/vendor/github.com/yanyiwu/gojieba/jieba.go:37 +0x268 fp=0xc420057a70 sp=0xc420057970 pc=0x80c088
di/marmot/nlp.NewFeatureSegmenter(0xc429e235e0, 0x2, 0x2, 0x1)
/home/lxx/code/go/src/di/marmot/nlp/feature_segment.go:10 +0x43 fp=0xc420057ab0 sp=0xc420057a70 pc=0x83f5c3
di/marmot/adserver/route.NewRestAPI(0xc420037c00, 0xc420087a90)
/home/lxx/code/go/src/di/marmot/adserver/route/rest_api.go:176 +0xf42 fp=0xc420057e78 sp=0xc420057ab0 pc=0x9e9ea2
main.main()
/home/lxx/code/go/src/di/marmot/adserver/main.go:27 +0x2ef fp=0xc420057f80 sp=0xc420057e78 pc=0xa0307f
runtime.main()
/usr/lib/golang/src/runtime/proc.go:195 +0x226 fp=0xc420057fe0 sp=0xc420057f80 pc=0x4344d6
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc420057fe8 sp=0xc420057fe0 pc=0x460a31

rax 0x0
rbx 0x7f6404df5920
rcx 0xffffffffffffffff
rdx 0x6
rdi 0x3da0
rsi 0x3daa
rbp 0x7f6404df5510
rsp 0x7f6404df53c8
r8 0x1
r9 0x7f6404df7700
r10 0x8
r11 0x206
r12 0x7f640a3b9140
r13 0x7f640a3b7dc0
r14 0x7f63e0000dc0
r15 0x7f6404df5580
rip 0x7f64097961d7
rflags 0x206
cs 0x33
fs 0x0
gs 0x0

增加单词的词性、权重属性

type WordWeight struct {
Word string
Weight float64
Tag string
}

word weight改成WordProperty

最好是所有模式分词出来的,都带有词频、词性 等属性

与python版jieba分词结果不一致

请问为什么与python的jieba分词结果不一致呢?我看python的dict.txt与gojieba的文件jieba.dict.utf8几乎差不多,但是分词结果不一致导致线上线下无法保持一致性

请问怎样减少CPU消耗?

我的使用场景中文较少,绝大部分是英文。只需要将常见的中文分词即可。
现在每秒2000~3000次分词,top 中看 cpu 消耗会超过100%,请问怎样能减少CPU消耗。

目前使用 CutForSearch (str, true) 这样的分词方式

存在词典为1行时,词典无效的情况。

代码:

package main

import (
	"fmt"
	"strings"

	"github.com/yanyiwu/gojieba"
)

func main() {
	var s string
	var words []string
	// use_hmm := true
	x := gojieba.NewJieba([]string{"./user.dict.utf8"}...)
	defer x.Free()
	s = "王者荣耀"
	words = x.Tag(s)
	fmt.Println("精确模式:", strings.Join(words, "/"))

}

词典1

王者荣耀 1 n

结果1

精确模式: 王者/x/荣耀/x

词典2

王者荣耀 1 n
云计算 1 n

结果2

精确模式: 王者荣耀/n

AddWord不正确分词问题

场景: 在处理政治敏感词问题, 比如: 布局十九大
在添加了 AddWord(“十九大")的情况下, 分词结果为: 布局/十九/大

install package error

github.com/yanyiwu/gojieba

cc1.exe: sorry, unimplemented: 64-bit mode not compiled in

不兼容go1.9?

运行到
return &Jieba{
C.NewJieba(
dpath,
hpath,
upath,
ipath,
spath,
),
}

程序直接退出了

使用範例出現多個「not defined」 Error

github.com/yanyiwu/gojieba(.text): _ZNSt13basic_filebufIcSt11char_traitsIcEED1Ev: not defined
github.com/yanyiwu/gojieba(.text): _ZNSt7__cxx1119basic_ostringstreamIcSt11char_traitsIcESaIcEEC1ESt13_Ios_Openmode: not defined
github.com/yanyiwu/gojieba(.text): _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_x: not defined
github.com/yanyiwu/gojieba(.text): _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_x: not defined
github.com/yanyiwu/gojieba(.text): _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc: not defined
github.com/yanyiwu/gojieba(.text): _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_x: not defined
C:\Go\pkg\tool\windows_amd64\link.exe: too many errors
exit status 2

Issue: Mac OS 下build Linux 版本提示 undefined: gojieba.NewJieba

package main

import (
    "fmt"

    "github.com/yanyiwu/gojieba"
)

func main() {
    res := SplitWords("北京欢迎你", "all", "")
    fmt.Println(res)
}

func SplitWords(text, model, dict string) []string {
    var words []string
    jb := gojieba.NewJieba()

    if dict != "" {
        jb.AddWord(dict)
    }
    defer jb.Free()

    switch model {
    case "all":
        words = jb.CutAll(text)
    case "accurate":
        words = jb.Cut(text, true)
    }
    return words
}
> export GOOS=linux
> go build ts.go
> # command-line-arguments
> ./ts.go:16:8: undefined: gojieba.NewJieba
  • 环境:mac OS High Sierra 10.13.6 (17G65)
  • 如果改为 export GOOS=darwin则正常build
  • GOOS为: linux 或 windows都会出现此问题

godep后找不到文件

../vendor/github.com/yanyiwu/gojieba/jieba.cpp:5:10: fatal error: 'cppjieba/Jieba.hpp' file not found

godep之后报文件不存在

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.