go-ego / gse Goto Github PK

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.

License: Apache License 2.0

Go 99.94% HTML 0.06%

go segment nlp gse hmm jieba hmm-viterbi-algorithm trie chinese english

gse's Introduction

gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. And supports with elasticsearch and bleve.

简体中文

Gse is implements jieba by golang, and try add NLP support and more feature

Feature:

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;
Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words
Support multilingual: English, Chinese, Japanese and others
Support Traditional Chinese
Support HMM cut text use Viterbi algorithm
Support NLP by TensorFlow (in work)
Named Entity Recognition (in work)
Supports with elasticsearch and bleve
run JSON RPC service.

Algorithm:

Dictionary with double array trie (Double-Array Trie) to achieve
Segmenter algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

Text Segmentation speed:

single thread 9.2MB/s
goroutines concurrent 26.8MB/s.
HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

With Go module support (Go 1.11+), just import:

import "github.com/go-ego/gse"

Otherwise, to install the gse package, run the command:

go get -u github.com/go-ego/gse

Use

package main

import (
	_ "embed"
	"fmt"

	"github.com/go-ego/gse"
)

//go:embed testdata/test_en2.txt
var testDict string

//go:embed testdata/test_en.txt
var testEn string

var (
	text  = "To be or not to be, that's the question!"
	test1 = "Hiworld, Helloworld!"
)

func main() {
	var seg1 gse.Segmenter
	seg1.DictSep = ","
	err := seg1.LoadDict("./testdata/test_en.txt")
	if err != nil {
		fmt.Println("Load dictionary error: ", err)
	}

	s1 := seg1.Cut(text)
	fmt.Println("seg1 Cut: ", s1)
	// seg1 Cut:  [to be   or   not to be ,   that's the question!]

	var seg2 gse.Segmenter
	seg2.AlphaNum = true
	seg2.LoadDict("./testdata/test_en_dict3.txt")

	s2 := seg2.Cut(test1)
	fmt.Println("seg2 Cut: ", s2)
	// seg2 Cut:  [hi world ,   hello world !]

	var seg3 gse.Segmenter
	seg3.AlphaNum = true
	seg3.DictSep = ","
	err = seg3.LoadDictEmbed(testDict + "\n" + testEn)
	if err != nil {
		fmt.Println("loadDictEmbed error: ", err)
	}
	s3 := seg3.Cut(text + test1)
	fmt.Println("seg3 Cut: ", s3)
	// seg3 Cut:  [to be   or   not to be ,   that's the question! hi world ,   hello world !]

	// example2()
}

Example2:

package main

import (
	"fmt"
	"regexp"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! こんにちは世界, 你好世界."

	new, _ = gse.New("zh,testdata/test_en_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func main() {
	// Loading the default dictionary
	seg.LoadDict()
	// Loading the default dictionary with embed
	// seg.LoadDictEmbed()
	//
	// Loading the Simplified Chinese dictionary
	// seg.LoadDict("zh_s")
	// seg.LoadDictEmbed("zh_s")
	//
	// Loading the Traditional Chinese dictionary
	// seg.LoadDict("zh_t")
	//
	// Loading the Japanese dictionary
	// seg.LoadDict("jp")
	//
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	cut()

	segCut()
}

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)
	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
	hmm = seg.CutDAG(text1, reg)
	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])
}

func analyzeAndTrim(cut []string) {
	a := seg.Analyze(cut, "")
	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))
}

func cutPos() {
	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)
	po = posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)
	// Handle word segmentation results, search mode
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"
	_ "embed"

	"github.com/go-ego/gse"
)

//go:embed test_en_dict3.txt
var testDict string

func main() {
	// var seg gse.Segmenter
	// seg.LoadDict("zh, testdata/zh/test_dict.txt, testdata/zh/test_dict1.txt")
	// seg.LoadStop()
	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")
	// seg.LoadDictEmbed()
	seg.LoadStopEmbed()

	text1 := "Hello world, こんにちは世界, 你好世界!"
	s1 := seg.Cut(text1, true)
	fmt.Println(s1)
	fmt.Println("trim: ", seg.Trim(s1))
	fmt.Println("stop: ", seg.Stop(s1))
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Elasticsearch

How to use it with elasticsearch?

go-gse-elastic

Authors

License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See LICENSE-APACHE, LICENSE-MIT.

Thanks for sego and jieba(jiebago).

gse's People

Contributors

Stargazers

Watchers

Forkers

devopsmi justintung koalacxr mimosa awesome-golang forging2012 ganggang shiguanghuxian zhilijiqi zuzmic wwjiang007 xingjianwei nilportugues zhufenggood hcxiong appleboy ishawge pythonsite jhzlf githubwbs littlelotta phpip brookxs leonliu315 smartfish 987127892 weiboyiyou bozzcq mewbak jangocheng jangocity chenny linecode totoleo ares2013 vuthaihoc alkeid yehuangcn riposa wbmcloud pilgrim2go gezidan kugoucode waterem phenixsh awesomegolang zhenyuanwei nofeetbird0321 allensmile woshizilong grasswin jinjim woerwin louisliaoxh1989 imanner zhanglei blastbao yuedf msonawane mysticaltech dolanor-galaxy wkshare xmenycg liujiawm yanjingang lazytooo outter isgasho silencekai daniel-007 pi-pi-miao xintangli keramist yieyu daflyx fpgz6 xuchengzhi martialbe devister hhjpin simon-ding shibingli guangxuewu phymucs liudanyejiantai shuyabin peaceiris beoran chuchiy moonyang lunny dreamxyp llgoer loraxh eriendeng bignuoli thinklib wilesun bhbhken veritastry

gse's Issues

请问如何才能做到尽量匹配最完整的词语

Please speak English, this is the language everybody of us can speak and write.
Please take a moment to search that an issue doesn't already exist.
Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

Gse version (or commit ref):
Go version:
Operating system and bit:
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant
Provide example code:

Log gist:

Description

...

举个例子

打开下面的网站
https://www.qqxiuzi.cn/zh/pinyin/
在文本中粘贴以下文本

 那是力争上游的一种树，笔直的干，笔直的枝。它的干呢，通常是丈把高，像是加以人工似的，一丈以内，绝无旁枝；它所有的桠枝呢，一律向上，而且紧紧靠拢，也像是加以人工似的，成为一束，绝无横斜逸出；它的宽大的叶子也是片片向上，几乎没有斜生的，更不用说倒垂了；它的皮，光滑而有银色的晕圈，微微泛出淡青色。这是虽在北方的风雪的压迫下却保持着倔强挺立的一种树！哪怕只有碗来粗细罢，它却努力向上发展，高到丈许，两丈，参天耸立，不折不挠，对抗着西北风。

点转成拼音按钮
唯一的多音词就出来了.

想要的效果和上面的网站类似

比如文本中有多个多音字，借助gse，把多音字在文本中的唯一音读出来。

elastic in pure golang

https://github.com/prabhatsharma/zinc

it works...

In Chinese word segmentation, only a single word is separated

Execute the following code (tabooSegmentCustomDicList there are more than 2000 words)
`
for _, tabooSegmentCustomDic := range tabooSegmentCustomDicList {
lowerCaseWord := strings.ToLower(tabooSegmentCustomDic.Word)
segmentutil.AddWord(lowerCaseWord)
}

func AddWord(word string) bool {
defer recoverPanic(word)
err := seg.AddToken(word, 100)
if err != nil {
logger.Errorf("Error when AddWord,%s", word, err)
return false
}
return true
}

func TextSegment(text string) []string {
defer recoverPanic(text)
return seg.Cut(text)
}

TextSegment("api发送文本loumès 𝘾𝘼𝙍𝙏𝙄𝙀𝙍")

the result is ["api","发","送","文","本","lou","mès"," ","𝘾𝘼𝙍𝙏𝙄𝙀𝙍"]

请问字典里的分词频率是如何定义的？

请问分词字典里面的分词频率是如何定义和产生的？谢谢

Is there any bug of seg.ModeSegment?

The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

I thought the result of ModeSegment should like seg.CutSearch.

test code:

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

var (
	seg  gse.Segmenter
	text = "《复仇者联盟3：无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
)

func main() {
	seg.LoadDict()
	addToken()
	cut()
}

func addToken() {
	seg.AddToken("《复仇者联盟3：无限战争》", 100, "n")
}

// 使用 DAG 或 HMM 模式分词
func cut() {
	// "《复仇者联盟3：无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."

	// use DAG and HMM
	hmm := seg.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)
	// cut use hmm:  [《复仇者联盟3：无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	cut := seg.Cut(text)
	fmt.Println("cut: ", cut)
	// cut:  [《 复仇者 联盟 3 ： 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	hmm = seg.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3：无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
	fmt.Println("analyze: ", seg.Analyze(hmm, text))

	cut = seg.CutSearch(text)
	fmt.Println("cut search: ", cut)
	// cut search:  [《 复仇 者 复仇者 联盟 3 ： 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]

	segment1 := seg.Segment([]byte(text))
	for i, token := range segment1 {
		fmt.Println(i, token.Token().Text())
	}
	segment2 := seg.ModeSegment([]byte(text), true)
	for i, token := range segment2 {
		fmt.Println(i, token.Token().Text())
	}
}

Add more common APIs

Feature Request: Provide Segmenter's Read Dict From io.Reader func

Description

We could load up the dict resource from other ways. such as load from go embed package.

Named entity support

named entity
optimize dictionary and model #13, #14

老哥，停止词典一直不生效，加了

package main

import (
"fmt"

"github.com/go-ego/gse"

)

var (
text = "第一次爱的人是谁演唱的"
new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() {
cut()
}

func cut() {
new.LoadStop("stop.txt")
new.AddStop("的")
new.AddStop("是") //加了这行也没用
fmt.Println("cut: ", new.Cut(text, true))
fmt.Println("cut all: ", new.CutAll(text))
fmt.Println("cut for search: ", new.CutSearch(text, true))
fmt.Println(new.String(text, true))
}

//控制台打印如下所示
//2022/02/18 17:44:34 Dict files path: [dict.txt]
//2022/02/18 17:44:34 Load the gse dictionary: "dict.txt"
//2022/02/18 17:44:34 Gse dictionary loaded finished.
//2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt"
//cut: [第一次爱的人是谁演唱的]
//cut all: [第一次爱的人是谁演唱的]
//cut for search: [第一次爱的人是谁演唱的]
//第一次爱的人/n 是/x 谁/x 演唱/v 的/x

how to search emoji?

Is AddToken() thread safe?

Add HMM and CRF support

Gse version (or commit ref): last

Add HMM and CRF support.

分词能不能把符号和空格过滤掉

分词用Slice 方法的时候，都有符号和空格，这东西对分词没用

Why can't I load my user define dictionary?

seg.LoadDict("zh, /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt")

2019/07/18 10:40:48 Could not load dictionaries: " /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt", open /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt: no such file or directory

使用过程中内存占用过高

我在使用的时候，发现内存占用过高，top里面有一半是这个包的占用。请问能优化下吗
Showing top 10 nodes out of 50
flat flat% sum% cum cum%
72.97MB 18.91% 18.91% 72.97MB 18.91% github.com/go-ego/gse/vendor/github.com/go-ego/cedar.(*Cedar).addBlock
60.51MB 15.68% 34.58% 60.51MB 15.68% database/sql.convertAssign
51.74MB 13.41% 47.99% 124.71MB 32.31% github.com/go-ego/gse.(*Dictionary).addToken
40.50MB 10.49% 58.48% 40.50MB 10.49% github.com/go-ego/gse.splitTextToWords
36.63MB 9.49% 67.97% 39.63MB 10.27% sync.(*Map).Store
27MB 7.00% 74.97% 27MB 7.00% github.com/go-ego/gse.(*Segmenter).segmentWords
21.50MB 5.57% 80.54% 337.34MB 87.41% sensitive/task.LoadSensitive
13.50MB 3.50% 84.04% 40.50MB 10.49% github.com/go-ego/gse.(*Segmenter).CalcToken
10.50MB 2.72% 86.76% 38MB 9.85% sensitive/task.Analyze.func2.1
7.50MB 1.94% 88.70% 13MB 3.37% github.com/modern-go/reflect2.(*UnsafeSliceType).UnsafeMakeSlice

ExtractTags and TextRank output blanks.

Gse version (or commit ref):
0.64.1
Go version:
1.15.6
Operating system and bit:
OS: macOS Catalina 10.15.7
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant
Provide example code:

Simple code as below:

    text := "那里湖面总是澄清, 那里空气充满宁静"

    seg := gse.New()
    log.Debug().Interface("cut", seg.Cut(text, true)).Interface("cut all", seg.CutAll(text)).Msg("")
    // output: cut=["那里","湖面","总是","澄清",", ","那里","空气","充满","宁静"] cut all=["那里","里湖","湖面","总是","澄清",","," ","那里","空气","充满","宁静"]

    // seg.LoadDict("./dictionary.txt")
    var tr idf.TextRanker
    tr.WithGse(seg)
    result := tr.TextRank(text, 5)
    log.Debug().Interface("text rank", result).Msg("")
    // output: text rank=[{},{},{}]

    var te idf.TagExtracter
    te.WithGse(seg)
    if err := te.LoadIdf(); err != nil {
        log.Error().Err(err).Msg("")
    }
    segments := te.ExtractTags(text, 5)
    log.Debug().Interface("segments", segments).Msg("")
    // output: segments=[{},{},{},{},{}]

Log gist:

Description

ExtractTags and TextRank output blanks.

Any advise?

V1 Release?

Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

老哥，停止词典的那个方法一直无法生效，咋回事呀

package main

import (
"fmt"

"github.com/go-ego/gse"

)

var (
text = "第一次爱的人是谁演唱的"
new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() {
cut()
}

// loadDictEmbed supported from go1.16
func loadDictEmbed() {
seg.LoadDictEmbed()
seg.LoadStopEmbed()
}

func cut() {
new.LoadStop("stop.txt")
new.IsStop("是") //将“是“加入停止词典以后，“是”仍然出现在了分词结果中
fmt.Println("cut: ", new.Cut(text, true))
fmt.Println("cut all: ", new.CutAll(text))
fmt.Println("cut for search: ", new.CutSearch(text, true))
fmt.Println(new.String(text, true))
}

// 输出结果如下:
// cut: [第一次爱的人是谁演唱的]
//cut all: [第一次爱的人是谁演唱的]
//cut for search: [第一次爱的人是谁演唱的]
// 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x

With Go module support (Go 1.11+), Error

最低只支持go 1.17

github.com/go-ego/gse

../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:31:25: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:31:31: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:43:25: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:47:25: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:58:27: undefined: ja
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:65:27: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:68:27: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:140:27: undefined: stopDict
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:153:25: undefined: stopDict
note: module requires Go 1.17

"\001" in text gets error result

Gse version (or commit ref):
1fd1428
Go version:
1.14.2
Operating system and bit:
any
Can you reproduce the bug at Examples:
- No
- Yes (provide example code)
- Not relevant
Provide example code:

func TestSegment(t *testing.T) {
	seg := &gse.Segmenter{}
	err := seg.LoadDict("../data/dictionary.txt")
	if err != nil {
		t.Fatal(err)
	}
	data := []byte("\001你好吗", )
	res := seg.Segment(data)
	for _, re := range res {
		t.Log(re.Token().Text())
		t.Log(re.Start())
		t.Log(re.End())
	}
}

Log gist:


    TestSegment: process_test.go:51: 你
    TestSegment: process_test.go:52: 0
    TestSegment: process_test.go:53: 3
    TestSegment: process_test.go:51: 你好
    TestSegment: process_test.go:52: 3
    TestSegment: process_test.go:53: 9
    TestSegment: process_test.go:51: 吗
    TestSegment: process_test.go:52: 9
    TestSegment: process_test.go:53: 12

Description

the first token should be "\001", we get second word instand.
the start of second token should be 1.

English cut bug

Gse version (or commit ref): 82fc9e4
Go version: 1.20
Operating system and bit: macOS 13.0
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant

seg.LoadDict("zh")
seg.LoadDict("en")
seg.LoadDict("jp")
seg.LoadStop("zh")
logrus.Debugln(seg.CutSearch("Nowadays, there are more and more misunderstanding between parents and children which is so- called generation gap. It is estimated that ( 75 percentages of parents often complain their children’s unreasonable behavior while children usually think their parents too old fashioned )."))

Log gist:

[nowadays ,   there   are   more   and   more   misunderstanding   between   parents   and   children   which   is   so -   called   generation   gap .   it   is   estimated   that   (   75   percentages   of   parents   often   complain   their   children ’ s   unreasonable   behavior   while   children   usually   think   their   parents   too   old   fashioned   ) .]

Description

I read the data/en/dict.txt and find it empty. However, it seems like gse doesn't support english text cutting.

使用用户自定义词典后，没有成功分词。

err := segmenter.LoadDict(strings.Join(files, ","))

segmenter.cut()

完全没有按照自定义词典分词。。

Bleve

Has anyone tried using this with bleve.

Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

Using this with bleve would be a powerful stack

Could not load dictionaries

I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.
If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other

gse dict lose efficacy when set an empty string

Gse version: v0.80.2
Go version: 1.19

func main() {
	seg := new(gse.Segmenter)
	seg.Dict = gse.NewDict()
	seg.Init()
	seg.AddToken("bj", 100, "n")
	fmt.Println(seg.Dictionary())
	fmt.Println(seg.Find("bj"))
	seg.AddToken("", 100, "n")
	fmt.Println(seg.Dictionary())
	fmt.Println(seg.Find("bj"))
}
// output:
// &{0xc000140000 1 [{[[98 106]] 100 n 0 []}] 100}
// 100 n true
// &{0xc000140000 1 [{[[98 106]] 100 n 0 []} {[] 100 n 0 []}] 200}
// 0  false

Description

The Find func of dict will lose efficacy when I set an empty string, the instance will be unable to cutting string.

does this software support user custom dictionary?

支持英语词语定义吗?我看看了代码注释都是中文为啥不用中问题问呢

希望在文档中像中文一样定义一些词，比如开门
opening the door
cut 方法可以直接匹配这个词。谁能告诉我怎么改

Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

Gse version (or commit ref):
0.60
Go version:
1.14
Operating system and bit:
macOS 10.14

Description

In my case, I need get start and end info of each word after segmenting in hmm and search mode.
By reading apis, I only found:

CutSearch(string, true) which only return []string but no star and end infos
Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

Is there anyway to something like Segment([]byte(text), searchMode)?

自定义英文单词，分词不正确

添加自定义英文单词

UI

对“UIUI” 分词，依然还是得到UIUI，应该是得到UI UI两个

Optimize the dictionary, classify the dictionary, add new words

Optimize the dictionary, classify the dictionary, add new words.

[WIP] feature: implement crf algorithm

pls assign me. :-) I want to implement it, but I think it is long term, so I add the [WIP] tag.

Add HMM basic support

Hmm basic support is completed.

How to build without embed dictionary on Go1.16 or above?

Hi,

I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?

Thanks.

Gse version (or commit ref): 0.70.1
Go version: 1.17
Operating system and bit: Ubuntu 20.04 64bit
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant

Optimize the HMM model

Optimize the HMM model.

Japanese examples use simplified Chinese hanzi

It's hard to read if you only know Japanese.

Found a bug in file dict_util.go

in this func:
func (seg *Segmenter) Reader(reader *bufio.Reader, files ...string) error

those code lines:
if fsErr != nil {
if fsErr == io.EOF {
...
}
must put after :
...
seg.Dict.AddToken(token)
Otherwise,last line of dictory file will be missed.exclude the last line is empty.

各个分词方法的区别是什么，能介绍一下吗？

cut search hm string
您好，有详细的方法区别介绍吗

Float should not be split

Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

What i can do?

Gse version (or commit ref):v0.69.3
Go version:1.16

TF-IDF 默认能指定 allowPOS 吗

Split “ 2021年09月10日”. I want got "2021年 / 09月 / 10日"

Split “ 2021年09月10日”. I want got "2021年 / 09月 / 10日".
Split “ 중국 규제 리스크에 울고 웃는 종목들현명하게 대응하려면”. I want got "중국 / 규제 / 리스크에...".

I use code
words := seg.Cut("2021年09月10日", true)

I change this code of this package.But I think it is not good.Is there a better way?
File: hmm_seg.go line: 27
regSkip = regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)

Gse version (or commit ref):v0.69.3
Go version:1.16

Invalid ranges produced on bad inputs

segments := segmenter.Segment([]byte(w))
		for i, seg := range segments {
			if seg.End() > len(w) {
				log.Println("bad split: ", seg.Start(), "/", seg.End(), " w '", w, "' len ", len(w), "hex ", hex.EncodeToString([]byte(w)), "i ", i)
			} else {
				log.Println(w[seg.Start():seg.End()])`
			}
		}

The code above gives bad ranges on bad inputs, examples:
bad split: 3 / 6 w ' �缊 ' len 4 hex 01e7bc8a i 1
bad split: 5 / 8 w ' Vm�犹 ' len 6 hex 566d01e78ab9 i 2
bad split: 5 / 8 w ' Vm�犹 ' len 6 hex 566d01e78ab9 i 2
bad split: 3 / 6 w ' �榬 ' len 4 hex 01e6a6ac i 1

Expected behavior is that seg.End() should always be in range 0..len(w)

ego commit is recent, bdc71ec
go version go1.9 windows/amd64

package main

import (
"flag"
"fmt"

"github.com/go-ego/gse"
	"encoding/hex"
	"log"
)

func main() {
	flag.Parse()
	var seg gse.Segmenter
	seg.LoadDict()
	text, _ := hex.DecodeString("01e7bc8a")
	segments := seg.Segment([]byte(text))
	fmt.Println(gse.ToString(segments, true))
	for _, seg := range segments {
		log.Println(text[seg.Start():seg.End()])
	}
}

2017/11/12 10:41:25 载入 gse 词典 C:/Users/Valle/go/src/github.com/go-ego/gse/data/dict/dictionary.txt
缊/zg 缊/zg 
2017/11/12 10:41:27 gse 词典载入完毕
2017/11/12 10:41:27 [1 231 188]

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.main()

开启了SkipLog，仍然打印日志

Please speak English (English only), this is the language everybody of us can speak and write.
Please take a moment to search that an issue doesn't already exist.
Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

Gse version (or commit ref): v0.80.2
Go version: 1.20.4
Operating system and bit: centOS 7.6
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant
Provide example code:

var tagExtracter *idf.TagExtracter
var onceSeg sync.Once
var err error
onceSeg.Do(func() {
	seg := gse.Segmenter{
		SkipLog: true,
	}
	err = seg.LoadDict()
	if err == nil {
		var te idf.TagExtracter
		te.WithGse(seg)
		err = te.LoadIdf()
		if err == nil {
			tagExtracter = &te
		}
	}
})

Log gist:

Description

开启了SkipLog，但仍然打印了加载字典文件的日志
2023/05/28 11:29:08 Dict files path: [/var/www/api/vendor/github.com/go-ego/gse/data/dict/zh/idf.txt]
2023/05/28 11:29:08 Load the gse dictionary: "/var/www/api/vendor/github.com/go-ego/gse/data/dict/zh/idf.txt"
2023/05/28 11:29:10 Gse dictionary loaded finished.

...

期望开启了SkipLog，不打印任何日志

English How to Customize File Definition Terms

I hope to define some words in the document like Chinese, such as opening the door
opening the door
The cut method can directly match this word. Can someone tell me how to change it

There is a problem with Chinese word segmentation

Gse version (or commit ref): v0.71.0.695, Green Lake!
Go version:1.20
Operating system and bit: Mac OS
Can you reproduce the bug at Examples:
- Yes (provide example code)
Provide example code:

x, _ := gse.New()
fmt.Println(x.Cut("法院调解费是多少", true))

Log gist:
[法院调解费是多少]

Description

费是 ??
...

feature: any plan on implementing the bm25 algorithm ?

Description

I see the bm25.go file in path(hmm/bm25/bm25.go), so I wanna ask author any plan on bm25 ? 😃

If author had the plan on implementing the bm25, I want to make it. 🫡

add english Participles

Please speak English, this is the language everybody of us can speak and write.
Please take a moment to search that an issue doesn't already exist.
Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

Gse version (or commit ref):
Go version:
Operating system and bit:
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant
Provide example code:

Log gist:

Description

...

How to disable output of dictionary loading?

There always output information about dictionary loading

2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"

My codes as below

	segmenter, err := gse.NewEmbed(static.DictFile)
	segmenter.LoadStopEmbed(static.StopFile)
	segmenter.MoreLog = false  <-- seems no usable
	segmenter.SkipLog = true  <-- seems no usable

Is there any option to disable output such info?

Feature request: Any plan to support Korean?

Dear developer(s),

I wonder any plan to support Korean? Thanks. :)

加载一个8m的字典，但是整个seg就占用很大的内容，整个项目占用内存最大的，有没有什么办法优化？

Please speak English (English only), this is the language everybody of us can speak and write.
Please take a moment to search that an issue doesn't already exist.
Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

Gse version (or commit ref):
Go version:
Operating system and bit:
Can you reproduce the bug at Examples:
- Yes (provide example code)
- No
- Not relevant
Provide example code:

//go:embed zh/dictionary.txt
var zhSimpleDict string
var (
	seg gse.Segmenter
)
func Init() (err error) {
	return seg.LoadDictEmbed(zhSimpleDict)
}

Log gist:
(pprof) top
Showing nodes accounting for 138.46MB, 97.84% of 141.52MB total
Dropped 26 nodes (cum <= 0.71MB)
Showing top 10 nodes out of 35
flat flat% sum% cum cum%
40.95MB 28.93% 28.93% 77.66MB 54.87% github.com/go-ego/gse.(*Dictionary).AddToken
36.71MB 25.94% 54.87% 36.71MB 25.94% github.com/vcaesar/cedar.(*Cedar).addBlock
31.50MB 22.26% 77.13% 31.50MB 22.26% github.com/go-ego/gse.(*Segmenter).SplitTextToWords
9MB 6.36% 83.49% 129.18MB 91.28% github.com/go-ego/gse.(*Segmenter).LoadDictStr
8.99MB 6.35% 89.85% 8.99MB 6.35% strings.genSplit
5.52MB 3.90% 93.75% 5.52MB 3.90% embed.FS.ReadFile
2.02MB 1.43% 95.18% 2.02MB 1.43% github.com/go-ego/gse/hmm.loadDefEmit
1.78MB 1.26% 96.44% 1.78MB 1.26% github.com/mozillazg/go-pinyin.init
1.13MB 0.8% 97.23% 1.13MB 0.8% github.com/valyala/fasthttp/stackless.NewFunc
0.85MB 0.6% 97.84% 0.85MB 0.6% github.com/goccy/go-json/internal/decoder.init.0

Description

整个项目中gse占用内存最大，有没有什么办法优化？
字典是8m的大小
...

go-ego / gse Goto Github PK

gse's Introduction

gse

Feature:

Algorithm:

Text Segmentation speed:

Binding:

Install / update

Use

Elasticsearch

Authors

License

gse's People

Contributors

Stargazers

Watchers

Forkers

gse's Issues

Description

举个例子

想要的效果和上面的网站类似

Description

Description

github.com/go-ego/gse

Description

Description

Description

Description

Description

Description

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org