Git Product home page Git Product logo

gse's Introduction

gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. And supports with elasticsearch and bleve.

Build Status CircleCI Status codecov Build Status Go Report Card GoDoc GitHub release Join the chat at https://gitter.im/go-ego/ego

简体中文

Gse is implements jieba by golang, and try add NLP support and more feature

Feature:

  • Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;
  • Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words
  • Support multilingual: English, Chinese, Japanese and others
  • Support Traditional Chinese
  • Support HMM cut text use Viterbi algorithm
  • Support NLP by TensorFlow (in work)
  • Named Entity Recognition (in work)
  • Supports with elasticsearch and bleve
  • run JSON RPC service.

Algorithm:

  • Dictionary with double array trie (Double-Array Trie) to achieve
  • Segmenter algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

Text Segmentation speed:

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

With Go module support (Go 1.11+), just import:

import "github.com/go-ego/gse"

Otherwise, to install the gse package, run the command:

go get -u github.com/go-ego/gse

Use

package main

import (
	_ "embed"
	"fmt"

	"github.com/go-ego/gse"
)

//go:embed testdata/test_en2.txt
var testDict string

//go:embed testdata/test_en.txt
var testEn string

var (
	text  = "To be or not to be, that's the question!"
	test1 = "Hiworld, Helloworld!"
)

func main() {
	var seg1 gse.Segmenter
	seg1.DictSep = ","
	err := seg1.LoadDict("./testdata/test_en.txt")
	if err != nil {
		fmt.Println("Load dictionary error: ", err)
	}

	s1 := seg1.Cut(text)
	fmt.Println("seg1 Cut: ", s1)
	// seg1 Cut:  [to be   or   not to be ,   that's the question!]

	var seg2 gse.Segmenter
	seg2.AlphaNum = true
	seg2.LoadDict("./testdata/test_en_dict3.txt")

	s2 := seg2.Cut(test1)
	fmt.Println("seg2 Cut: ", s2)
	// seg2 Cut:  [hi world ,   hello world !]

	var seg3 gse.Segmenter
	seg3.AlphaNum = true
	seg3.DictSep = ","
	err = seg3.LoadDictEmbed(testDict + "\n" + testEn)
	if err != nil {
		fmt.Println("loadDictEmbed error: ", err)
	}
	s3 := seg3.Cut(text + test1)
	fmt.Println("seg3 Cut: ", s3)
	// seg3 Cut:  [to be   or   not to be ,   that's the question! hi world ,   hello world !]

	// example2()
}

Example2:

package main

import (
	"fmt"
	"regexp"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! こんにちは世界, 你好世界."

	new, _ = gse.New("zh,testdata/test_en_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func main() {
	// Loading the default dictionary
	seg.LoadDict()
	// Loading the default dictionary with embed
	// seg.LoadDictEmbed()
	//
	// Loading the Simplified Chinese dictionary
	// seg.LoadDict("zh_s")
	// seg.LoadDictEmbed("zh_s")
	//
	// Loading the Traditional Chinese dictionary
	// seg.LoadDict("zh_t")
	//
	// Loading the Japanese dictionary
	// seg.LoadDict("jp")
	//
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	cut()

	segCut()
}

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)
	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
	hmm = seg.CutDAG(text1, reg)
	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])
}

func analyzeAndTrim(cut []string) {
	a := seg.Analyze(cut, "")
	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))
}

func cutPos() {
	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)
	po = posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)
	// Handle word segmentation results, search mode
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"
	_ "embed"

	"github.com/go-ego/gse"
)

//go:embed test_en_dict3.txt
var testDict string

func main() {
	// var seg gse.Segmenter
	// seg.LoadDict("zh, testdata/zh/test_dict.txt, testdata/zh/test_dict1.txt")
	// seg.LoadStop()
	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")
	// seg.LoadDictEmbed()
	seg.LoadStopEmbed()

	text1 := "Hello world, こんにちは世界, 你好世界!"
	s1 := seg.Cut(text1, true)
	fmt.Println(s1)
	fmt.Println("trim: ", seg.Trim(s1))
	fmt.Println("stop: ", seg.Stop(s1))
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Elasticsearch

How to use it with elasticsearch?

go-gse-elastic

Authors

License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See LICENSE-APACHE, LICENSE-MIT.

Thanks for sego and jieba(jiebago).

gse's People

Contributors

amazingrise avatar appleboy avatar cocainecong avatar magicaltux avatar simon-ding avatar suntong avatar vcaesar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gse's Issues

请问如何才能做到尽量匹配最完整的词语

  1. Please speak English, this is the language everybody of us can speak and write.
  2. Please take a moment to search that an issue doesn't already exist.
  3. Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
  4. Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

  • Gse version (or commit ref):
  • Go version:
  • Operating system and bit:
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:
  • Log gist:

Description

...

可以标记文本中的多音字吗?然后标注唯一读音

举个例子

  1. 打开下面的网站
    https://www.qqxiuzi.cn/zh/pinyin/

  2. 在文本中粘贴以下文本

 那是力争上游的一种树,笔直的干,笔直的枝。它的干呢,通常是丈把高,像是加以人工似的,一丈以内,绝无旁枝;它所有的桠枝呢,一律向上,而且紧紧靠拢,也像是加以人工似的,成为一束,绝无横斜逸出;它的宽大的叶子也是片片向上,几乎没有斜生的,更不用说倒垂了;它的皮,光滑而有银色的晕圈,微微泛出淡青色。这是虽在北方的风雪的压迫下却保持着倔强挺立的一种树!哪怕只有碗来粗细罢,它却努力向上发展,高到丈许,两丈,参天耸立,不折不挠,对抗着西北风。
  1. 点转成拼音按钮
    唯一的多音词就出来了.

想要的效果和上面的网站类似

比如文本中有多个多音字,借助gse,把多音字在文本中的唯一音读出来。

In Chinese word segmentation, only a single word is separated

Execute the following code (tabooSegmentCustomDicList there are more than 2000 words)
`
for _, tabooSegmentCustomDic := range tabooSegmentCustomDicList {
lowerCaseWord := strings.ToLower(tabooSegmentCustomDic.Word)
segmentutil.AddWord(lowerCaseWord)
}

func AddWord(word string) bool {
defer recoverPanic(word)
err := seg.AddToken(word, 100)
if err != nil {
logger.Errorf("Error when AddWord,%s", word, err)
return false
}
return true
}

func TextSegment(text string) []string {
defer recoverPanic(text)
return seg.Cut(text)
}

`

TextSegment("api发送文本loumès 𝘾𝘼𝙍𝙏𝙄𝙀𝙍")

the result is ["api","发","送","文","本","lou","mès"," ","𝘾𝘼𝙍𝙏𝙄𝙀𝙍"]

Is there any bug of seg.ModeSegment?

The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

I thought the result of ModeSegment should like seg.CutSearch.

test code:

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

var (
	seg  gse.Segmenter
	text = "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
)

func main() {
	seg.LoadDict()
	addToken()
	cut()
}

func addToken() {
	seg.AddToken("《复仇者联盟3:无限战争》", 100, "n")
}

// 使用 DAG 或 HMM 模式分词
func cut() {
	// "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."

	// use DAG and HMM
	hmm := seg.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)
	// cut use hmm:  [《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	cut := seg.Cut(text)
	fmt.Println("cut: ", cut)
	// cut:  [《 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]

	hmm = seg.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
	fmt.Println("analyze: ", seg.Analyze(hmm, text))

	cut = seg.CutSearch(text)
	fmt.Println("cut search: ", cut)
	// cut search:  [《 复仇 者 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]

	segment1 := seg.Segment([]byte(text))
	for i, token := range segment1 {
		fmt.Println(i, token.Token().Text())
	}
	segment2 := seg.ModeSegment([]byte(text), true)
	for i, token := range segment2 {
		fmt.Println(i, token.Token().Text())
	}
}

老哥,停止词典一直不生效,加了

package main

import (
"fmt"

"github.com/go-ego/gse"

)

var (
text = "第一次爱的人是谁演唱的"
new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() {
cut()
}

func cut() {
new.LoadStop("stop.txt")
new.AddStop("的")
new.AddStop("是") //加了这行也没用
fmt.Println("cut: ", new.Cut(text, true))
fmt.Println("cut all: ", new.CutAll(text))
fmt.Println("cut for search: ", new.CutSearch(text, true))
fmt.Println(new.String(text, true))
}

//控制台打印如下所示
//2022/02/18 17:44:34 Dict files path: [dict.txt]
//2022/02/18 17:44:34 Load the gse dictionary: "dict.txt"
//2022/02/18 17:44:34 Gse dictionary loaded finished.
//2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt"
//cut: [第一次爱的人 是 谁 演唱 的]
//cut all: [第一次爱的人 是 谁 演唱 的]
//cut for search: [第一次爱的人 是 谁 演唱 的]
//第一次爱的人/n 是/x 谁/x 演唱/v 的/x

Why can't I load my user define dictionary?

seg.LoadDict("zh, /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt")

2019/07/18 10:40:48 Could not load dictionaries: " /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt", open /Users/xxxxxx/go/src/github.com/go-ego/gse/data/dict/dictionary.txt: no such file or directory

使用过程中内存占用过高

我在使用的时候,发现内存占用过高,top里面有一半是这个包的占用。请问能优化下吗
Showing top 10 nodes out of 50
flat flat% sum% cum cum%
72.97MB 18.91% 18.91% 72.97MB 18.91% github.com/go-ego/gse/vendor/github.com/go-ego/cedar.(*Cedar).addBlock
60.51MB 15.68% 34.58% 60.51MB 15.68% database/sql.convertAssign
51.74MB 13.41% 47.99% 124.71MB 32.31% github.com/go-ego/gse.(*Dictionary).addToken
40.50MB 10.49% 58.48% 40.50MB 10.49% github.com/go-ego/gse.splitTextToWords
36.63MB 9.49% 67.97% 39.63MB 10.27% sync.(*Map).Store
27MB 7.00% 74.97% 27MB 7.00% github.com/go-ego/gse.(*Segmenter).segmentWords
21.50MB 5.57% 80.54% 337.34MB 87.41% sensitive/task.LoadSensitive
13.50MB 3.50% 84.04% 40.50MB 10.49% github.com/go-ego/gse.(*Segmenter).CalcToken
10.50MB 2.72% 86.76% 38MB 9.85% sensitive/task.Analyze.func2.1
7.50MB 1.94% 88.70% 13MB 3.37% github.com/modern-go/reflect2.(*UnsafeSliceType).UnsafeMakeSlice

ExtractTags and TextRank output blanks.

  • Gse version (or commit ref):
    0.64.1
  • Go version:
    1.15.6
  • Operating system and bit:
    OS: macOS Catalina 10.15.7
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:

Simple code as below:

    text := "那里湖面总是澄清, 那里空气充满宁静"

    seg := gse.New()
    log.Debug().Interface("cut", seg.Cut(text, true)).Interface("cut all", seg.CutAll(text)).Msg("")
    // output: cut=["那里","湖面","总是","澄清",", ","那里","空气","充满","宁静"] cut all=["那里","里湖","湖面","总是","澄清",","," ","那里","空气","充满","宁静"]

    // seg.LoadDict("./dictionary.txt")
    var tr idf.TextRanker
    tr.WithGse(seg)
    result := tr.TextRank(text, 5)
    log.Debug().Interface("text rank", result).Msg("")
    // output: text rank=[{},{},{}]

    var te idf.TagExtracter
    te.WithGse(seg)
    if err := te.LoadIdf(); err != nil {
        log.Error().Err(err).Msg("")
    }
    segments := te.ExtractTags(text, 5)
    log.Debug().Interface("segments", segments).Msg("")
    // output: segments=[{},{},{},{},{}]
  • Log gist:

Description

ExtractTags and TextRank output blanks.

Any advise?

V1 Release?

Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

老哥,停止词典的那个方法一直无法生效,咋回事呀

package main

import (
"fmt"

"github.com/go-ego/gse"

)

var (
text = "第一次爱的人是谁演唱的"
new, _ = gse.New("dict.txt")

seg gse.Segmenter

)

func main() {
cut()
}

// loadDictEmbed supported from go1.16
func loadDictEmbed() {
seg.LoadDictEmbed()
seg.LoadStopEmbed()
}

func cut() {
new.LoadStop("stop.txt")
new.IsStop("是") //将“是“加入停止词典以后,“是”仍然出现在了分词结果中
fmt.Println("cut: ", new.Cut(text, true))
fmt.Println("cut all: ", new.CutAll(text))
fmt.Println("cut for search: ", new.CutSearch(text, true))
fmt.Println(new.String(text, true))
}

// 输出结果如下:
// cut: [第一次爱的人 是 谁 演唱 的]
//cut all: [第一次爱的人 是 谁 演 唱 的]
//cut for search: [第一次爱的人 是 谁 演唱 的]
// 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x

With Go module support (Go 1.11+), Error

最低只支持go 1.17

github.com/go-ego/gse

../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:31:25: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:31:31: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:43:25: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:47:25: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:58:27: undefined: ja
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:65:27: undefined: zhS
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:68:27: undefined: zhT
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:140:27: undefined: stopDict
../../pkg/mod/github.com/go-ego/[email protected]/dict_1.16.go:153:25: undefined: stopDict
note: module requires Go 1.17

"\001" in text gets error result

  • Gse version (or commit ref):
    1fd1428
  • Go version:
    1.14.2
  • Operating system and bit:
    any
  • Can you reproduce the bug at Examples:
    • No
    • Yes (provide example code)
    • Not relevant
  • Provide example code:
func TestSegment(t *testing.T) {
	seg := &gse.Segmenter{}
	err := seg.LoadDict("../data/dictionary.txt")
	if err != nil {
		t.Fatal(err)
	}
	data := []byte("\001你好吗", )
	res := seg.Segment(data)
	for _, re := range res {
		t.Log(re.Token().Text())
		t.Log(re.Start())
		t.Log(re.End())
	}
}
  • Log gist:

    TestSegment: process_test.go:51: 你
    TestSegment: process_test.go:52: 0
    TestSegment: process_test.go:53: 3
    TestSegment: process_test.go:51: 你好
    TestSegment: process_test.go:52: 3
    TestSegment: process_test.go:53: 9
    TestSegment: process_test.go:51: 吗
    TestSegment: process_test.go:52: 9
    TestSegment: process_test.go:53: 12

Description

the first token should be "\001", we get second word instand.
the start of second token should be 1.

English cut bug

  • Gse version (or commit ref): 82fc9e4
  • Go version: 1.20
  • Operating system and bit: macOS 13.0
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
seg.LoadDict("zh")
seg.LoadDict("en")
seg.LoadDict("jp")
seg.LoadStop("zh")
logrus.Debugln(seg.CutSearch("Nowadays, there are more and more misunderstanding between parents and children which is so- called generation gap. It is estimated that ( 75 percentages of parents often complain their children’s unreasonable behavior while children usually think their parents too old fashioned )."))
  • Log gist:
[nowadays ,   there   are   more   and   more   misunderstanding   between   parents   and   children   which   is   so -   called   generation   gap .   it   is   estimated   that   (   75   percentages   of   parents   often   complain   their   children ’ s   unreasonable   behavior   while   children   usually   think   their   parents   too   old   fashioned   ) .]

Description

I read the data/en/dict.txt and find it empty. However, it seems like gse doesn't support english text cutting.

Bleve

Has anyone tried using this with bleve.

Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

Using this with bleve would be a powerful stack

Could not load dictionaries

I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

  1. Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.
  2. If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other

gse dict lose efficacy when set an empty string

  • Gse version: v0.80.2
  • Go version: 1.19
func main() {
	seg := new(gse.Segmenter)
	seg.Dict = gse.NewDict()
	seg.Init()
	seg.AddToken("bj", 100, "n")
	fmt.Println(seg.Dictionary())
	fmt.Println(seg.Find("bj"))
	seg.AddToken("", 100, "n")
	fmt.Println(seg.Dictionary())
	fmt.Println(seg.Find("bj"))
}
// output:
// &{0xc000140000 1 [{[[98 106]] 100 n 0 []}] 100}
// 100 n true
// &{0xc000140000 1 [{[[98 106]] 100 n 0 []} {[] 100 n 0 []}] 200}
// 0  false

Description

The Find func of dict will lose efficacy when I set an empty string, the instance will be unable to cutting string.

Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

  • Gse version (or commit ref):
    0.60
  • Go version:
    1.14
  • Operating system and bit:
    macOS 10.14

Description

In my case, I need get start and end info of each word after segmenting in hmm and search mode.
By reading apis, I only found:

  • CutSearch(string, true) which only return []string but no star and end infos
  • Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

Is there anyway to something like Segment([]byte(text), searchMode)?

How to build without embed dictionary on Go1.16 or above?

Hi,

I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?

Thanks.

  • Gse version (or commit ref): 0.70.1
  • Go version: 1.17
  • Operating system and bit: Ubuntu 20.04 64bit
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant

Found a bug in file dict_util.go

in this func:
func (seg *Segmenter) Reader(reader *bufio.Reader, files ...string) error

those code lines:
if fsErr != nil {
if fsErr == io.EOF {
...
}
must put after :
...
seg.Dict.AddToken(token)
Otherwise,last line of dictory file will be missed.exclude the last line is empty.

Float should not be split

Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

What i can do?

  • Gse version (or commit ref):v0.69.3
  • Go version:1.16

Split “ 2021年09月10日”. I want got "2021年 / 09月 / 10日"

Split “ 2021年09月10日”. I want got "2021年 / 09月 / 10日".
Split “ 중국 규제 리스크에 울고 웃는 종목들현명하게 대응하려면”. I want got "중국 / 규제 / 리스크에...".

I use code
words := seg.Cut("2021年09月10日", true)

I change this code of this package.But I think it is not good.Is there a better way?
File: hmm_seg.go line: 27
regSkip = regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)

  • Gse version (or commit ref):v0.69.3
  • Go version:1.16

Invalid ranges produced on bad inputs

segments := segmenter.Segment([]byte(w))
		for i, seg := range segments {
			if seg.End() > len(w) {
				log.Println("bad split: ", seg.Start(), "/", seg.End(), " w '", w, "' len ", len(w), "hex ", hex.EncodeToString([]byte(w)), "i ", i)
			} else {
				log.Println(w[seg.Start():seg.End()])`
			}
		}

The code above gives bad ranges on bad inputs, examples:
bad split: 3 / 6 w ' �缊 ' len 4 hex 01e7bc8a i 1
bad split: 5 / 8 w ' Vm�犹 ' len 6 hex 566d01e78ab9 i 2
bad split: 5 / 8 w ' Vm�犹 ' len 6 hex 566d01e78ab9 i 2
bad split: 3 / 6 w ' �榬 ' len 4 hex 01e6a6ac i 1

Expected behavior is that seg.End() should always be in range 0..len(w)

  1. ego commit is recent, bdc71ec
  2. go version go1.9 windows/amd64
package main

import (
"flag"
"fmt"

"github.com/go-ego/gse"
	"encoding/hex"
	"log"
)

func main() {
	flag.Parse()
	var seg gse.Segmenter
	seg.LoadDict()
	text, _ := hex.DecodeString("01e7bc8a")
	segments := seg.Segment([]byte(text))
	fmt.Println(gse.ToString(segments, true))
	for _, seg := range segments {
		log.Println(text[seg.Start():seg.End()])
	}
}
2017/11/12 10:41:25 载入 gse 词典 C:/Users/Valle/go/src/github.com/go-ego/gse/data/dict/dictionary.txt
缊/zg 缊/zg 
2017/11/12 10:41:27 gse 词典载入完毕
2017/11/12 10:41:27 [1 231 188]

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.main()

开启了SkipLog,仍然打印日志

  1. Please speak English (English only), this is the language everybody of us can speak and write.
  2. Please take a moment to search that an issue doesn't already exist.
  3. Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
  4. Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

  • Gse version (or commit ref): v0.80.2
  • Go version: 1.20.4
  • Operating system and bit: centOS 7.6
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:
var tagExtracter *idf.TagExtracter
var onceSeg sync.Once
var err error
onceSeg.Do(func() {
	seg := gse.Segmenter{
		SkipLog: true,
	}
	err = seg.LoadDict()
	if err == nil {
		var te idf.TagExtracter
		te.WithGse(seg)
		err = te.LoadIdf()
		if err == nil {
			tagExtracter = &te
		}
	}
})
  • Log gist:

Description

开启了SkipLog,但仍然打印了加载字典文件的日志
2023/05/28 11:29:08 Dict files path: [/var/www/api/vendor/github.com/go-ego/gse/data/dict/zh/idf.txt]
2023/05/28 11:29:08 Load the gse dictionary: "/var/www/api/vendor/github.com/go-ego/gse/data/dict/zh/idf.txt"
2023/05/28 11:29:10 Gse dictionary loaded finished.

...

期望开启了SkipLog,不打印任何日志

English How to Customize File Definition Terms

I hope to define some words in the document like Chinese, such as opening the door
opening the door
The cut method can directly match this word. Can someone tell me how to change it

There is a problem with Chinese word segmentation

  • Gse version (or commit ref): v0.71.0.695, Green Lake!
  • Go version:1.20
  • Operating system and bit: Mac OS
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
  • Provide example code:
x, _ := gse.New()
fmt.Println(x.Cut("法院调解费是多少", true))
  • Log gist:
    [法院 调解 费是 多少]

Description

费是 ??
...

add english Participles

  1. Please speak English, this is the language everybody of us can speak and write.
  2. Please take a moment to search that an issue doesn't already exist.
  3. Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
  4. Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

  • Gse version (or commit ref):
  • Go version:
  • Operating system and bit:
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:
  • Log gist:

Description

...

How to disable output of dictionary loading?

There always output information about dictionary loading

2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Dict files path:  [./xxx]
2023/01/07 23:04:15 Load the gse dictionary: "./xxx"
2023/01/07 23:04:15 Gse dictionary loaded finished.
2023/01/07 23:04:15 Load the stop word dictionary: "./xxx"

My codes as below

	segmenter, err := gse.NewEmbed(static.DictFile)
	segmenter.LoadStopEmbed(static.StopFile)
	segmenter.MoreLog = false  <-- seems no usable
	segmenter.SkipLog = true  <-- seems no usable

Is there any option to disable output such info?

加载一个8m的字典,但是整个seg就占用很大的内容,整个项目占用内存最大的,有没有什么办法优化?

  1. Please speak English (English only), this is the language everybody of us can speak and write.
  2. Please take a moment to search that an issue doesn't already exist.
  3. Please ask questions or config/deploy problems on our Gitter channel: https://gitter.im/go-ego/ego
  4. Please give all relevant information below for bug reports, incomplete details will be handled as an invalid report.

You MUST delete the content above including this line before posting, otherwise your issue will be invalid.

  • Gse version (or commit ref):
  • Go version:
  • Operating system and bit:
  • Can you reproduce the bug at Examples:
    • Yes (provide example code)
    • No
    • Not relevant
  • Provide example code:
//go:embed zh/dictionary.txt
var zhSimpleDict string
var (
	seg gse.Segmenter
)
func Init() (err error) {
	return seg.LoadDictEmbed(zhSimpleDict)
}
  • Log gist:
    (pprof) top
    Showing nodes accounting for 138.46MB, 97.84% of 141.52MB total
    Dropped 26 nodes (cum <= 0.71MB)
    Showing top 10 nodes out of 35
    flat flat% sum% cum cum%
    40.95MB 28.93% 28.93% 77.66MB 54.87% github.com/go-ego/gse.(*Dictionary).AddToken
    36.71MB 25.94% 54.87% 36.71MB 25.94% github.com/vcaesar/cedar.(*Cedar).addBlock
    31.50MB 22.26% 77.13% 31.50MB 22.26% github.com/go-ego/gse.(*Segmenter).SplitTextToWords
    9MB 6.36% 83.49% 129.18MB 91.28% github.com/go-ego/gse.(*Segmenter).LoadDictStr
    8.99MB 6.35% 89.85% 8.99MB 6.35% strings.genSplit
    5.52MB 3.90% 93.75% 5.52MB 3.90% embed.FS.ReadFile
    2.02MB 1.43% 95.18% 2.02MB 1.43% github.com/go-ego/gse/hmm.loadDefEmit
    1.78MB 1.26% 96.44% 1.78MB 1.26% github.com/mozillazg/go-pinyin.init
    1.13MB 0.8% 97.23% 1.13MB 0.8% github.com/valyala/fasthttp/stackless.NewFunc
    0.85MB 0.6% 97.84% 0.85MB 0.6% github.com/goccy/go-json/internal/decoder.init.0

Description

整个项目中gse占用内存最大,有没有什么办法优化?
字典是8m的大小
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.