suggest-go / suggest Goto Github PK

Top-k Approximate String Matching.

Home Page: https://suggest-go.github.io/

License: MIT License

Go 99.67% Dockerfile 0.12% Makefile 0.20%

fuzzy-string-matching golang-library top-k-approximate-string-matching ngram fuzzy-search search-engine language-model spellchecker autocomplete

suggest's People

Contributors

Stargazers

Watchers

Forkers

mewbak lunarforge lisp2021 teng231

suggest's Issues

Roaring bitmap integration

Roaring bitmap package shows significant performance for methods IntPeekable.AdvanceIfNeeded, IntIterator.Next. Needs to add PostingList implementation using this library.

Experiment with Roaring64 for NGramVector

Roaring lib provides a new feature: Roaring64. It requires to replace NGramVector with Roaring64 and provide benchmarks of this experiment.

Support for Protocol Buffers

It requires to add protocol buffers support. Benefits could be found here.

Ability for getting index of collection.

As I seen the api, suggest service can return only strings. It can be much convinient to add to ResultItem index of element in the original collections to ability fetch related data to this result item.

Logger injection

We should have the ability to log debug data, errors, and info

MergeSkip enhancement

https://lemire.me/blog/2019/08/16/faster-threshold-queries-with-cache-sensitive-scancount/#comment-424620

Spellchecker webservice

Continuous Integration via Travis

Map Reduce for n-gram count

Experiment with Map Reduce package https://github.com/chrislusf/gleam for building n-grams counts files

[FUZZIT] Crash at fuzzing target mph

A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

Next word suggestion prototype

Let's say we have the next indexed data:

Red dead redemption 2
Red deer
Red driving school
iPhone X
Dead island
Design pattern

And a user has typed Red de. With Top-k approximate string matching feature we can suggest the next list by the prefix de

Design
Deer
Dead

We need to have the ability to sort the returned data by relevance, i.e. the returned list should look:

Dead
Deer
Design

Fortunately, NGram Language Model gives us this ability. The issue requires to build a Next word suggestion prototype with a combination of topK approximate string search in a dictionary and language model as a scorer.

Posting list intersector

Adapt MergeSkip to perform intersection with Collector dependency.

Phonetic correction

It requires to implement phonetic tokenizer filters such as Double metaphone, Daitch-Mokotoff Soundex and build a special phonetic inverted index for search purposes

Some links

Variable-Length Grams

There was an attempt to implement VGrams for performance enhancement. I see a big potential in it.

See more: http://www.vldb.org/conf/2007/papers/research/p303-li.pdf

Unit tests [The FIRST Principal]

It requires to rewrite tests using The FIRST Principal:

[F]ast: Unit tests should be fast otherwise they will slow down development & deployment.
[I]ndependent: Never ever write tests which depend on other test cases.
[R]epeatable: A repeatable test is one that produces the same results each time you run it.
[S]elf-validating: There must be no manual interpretation of the results.
[T]imely/[T]horoughly: Unit tests must be included for every pull request of a new feature and cover edge cases, errors, and bad inputs.

Stemmer support

https://github.com/snowballstem/snowball/go provides an amazing library for stemmer support. It requires to generate code for Russian and English languages, add StemmerFilters.

Integrate https://github.com/stretchr/testify

https://github.com/stretchr/testify should make the test development easier.

Upload a sample of language model

This requires to upload a sample of 3-Gram language model and build a usage example of the spellchecker cli.

fmt.Errorf apply %w verb

Go 1.13 introduces a new API for errors package
Wrap all errors created by fmt.Errorf and apply the %w verb to the error argument

We can use https://github.com/fatih/errwrap to make this migration easier

demo @ http://54.183.244.111:8000 down?

Posting list processing enhancement

The current workflow of a posting list processing can be described as next:

For each term, a posting list is fetched from file and completely decoded.
The set of posting lists are merged with using of binary search.
Merge result uses for further top k selection.

The strategy described in step 1 is potentially slow. We should completely decode every posting list, because of the inability to perform random access binary searching on the compressed list.
Also, step 1 requires us to allocate memory for storing each decoded posting list, which could be a bottleneck, as well.

In this issue, we need to implement the skipping approach, where the idea can be found in Section 2.3 of Introduction to Information Retrieval. The original reference for all these ideas is the paper by Moffat and Zobel.

Document Collector implementation

The current implementation of merge documents is not performance efficient:

We collect all documents (even when do not have to do it)
We are performing scoring on each document
We return all matching documents (memory inefficient at all)

As solutions, we can use Lucene approach:

Extend store.Output

Interface store.Input has built in abilities to read encoded uint32, uint16 etc. This approach has shown successful experience and simplicity of usage. This proposal aims to extend store.Output with the corresponding encoding methods such as:

WriteVUInt32
WriteUInt32
WriteUInt16

[FUZZIT] Crash at fuzzing target mph

A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

CPMerge inappropriate processing threshold = 1

Because of
6886d01#diff-24fc625b69b8cd5e7f1c88af065d0cf8R93

We have inappropriate results on threshold = 1. It requires to fix it and add some tests to ensure it works properly.

suggest-go / suggest Goto Github PK

suggest's People

Contributors

Stargazers

Watchers

Forkers

suggest's Issues

Recommend Projects

Recommend Topics

Recommend Org