suggest-go / suggest Goto Github PK
View Code? Open in Web Editor NEWTop-k Approximate String Matching.
Home Page: https://suggest-go.github.io/
License: MIT License
Top-k Approximate String Matching.
Home Page: https://suggest-go.github.io/
License: MIT License
Roaring bitmap package shows significant performance for methods IntPeekable.AdvanceIfNeeded, IntIterator.Next
. Needs to add PostingList implementation using this library.
Roaring lib provides a new feature: Roaring64. It requires to replace NGramVector
with Roaring64
and provide benchmarks of this experiment.
It requires to add protocol buffers support. Benefits could be found here.
As I seen the api, suggest service can return only strings. It can be much convinient to add to ResultItem index of element in the original collections to ability fetch related data to this result item.
We should have the ability to log debug data, errors, and info
Experiment with Map Reduce package https://github.com/chrislusf/gleam for building n-grams counts files
A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:
2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO: 0 files found in corpus
FUZZER: INFO: 0 files found in additional-corpus
FUZZER: INFO: 0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2 INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3 NEW ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6 NEW ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336 NEW ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943 REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954 NEW ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969 NEW ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991 REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER: and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER: #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER: #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER: #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER: #3 0x7f80c8a9e0df (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER: #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER:
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec: 14
FUZZER: stat::new_units_added: 7
FUZZER: stat::slowest_unit_time_sec: 0
FUZZER: stat::peak_rss_mb: 32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...
More details can be found here
Cheers,
Fuzzit Bot
Let's say we have the next indexed data:
And a user has typed Red de
. With Top-k approximate string matching
feature we can suggest the next list by the prefix de
We need to have the ability to sort the returned data by relevance, i.e. the returned list should look:
Fortunately, NGram Language Model
gives us this ability. The issue requires to build a Next word suggestion prototype
with a combination of topK approximate string search in a dictionary
and language model
as a scorer.
Adapt MergeSkip
to perform intersection with Collector
dependency.
It requires to implement phonetic tokenizer filters such as Double metaphone
, Daitch-Mokotoff Soundex
and build a special phonetic inverted index for search purposes
Some links
There was an attempt to implement VGrams for performance enhancement. I see a big potential in it.
See more: http://www.vldb.org/conf/2007/papers/research/p303-li.pdf
It requires to rewrite tests using The FIRST Principal:
[F]ast: Unit tests should be fast otherwise they will slow down development & deployment.
[I]ndependent: Never ever write tests which depend on other test cases.
[R]epeatable: A repeatable test is one that produces the same results each time you run it.
[S]elf-validating: There must be no manual interpretation of the results.
[T]imely/[T]horoughly: Unit tests must be included for every pull request of a new feature and cover edge cases, errors, and bad inputs.
https://github.com/snowballstem/snowball/go provides an amazing library for stemmer support. It requires to generate code for Russian and English languages, add StemmerFilters.
https://github.com/stretchr/testify should make the test development easier.
This requires to upload a sample of 3-Gram language model and build a usage example of the spellchecker cli.
Go 1.13 introduces a new API for errors
package
Wrap all errors created by fmt.Errorf
and apply the %w verb to the error argument
We can use https://github.com/fatih/errwrap to make this migration easier
The current workflow of a posting list processing can be described as next:
top k
selection.The strategy described in step 1 is potentially slow. We should completely decode every posting list, because of the inability to perform random access binary searching on the compressed list.
Also, step 1 requires us to allocate memory for storing each decoded posting list, which could be a bottleneck, as well.
In this issue, we need to implement the skipping
approach, where the idea can be found in Section 2.3 of Introduction to Information Retrieval
. The original reference for all these ideas is the paper by Moffat and Zobel.
The current implementation of merge documents is not performance efficient:
As solutions, we can use Lucene approach:
Interface store.Input
has built in abilities to read encoded uint32, uint16 etc. This approach has shown successful experience and simplicity of usage. This proposal aims to extend store.Output
with the corresponding encoding methods such as:
WriteVUInt32
WriteUInt32
WriteUInt16
A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:
2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO: 0 files found in corpus
FUZZER: INFO: 0 files found in additional-corpus
FUZZER: INFO: 0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2 INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3 NEW ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6 NEW ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336 NEW ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943 REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954 NEW ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969 NEW ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991 REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER: and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER: #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER: #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER: #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER: #3 0x7f80c8a9e0df (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER: #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER:
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec: 14
FUZZER: stat::new_units_added: 7
FUZZER: stat::slowest_unit_time_sec: 0
FUZZER: stat::peak_rss_mb: 32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...
More details can be found here
Cheers,
Fuzzit Bot
From time to time, the structure of inverted index changes. There is no way to tell a client that the index has changed, which could bring frustration and waste of time.
For solving this issue, it is needed to store an index version in index header, and check it on index read. On each index structure change, the index version will be increased.
This package doesn't provide good documentation and examples of usage. The issue requires to create documentation, terminology, usage examples, and demos pages.
A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:
2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO: 0 files found in corpus
FUZZER: INFO: 0 files found in additional-corpus
FUZZER: INFO: 0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2 INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3 NEW ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6 NEW ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336 NEW ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943 REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954 NEW ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969 NEW ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991 REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER: and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER: #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER: #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER: #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER: #3 0x7f80c8a9e0df (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER: #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER:
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec: 14
FUZZER: stat::new_units_added: 7
FUZZER: stat::slowest_unit_time_sec: 0
FUZZER: stat::peak_rss_mb: 32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...
More details can be found here
Cheers,
Fuzzit Bot
Because of
6886d01#diff-24fc625b69b8cd5e7f1c88af065d0cf8R93
We have inappropriate results on threshold
= 1. It requires to fix it and add some tests to ensure it works properly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.