Git Product home page Git Product logo

suggest's People

Contributors

alldroll avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

suggest's Issues

Ability for getting index of collection.

As I seen the api, suggest service can return only strings. It can be much convinient to add to ResultItem index of element in the original collections to ability fetch related data to this result item.

Logger injection

We should have the ability to log debug data, errors, and info

[FUZZIT] Crash at fuzzing target mph

A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

Next word suggestion prototype

Let's say we have the next indexed data:

  • Red dead redemption 2
  • Red deer
  • Red driving school
  • iPhone X
  • Dead island
  • Design pattern

And a user has typed Red de. With Top-k approximate string matching feature we can suggest the next list by the prefix de

  • Design
  • Deer
  • Dead

We need to have the ability to sort the returned data by relevance, i.e. the returned list should look:

  • Dead
  • Deer
  • Design

Fortunately, NGram Language Model gives us this ability. The issue requires to build a Next word suggestion prototype with a combination of topK approximate string search in a dictionary and language model as a scorer.

Unit tests [The FIRST Principal]

It requires to rewrite tests using The FIRST Principal:

[F]ast: Unit tests should be fast otherwise they will slow down development & deployment.
[I]ndependent: Never ever write tests which depend on other test cases.
[R]epeatable: A repeatable test is one that produces the same results each time you run it.
[S]elf-validating: There must be no manual interpretation of the results.
[T]imely/[T]horoughly: Unit tests must be included for every pull request of a new feature and cover edge cases, errors, and bad inputs.

Posting list processing enhancement

The current workflow of a posting list processing can be described as next:

  • For each term, a posting list is fetched from file and completely decoded.
  • The set of posting lists are merged with using of binary search.
  • Merge result uses for further top k selection.

The strategy described in step 1 is potentially slow. We should completely decode every posting list, because of the inability to perform random access binary searching on the compressed list.
Also, step 1 requires us to allocate memory for storing each decoded posting list, which could be a bottleneck, as well.

In this issue, we need to implement the skipping approach, where the idea can be found in Section 2.3 of Introduction to Information Retrieval. The original reference for all these ideas is the paper by Moffat and Zobel.

img106

Document Collector implementation

The current implementation of merge documents is not performance efficient:

  • We collect all documents (even when do not have to do it)
  • We are performing scoring on each document
  • We return all matching documents (memory inefficient at all)

As solutions, we can use Lucene approach:

Extend store.Output

Interface store.Input has built in abilities to read encoded uint32, uint16 etc. This approach has shown successful experience and simplicity of usage. This proposal aims to extend store.Output with the corresponding encoding methods such as:

WriteVUInt32
WriteUInt32
WriteUInt16

[FUZZIT] Crash at fuzzing target mph

A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

Inverted index version validation

From time to time, the structure of inverted index changes. There is no way to tell a client that the index has changed, which could bring frustration and waste of time.

For solving this issue, it is needed to store an index version in index header, and check it on index read. On each index structure change, the index version will be increased.

Documentation and usage examples

This package doesn't provide good documentation and examples of usage. The issue requires to create documentation, terminology, usage examples, and demos pages.

[FUZZIT] Crash at fuzzing target mph

A new crash was discovered for fuzzing target mph.
Here is a snippet of the log:

2019/12/23 14:30:20 downloading seed
2019/12/23 14:30:21 no seed corpus. continue...
2019/12/23 14:30:21 downloading corpus
2019/12/23 14:30:21 no generating corpus yet. continue...
2019/12/23 14:30:21 downloading fuzzer
2019/12/23 14:30:22 downloading additional corpus
2019/12/23 14:30:22 no additional-corpus. skipping...
2019/12/23 14:30:22 Running fuzzing with: ./fuzzer -print_final_stats=1 -exact_artifact_path=./artifact -error_exitcode=76 -max_total_time=3600 corpus additional-corpus seed -rss_limit_mb=1984
FUZZER: INFO: Seed: 3726776796
FUZZER: INFO: 65536 Extra Counters
FUZZER: INFO:        0 files found in corpus
FUZZER: INFO:        0 files found in additional-corpus
FUZZER: INFO:        0 files found in seed
FUZZER: INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
FUZZER: INFO: A corpus is not provided, starting from an empty corpus
FUZZER: #2	INITED ft: 76 corp: 1/1b lim: 4 exec/s: 0 rss: 26Mb
FUZZER: #3	NEW    ft: 78 corp: 2/3b lim: 4 exec/s: 0 rss: 26Mb L: 2/2 MS: 1 InsertByte-
FUZZER: #6	NEW    ft: 79 corp: 3/6b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 3 ChangeByte-CMP-CopyPart- DE: "\x01\x00"-
FUZZER: #336	NEW    ft: 80 corp: 4/9b lim: 4 exec/s: 0 rss: 26Mb L: 3/3 MS: 5 ShuffleBytes-CopyPart-ChangeBinInt-ChangeBit-EraseBytes-
FUZZER: #943	REDUCE ft: 80 corp: 4/7b lim: 4 exec/s: 0 rss: 26Mb L: 1/3 MS: 2 EraseBytes-EraseBytes-
FUZZER: #2954	NEW    ft: 81 corp: 5/13b lim: 6 exec/s: 0 rss: 26Mb L: 6/6 MS: 1 InsertRepeatedBytes-
FUZZER: #7969	NEW    ft: 82 corp: 6/24b lim: 11 exec/s: 0 rss: 28Mb L: 11/11 MS: 5 EraseBytes-EraseBytes-InsertRepeatedBytes-ChangeBinInt-InsertByte-
FUZZER: #17991	REDUCE ft: 134 corp: 7/45b lim: 21 exec/s: 17991 rss: 29Mb L: 21/21 MS: 2 CMP-CrossOver- DE: "\x00\x00\x00\x00"-
FUZZER: ALARM: working on the last Unit for 1200 seconds
FUZZER:        and the timeout value is 1200 (use -timeout=N to change)
FUZZER: MS: 1 ChangeByte-; base unit: e6e51fdc0572094ff3574d5b922354589890309e
FUZZER: 0x0,0x0,0x5e,0xa,0x0,0x0,0x0,0xa,0xb5,0xb5,0xb4,0x4a,0xf3,0x4a,0x4a,0x4a,0x3d,0x0,0x0,0x0,0x0,
FUZZER: \x00\x00^\x0a\x00\x00\x00\x0a\xb5\xb5\xb4J\xf3JJJ=\x00\x00\x00\x00
FUZZER: artifact_prefix='./'; Test unit written to ./artifact
FUZZER: Base64: AABeCgAAAAq1tbRK80pKSj0AAAAA
FUZZER: ==26== ERROR: libFuzzer: timeout after 1200 seconds
FUZZER:     #0 0x4b059f in __sanitizer_print_stack_trace /tmp/final/llvm.src/projects/compiler-rt/lib/ubsan/ubsan_diag_standalone.cc:29:3
FUZZER:     #1 0x4529a8 in fuzzer::PrintStackTrace() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:206:5
FUZZER:     #2 0x430c1d in fuzzer::Fuzzer::AlarmCallback() /tmp/final/llvm.src/projects/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:300:5
FUZZER:     #3 0x7f80c8a9e0df  (/lib/x86_64-linux-gnu/libpthread.so.0+0x110df)
FUZZER:     #4 0x55dc30 in github.com/suggest-go/suggest/pkg/mph.(*mph).Build /home/travis/gopath/src/github.com/suggest-go/suggest/pkg/mph/mph.go:85
FUZZER: 
FUZZER: SUMMARY: libFuzzer: timeout
FUZZER: stat::number_of_executed_units: 17992
FUZZER: stat::average_exec_per_sec:     14
FUZZER: stat::new_units_added:          7
FUZZER: stat::slowest_unit_time_sec:    0
FUZZER: stat::peak_rss_mb:              32
2019/12/23 14:50:25 process finished with error = exit status 77
2019/12/23 14:50:25 Exit Status: 77
2019/12/23 14:50:26 uploading crash...

More details can be found here

Cheers,
Fuzzit Bot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.