rmind / nxsearch Goto Github PK

View Code? Open in Web Editor NEW

11.0 3.0 1.0 245 KB

nxsearch: a full-text search engine

License: BSD 2-Clause "Simplified" License

Dockerfile 0.92% Makefile 2.01% C 88.64% Lua 5.81% Shell 0.75% HTML 0.27% Yacc 0.89% C++ 0.70%

bk-tree bm25 full-text-search fuzzy-matching levenshtein-distance tf-idf

nxsearch's Introduction

nxsearch

Work in progress. Upstream at: https://github.com/rmind/nxsearch

nxsearch is a full-text search engine library which is also provided with a web server integration.

The engine is written in C11 and is distributed under the 2-clause BSD license.

Features

Lightweight, simple and fast.
BM25 and TF-IDF algorithms.
Integrates with the Snowball stemmer.
Supports fuzzy matching (using the BK-tree with Levenshtein distance).
Supports query logic operators, grouping, nesting, etc.
Supports filters in Lua for easy extendibility.
Basic UTF-8 and internationalization support.

Usage

To try as a web service:

# git submodule update --init --recursive  # ensure you have submodules
docker-compose up app  # spin up the service
open http://127.0.0.1:8000/docs  # documentation page

The NXS_BASEDIR environment variable specifies the base directory where the indexed documents as well as the application data files are stored.

Shell

# Create the index:
curl -XPOST http://127.0.0.1:8000/test-idx

# Index some test documents:
curl -d "cat dog cow" http://127.0.0.1:8000/test-idx/add/1
curl -d "dog cow" http://127.0.0.1:8000/test-idx/add/2
curl -d "cat cat cat" http://127.0.0.1:8000/test-idx/add/3

# Run a query:
curl -s -d "cat" http://127.0.0.1:8000/test-idx/search | jq

Documentation

Swagger UI with the endpoint documentation is provided at /docs URL.
The Lua filters API can be found HERE.
The C API documentation can be found HERE.

nxsearch's People

Contributors

Stargazers

Watchers

Forkers

toloco

nxsearch's Issues

Add Swagger UI

Add Swagger UI for the nxsearch web service (serve it as a static file at /docs).
Ideally, it should be auto-generated from the comments in the nxsearch_svc.lua source.
Generate and add it as a build step for the nxsearch-svc image.

Implement concurrent compaction

Currently, the index structures are generally append-only as the document removal uses tombstones (special markings) to indicate deletions. Therefore, many deletions would produce gaps in the index files which would waste space. We want to address this problem by implementing compaction.

The following is a proposal for a concurrent compaction algorithm:

new-dtmap = exclusive-open-create
lock new-dtmap

// Initial concurrent sync (captures most of the data)
dtmap-sync from current-dtmap

lock current-dtmap
  // Sync any remaining data with the lock held if there was a race
  dtmap-sync from current-dtmap

  // Make the new index globally visible
  atomic-posix-rename new-dtmap.filename to current-dtmap.filename

  // Publish the compaction offset for the active index references
  atomic-store-release current-dtmap.compaction-offset <= last offset in new-dtmap
unlock current-dtmap
unlock new-dtmap

The active index references check for a compaction-offset change, re-open the index (picking up the new file) and sync from this offset. The existing references to the memory-mapped file itself, primarily idxdoc_t::offset, would require a sequential scan to be adjusted.
The next compaction may be triggered only when all active index references have synced. The index deletion and re-creation should be separately handled with the timestamp/generation number to avoid the ABA problem.

Add multi-field search support

Add support for fields (no nested schemes for now), e.g:

{
    "title": "Short story",
    "content": "Once upon a time ...",
    "date": 1666111000
}

Selective search: title: "story" and content: "once"

Function to substitute diacritics

Need a function to substitute for normalization, e.g. replace ė with e. This will either be used by the built-in normalizer or a added as a new filter. Note: ICU library might have this functionality hidden somewhere.

Handle delete of the index with multiple nginx workers

If the index gets deleted, see see here, the other nginx workers won't pick it up. Since references use LRU with TTL, it will eventually get G/C-ed and closed. However, this doesn't work if index is deleted and immediatelly re-created. This can be fixed either on the OpenResty side or the nxsearch lib.

Autogenerate doc ids

Options:

Generate doc ids internally in nxs
Generate doc ids in the lua layer

Filters should be able to return an error message

Currently, filter_pipeline_run() fails with FILT_ERROR we have no way to obtain the error message.

Provide an API for the error message. Update the parser and the Lua code to support it.

Add a very simple interactive search UI

Provide a very simple UI to 1) select the index; 2) search it.
Interactive i.e. search-as-you-type.
Minimum JavaScript dependencies.

Add support for query logics

Currently, all tokens are treated using the OR logic ("science medicine" is "science" or "medicine"). The query parser should support the following requirements:

AND, OR as well as NOT logic.
Alternative syntax using &, | and - (minus means exclusion, e.g. science -medicine)
Grouping: (C OR C++ OR Python) AND developer AND (Linux OR Unix OR BSD) AND NOT (C# OR Java)
Nesting: (software AND (engineer OR developer))
Exact terms using quotes, e.g. ("software engineer" OR "software developer")
Defined order of operations: does A | B & C mean (A | B) & C or A | (B & C)? I'd say stick with standard precedence in logics: ¬, ∧,∨. But left-to-right might also be an option.

Consider using re2c and lemon for lexer and parser.

Implement document removal

Design a solution for the document removal from the index which would work well with concurrent searches.
Implement it.

Add support for exclusion/negation logic

Add support for exclusion e.g. "science -medicine" where any documents matching "medicine" are excluded.

Support exact string search

Tokenizer rewrite

Currently, the tokenizer is a quickly written naive implementation: 1) it is not efficient as it copies the whole input string; 2) it uses libc strtok_r() to replace the separator bytes with NIL terminator which may break UTF-8 strings.

Objective: write an efficient and UTF-8 compatible tokenizer, probably using a simple state machine, perhaps leveraging some UTF-8 helpers from the libicu Unicode library.

The new tokenizer should handle separators (dots, commas, etc) and other punctuation. Some cases to think about and consider:

Basic punctuation: foo, bar producing foo and bar rather than foo, and bar.
Brackets and marking: the [client] is <...>, some *bold* or _underscore_ marks.
Broken spacing: Some.Text,which doesn't have spaces right;one;two;three..
Dashes: keep-alive, this--done (m-dash stands for "is").
[Optional] Corner cases: U.S.A., I.B.M. or Dennis M. Ritchie.

SVC returns 200 on error, for all end points

Regardless of the error and the end point it always returns 200.

get_http_body: support large payloads

If the payload body is larger than in-memory, nginx will start using temporary files. The buffer sizes are specified here. Need to update the get_http_body() function here here to support it. A potentially good way might be to memory-map the file from Lua.

Optimize idxdoc_get_termcount()

Currently, idxdoc_get_termcount() is O(n) and is therefore very inefficient. The t_index_limits test exposes this behaviour (slow search).

A simple fix: sort the term IDs on addition and do binary search on lookup.
Alternative: instead of sorting, an ordered list could be used with a data structure tailored for this purpose.
Another idea: term IDs are known in advance and they do not change, therefore it could use perfect hashing. Generating perfect hash, though, can be expensive and some middle could be explored, e.g. nearly perfect hash with some linear probing solution.

We want to add an API to support user-defined synonyms associated with the index.
Initially support query-time synonyms, e.g. search for "ewe" would result in search for "ewe" or "sheep".
Consider index-time synonyms where tokens are converted to some chosen generic form.
Consider using and integrating with a synonyms database for English. We want to handle, at least, cases like "calibre" vs "caliber", "defense" vs "defence", "tech" vs "technology", etc.