Git Product home page Git Product logo

nxsearch's Introduction

nxsearch

TESTS

Work in progress. Upstream at: https://github.com/rmind/nxsearch

nxsearch is a full-text search engine library which is also provided with a web server integration.

The engine is written in C11 and is distributed under the 2-clause BSD license.

Features

  • Lightweight, simple and fast.
  • BM25 and TF-IDF algorithms.
  • Integrates with the Snowball stemmer.
  • Supports fuzzy matching (using the BK-tree with Levenshtein distance).
  • Supports query logic operators, grouping, nesting, etc.
  • Supports filters in Lua for easy extendibility.
  • Basic UTF-8 and internationalization support.

Usage

To try as a web service:

# git submodule update --init --recursive  # ensure you have submodules
docker-compose up app  # spin up the service
open http://127.0.0.1:8000/docs  # documentation page

The NXS_BASEDIR environment variable specifies the base directory where the indexed documents as well as the application data files are stored.

Shell

# Create the index:
curl -XPOST http://127.0.0.1:8000/test-idx

# Index some test documents:
curl -d "cat dog cow" http://127.0.0.1:8000/test-idx/add/1
curl -d "dog cow" http://127.0.0.1:8000/test-idx/add/2
curl -d "cat cat cat" http://127.0.0.1:8000/test-idx/add/3

# Run a query:
curl -s -d "cat" http://127.0.0.1:8000/test-idx/search | jq

Documentation

  • Swagger UI with the endpoint documentation is provided at /docs URL.

  • The Lua filters API can be found HERE.

  • The C API documentation can be found HERE.

nxsearch's People

Contributors

rmind avatar toloco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

toloco

nxsearch's Issues

Add Swagger UI

  • Add Swagger UI for the nxsearch web service (serve it as a static file at /docs).
  • Ideally, it should be auto-generated from the comments in the nxsearch_svc.lua source.
  • Generate and add it as a build step for the nxsearch-svc image.

Implement concurrent compaction

Currently, the index structures are generally append-only as the document removal uses tombstones (special markings) to indicate deletions. Therefore, many deletions would produce gaps in the index files which would waste space. We want to address this problem by implementing compaction.

The following is a proposal for a concurrent compaction algorithm:

new-dtmap = exclusive-open-create
lock new-dtmap

// Initial concurrent sync (captures most of the data)
dtmap-sync from current-dtmap

lock current-dtmap
  // Sync any remaining data with the lock held if there was a race
  dtmap-sync from current-dtmap

  // Make the new index globally visible
  atomic-posix-rename new-dtmap.filename to current-dtmap.filename

  // Publish the compaction offset for the active index references
  atomic-store-release current-dtmap.compaction-offset <= last offset in new-dtmap
unlock current-dtmap
unlock new-dtmap
  • The active index references check for a compaction-offset change, re-open the index (picking up the new file) and sync from this offset. The existing references to the memory-mapped file itself, primarily idxdoc_t::offset, would require a sequential scan to be adjusted.
  • The next compaction may be triggered only when all active index references have synced. The index deletion and re-creation should be separately handled with the timestamp/generation number to avoid the ABA problem.

Add multi-field search support

  • Add support for fields (no nested schemes for now), e.g:
{
    "title": "Short story",
    "content": "Once upon a time ...",
    "date": 1666111000
}
  • Selective search: title: "story" and content: "once"

Function to substitute diacritics

Need a function to substitute for normalization, e.g. replace ė with e. This will either be used by the built-in normalizer or a added as a new filter. Note: ICU library might have this functionality hidden somewhere.

Handle delete of the index with multiple nginx workers

If the index gets deleted, see see here, the other nginx workers won't pick it up. Since references use LRU with TTL, it will eventually get G/C-ed and closed. However, this doesn't work if index is deleted and immediatelly re-created. This can be fixed either on the OpenResty side or the nxsearch lib.

Autogenerate doc ids

Options:

  • Generate doc ids internally in nxs
  • Generate doc ids in the lua layer

Add support for query logics

Currently, all tokens are treated using the OR logic ("science medicine" is "science" or "medicine"). The query parser should support the following requirements:

  • AND, OR as well as NOT logic.
  • Alternative syntax using &, | and - (minus means exclusion, e.g. science -medicine)
  • Grouping: (C OR C++ OR Python) AND developer AND (Linux OR Unix OR BSD) AND NOT (C# OR Java)
  • Nesting: (software AND (engineer OR developer))
  • Exact terms using quotes, e.g. ("software engineer" OR "software developer")
  • Defined order of operations: does A | B & C mean (A | B) & C or A | (B & C)? I'd say stick with standard precedence in logics: ¬, ∧,∨. But left-to-right might also be an option.

Consider using re2c and lemon for lexer and parser.

Implement document removal

  • Design a solution for the document removal from the index which would work well with concurrent searches.
  • Implement it.

Tokenizer rewrite

Currently, the tokenizer is a quickly written naive implementation: 1) it is not efficient as it copies the whole input string; 2) it uses libc strtok_r() to replace the separator bytes with NIL terminator which may break UTF-8 strings.

Objective: write an efficient and UTF-8 compatible tokenizer, probably using a simple state machine, perhaps leveraging some UTF-8 helpers from the libicu Unicode library.

The new tokenizer should handle separators (dots, commas, etc) and other punctuation. Some cases to think about and consider:

  • Basic punctuation: foo, bar producing foo and bar rather than foo, and bar.
  • Brackets and marking: the [client] is <...>, some *bold* or _underscore_ marks.
  • Broken spacing: Some.Text,which doesn't have spaces right;one;two;three..
  • Dashes: keep-alive, this--done (m-dash stands for "is").
  • [Optional] Corner cases: U.S.A., I.B.M. or Dennis M. Ritchie.

get_http_body: support large payloads

If the payload body is larger than in-memory, nginx will start using temporary files. The buffer sizes are specified here. Need to update the get_http_body() function here here to support it. A potentially good way might be to memory-map the file from Lua.

Optimize idxdoc_get_termcount()

Currently, idxdoc_get_termcount() is O(n) and is therefore very inefficient. The t_index_limits test exposes this behaviour (slow search).

  • A simple fix: sort the term IDs on addition and do binary search on lookup.
  • Alternative: instead of sorting, an ordered list could be used with a data structure tailored for this purpose.
  • Another idea: term IDs are known in advance and they do not change, therefore it could use perfect hashing. Generating perfect hash, though, can be expensive and some middle could be explored, e.g. nearly perfect hash with some linear probing solution.

Sort and cap results

Currently, the results in nxs_resp_t are returned in arbitrary order. They should be ordered by score and the results should have a custom cap/limit, e.g. a default can be 10k.

Add support for synonyms

  • We want to add an API to support user-defined synonyms associated with the index.
  • Initially support query-time synonyms, e.g. search for "ewe" would result in search for "ewe" or "sheep".
  • Consider index-time synonyms where tokens are converted to some chosen generic form.
  • Consider using and integrating with a synonyms database for English. We want to handle, at least, cases like "calibre" vs "caliber", "defense" vs "defence", "tech" vs "technology", etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.