Git Product home page Git Product logo

flexsearch's Introduction

Let's discuss the upcoming FlexSearch v0.8 here: #415

FlexSearch.js: Next-Generation full text search library for Browser and Node.js

Web's fastest and most memory-flexible full-text search library with zero dependencies.

Basic Start  •  API Reference  •  Document Indexes  •  Using Worker  •  Changelog

Support this Project

You can help me by making a personal donation to keep this project alive and also to provide all the contribution to solve your needs.

Donate using Open Collective Donate using Github Sponsors Donate using Liberapay Donate using Patreon Donate using Bountysource Donate using PayPal

When it comes to raw search speed FlexSearch outperforms every single searching library out there and also provides flexible search capabilities like multi-field search, phonetic transformations or partial matching.

Depending on the used options it also provides the most memory-efficient index. FlexSearch introduce a new scoring algorithm called "contextual index" based on a pre-scored lexical dictionary architecture which actually performs queries up to 1,000,000 times faster compared to other libraries. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.

Supported Platforms:

  • Browser
  • Node.js

Library Comparison "Gulliver's Travels":

Plugins (extern projects):

Get Latest

Build File CDN
flexsearch.bundle.js Download https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.bundle.js
flexsearch.light.js Download https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.light.js
flexsearch.compact.js Download https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.compact.js
flexsearch.es5.js * Download https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.es5.js
ES6 Modules Download The /dist/module/ folder of this Github repository

* The bundle "flexsearch.es5.js" includes polyfills for EcmaScript 5 Support.

Get Latest (NPM)

npm install flexsearch

Compare Web-Bundles

The Node.js package includes all features from flexsearch.bundle.js.

Feature flexsearch.bundle.js flexsearch.compact.js flexsearch.light.js
Presets -
Async Search -
Workers (Web + Node.js) - -
Contextual Indexes
Index Documents (Field-Search) -
Document Store -
Partial Matching
Relevance Scoring
Auto-Balanced Cache by Popularity - -
Tags - -
Suggestions -
Phonetic Matching -
Customizable Charset/Language (Matcher, Encoder, Tokenizer, Stemmer, Filter, Split, RTL)
Export / Import Indexes - -
File Size (gzip) 6.8 kb 5.3 kb 2.9 kb

Performance Benchmark (Ranking)

Run Comparison: Performance Benchmark "Gulliver's Travels"

Operation per seconds, higher is better, except the test "Memory" on which lower is better.

Rank Library Memory Query (Single Term) Query (Multi Term) Query (Long) Query (Dupes) Query (Not Found)
1 FlexSearch 17 7084129 1586856 511585 2017142 3202006
2 JSii 27 6564 158149 61290 95098 534109
3 Wade 424 20471 78780 16693 225824 213754
4 JS Search 193 8221 64034 10377 95830 167605
5 Elasticlunr.js 646 5412 7573 2865 23786 13982
6 BulkSearch 1021 3069 3141 3333 3265 21825569
7 MiniSearch 24348 4406 10945 72 39989 17624
8 bm25 15719 1429 789 366 884 1823
9 Lunr.js 2219 255 271 272 266 267
10 FuzzySearch 157373 53 38 15 32 43
11 Fuse 7641904 6 2 1 2 3

Load Library

There are 3 types of indexes:

  1. Index is a flat high performance index which stores id-content-pairs.
  2. Worker / WorkerIndex is also a flat index which stores id-content-pairs but runs in background as a dedicated worker thread.
  3. Document is multi-field index which can store complex JSON documents (could also exist of worker indexes).

The most of you probably need just one of them according to your scenario.

Browser

Legacy ES5 Script Tag (Bundled)

<script src="node_modules/flexsearch/dist/flexsearch.bundle.min.js"></script>
<script>

    // FlexSearch is available on window.FlexSearch
    // Access FlexSearch static methods via bundled export (static class methods of FlexSearch)

    const index = FlexSearch.Index(options);
    const document = FlexSearch.Document(options);
    const worker = FlexSearch.Worker(options);

</script>

ESM/ES6 Modules:

<script type="module">

    // FlexSearch is NOT available on window.FlexSearch
    // Access FlexSearch static methods by importing them explicitly

    import Index from "./node_modules/flexsearch/dist/module/index";
    import Document from "./node_modules/flexsearch/dist/module/document";
    import Worker from "./node_modules/flexsearch/dist/module/worker";

    const index = new Index(options);
    const document = new Document(options);
    const worker = new Worker(options);

</script>

ESM/ES6 Bundled Module:

<script type="module">

    // FlexSearch is NOT available on window.FlexSearch
    // Access FlexSearch static methods via bundled export (static class methods of FlexSearch)

    import FlexSearch from "./node_modules/flexsearch/dist/flexsearch.bundle.module.min.js";

    const index = FlexSearch.Index(options);
    const document = FlexSearch.Document(options);
    const worker = FlexSearch.Worker(options);

</script>

Or via CDN:

<script src="https://cdn.jsdelivr.net/gh/nextapps-de/[email protected]/dist/flexsearch.bundle.min.js"></script>

AMD / CommonJS:

var FlexSearch = require("./node_modules/flexsearch/dist/flexsearch.bundle.min.js");

Node.js

npm install flexsearch

In your code include as follows:

const { Index, Document, Worker } = require("flexsearch");

const index = new Index(options);
const document = new Document(options);
const worker = new Worker(options);

Or:

const FlexSearch = require("flexsearch");

const index = new FlexSearch.Index(options);
const document = new FlexSearch.Document(options);
const worker = new FlexSearch.Worker(options);

Basic Usage and Variants

index.add(id, text);
index.search(text);
index.search(text, limit);
index.search(text, options);
index.search(text, limit, options);
index.search(options);
document.add(doc);
document.add(id, doc);
document.search(text);
document.search(text, limit);
document.search(text, options);
document.search(text, limit, options);
document.search(options);
worker.add(id, text);
worker.search(text);
worker.search(text, limit);
worker.search(text, options);
worker.search(text, limit, options);
worker.search(text, limit, options, callback);
worker.search(options);

The worker inherits from type Index and does not inherit from type Document. Therefore, a WorkerIndex basically works like a standard FlexSearch Index. Worker-Support in documents needs to be enabled by just passing the appropriate option during creation { worker: true }.

Every method called on a Worker index is treated as async. You will get back a Promise or you can provide a callback function as the last parameter alternatively.

API Overview

Global methods:

Index methods:

WorkerIndex methods:

Document methods:

* For each of those methods there exist an asynchronous equivalent:

Async Version:

Async methods will return a Promise, alternatively you can pass a callback function as the last parameter.

Methods export and also import are always async as well as every method you call on a Worker-based Index.

Options

FlexSearch is highly customizable. Make use of the right options can really improve your results as well as memory economy and query time.

Index Options

Option Values Description Default
preset "memory"
"performance"
"match"
"score"
"default"
The configuration profile as a shortcut or as a base for your custom settings.
"default"
tokenize "strict"
"forward"
"reverse"
"full"
The indexing mode (tokenizer).

Choose one of the built-ins or pass a custom tokenizer function.
"strict"
cache Boolean
Number
Enable/Disable and/or set capacity of cached entries.

When passing a number as a limit the cache automatically balance stored entries related to their popularity.

Note: When just using "true" the cache has no limits and growth unbounded.
false
resolution Number Sets the scoring resolution (default: 9). 9
context Boolean
Context Options
Enable/Disable contextual indexing. When passing "true" as value it will take the default values for the context. false
optimize Boolean When enabled it uses a memory-optimized stack flow for the index. true
boost function(arr, str, int) => float A custom boost function used when indexing contents to the index. The function has this signature: Function(words[], term, index) => Float. It has 3 parameters where you get an array of all words, the current term and the current index where the term is placed in the word array. You can apply your own calculation e.g. the occurrences of a term and return this factor (<1 means relevance is lowered, >1 means relevance is increased).

Note: this feature is currently limited by using the tokenizer "strict" only.
null
Language-specific Options and Encoding:
charset

Charset Payload
String (key)
Provide a custom charset payload or pass one of the keys of built-in charsets. "latin"
language

Language Payload
String (key)
Provide a custom language payload or pass in language shorthand flag (ISO-3166) of built-in languages. null
encode






false
"default"
"simple"
"balance"
"advanced"
"extra"
function(str) => [words]
The encoding type.

Choose one of the built-ins or pass a custom encoding function.
"default"
stemmer


false
String
Function
false
filter


false
String
Function
false
matcher


false
String
Function
false
Additional Options for Document Indexes:
worker
Boolean Enable/Disable and set count of running worker threads. false
document
Document Descriptor Includes definitions for the document index and storage.

Context Options

Option Values Description Default
resolution Number Sets the scoring resolution for the context (default: 1). 1
depth

false
Number
Enable/Disable contextual indexing and also sets contextual distance of relevance. Depth is the maximum number of words/tokens away a term to be considered as relevant. 1
bidirectional Boolean Sets bidirectional search result. If enabled and the source text contains "red hat", it will be found for queries "red hat" and "hat red". true

Document Options

Option Values Description Default
id
String "id""
tag

false
String
"tag"
index


String
Array<String>
Array<Object>
store


Boolean
String
Array<String>
false

Charset Options

Option Values Description Default
split

false
RegExp
String
The rule to split words when using non-custom tokenizer (built-ins e.g. "forward"). Use a string/char or use a regular expression (default: /\W+/).
/[\W_]+/
rtl
Boolean Enables Right-To-Left encoding. false
encode
function(str) => [words] The custom encoding function. /lang/latin/default.js

Language Options

Option Values Description
stemmer


false
String
Function
Disable or pass in language shorthand flag (ISO-3166) or a custom object.
filter


false
String
Function
Disable or pass in language shorthand flag (ISO-3166) or a custom array.
matcher


false
String
Function
Disable or pass in language shorthand flag (ISO-3166) or a custom array.

Search Options

Option Values Description Default
limit number Sets the limit of results. 100
offset number Apply offset (skip items). 0
suggest Boolean Enables suggestions in results. false

Document Search Options

  • Additionally, to the Index search options above.
Option Values Description Default
index String
Array<String>
Array<Object>
Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported.
tag String
Array<String>
Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. false
enrich Boolean Enrich IDs from the results with the corresponding documents. false
bool "and"
"or"
Sets the used logical operator when searching through multiple fields or tags. "or"

Tokenizer (Prefix Search)

Tokenizer affects the required memory also as query time and flexibility of partial matches. Try to choose the most upper of these tokenizer which fits your needs:

Option Description Example Memory Factor (n = length of word)
"strict" index whole words foobar * 1
"forward" incrementally index words in forward direction foobar
foobar
* n
"reverse" incrementally index words in both directions foobar
foobar
* 2n - 1
"full" index every possible combination foobar
foobar
* n * (n - 1)

Encoders

Encoding affects the required memory also as query time and phonetic matches. Try to choose the most upper of these encoders which fits your needs, or pass in a custom encoder:

Option Description False-Positives Compression
false Turn off encoding no 0%
"default" Case in-sensitive encoding no 0%
"simple" Case in-sensitive encoding
Charset normalizations
no ~ 3%
"balance" Case in-sensitive encoding
Charset normalizations
Literal transformations
no ~ 30%
"advanced" Case in-sensitive encoding
Charset normalizations
Literal transformations
Phonetic normalizations
no ~ 40%
"extra" Case in-sensitive encoding
Charset normalizations
Literal transformations
Phonetic normalizations
Soundex transformations
yes ~ 65%
function() Pass custom encoding via function(string):[words]

Usage

Create a new index

var index = new Index();

Create a new index and choosing one of the presets:

var index = new Index("performance");

Create a new index with custom options:

var index = new Index({
    charset: "latin:extra",
    tokenize: "reverse",
    resolution: 9
});

Create a new index and extend a preset with custom options:

var index = new FlexSearch({
    preset: "memory",
    tokenize: "forward",
    resolution: 5
});

See all available custom options.

Add text item to an index

Every content which should be added to the index needs an ID. When your content has no ID, then you need to create one by passing an index or count or something else as an ID (a value from type number is highly recommended). Those IDs are unique references to a given content. This is important when you update or adding over content through existing IDs. When referencing is not a concern, you can simply use something simple like count++.

Index.add(id, string)

index.add(0, "John Doe");

Search items

Index.search(string | options, <limit>, <options>)

index.search("John");

Limit the result:

index.search("John", 10);

Check existence of already indexed IDs

You can check if an ID was already indexed by:

if(index.contain(1)){
    console.log("ID is already in index");
}

Async

You can call each method in its async version, e.g. index.addAsync or index.searchAsync.

You can assign callbacks to each async function:

index.addAsync(id, content, function(){
    console.log("Task Done");
});

index.searchAsync(query, function(result){
    console.log("Results: ", result);
});

Or do not pass a callback function and getting back a Promise instead:

index.addAsync(id, content).then(function(){
    console.log("Task Done");
});

index.searchAsync(query).then(function(result){
    console.log("Results: ", result);
});

Or use async and await:

async function add(){
    await index.addAsync(id, content);
    console.log("Task Done");
}

async function search(){
    const results = await index.searchAsync(query);
    console.log("Results: ", result);
}

Append Contents

You can append contents to an existing index like:

index.append(id, content);

This will not overwrite the old indexed contents as it will do when perform index.update(id, content). Keep in mind that index.add(id, content) will also perform "update" under the hood when the id was already being indexed.

Appended contents will have their own context and also their own full resolution. Therefore, the relevance isn't being stacked but gets its own context.

Let us take this example:

index.add(0, "some index");
index.append(0, "some appended content");

index.add(1, "some text");
index.append(1, "index appended content");

When you query index.search("index") then you will get index id 1 as the first entry in the result, because the context starts from zero for the appended data (isn't stacked to the old context) and here "index" is the first term.

If you didn't want this behavior than just use the standard index.add(id, content) and provide the full length of content.

Update item from an index

Index.update(id, string)

index.update(0, "Max Miller");

Remove item from an index

Index.remove(id)

index.remove(0);

Add custom tokenizer

A tokenizer split words/terms into components or partials.

Define a private custom tokenizer during creation/initialization:

var index = new FlexSearch({

    tokenize: function(str){

        return str.split(/\s-\//g);
    }
});

The tokenizer function gets a string as a parameter and has to return an array of strings representing a word or term. In some languages every char is a term and also not separated via whitespaces.

Add language-specific stemmer and/or filter

Stemmer: several linguistic mutations of the same word (e.g. "run" and "running")

Filter: a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be")

Assign a private custom stemmer or filter during creation/initialization:

var index = new FlexSearch({

    stemmer: {

        // object {key: replacement}
        "ational": "ate",
        "tional": "tion",
        "enci": "ence",
        "ing": ""
    },
    filter: [

        // array blacklist
        "in",
        "into",
        "is",
        "isn't",
        "it",
        "it's"
    ]
});

Using a custom filter, e.g.:

var index = new FlexSearch({

    filter: function(value){

        // just add values with length > 1 to the index

        return value.length > 1;
    }
});

Or assign stemmer/filters globally to a language:

Stemmer are passed as a object (key-value-pair), filter as an array.

FlexSearch.registerLanguage("us", {

    stemmer: { /* ... */ },
    filter:  [ /* ... */ ]
});

Or use some pre-defined stemmer or filter of your preferred languages:

<html>
<head>
    <script src="js/flexsearch.bundle.js"></script>
    <script src="js/lang/en.min.js"></script>
    <script src="js/lang/de.min.js"></script>
</head>
...

Now you can assign built-in stemmer during creation/initialization:

var index_en = new FlexSearch.Index({
    language: "en"
});

var index_de = new FlexSearch.Index({
    language: "de"
});

In Node.js all built-in language packs files are available:

const { Index } = require("flexsearch");

var index_en = new Index({
    language: "en"
});

Right-To-Left Support

Set the tokenizer at least to "reverse" or "full" when using RTL.

Just set the field "rtl" to true and use a compatible tokenizer:

var index = new Index({
    encode: str => str.toLowerCase().split(/[^a-z]+/),
    tokenize: "reverse",
    rtl: true
});

CJK Word Break (Chinese, Japanese, Korean)

Set a custom tokenizer which fits your needs, e.g.:

var index = FlexSearch.create({
    encode: str => str.replace(/[\x00-\x7F]/g, "").split("")
});

You can also pass a custom encoder function to apply some linguistic transformations.

index.add(0, "一个单词");
var results = index.search("单词");

Index Documents (Field-Search)

The Document Descriptor

Assuming our document has a data structure like this:

{ 
    "id": 0, 
    "content": "some text"
}

Old syntax FlexSearch v0.6.3 (not supported anymore!):

const index = new Document({
    doc: {
        id: "id",
        field: ["content"]
    }
});

The document descriptor has slightly changed, there is no field branch anymore, instead just apply one level higher, so key becomes a main member of options.

For the new syntax the field "doc" was renamed to document and the field "field" was renamed to index:

const index = new Document({
    document: {
        id: "id",
        index: ["content"]
    }
});

index.add({ 
    id: 0, 
    content: "some text"
});

The field id describes where the ID or unique key lives inside your documents. The default key gets the value id by default when not passed, so you can shorten the example from above to:

const index = new Document({
    document: {
        index: ["content"]
    }
});

The member index has a list of fields which you want to be indexed from your documents. When just selecting one field, then you can pass a string. When also using default key id then this shortens to just:

const index = new Document({ document: "content" });
index.add({ id: 0, content: "some text" });

Assuming you have several fields, you can add multiple fields to the index:

var docs = [{
    id: 0,
    title: "Title A",
    content: "Body A"
},{
    id: 1,
    title: "Title B",
    content: "Body B"
}];
const index = new Document({
    id: "id",
    index: ["title", "content"]
});

You can pass custom options for each field:

const index = new Document({
    id: "id",
    index: [{
        field: "title",
        tokenize: "forward",
        optimize: true,
        resolution: 9
    },{
        field:  "content",
        tokenize: "strict",
        optimize: true,
        resolution: 5,
        minlength: 3,
        context: {
            depth: 1,
            resolution: 3
        }
    }]
});

Field options gets inherited when also global options was passed, e.g.:

const index = new Document({
    tokenize: "strict",
    optimize: true,
    resolution: 9,
    document: {
        id: "id",
        index:[{
            field: "title",
            tokenize: "forward"
        },{
            field: "content",
            minlength: 3,
            context: {
                depth: 1,
                resolution: 3
            }
        }]
    }
});

Note: The context options from the field "content" also gets inherited by the corresponding field options, whereas this field options was inherited by the global option.

Nested Data Fields (Complex Objects)

Assume the document array looks more complex (has nested branches etc.), e.g.:

{
  "record": {
    "id": 0,
    "title": "some title",
    "content": {
      "header": "some text",
      "footer": "some text"
    }
  }
}

Then use the colon separated notation root:child:child to define hierarchy within the document descriptor:

const index = new Document({
    document: {
        id: "record:id",
        index: [
            "record:title",
            "record:content:header",
            "record:content:footer"
        ]
    }
});

Just add fields you want to query against. Do not add fields to the index, you just need in the result (but did not query against). For this purpose you can store documents independently of its index (read below).

When you want to query through a field you have to pass the exact key of the field you have defined in the doc as a field name (with colon syntax):

index.search(query, {
    index: [
        "record:title",
        "record:content:header",
        "record:content:footer"
    ]
});

Same as:

index.search(query, [
    "record:title",
    "record:content:header",
    "record:content:footer"
]);

Using field-specific options:

index.search([{
    field: "record:title",
    query: "some query",
    limit: 100,
    suggest: true
},{
    field: "record:title",
    query: "some other query",
    limit: 100,
    suggest: true
}]);

You can perform a search through the same field with different queries.

When passing field-specific options you need to provide the full configuration for each field. They get not inherited like the document descriptor.

Complex Documents

You need to follow 2 rules for your documents:

  1. The document cannot start with an Array at the root index. This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
[ // <-- not allowed as document start!
  {
    "id": 0,
    "title": "title"
  }
]
  1. The id can't be nested inside an array (also none of the parent fields can't be an array). This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
{
  "records": [ // <-- not allowed when ID or tag lives inside!
    {
      "id": 0,
      "title": "title"
    }
  ]
}

Here an example for a supported complex document:

{
  "meta": {
    "tag": "cat",
    "id": 0
  },
  "contents": [
    {
      "body": {
        "title": "some title",
        "footer": "some text"
      },
      "keywords": ["some", "key", "words"]
    },
    {
      "body": {
        "title": "some title",
        "footer": "some text"
      },
      "keywords": ["some", "key", "words"]
    }
  ]
}

The corresponding document descriptor (when all fields should be indexed) looks like:

const index = new Document({
    document: {
        id: "meta:id",
        tag: "meta:tag",
        index: [
            "contents[]:body:title",
            "contents[]:body:footer",
            "contents[]:keywords"
        ]
    }
});

Again, when searching you have to use the same colon-separated-string from your field definition.

index.search(query, { 
    index: "contents[]:body:title"
});

Not Supported Documents (Sequential Data)

This example breaks both rules from above:

[ // <-- not allowed as document start!
  {
    "tag": "cat",
    "records": [ // <-- not allowed when ID or tag lives inside!
      {
        "id": 0,
        "body": {
          "title": "some title",
          "footer": "some text"
        },
        "keywords": ["some", "key", "words"]
      },
      {
        "id": 1,
        "body": {
          "title": "some title",
          "footer": "some text"
        },
        "keywords": ["some", "key", "words"]
      }
    ]
  }
]

You need to apply some kind of structure normalization.

A workaround to such a data structure looks like this:

const index = new Document({
    document: {
        id: "record:id",
        tag: "tag",
        index: [
            "record:body:title",
            "record:body:footer",
            "record:body:keywords"
        ]
    }
});

function add(sequential_data){

    for(let x = 0, data; x < sequential_data.length; x++){

        data = sequential_data[x];

        for(let y = 0, record; y < data.records.length; y++){

            record = data.records[y];

            index.add({
                id: record.id,
                tag: data.tag,
                record: record
            });
        }
    }  
}

// now just use add() helper method as usual:

add([{
    // sequential structured data
    // take the data example above
}]);

You can skip the first loop when your document data has just one index as the outer array.

Add/Update/Remove Documents to/from the Index

Add a document to the index:

index.add({
            id: 0,
            title: "Foo",
            content: "Bar"
          });

Update index with a single object or an array of objects:

index.update({
    data:{
        id: 0,
        title: "Foo",
        body: {
            content: "Bar"
        }
    }
});

Remove a single object or an array of objects from the index:

index.remove(docs);

When the id is known, you can also simply remove by (faster):

index.remove(id);

Join / Append Arrays

On the complex example above, the field keywords is an array but here the markup did not have brackets like keywords[]. That will also detect the array but instead of appending each entry to a new context, the array will be joined into on large string and added to the index.

The difference of both kinds of adding array contents is the relevance when searching. When adding each item of an array via append() to its own context by using the syntax field[], then the relevance of the last entry concurrent with the first entry. When you left the brackets in the notation, it will join the array to one whitespace-separated string. Here the first entry has the highest relevance, whereas the last entry has the lowest relevance.

So assuming the keyword from the example above are pre-sorted by relevance to its popularity, then you want to keep this order (information of relevance). For this purpose do not add brackets to the notation. Otherwise, it would take the entries in a new scoring context (the old order is getting lost).

Also you can left bracket notation for better performance and smaller memory footprint. Use it when you did not need the granularity of relevance by the entries.

Field-Search

Search through all fields:

index.search(query);

Search through a specific field:

index.search(query, { index: "title" });

Search through a given set of fields:

index.search(query, { index: ["title", "content"] });

Same as:

index.search(query, ["title", "content"]);

Pass custom modifiers and queries to each field:

index.search([{
    field: "content",
    query: "some query",
    limit: 100,
    suggest: true
},{
    field: "content",
    query: "some other query",
    limit: 100,
    suggest: true
}]);

You can perform a search through the same field with different queries.

See all available field-search options.

The Result Set

Schema of the result-set:

fields[] => { field, result[] => { document }}

The first index is an array of fields the query was applied to. Each of this field has a record (object) with 2 properties "field" and "result". The "result" is also an array and includes the result for this specific field. The result could be an array of IDs or as enriched with stored document data.

A non-enriched result set now looks like:

[{
    field: "title",
    result: [0, 1, 2]
},{
    field: "content",
    result: [3, 4, 5]
}]

An enriched result set now looks like:

[{
    field: "title",
    result: [
        { id: 0, doc: { /* document */ }},
        { id: 1, doc: { /* document */ }},
        { id: 2, doc: { /* document */ }}
    ]
},{
    field: "content",
    result: [
        { id: 3, doc: { /* document */ }},
        { id: 4, doc: { /* document */ }},
        { id: 5, doc: { /* document */ }}
    ]
}]

When using pluck instead of "field" you can explicitly select just one field and get back a flat representation:

index.search(query, { pluck: "title", enrich: true });
[
    { id: 0, doc: { /* document */ }},
    { id: 1, doc: { /* document */ }},
    { id: 2, doc: { /* document */ }}
]

This result set is a replacement of "boolean search". Instead of applying your bool logic to a nested object, you can apply your logic by yourself on top of the result-set dynamically. This opens hugely capabilities on how you process the results. Therefore, the results from the fields aren't squashed into one result anymore. That keeps some important information, like the name of the field as well as the relevance of each field results which didn't get mixed anymore.

A field search will apply a query with the boolean "or" logic by default. Each field has its own result to the given query.

There is one situation where the bool property is being still supported. When you like to switch the default "or" logic from the field search into "and", e.g.:

index.search(query, { 
    index: ["title", "content"],
    bool: "and" 
});

You will just get results which contains the query in both fields. That's it.

Tags

Like the key for the ID just define the path to the tag:

const index = new Document({
    document: { 
        id: "id",
        tag: "tag",
        index: "content"
    }
});
index.add({
    id: 0,
    tag: "cat",
    content: "Some content ..."
});

Your data also can have multiple tags as an array:

index.add({
    id: 1,
    tag: ["animal", "dog"],
    content: "Some content ..."
});

You can perform a tag-specific search by:

index.search(query, { 
    index: "content",
    tag: "animal" 
});

This just gives you result which was tagged with the given tag.

Use multiple tags when searching:

index.search(query, { 
    index: "content",
    tag: ["cat", "dog"]
});

This gives you result which are tagged with one of the given tag.

Multiple tags will apply as the boolean "or" by default. It just needs one of the tags to be existing.

This is another situation where the bool property is still supported. When you like to switch the default "or" logic from the tag search into "and", e.g.:

index.search(query, { 
    index: "content",
    tag: ["dog", "animal"],
    bool: "and"
});

You will just get results which contains both tags (in this example there is just one records which has the tag "dog" and "animal").

Tag Search

You can also fetch results from one or more tags when no query was passed:

index.search({ tag: ["cat", "dog"] });

In this case the result-set looks like:

[{
    tag: "cat",
    result: [ /* all cats */ ]
},{
    tag: "dog",
    result: [ /* all dogs */ ]
}]

Limit & Offset

By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size.

You can set the limit and the offset for each query:

index.search(query, { limit: 20, offset: 100 });

You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue.

Document Store

Only a document index can have a store. You can use a document index instead of a flat index to get this functionality also when only storing ID-content-pairs.

You can define independently which fields should be indexed and which fields should be stored. This way you can index fields which should not be included in the search result.

Do not use a store when: 1. an array of IDs as the result is good enough, or 2. you already have the contents/documents stored elsewhere (outside the index).

When the store attribute was set, you have to include all fields which should be stored explicitly (acts like a whitelist).

When the store attribute was not set, the original document is stored as a fallback.

This will add the whole original content to the store:

const index = new Document({
    document: { 
        index: "content",
        store: true
    }
});

index.add({ id: 0, content: "some text" });

Access documents from internal store

You can get indexed documents from the store:

var data = index.get(1);

You can update/change store contents directly without changing the index by:

index.set(1, data);

To update the store and also update the index then just use index.update, index.add or index.append.

When you perform a query, weather it is a document index or a flat index, then you will always get back an array of IDs.

Optionally you can enrich the query results automatically with stored contents by:

index.search(query, { enrich: true });

Your results look now like:

[{
    id: 0,
    doc: { /* content from store */ }
},{
    id: 1,
    doc: { /* content from store */ }
}]

Configure Storage (Recommended)

This will add just specific fields from a document to the store (the ID isn't necessary to keep in store):

const index = new Document({
    document: {
        index: "content",
        store: ["author", "email"]
    }
});

index.add(id, content);

You can configure independently what should being indexed and what should being stored. It is highly recommended to make use of this whenever you can.

Here a useful example of configuring doc and store:

const index = new Document({
    document: { 
        index: "content",
        store: ["author", "email"] 
    }
});

index.add({
    id: 0,
    author: "Jon Doe",
    email: "[email protected]",
    content: "Some content for the index ..."
});

You can query through the contents and will get back the stored values instead:

index.search("some content", { enrich: true });

Your results are now looking like:

[{
    field: "content",
    result: [{
        id: 0,
        doc: {
            author: "Jon Doe",
            email: "[email protected]",
        }
    }]
}]

Both field "author" and "email" are not indexed.

Chaining

Simply chain methods like:

var index = FlexSearch.create()
                      .addMatcher({'â': 'a'})
                      .add(0, 'foo')
                      .add(1, 'bar');
index.remove(0).update(1, 'foo').add(2, 'foobar');

Contextual Search

Note: This feature is disabled by default because of its extended memory usage. Read here get more information about and how to enable.

FlexSearch introduce a new scoring mechanism called Contextual Search which was invented by Thomas Wilkerling, the author of this library. A Contextual Search incredibly boost up queries to a complete new level but also requires some additional memory (depending on depth). The basic idea of this concept is to limit relevance by its context instead of calculating relevance through the whole distance of its corresponding document. This way contextual search also improves the results of relevance-based queries on a large amount of text data.

Enable Contextual Scoring

Create an index and use the default context:

var index = new FlexSearch({

    tokenize: "strict",
    context: true
});

Create an index and apply custom options for the context:

var index = new FlexSearch({

    tokenize: "strict",
    context: { 
        resolution: 5,
        depth: 3,
        bidirectional: true
    }
});

Only the tokenizer "strict" is actually supported by the contextual index.

The contextual index requires additional amount of memory depending on depth.

Auto-Balanced Cache (By Popularity)

You need to initialize the cache and its limit during the creation of the index:

const index = new Index({ cache: 100 });
const results = index.searchCache(query);

A common scenario for using a cache is an autocomplete or instant search when typing.

When passing a number as a limit the cache automatically balance stored entries related to their popularity.

When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run).

Worker Parallelism (Browser + Node.js)

The new worker model from v0.7.0 is divided into "fields" from the document (1 worker = 1 field index). This way the worker becomes able to solve tasks (subtasks) completely. The downside of this paradigm is they might not have been perfect balanced in storing contents (fields may have different length of contents). On the other hand there is no indication that balancing the storage gives any advantage (they all require the same amount in total).

When using a document index, then just apply the option "worker":

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

index.add({ 
    id: 1, tag: "cat", name: "Tom", title: "some", text: "some" 
}).add({
    id: 2, tag: "dog", name: "Ben", title: "title", text: "content" 
}).add({ 
    id: 3, tag: "cat", name: "Max", title: "to", text: "to" 
}).add({ 
    id: 4, tag: "dog", name: "Tim", title: "index", text: "index" 
});
Worker 1: { 1: "cat", 2: "dog", 3: "cat", 4: "dog" }
Worker 2: { 1: "Tom", 2: "Ben", 3: "Max", 4: "Tim" }
Worker 3: { 1: "some", 2: "title", 3: "to", 4: "index" }
Worker 4: { 1: "some", 2: "content", 3: "to", 4: "index" }

When you perform a field search through all fields then this task is being balanced perfectly through all workers, which can solve their subtasks independently.

Worker Index

Above we have seen that documents will create worker automatically for each field. You can also create a WorkerIndex directly (same like using Index instead of Document).

Use as ES6 module:

import WorkerIndex from "./worker/index.js";
const index = new WorkerIndex(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Or when bundled version was used instead:

var index = new FlexSearch.Worker(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Such a WorkerIndex works pretty much the same as a created instance of Index.

A WorkerIndex only support the async variant of all methods. That means when you call index.search() on a WorkerIndex this will perform also in async the same way as index.searchAsync() will do.

Worker Threads (Node.js)

The worker model for Node.js is based on "worker threads" and works exactly the same way:

const { Document } = require("flexsearch");

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

Or create a single worker instance for a non-document index:

const { Worker } = require("flexsearch");
const index = new Worker({ options });

The Worker Async Model (Best Practices)

A worker will always perform as async. On a query method call you always should handle the returned promise (e.g. use await) or pass a callback function as the last parameter.

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

All requests and sub-tasks will run in parallel (prioritize "all tasks completed"):

index.searchAsync(query, callback);
index.searchAsync(query, callback);
index.searchAsync(query, callback);

Also (prioritize "all tasks completed"):

index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);

Or when you have just one callback when all requests are done, simply use Promise.all() which also prioritize "all tasks completed":

Promise.all([
    index.searchAsync(query),
    index.searchAsync(query),
    index.searchAsync(query)
]).then(callback);

Inside the callback of Promise.all() you will also get an array of results as the first parameter respectively for each query you put into.

When using await you can prioritize the order (prioritize "first task completed") and solve requests one by one and just process the sub-tasks in parallel:

await index.searchAsync(query);
await index.searchAsync(query);
await index.searchAsync(query);

Same for index.add(), index.append(), index.remove() or index.update(). Here there is a special case which isn't disabled by the library, but you need to keep in mind when using Workers.

When you call the "synced" version on a worker index:

index.add(doc);
index.add(doc);
index.add(doc);
// contents aren't indexed yet,
// they just queued on the message channel 

Of course, you can do that but keep in mind that the main thread does not have an additional queue for distributed worker tasks. Running these in a long loop fires content massively to the message channel via worker.postMessage() internally. Luckily the browser and Node.js will handle such incoming tasks for you automatically (as long enough free RAM is available). When using the "synced" version on a worker index, the content isn't indexed one line below, because all calls are treated as async by default.

When adding/updating/removing large bulks of content to the index (or high frequency), it is recommended to use the async version along with async/await to keep a low memory footprint during long processes.

Export / Import

Export

The export has slightly changed. The export now consist of several smaller parts, instead of just one large bulk. You need to pass a callback function which has 2 arguments "key" and "data". This callback function is called by each part, e.g.:

index.export(function(key, data){ 
    
    // you need to store both the key and the data!
    // e.g. use the key for the filename and save your data
    
    localStorage.setItem(key, data);
});

Exporting data to the localStorage isn't really a good practice, but if size is not a concern than use it if you like. The export primarily exists for the usage in Node.js or to store indexes you want to delegate from a server to the client.

The size of the export corresponds to the memory consumption of the library. To reduce export size you have to use a configuration which has less memory footprint (use the table at the bottom to get information about configs and its memory allocation).

When your save routine runs asynchronously you have to return a promise:

index.export(function(key, data){ 
    
    return new Promise(function(resolve){
        
        // do the saving as async

        resolve();
    });
});

You cannot export the additional table for the "fastupdate" feature. These table exists of references and when stored they fully get serialized and becomes too large. The lib will handle these automatically for you. When importing data, the index automatically disables "fastupdate".

Import

Before you can import data, you need to create your index first. For document indexes provide the same document descriptor you used when export the data. This configuration isn't stored in the export.

var index = new Index({ ... });

To import the data just pass a key and data:

index.import(key, localStorage.getItem(key));

You need to import every key! Otherwise, your index does not work. You need to store the keys from the export and use this keys for the import (the order of the keys can differ).

This is just for demonstration and is not recommended, because you might have other keys in your localStorage which aren't supported as an import:

var keys = Object.keys(localStorage);

for(let i = 0, key; i < keys.length; i++){
    
    key = keys[i];
    index.import(key, localStorage.getItem(key));
}

Languages

Language-specific definitions are being divided into two groups:

  1. Charset
    1. encode, type: function(string):string[]
    2. rtl, type: boolean
  2. Language
    1. matcher, type: {string: string}
    2. stemmer, type: {string: string}
    3. filter, type: string[]

The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.).

To fully describe a custom language on the fly you need to pass:

const index = FlexSearch({
    // mandatory:
    encode: (content) => [words],
    // optionally:
    rtl: false,
    stemmer: {},
    matcher: {},
    filter: []
});

When passing no parameter it uses the latin:default schema by default.

Field Category Description
encode charset The encoder function. Has to return an array of separated words (or an empty string).
rtl charset A boolean property which indicates right-to-left encoding.
filter language Filter are also known as "stopwords", they completely filter out words from being indexed.
stemmer language Stemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.
matcher language Matcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".

1. Language Packs: ES6 Modules

The most simple way to assign charset/language specific encoding via modules is:

import charset from "./dist/module/lang/latin/advanced.js";
import lang from "./dist/module/lang/en.js";

const index = FlexSearch({
    charset: charset,
    lang: lang
});

Just import the default export by each module and assign them accordingly.

The full qualified example from above is:

import { encode, rtl } from "./dist/module/lang/latin/advanced.js";
import { stemmer, filter, matcher } from "./dist/module/lang/en.js";

const index = FlexSearch({
    encode: encode,
    rtl: rtl,
    stemmer: stemmer,
    matcher: matcher,
    filter: filter
});

The example above is the standard interface which is at least exported from each charset/language.

You can also define the encoder directly and left all other options:

import simple from "./dist/module/lang/latin/simple.js";

const index = FlexSearch({
    encode: simple
});

Available Latin Encoders

  1. default
  2. simple
  3. balance
  4. advanced
  5. extra

You can assign a charset by passing the charset during initialization, e.g. charset: "latin" for the default charset encoder or charset: "latin:soundex" for a encoder variant.

Dialect / Slang

Language definitions (especially matchers) also could be used to normalize dialect and slang of a specific language.

2. Language Packs: ES5 (Language Packs)

You need to make the charset and/or language definitions available by:

  1. All charset definitions are included in the flexsearch.bundle.js build by default, but no language-specific definitions are included
  2. You can load packages located in /dist/lang/ (files refers to languages, folders are charsets)
  3. You can make a custom build

When loading language packs, make sure that the library was loaded before:

<script src="dist/flexsearch.light.js"></script>
<script src="dist/lang/latin/default.min.js"></script>
<script src="dist/lang/en.min.js"></script>

When using the full "bundle" version the built-in latin encoders are already included and you just have to load the language file:

<script src="dist/flexsearch.bundle.js"></script>
<script src="dist/lang/en.min.js"></script>

Because you loading packs as external packages (non-ES6-modules) you have to initialize them by shortcuts:

const index = FlexSearch({
    charset: "latin:soundex",
    lang: "en"
});

Use the charset:variant notation to assign charset and its variants. When just passing the charset without a variant will automatically resolve as charset:default.

You can also override existing definitions, e.g.:

const index = FlexSearch({
    charset: "latin",
    lang: "en",
    matcher: {}
});

Passed definitions will not extend default definitions, they will replace them.

When you like to extend a definition just create a new language file and put in all the logic.

Encoder Variants

It is pretty straight forward when using an encoder variant:

<script src="dist/flexsearch.light.js"></script>
<script src="dist/lang/latin/advanced.min.js"></script>
<script src="dist/lang/latin/extra.min.js"></script>
<script src="dist/lang/en.min.js"></script>

When using the full "bundle" version the built-in latin encoders are already included and you just have to load the language file:

<script src="dist/flexsearch.bundle.js"></script>
<script src="dist/lang/en.min.js"></script>
const index_advanced = FlexSearch({
    charset: "latin:advanced"
});

const index_extra = FlexSearch({
    charset: "latin:extra"
});

Partial Tokenizer

In FlexSearch you can't provide your own partial tokenizer, because it is a direct dependency to the core unit. The built-in tokenizer of FlexSearch splits each word into fragments by different patterns:

  1. strict (supports contextual index)
  2. forward
  3. reverse (including forward)
  4. full

Language Processing Pipeline

This is the default pipeline provided by FlexSearch:

Custom Pipeline

At first take a look into the default pipeline in src/common.js. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests.

Inject the default pipeline by e.g.:

this.pipeline(

    /* string: */ str.toLowerCase(),
    /* normalize: */ false,
    /* split: */ split,
    /* collapse: */ false
);

Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also.

Here is a good example of extending pipelines: src/lang/latin/extra.jssrc/lang/latin/advanced.jssrc/lang/latin/simple.js.

How to contribute?

Search for your language in src/lang/, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language.

A new charset should provide at least:

  1. encode A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and returns an array of separated words. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split.
  2. rtl A boolean flag which indicates right-to-left encoding

Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding:

export function encode(str){ return [str] }
export const rtl = false;

Encoder Matching Comparison

Reference String: "Björn-Phillipp Mayer"

Query default simple advanced extra
björn yes yes yes yes
björ yes yes yes yes
bjorn no yes yes yes
bjoern no no yes yes
philipp no no yes yes
filip no no yes yes
björnphillip no yes yes yes
meier no no yes yes
björn meier no no yes yes
meier fhilip no no yes yes
byorn mair no no no yes
(false positives) no no no yes

Memory Allocation

The book "Gulliver's Travels Swift Jonathan 1726" was fully indexed for the examples below.

The most memory-optimized meaningful setting will allocate just 1.2 Mb for the whole book indexed! This is probably the most tiny memory footprint you will get from a search library.

import { encode } from "./lang/latin/extra.js";

index = new Index({
    encode: encode,
    tokenize: "strict",
    optimize: true,
    resolution: 1,
    minlength: 3,
    fastupdate: false,
    context: false
});

Memory Consumption

The book "Gulliver's Travels" (Swift Jonathan 1726) was completely indexed for this test:


Compare Impact of Memory Allocation

by default a lexical index is very small:
depth: 0, bidirectional: 0, resolution: 3, minlength: 0 => 2.1 Mb

a higher resolution will increase the memory allocation:
depth: 0, bidirectional: 0, resolution: 9, minlength: 0 => 2.9 Mb

using the contextual index will increase the memory allocation:
depth: 1, bidirectional: 0, resolution: 9, minlength: 0 => 12.5 Mb

a higher contextual depth will increase the memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 0 => 21.5 Mb

a higher minlength will decrease memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 3 => 19.0 Mb

using bidirectional will decrease memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3 => 17.9 Mb

enable the option "fastupdate" will increase memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3 => 6.3 Mb

Full Comparison Table

Every search library is constantly in competition with these 4 properties:

  1. Memory Allocation
  2. Performance
  3. Matching Capabilities
  4. Relevance Order (Scoring)

FlexSearch provides you many parameters you can use to adjust the optimal balance for your specific use-case.

Modifier Memory Impact * Performance Impact ** Matching Impact ** Scoring Impact **
resolution +1 (per level) +1 (per level) 0 +2 (per level)
depth +4 (per level) -1 (per level) -10 + depth +10
minlength -2 (per level) +2 (per level) -3 (per level) +2 (per level)
bidirectional -2 0 +3 -1
fastupdate +1 +10 (update, remove) 0 0
optimize: true -7 -1 0 -3
encoder: "icase" 0 0 0 0
encoder: "simple" -2 -1 +2 0
encoder: "advanced" -3 -2 +4 0
encoder: "extra" -5 -5 +6 0
encoder: "soundex" -6 -2 +8 0
tokenize: "strict" 0 0 0 0
tokenize: "forward" +3 -2 +5 0
tokenize: "reverse" +5 -4 +7 0
tokenize: "full" +8 -5 +10 0
document index +3 (per field) -1 (per field) 0 0
document tags +1 (per tag) -1 (per tag) 0 0
store: true +5 (per document) 0 0 0
store: [fields] +1 (per field) 0 0 0
cache: true +10 +10 0 0
cache: 100 +1 +9 0 0
type of ids: number 0 0 0 0
type of ids: string +3 -3 0 0
* range from -10 to 10, lower is better (-10 => big decrease, 0 => unchanged, +10 => big increase)
** range from -10 to 10, higher is better

Presets

  1. memory (primary optimize for memory)
  2. performance (primary optimize for performance)
  3. match (primary optimize for matching)
  4. score (primary optimize for scoring)
  5. default (the default balanced profile)

These profiles are covering standard use cases. It is recommended to apply custom configuration instead of using profiles to get the best out for your situation. Every profile could be optimized further to its specific task, e.g. extreme performance optimized configuration or extreme memory and so on.

You can pass a preset during creation/initialization of the index.

Best Practices

Use numeric IDs

It is recommended to use numeric id values as reference when adding content to the index. The byte length of passed ids influences the memory consumption significantly. If this is not possible you should consider to use a index table and map the ids with indexes, this becomes important especially when using contextual indexes on a large amount of content.

Split Complexity

Whenever you can, try to divide content by categories and add them to its own index, e.g.:

var action = new FlexSearch();
var adventure = new FlexSearch();
var comedy = new FlexSearch();

This way you can also provide different settings for each category. This is actually the fastest way to perform a fuzzy search.

To make this workaround more extendable you can use a short helper:

var index = {};

function add(id, cat, content){
    (index[cat] || (
        index[cat] = new FlexSearch
    )).add(id, content);
}

function search(cat, query){
    return index[cat] ?
        index[cat].search(query) : [];
}

Add content to the index:

add(1, "action", "Movie Title");
add(2, "adventure", "Movie Title");
add(3, "comedy", "Movie Title");

Perform queries:

var results = search("action", "movie title"); // --> [1]

Split indexes by categories improves performance significantly.


Copyright 2018-2023 Thomas Wilkerling, Hosted by Nextapps GmbH
Released under the Apache 2.0 License

flexsearch's People

Contributors

0xflotus avatar aslafy-z avatar benmccann avatar bentley-atlassian avatar danawoodman avatar davfsa avatar derzade avatar desjob avatar elliot67 avatar fossabot avatar gamtiq avatar giraffesyo avatar greenkeeper[bot] avatar hasparus avatar itkg-mbaumann avatar itpiligrim avatar leyart avatar lionello avatar lukevp avatar maximilianmairinger avatar mhajder avatar millette avatar mlix8hoblc avatar rxminus avatar safareli avatar salim-b avatar tehshrike avatar thexeos avatar ts-thomas avatar vatz88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flexsearch's Issues

What are "depth" and "threshold"?

I don't know enough fulltext index terminology to infer what these two settings actually mean.

I'm guessing from context that "depth" is the maximum number of words/tokens away a term can be and still be considered relevant.

I have no idea what the "threshold" number implies. :-x

I know I want that sweet contextual searching, so I'd love to figure this out so I can pick numbers appropriate to my use case.

Any reason for all the weird linebreaks in flexsearch.js?

I say weird, but I should rather say… unconventional.

Like:

while(i < length){

                        tmp = arr[i++];

                        const index = "@" + tmp;

                        if(check[index]){

Are they on purpose?

If so, what is their purpose?

If not, could using tools like Prettier (or Prettier + ESLint) help?

How best to return unindexed data for each match (as well as the ID)?

For each item that matches a query, I'd like to be able to get unindexed arbitrary data — not just its ID.

For example: for matches when searching Shakespeare plays, I'd like to be able to return the text of an individual line (which is indexed) but also play name, location, speaker, etc.

What's the best way to achieve this?

I can do this in Elasticlunr (for example) like this:

const index = elasticlunr(function() {
    this.addField('text'); // doc property to be indexed
    this.setRef('id'); // doc property that is the ID of each item
    for (const doc of docs) {
      // doc includes additional arbitrary data for each item: play, speaker, location, etc.
      this.addDoc(doc); 
    }
}

Would I simply need to create an object that maps IDs with item data, or is there a better way to do this?

Great project by the way — thanks so much for building this.

Error while loading language files on node.js

I'm trying to load the language files to use with the stemmer p.e., but I'm getting a TypeError: Cannot read property 'registerLanguage' of undefined error.

var FlexSearch = require('flexsearch')
require(require('flexsearch/lang/en')

The error seems to indicate that the flexsearch object is not in scope, but when pass it as a global variable I get the same error. Am I missing something here?

Property 'length' of undefined when using web-worker

I tried setting the "worker" option to false and everything worked very well. But when I enable this option and set it to any number different than false, my console prints "Uncaught (in promise) TypeError: Cannot read property 'length' of undefined".

Here is the screenshot:

screenshot_7

image

I have around 30.000 items, thats why I want to use the web worker feature.

Any ideas? I can give another informations if necessary.

Multiple documents update by query?

Hello, first of all, thanks for creating new nice search engine. We are looking to use it instead of elasticsearch, which is very complex and have lots of legacy in it’s DSL and difficulties to get desired results. Currently we are interested if there’s any plans to implement multiple documents update by single query? It’s necessary, for example, to disable some of products when it’s category is disabled.

Also, to avoid creating another ticket, I would like know if it is possible to boost search result based on numeric value stored in search index itself.

Thanks in advance.

Paging with mutltiple fields/boost.

Setup

var index = FlexSearch.create({
    doc: {
        id: "url",
        field: [
            "title",
            "content"
        ]
    }
});

Working

Invoke:

index.search(
    "test",
    {
        page: true,
        limit: 5
    })

Result:

{
  "page": "0",
  "next": "5",
  "result": [
    {
      "title": "Load Testing V. 1.0.1",
      "content": "test",
      "url": "/Project_Management/validations/validation2"
    },
    {
      "title": "Pre Test Inpsection Report",
      "content": "test",
      "url": "/V_and_V/5016-09-F21"
    },
    {
      "title": "Packaging Validaiton Test Report",
      "content": "test",
      "url": "/V_and_V/5016-09-F19"
    },
    {
      "title": "EMC 60601 Test Plan",
      "content": "test",
      "url": "/V_and_V/5016-09-F23"
    },
    {
      "title": "Third Party Testing",
      "content": "test",
      "url": "/3rd_Party_Testing"
    }
  ]
}

Not working

Invoke:

index.search(
    [
        {
            field: "title",
            query: "test",
            boost: 1
        },
        {
            field: "content",
            query: "test",
            boost: 0.5
        }
    ],
    {
        page: true,
        limit: 5
    }));

Result:

{
  "page": "0",
  "next": null,
  "result": [
  ]
}

Comments

I need to be able to page the results, while also search multiple fields with different boost values.

Question on 0.7.0/Field boosting

I see the documentation on indexing different fields in a document has been fleshed out, which is great, I was wondering how that would work.

The readme claims that field searching is a thing in 0.7.0, but the changelog only goes up to 0.6.0 and the version on npm is 0.6.2 – what's the deal there?

Besides wondering where I could find 0.7.0 I have one question: how does boosting work?

I have a document with a title, and a body. I want matches in the title to count towards the score 10x more than matches in the body.

Could I achieve that by setting the boost on the title field to 10, and the boost on the body field to 1? Is that how boost works, or have I misguessed? What is the default boost for a field?

TypeError C is not a function

Thanks for this capability. I am excited to learn how this works for several use cases I have

I ran your 'best practice with some modeification. I cannot find the source of the error ...
"c is not a function'

Here is my code

const FlexSearch = require("flexsearch")

const bookstore = new FlexSearch();
const pizzashop = new FlexSearch();
const votingbooth = new FlexSearch();

let settings = {
action: "score",
adventure: {
encode: "extra",
tokenize: "strict",
depth: 5,
threhold: 5,
doc: {
id: "id",
field: ["intent", "text"]
} },

comedy: {
    encode: "advanced",
    tokenize: "forward",
    threshold: 5
}

}
let index = {}

const add = (id, cat, intent, text) => {
console.log(gr(Starting on Index ${id}))
console.log(for ${cat}, ${intent}, ${text})
try {
(index[cat] || (
index[cat] = new FlexSearch(settings[cat])
)).add(id, intent, text);
} catch(error) {
console.log(error)
}

}

const search = (cat, query) => {
return index[cat] ? index[cat].search(query) : [];
}

let x = 0
training.map((t) => {
console.log(b(Creating index ${x}))
x++
add(x, "bookstore", t.intent, t.text);
add(x, "pizzashop", t.intent, t.text);
add(x, "votingbooth", t.intent, t.text);
})

//add(1, "action", "Movie Title");
//add(2, "adventure", "Movie Title");
//add(3, "comedy", "Movie Title");

console.log(r(THIS SHOULD EXECUTE LAST))
//index.update(10025, "Road Runner");
//index.remove(10025);
var result1 = search("bookstore", "i am searching for a book"); // --> [1]
var result2 = search("pizzashop", "howdy"); // --> [1]
var result3 = search("votingboooth", "i need directions"); // --> [1]

console.log(========== FAST SEARCH TEST ==========)
console.log(result1)
console.log(result2)
console.log(result3)

The log shows an empty array

Error when search "john wick" on demo site

Hi,
Your work is great.

After playing around, i found an issue with your demo.

Search string "john wi" is shown fine.
But search "john wic" is empty.

Could you check it?

2019-03-12_215239

Serializing as stream instead of string

I'm trying to create an index over a large dataset and I want to separate the script that's creating the index from the script that's using the index. The index creation seems to work very well, but when I use index.export(), I'm getting a RangeError: Invalid string length error. Is there a way to export the index as a file without getting this error? A possible solution would be to allow exporting via a stream that could be written to a file directly.

Thanks!

Error on compile.js

I tried to build this library with 'npm run build-compact' and got some errors like below :

/bin/sh: -c: line 0: unexpected EOF while looking for matching '' /bin/sh: -c: line 1: syntax error: unexpected end of file { Error: Command failed: java -jar node_modules/google-closure-compiler-java/compiler.jar --compilation_level=ADVANCED_OPTIMIZATIONS --use_types_for_optimization=true --new_type_inf=true --jscomp_warning=newCheckTypes --generate_exports=true --export_local_property_definitions=true --language_in=ECMASCRIPT6_STRICT --language_out=ECMASCRIPT6_STRICT --process_closure_primitives=true --summary_detail_level=3 --warning_level=VERBOSE --emit_use_strict=true --output_manifest=log/manifest.log --output_module_dependencies=log/module_dependencies.log --property_renaming_report=log/renaming_report.log' --js='flexsearch.js' --js='lang/**.js' --js='!lang/**.min.js' --define='RELEASE=compact' --define='DEBUG=false' --define='PROFILER=false' --define='SUPPORT_WORKER=false' --define='SUPPORT_ENCODER=true' --define='SUPPORT_CACHE=false' --define='SUPPORT_ASYNC=true' --define='SUPPORT_PRESETS=true' --define='SUPPORT_SUGGESTIONS=false' --define='SUPPORT_SERIALIZE=false' --define='SUPPORT_INFO=false' --define='SUPPORT_DOCUMENTS=true' --define='SUPPORT_WHERE=false' --define='SUPPORT_LANG_DE=false' --define='SUPPORT_LANG_EN=false' --js_output_file='dist/flexsearch.compact.js' && exit 0

and just found a simple error in 'compile.js(116:92)'.

exec("java -jar node_modules/google-closure-compiler-java/compiler.jar" + parameter + "' --js='flexsearch.js' --js='lang/**.js' --js='!lang/**.min.js'" + flag_str + " --js_output_file='dist/flex search." + (options["RELEASE"] || "custom") + ".js' && exit 0", function(){

After removing the unnecessary single quotation after parameter + ", the build process worked fine.
I think it's just a mistyping... maybe. 😓

Port of this library for Ruby

Hi @ts-thomas ,

I am a beginner to open source contribution / projects. I want to work on the port of this library for Ruby. If possible can you point towards any reference/article/blog post related to scoring algorithm and other implementations used in this library. If anyone is already working on this library for Ruby, please let me know, I would also love to contribute to the project.

Exception thrown when searching for a value containing whitespace where suggest is set to true

Hi Thomas

Using the following example

const FlexSearch = require('./flexsearch')

const fs = new FlexSearch({
  encode: 'extra',
  tokenize: 'full',
  threshold: 1,
  depth: 4,
  resolution: 9,
  async: false,
  worker: 1,
  cache: true,
  suggest: true,
  doc: {
    id: 'id',
    field: [ 'intent', 'text' ]
  }
})

fs.add([
  {
    id: 0,
    intent: 'intent',
    text: 'text'
  }, {
    id: 1,
    intent: 'intent',
    text: 'howdy - how are you doing'
  }
])

console.log('INFO', fs.info())

const result = fs.search('howdy', { bool: 'or' })
console.log('RESULT', result)

const result2 = fs.search('howdy -', { bool: 'or' })
console.log('RESULT', result2)

An exception is thrown using 'howdy - as search parameter. When setting suggest to false, the search is successful, but the search for howdy - does not find any results.

The exception thrown is

.../search/flexsearch.js:3308
                    z = suggestions.length;
                                    ^

TypeError: Cannot read property 'length' of undefined
    at intersect (.../servers/search/flexsearch.js:3308:37)
    at FlexSearch.merge_and_sort (.../servers/search/flexsearch.js:1393:22)
    at FlexSearch.search (.../servers/search/flexsearch.js:1561:43)
    at Object.<anonymous> (.../servers/search/test2.js:33:19)
    at Module._compile (internal/modules/cjs/loader.js:734:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:745:10)
    at Module.load (internal/modules/cjs/loader.js:626:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:566:12)
    at Function.Module._load (internal/modules/cjs/loader.js:558:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:797:12)

In flexsearch on line 3068

function intersect(arrays, limit, cursor, suggest, bool, has_not) {

            let result = [];
            let suggestions;
            const length_z = arrays.length;

suggestions is not being assigned, because the while loop on line 3133 is false

while(++z < length_z){

so the assignment of the suggestion variable on line 3211 is bypassed

                    let found = false;

                    i = 0;
                    suggestions = [];

                    while(i < length){

The reason the search for howdy -, when suggestions is false, is unsuccessful is probably because of the options passed in. Should I implement my own tokenizer if I would like to find queries like howdy -?

Thanks in advance

Regards
William

Distinct values and distinct count

Hello, is it possible to count distinct values of field or\and get distinct values for some fields? For example, when searching products in catalog, it's good to know distinct category id's of results

Search results depend on the order of fields

NOTE: I've rewritten the entire issue because I've found a way to reproduce my issue on a very small dataset.

I've noticed that I'm missing search results depending on the order of fields that I provide when creating the index.

In the following example, there are two objects where notation:0 matches the search term WW 8840, and one object where prefLabel:de matches WW 8840. In the first example, only the latter object is returned as a search result even though all fields are supposed to be searched. The second example returns the correct search results just by reordering the fields (putting notation:0 to the end). Note that when specifying notation:0 as the only field to search, it will return the correct results in both cases.

Non-working example (prints 1 and 2 even though the first query should return 3 results):

const FlexSearch = require("flexsearch")

let index = new FlexSearch({
  doc: {
    id: "uri",
    field: [
      "prefLabel:de",
      "notation",
      "editorialNote:de",
    ]
  },
  profile: "score"
})

// Example dataset
let concepts = [
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WW%208720%20-%20WW%209239"}],"created":"2012-07-05","editorialNote":{"de":"(Blutgruppen s. XD 3200)"},"http://www.w3.org/2004/02/skos/core#closeMatch":[{"uri":"http://d-nb.info/gnd/4130604-1"},{"uri":"http://d-nb.info/gnd/4022814-9"},{"uri":"http://d-nb.info/gnd/4070945-0"},{"uri":"http://d-nb.info/gnd/4074195-3"}],"identifier":["152145:13422"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WW 8840 - WW 8879","prefLabel":{"de":"Blutkörperchen (Erythrozyt, Leukozyt), Hämoglobin"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WW%208840%20-%20WW%208879"},
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WD%205000%20-%20WD%205970"}],"created":"2012-07-05","editorialNote":{"de":"(Antibiotika s. XI 3500)"},"http://www.w3.org/2004/02/skos/core#closeMatch":[{"uri":"http://d-nb.info/gnd/4155845-5"},{"uri":"http://d-nb.info/gnd/4276935-8"},{"uri":"http://d-nb.info/gnd/4176522-9"},{"uri":"http://d-nb.info/gnd/4175383-5"},{"uri":"http://d-nb.info/gnd/4148701-1"}],"identifier":["148204:"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WD 5380","prefLabel":{"de":"Pyrrolfarbstoffe, Cytochrome, Chromoproteine (Hämoglobin s. WW 8840)"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WD%205380"},
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WW%208840%20-%20WW%208879"}],"created":"2012-07-05","editorialNote":{},"identifier":["152145:13423"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WW 8840","prefLabel":{"de":"Allgemeines"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WW%208840"}
]

index.add(concepts)

let results
results = index.search("WW 8840")
console.log(results.length) // only matches the second concept (which mentions "WW 8840" in label)

results = index.search("WW 8840", {
  field: "notation"
})
console.log(results.length) // correctly matches two concepts
// with large dataset, also correctly matches the two concepts

Working example (prints 3 and 2 as expected, just by reordering fields):

const FlexSearch = require("flexsearch")

let index = new FlexSearch({
  doc: {
    id: "uri",
    field: [
      "prefLabel:de",
      "editorialNote:de",
      "notation",
    ]
  },
  profile: "score"
})

// Example dataset
let concepts = [
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WW%208720%20-%20WW%209239"}],"created":"2012-07-05","editorialNote":{"de":"(Blutgruppen s. XD 3200)"},"http://www.w3.org/2004/02/skos/core#closeMatch":[{"uri":"http://d-nb.info/gnd/4130604-1"},{"uri":"http://d-nb.info/gnd/4022814-9"},{"uri":"http://d-nb.info/gnd/4070945-0"},{"uri":"http://d-nb.info/gnd/4074195-3"}],"identifier":["152145:13422"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WW 8840 - WW 8879","prefLabel":{"de":"Blutkörperchen (Erythrozyt, Leukozyt), Hämoglobin"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WW%208840%20-%20WW%208879"},
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WD%205000%20-%20WD%205970"}],"created":"2012-07-05","editorialNote":{"de":"(Antibiotika s. XI 3500)"},"http://www.w3.org/2004/02/skos/core#closeMatch":[{"uri":"http://d-nb.info/gnd/4155845-5"},{"uri":"http://d-nb.info/gnd/4276935-8"},{"uri":"http://d-nb.info/gnd/4176522-9"},{"uri":"http://d-nb.info/gnd/4175383-5"},{"uri":"http://d-nb.info/gnd/4148701-1"}],"identifier":["148204:"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WD 5380","prefLabel":{"de":"Pyrrolfarbstoffe, Cytochrome, Chromoproteine (Hämoglobin s. WW 8840)"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WD%205380"},
  {"@context":"https://gbv.github.io/jskos/context.json","broader":[{"uri":"http://rvk.uni-regensburg.de/nt/WW%208840%20-%20WW%208879"}],"created":"2012-07-05","editorialNote":{},"identifier":["152145:13423"],"inScheme":[{"uri":"http://uri.gbv.de/terminology/rvk/"}],"modified":"2018-12-14","notation":"WW 8840","prefLabel":{"de":"Allgemeines"},"type":["http://www.w3.org/2004/02/skos/core#Concept"],"uri":"http://rvk.uni-regensburg.de/nt/WW%208840"}
]

index.add(concepts)

let results
results = index.search("WW 8840")
console.log(results.length) // only matches the second concept (which mentions "WW 8840" in label)

results = index.search("WW 8840", {
  field: "notation"
})
console.log(results.length) // correctly matches two concepts
// with large dataset, also correctly matches the two concepts

Any idea why this is happening? Thanks!

Remove Features: Where / Find / Tags

I thinking about to remove these features:

  • index.find() (get document by ID will remain)
  • index.where()
  • tag fields
  • where clause in custom search

The main reasons for this may:

  • they do not scale properly, just useful up to a medium size of document length
  • tags cannot be serialized, instead they need to recover from the original documents which slows down the import function
  • a custom helper function will replace this functionality and is also faster and also less redundant

What do you think about?

Settings get overriden

We use flexsearch in a react app. Performs pretty well, thanks!
We store the flexsearch settings in a constant outside of a component. We also store documents and not key values pairs.
The first initialization of the component works perfect. All following behave wrong. The doc property is null. I guess flexsearch accesses the object by reference and somehow replaces the doc property.
image
Is this behavior expected?

Serialize/Deserialize for SSR ?

Does the library support serialize/deserialize flexsearch object as json ?
I'd love to create index in Node , but will deserialize the object in browser for client-side searching.

Pagination: forwards and backwards

The next page is not a problem, but the previous one. When I call the previous page, I get an array instead of an object. Then the fields for the page are also missing.
Could you give an simple example of a pagination back and forth?

Multivalue attributes

What is the best way to handle documents with multi value attributes?
For example a document with a m:n relation to another entity.

Relevance: can it be based on number of times a term occurs?

I would expect that if I search for a term, and that term appears once in document A but several times in document B, that B would have a higher position in the results than A. But that does not seem to be the case.

Example:

const FlexSearch = require(`flexsearch`)

const index = new FlexSearch({
	tokenize: `strict`,
	encode: `advanced`,
	cache: false,
	doc: {
		id: `id`,
		field: {
			content: {
				threshold: 9,
				resolution: 10,
			},
		},
	},
})

index.add([{
	id: 1,
	content: `billy bob thorton`,
}, {
	id: 2,
	content: `billy who now what billy okay so what now thorton?`,
}])

console.log(
	index.search(`billy`)
)
// => [ { id: 1, content: 'billy bob thorton' },
//  { id: 2,
//    content: 'billy who now what billy okay so what now thorton?' } ]

I would expect that a search for billy would have a higher score for document id 2 than document id 1, but the search returns document id 1 as the top result.

Tested with [email protected].

Cyrillic languages support

Hello,

I've faced with the following behaviour.

This example works as expected:

const FlexSearch = require('flexsearch');
const index = new FlexSearch();

index.add(1, 'Foobar')
console.log(index.search('Foobar'));
// [ 1 ]

But this one shows no results.

const FlexSearch = require('flexsearch');
const index = new FlexSearch();

index.add(1, 'Фообар')
console.log(index.search('Фообар'));
// []

I've tested in node and in browser.

Benchmark with algolia ?

Can someone do a benchmark between this library and Algolia?
I just want to know if I should drop algolia for a better copycat?
Thank you ;)

Development Roadmap (Please Participate)

Please make suggestions or give some feedback.

1. Extract Core Functionality

The extraction of the core functionality is basically required for many upcoming features as well as for still existing ones, like:

  • Plugin API
  • Custom Tooling
  • Language-specific ports or migrations
  • Pluggable Workflows
  • All kinds of extensions

These still existing features has to remain as a core functionality:

  1. Lexical Pre-Scored Index
  2. Contextual-based Map
  3. Index-related Settings:
    • threshold
    • resolution
    • depth
    • rtl
  4. Matching Tokens (Query)
  5. Cursor-based Pagination
  6. Logical Operators
  7. Cross-Process Intersection
  8. Index-based Suggestions

The basic core API should have this methods:

  1. create
  2. init
  3. add
  4. update
  5. remove
  6. destroy
  7. match (search)

These missing features also needs to be integrated as a core functionality:

  • Providing abstract I/O, supporting various kinds of index storage:
    • In-Memory
    • Partial Persistent Storage (persistent documents, in-memory index)
    • Storage-only (persistent documents, persistent index)

These functions should be extracted as an optional tooling:

  • System-specific Features (Browser, Node.js):
    • Web Worker
    • Async
  • Language-specific Features:
    • Encoder
    • Tokenizer
    • Matcher, Stemmer, Filter
  • Documents (Field-Search)
  • Custom Search
  • Find / Where / Tags
  • Export / Import (Serialization)
  • Cache
  • Presets

2. Plugin API

The plugin API is required to provide additional tooling and features in a modular and extendable manner. The plugin API should have these capabilities:

  1. Extend via ad hoc methods
  2. Extend via pipeline
  3. Extend via events (callbacks)
  4. Plugin Package Descriptor

3. Prerequisites

  1. Extract language-specific logic
  2. Provide process connectivity and refactor

4. Language Port

There are several requests of a TypeScript port. The advantage of TypeScript compared to plain JavaScript may be too less, since the TypeScript also compiles to JavaScript and is also less optimized as the Google Closure Compiler for that purpose.

Technically there are two targets:

  1. Browser
  2. System (OS)

Browsers are actually covered as well as Node.js. Making a TypeScript port will do not cover any additional ecosystem. Only the formal codebase will differ and at the end it is just a different pattern for the same result. That's why I prefer a browser-less system-wide port over TypeScript. The language Rust is pretty close to TypeScript/JavaScript and covers 2., so this might be a better candidate for a port.

There is no final decision at the moment, so let us discuss pro and cons here.

Logical Operator (Please Vote)

Which kind of expression do you prefer?

1. required / optional / prohibited

var results = index.search([{
    field: "title",
    query: "foobar",
    presence: "required"
},{
    field: "body",
    query: "content",
    presence: "optional"
},{
    field: "blacklist",
    query: "xxx",
    presence: "prohibited"
}]);

2. and / or / not

var results = index.search([{
    field: "title",
    query: "foobar",
    bool: "and"
},{
    field: "body",
    query: "content",
    bool: "or"
},{
    field: "blacklist",
    query: "xxx",
    bool: "not"
}]);

3. + / -

var results = index.search([{
    field: "+title",
    query: "foobar"
},{
    field: "body",
    query: "content"
},{
    field: "-blacklist",
    query: "xxx"
}]);

The benchmarks with different presets seem unfair

The benchmarks for query and memory tests use different presets, but compare to same config of other libraries.
It would be helpful to be able to compare the difference of flexsearch performance between presets, while showing a full, unbiased picture.

Can't destroy index if created with doc parameter

flexsearch version 0.5.1

Problem

Can't destroy index instance in the browser because of the error.

Details

Here is test HTML:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Benchmark Presets</title>
    <style>
        body{
            font-family: sans-serif;
        }
        table td{
            padding: 1em 2em;
        }
        button{
            padding: 5px 10px;
        }
    </style>
</head>
<body>
<div id="container"></div>
<script src="../dist/flexsearch.min.js"></script>
<script>
  (function(){
    var index = new FlexSearch({
        doc: {
            id: 'id',
            field: 'title'
        }
    });
    index.add([
      { id: 1, title: 'foo' },
      { id: 2, title: 'bar' }
    ])
    console.log(index.search('foo'))
    index.destroy()
  })();
</script>
</body>
</html>

Window console displays error:

 TypeError: a is undefined[Learn More]
flexsearch.min.js:33:45

Unexpected exception when attempting to call Index.search method

I tried to use code example from unit test, but got the following error:

Code to reproduce:

const FlexSearch = require('flexsearch')

// tslint:disable

;(async () => {
  const index = new FlexSearch({
    async: true,
    doc: {
      id: 'id',
      field: [ 'data:name' ]
    }
  })

  const data = [{
    id: 2,
    data: {
      title: 'Title 3',
      body: 'Body 3'
    }
  }, {
    id: 1,
    data: {
      title: 'Title 2',
      body: 'Body 2'
    }
  }, {
    id: 0,
    data: {
      title: 'Title 1',
      body: 'Body 1'
    }
  }]

  await index.add(data)

  console.log(index.search)

  const result = await index.search({
    field: 'data:body',
    query: 'body'
  })

  console.dir(result)
})()

Output:

[Function]
(node:10016) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'search' of undefined
    at h.search (C:\Users\User\Documents\Projects\test\node_modules\flexsearch\dist\flexsearch.node.js:24:281)
    at C:\Users\User\Documents\Projects\test\index.js:38:30
    at process._tickCallback (internal/process/next_tick.js:43:7)
    at Function.Module.runMain (internal/modules/cjs/loader.js:778:11)
    at startup (internal/bootstrap/node.js:300:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:826:3)
(node:10016) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:10016) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Environment: Node
Node version: v11.2.0
Flexsearch version: "^0.5.2"

Data doesn't get indexed

Hi

I am trying to run the example you posted in issue #30 without any luck.

Here is the code:

const FlexSearch = require('flexsearch')

// provide a document descriptor for each index
// the field "id" and at least one "field" is mandatory.

const settings = {
  'bookstore': {
    preset: 'score',
    doc: {
      id: 'id',
      field: ['intent', 'text']
    }
  },
  'pizzashop': {
    encode: 'extra',
    tokenize: 'strict',
    depth: 5,
    threshold: 5,
    doc: {
      id: 'id',
      field: ['intent', 'text']
    }
  },
  'votingbooth': {
    encode: 'advanced',
    tokenize: 'forward',
    threshold: 5,
    doc: {
      id: 'id',
      field: ['intent', 'text']
    }
  }
}

const index = {}

const add = (cat, doc) => {
  const i = index[cat] || (
    index[cat] = new FlexSearch(settings[cat])
  )
  i.add(doc)
}

const search = (cat, query) => {
  return index[cat] ? index[cat].search(query) : []
}

// provide documents which have the same structure as defined in the document descriptor above

const bookstore = [{
  id: 0,
  intent: 'intent',
  text: 'text'
}, {
  id: 1,
  intent: 'intent',
  text: 'i am searching for a book'
}]

const pizzashop = [{
  id: 0,
  intent: 'intent',
  text: 'text'
}, {
  id: 1,
  intent: 'intent',
  text: 'howdy'
}]

const votingbooth = [{
  id: 0,
  intent: 'intent',
  text: 'text'
}, {
  id: 1,
  intent: 'intent',
  text: 'i need directions'
}]

// add a full document or an array of documents to the index

add('bookstore', bookstore)
add('pizzashop', pizzashop)
add('votingbooth', votingbooth)

console.log('INFO', index['bookstore'].info())
console.log('INFO', index['pizzashop'].info())
console.log('INFO', index['votingbooth'].info())

console.log('INFO', index['bookstore'])
// search

const result1 = search('bookstore', 'i am searching for a book') // --> [1]
const result2 = search('pizzashop', 'howdy') // --> [1]
const result3 = search('votingbooth', 'i need directions') // --> [1]

console.log('========== FAST SEARCH TEST ==========')
console.log(result1)
console.log(result2)
console.log(result3)

and the ouput I get is:

INFO { id: 0,
  memory: 0,
  items: 0,
  sequences: 0,
  chars: 0,
  cache: false,
  matcher: 0,
  worker: undefined,
  threshold: 1,
  depth: 4,
  contextual: true }
INFO { id: 3,
  memory: 0,
  items: 0,
  sequences: 0,
  chars: 0,
  cache: false,
  matcher: 0,
  worker: undefined,
  threshold: 5,
  depth: 5,
  contextual: true }
INFO { id: 6,
  memory: 0,
  items: 0,
  sequences: 0,
  chars: 0,
  cache: false,
  matcher: 0,
  worker: undefined,
  threshold: 5,
  depth: 0,
  contextual: 0 }
INFO k {
  id: 0,
  o: [],
  f: 'strict',
  w: false,
  async: false,
  threshold: 1,
  b: 9,
  depth: 4,
  C: false,
  m: false,
  s: [Function: bound ],
  a:
   { id: [ 'id' ],
     field: [ [Array], [Array] ],
     index: { intent: [k], text: [k] },
     keys: [ 'intent', 'text' ] },
  h:
   [ [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {},
     [Object: null prototype] {} ],
  i: [Object: null prototype] {},
  c: [Object: null prototype] {},
  g:
   [Object: null prototype] {
     '0': { id: 0, intent: 'intent', text: 'text' },
     '1':
      { id: 1, intent: 'intent', text: 'i am searching for a book' } },
  v: true,
  cache: false,
  j: false }
========== FAST SEARCH TEST ==========
[]
[]
[]

Am I missing something?

Node version: 11.9.0

Thanks in advance...

Sorting

Pretty neat. Performances really well.

I read in #7 "Flexsearch is a micro library whose complexity we want to keep as low as possible in the core. "

What about sorting? We are currently considering replacing our list filters by flexsearch. It would be nice to use the same index also for sorting.

How does suggestion work?

I tried to activate the suggestion function but it does not change anything in the result. How does it work?

thanks.

Contextual scoring doesn't seem to be working

When I set a depth, I would expect that if I search for multiple terms, documents that contain those terms near each other would score higher.

Example:

const FlexSearch = require(`flexsearch`)

const index = new FlexSearch({
	tokenize: `strict`,
	encode: `advanced`,
	cache: false,
	doc: {
		id: `id`,
		field: {
			content: {
				threshold: 9,
				resolution: 10,
				depth: 2,
			},
		},
	},
})

index.add([{
	id: 1,
	content: `billy who now what billy okay so what now thorton?`,
}, {
	id: 2,
	content: `billy bob thorton`,
}])

console.log(
	index.search(`billy thorton`)
)
// => [ { id: 1,
//    content: 'billy who now what billy okay so what now thorton?' },
//  { id: 2, content: 'billy bob thorton' } ]

I would expect document id 2 to be the top result, since it contains "billy" and "thorton" within two words of each other, but the top result is actually document id 1.

Tested in [email protected].

Results are not unique when matches in more than one field

I expected to get matching documents to be unique within result. What is the angle for repeating these?

Example:

const f = new FlexSearch({
	doc: {
		id: 'id',
		field: ['field1', 'field2']
	}
})

const docs = [
	{id: 1, field1: 'phrase', field2: 'phrase'}
]

f.add(docs)
console.log(f.search('phrase'))
// Result = [{id: 1, field1: "phrase", field2: "phrase"} 1: {id: 1, field1: "phrase", field2: "phrase"}]

Contextual Search documentation is missing

The readme includes the line

Note: This feature is actually not enabled by default. Read here how to enable.

but the "here" link doesn't go to any page, and I can't find the intended target in the repo :-o

How to create an index for a book

Hey
First thanks for the amazing library!

I would like to know if you can index a number and get the subject name, sub-topic, and paragraph number.
And whether it is possible to find two paragraphs together
For example

book:
[
    {
        "topic": "topic",
        "content": [
            {
                "title":
                "parts": [
                    "word1, word2, word3, word4, word5",
                    "word6, word7, word8, word9, word10",
                ]
            }
        ]
    }
]

index.search("word2 word3") // = [{topic: "topic1", title: "title1", part: 0}]
index.search("word5 word6") // = [{topic: "topic1", title: "title1", part: 0}, {topic: "topic1", title: "title1", part: 0}]
``` 

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.