meilisearch / meilisearch Goto Github PK

View Code? Open in Web Editor NEW

43.4K 270.0 1.6K 57.9 MB

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

Home Page: https://www.meilisearch.com

License: MIT License

Rust 99.83% Dockerfile 0.03% Shell 0.13%

search-engine rust typo-tolerance synonyms site-search database enterprise-search rest search app-search

meilisearch's Introduction

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

meilisearch's People

Contributors

Stargazers

Watchers

Forkers

baptistejamin felixonmars isgasho xjump s-you igxactly atnightly backwardn juripero rootdevelop xlpe bidoubiwa placrosse tpayet nithinkashyapn forksbot tchigher qdequele muqito kination scaledifferentwithkotlin gogenius1 leroyg kfabryczny longjohncoder otherlibrary datskos pete001 jakejscott enstyled dorucioclea imor cedricsellin hipeople olivierh59500 hoangpq pking74 marinpostma diegopacheco sbrichardson artisr mdheller ayourtch yhagio kokizzu suchoudh tarsbase vdt stanxii alek life1347 priyankark lfranchini31 ideaplexus oxydaemon thearchiver light4 laanak08 sadiqmmm bonedaddy tendant zeta1999 hhy5277 buxxux saimarpaka mbrukman beesitech zmilan 0xdaksh neuroradiology fengweijp sjfbjs bigprof rim31 kaushiksachin masums vedangshah lucemook87 ppamorim christophebe centryplaner cambricorp happileming irfaardy alexeyche joe2hpimn forkkit lumiqai sgummaluri techieshark gnomeria romannovo geekhuyang jaksz feverbase theolivenbaum wolfy-j voloyev xtunqki lironhl

meilisearch's Issues

Rank results following rules

Create a RankedStreamBuilder with settings and stuff to customise the sorting algorithms settings.

This builder produce a RankedStream which streams the results using a StreamWithState (with the levenshtein state) and sort the results with some possible optimisations.

The internal StreamWithState returns each matching word (with the right levenshtein distance) associated with the document id, word index in the query, the attribute where the word is located and position of the word in this attribute.

This is raw data and that must be sorted to match the following rules:

Typo: Will rank higher a word that doesn’t contain a typing mistake.
Words: Will rank higher an object matching more of the words typed by the user.
Proximity: Will rank higher words close to each other (George Clooney is better than George word Clooney).
Attribute: Will rank higher a match in a more important attribute (Title, Description).
Words position: Will rank higher words that are at the beginning of an attribute.
Exact: Will rank higher words that match exactly without any suffix (if prefix search is supported).

First step to follow the rules is to collect results (document id, word index in the query, attribute and position) in a collection to help sorting them.

But these rules must be applied using bucket sorting: if, after sorting results using the first rule, there is matches that are equal so use the next rule to disassociate them and so on if equallity are detected.

The typo rule is not the more complex to implement, sort documents according to the levenshtein distance only. If there is multiple words take the maximum of the minimum query words distances.

The words rule is straightforward: the index of the word in the query is returned in the result of the internal StreamWithState stream, this information can be used to know which word the result talk of and so simply count the number of different matching words ([1, 2, 3] is better than [1, 2] or [1, 1, 1]).

The proximity rule can be implemented by sorting the results by the attribute first, by the word index in the query after and finally by the index in the attribute. Once the results for each document are sorted then sort documents by the smallest sum of distances between each query word.

The attribute rule has already been fulfilled in the previous rule application.

The words position rule has already been granted by the previous sorting, select the different word index in the query and get the one that is the nearest from the start of the attribute, rank document based on these values (lower is better).

The exact rule is not defined for the moment.

Conclusion

The best way to implement all of these rules is to create a collection for each document that match something. Documents can be added as the research progresses, the collections contain structures with the index of the word in the query followed by the distance of the match, the attribute in which the word has matched and finally the index of this word in the attribute.

Policy for prefix search

On a search with one or multiples word in the query. The user can choose if each word can be searched as a prefix or not.

I propose 3 possibilities:

None => No Prefix, Never!
Last => Consider that only the last word can be a prefix
All => Consider that all words in the query can be a prefix

Improve document ranking when a query word is know to be exact

Currently the documents ranking doesn't seems to take into account the fact that a query word is at the end of the query or not. When a word is at the middle of a query or is not followed by a space this word is more likely to be finished/in its exact form.

For example if a user query "cat and dog" we must understand that the word "cat" is a finished word and documents containing the exact representation must be more interresting than documents contaning words starting with "cat" like "catalog" "catherine"...

This is already one of the criterion that is used by default but it is nearly the last criterion in the list (link to the code).

Reduce the deserialization time of the index

The current deserialization time of the "data-index" entry is something around 500ms, this is too much.

Currently the "data-index" is ser and deserialized using serde, the cpu consumption is passed in here.

https://github.com/Kerollmops/pentium/blob/0e825e05bbd6812a57ea36b2e9ac58bb8e89f687/src/database/blob/positive/blob.rs#L114-L149

WordArea and Attribute must not panic

Currently WordArea and Attribute panics if the arguments are greater than the internal limit.

This is an issue when texts are too long for example, the serializer must know when a WordArea can not be constructed due to words that are too far.

We must change the constructor to be able to catch possible errors, we will keep a pub(crate) unchecked constructor that panics just for tests.

Fix `CappedBTreeMap` insertion

The current CappedBTreeMap::insert function will pop out the smallest value if the capacity is reached but it doesn't test if the inserted one is smaller than the one in the tree so the poped/returned value is not the smallest one.

Change the example dataset

We must find a better dataset. A set of summary of some important Wikipedia pages for example.

Change CI from travis to Azure

Implement a better debug for Schema

We need a better debug implementation of the Schema Debug current one.

We could display all the attributes names along with their number and properties. Using the map Formatter helper.

Handle multiple indexes using the same key-value store

The RocksDB database have an interresting feature called column-families. If you don't sepcify which column family to use to put a value it will use the "default" column family.

Will could change this behaviour by specifying the column family name to be the index name that should be updated or queried. It is possible to create an SST file for a specific column family.

New highlight style

Currently, the engine returns for highlights the fields name, the start of the occurrence and the length.

e.g.

 "matches": [
      {
        "attribute": "xxx",
        "start": 62,
        "length": 7
      }
]

It should be great to have the content of the field with the highlights visible. Set a text before and after the highlight. It should be great for many usages. Like rendering HTML, printing on a terminal, etc. But keep the information starts and length of a match is also interesting in some case.

e.g.

"_highlightResult": {
	"attribute_name": {
		"value": "I am the text of attribute_name and the match is <em>chocolate</em>",
		"matchedWords": [
			"chocolate"
		],
		"matchesPositions": [
			{ "start": 50, "length": 9 }
		],
		"prefixTag": "<em>",
		"postfixTag": "</em>"
	}
},

Implement distinct on RankedStream

Implement the Algolia's distinct feature on the RankedStream struct using an HashMap and a counter. The attributeForDistinct will be given to the RankedStreamBuilder.

https://www.algolia.com/doc/tutorials/full-text-search/distinct/?language=ruby#distinct-at-query-time

Create a tokenizer

...that can be based on the Michael Nitschinger's blog post about tokenization in Rust.

Allow querying a range of documents

It is an interresting feature to allow a user to query the engine by specifying the exact range of document he want to have, it can be usefull to do pagination. The user could have done a work around by querying enough documents to reach the up bound of its final needed range.

By specify the range, the engine can skip many computation inside the bucket sort algorithm.

Remove the principle of positive and negative update

It could be easier for the user to create an update and not to care about the fact that this is a positive or a negative one.

It must be an update without any distinction, the user could be able to insert documents fields updates, remove documents fields or completly remove documents.

Use a shared Vec for the "data-index" PositiveBlob

This way it will reduce the number of allocation when reading this entry to one of a known size.

The PositiveBlob is composed of an fst::Map and a DocIndexes struct, these two structures are one after the other in memory. We could use a shared Vec to construct the Map and the DocIndexes.

Implement query filtering

Another great feature to have is a way to return only documents that should be yielded. An API like the following would be perfect.

fn filter_function(doc: DocumentId, view: &DatabaseView<D>) -> bool {
    // ...
}

let builder = view.query_builder()?;
let builder = builder.with_filter(filter_function);

let documents = builder.query("chateau", 0..20);

We must think again how work the query builder system.

For the moment a user can create a QueryBuilder without any parameter, if it needs to apply a distinct rule on it, the user can create a DistinctQueryBuilder from it. This is a new type with its own query function, really custom.

If we want to introduce a new way to filter documents we will multiply the number of types and it will be hard to combine types of filters (i.e. filter, distinct). An idea could be to introduce a Query trait that any filter will implement, this way we will be able to introduce query filters that wraps other query filters.

We need to be carreful, the query filter must always be done before the query distinct for example. A type restriction could be implemented.

Add some logging to the project

It can be for the http server or in the raptor library.

Using the log crate can be a good start.

Schema must be aware of the document id

Currently the schema know which fields of the document should be indexed and stored. Using the SchemaBuilder struct we declare each field in the order of importance we want and the document id is unknown to the finally built schema.

The advantage of this solution is that it will be impossible to serialize a document that does not have an id. The serializer will not be able to generate the appropriate key and therefore not be able to store it in the database.

We must absolutely keep some kind of "raw" API that will allow us to store a document under any id and with any attributes.

How should we modify the SchemaBuilder API ? What if the user want its id to be stored and indexed ?

We will probably need to transform the id (i.e. hashing it) to be able to save the in the database, when a user will need to identify a document (i.e. deleting it), he will refer to its original id, not the transformed one.

Implement some sort of incremental indexing

What is incremental indexing ?

Currently the engine take an index composed of 3 files, one for the final state transducer (.fst), one for the list of document id associated with the index where it match (a DocIndex) (.idx) and one that represents the key-values used to feed RocksDB (.sst).

This kind of index is immutable and it is the power of the datastructure, you can read more on the fst repo. I will use the "index" term to speak about this batch of file that works together in the following lines.

Problem

A use-case that should/must be managed by this library is the possiblity to add and remove documents in real time without needing to re-index all the documents. This is a feature that is really important when you need to take a small amount of time to expose new results.

Schema must be contructed from file

We must be able to create a Schema from a file. The format could be toml for example.

It could be something of this form, the order in which the attributes are declared is important.
There is a pull request about that in the toml repository.

identifer = "id"

[attributes."id"]
stored = true

[attributes."title"]
stored = true
indexed = true

[attributes."description"]
stored = true
indexed = true

[attributes."image"]
stored = true

Index language configuration

If the user has a database with documents translated in multiples languages, the best use case is to create one index per language. In this case, it should be great to permit the end user to choose a default language for each index. This will apply the tokenizer and the default stop words for this language during the indexing.

Of course, this option should be totally optional. By default, no stop word will be used. The tokenizer will split by space and use a default sentence separator like dots, coma, etc..

Schema must be de/serialized from a json format

Thanks to a previous issue #47 Schema can be built from a toml file.

The proposal for this issue is to handle the JSON format like the Toml format.

An example of the json schema:

{
    "identifier": "id",
    "attributes": {
        "id": {
            "strored": true
        },
        "title": {
            "strored": true,
            "indexed": true
        },
        "description": {
            "strored": true,
            "indexed": true
        },
        "image": {
            "indexed": true
        }
    }
}

Produce an SST RocksDB file at index time

Better produce an SST file at index time, this kind of file is useful when the key-value generation can be done earlier and the boot of the program, that will query the key-values, must be fast.

https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Use the Stream instead of a Levenshtein for little words

It will gives better performances.

Currently, if a query word is smaller or equal than 4 characters, a Leveinshtein automaton is created with an allowed distance of 0 that was used only to match on prefixes and it is much slower than combining the stream and starts_with map methods.

The best way to make it possible must be to define an enum that could be a stream + starts_with combination or a Levenshtein. We will keep the Levenshtein pool but it will under this wrapper.

Create a capped `BTreeMap` with a limited pool of nodes

...that will allow us to store nodes with values in a simple memory buffer, removing most of the cache misses while iterating over the tree.

A good way to identify nodes pointing over other nodes is to use node indexes in the memory buffer instead of raw pointers, it will allow us to growth the pool node.

Talking issue on implementing a client-server protocol

We must think about a way to communicate with the database, to insert, delete and update documents.

The database will be listening on a port in tcp and should be queried by any language (i.e. Rust, Ruby, Python, Javascript...). The protocol must be versioned.

We though that the first protocol that can be implemented will be a full text one that follows some of the Redis protocol specifications. This way it will be easy to implement it for other languages.

But the second version of the protocol could be oriented on something else, more like a binary format for example.

This is issue is blocked by #22 and #35.

`RankedStream` optimisations

The RankedStream stream can be setup to return a maximum number of results and so optimisations are possible.

The internal StreamWithState stream gives results in lexicographic order, each result, words, is associated to a given number of document id, position... These informations are used to rank results.
The order in which the internal stream gives results is useful to help know if other matching words can be given in the future.

If the stream as a limited number of results to return, some results are already known to be "frozen" ones, there is no results that can be better than these ones.

For example: a document that contain exactly the query word (with a levenshtein distance of zero) and that is at the first position in the first attribute can not be moved out, it is called as "frozen" and can be immediatly returned.

Another example: a one word query is created and the internal stream gives words mathing in the lexicographic order, so if the period in which the words exactly matching (with a levenshtein distance of zero) have been returned, and there is enough of them, so there is no reason to search for more results from the internal stream if the next ones will have a bigger distance and the current accumulated results can be sorted and returned already.

Distinct with automatic group size

The distinct feature (see #12 ) allow to de-duplicate objects based on a generic function.

The implementation keeps the possibility to choose the number of results returned for an object family.

The proposition is to allow users to choose between 3 choices:

x, keep the x best results
Auto, will keep all results with the same level

Pre-config

attributeForDistinct = [groupId]
attributeForAutoDistinct = [color]

Indexed data

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:15, name: t-shirt, color:blue }
{ groupId:2, id:16, name: t-shirt, color:marine }
{ groupId:2, id:17, name: t-shirt, color:red }

Request

Query: T-shirt marine

Results

Current:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }

Simply distinguished by groupId

Possible result 1:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }

Because the marine color is not present on groupId:1 but is on groupId:2

Possible result 2:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }

Because marine match one time in the category color

Give more information about the pocessing on the search response

The search responses are only composed by an array of documents and for each document the list of matches. We might need to have more information for debugging, do metrics, etc.

Here some information that can be interesting to be found on a response.

Process:

The total number of documents related to the query
The processing time in milliseconds

Pagination:

The actual page number requested by the user
The number of pages that can be requested
The number of hits per page

Debugging:

The indices used for search (useful in case of a search on multiple indices)
The complete ranking information (gave information to understand why results are in this order)
Return all information about the request made by the user

Implement a bucket sort algortihm

After some time experimenting on a basic sorting method that sort all documents at once it appears that it was not the best idea.

The new algorithm should use a bucket sort system, a sort that will only be applied on sub groups by criterion. Buckets are sorted by criterion and split, each sub-buckets are then sorted again and split by the next criterion. The advantage of this technique is that only the minimum number of buckets are taken into account, the other buckets that will not contain the returned documents will not be processed and sorted.

Policy for matching words

On a search with one or multiples word in the query. The user can choose what happens if a word on a query is not present on the document or if the engine has not found enough documents. It's mean the possibility to make a query word optional or not.

I propose 4 possibilities:

None => If a query word is missing the document is not considered
Last Word => If we have no results we remove the last query word and try again
First Word => If we have no result we remove the first query word and try again
Optional => All words in a query are always optional, If one is missing the document is kept

Make the Schema attribute_name function faster

For the moment the Schema::attribute_name function has an awful complexity, see by yourself:

https://github.com/Kerollmops/MeiliDB/blob/2cbb943cbef406a5223deea13e20c027094021bb/src/database/schema.rs#L113-L119

We can probably use a LinkedHashMap to be able to iterate on it but as its name explain it good: it is a linked-list that drive this fonctionnality.

Moving from RocksDB to Sled

Currently the underlying key-value store is RocksDB, this is one of the most important dependency we have. It allows the search engine to:

accept updates atomically and provide snapshots of the DB.
implement custom ranking rules based on the document attributes stored in the DB
probably make other ways of querying the data without affecting the current system (i.e. geo-search)

But RocksDB is a C++ library that does not expose detailed error types. Therefore it will never be possible to solve the issue #24.

Another problem is that it multiply the number of tools we need to compile the library, we need to have a C++ compiler, make and cmake.

The solution to these problems is to use a full Rust library, thinking of sled. If we move to it, it will be much easy for us to maintain the sled code base making it better fitting for our usecases.

For the moment Sled does not have many of the features we need to make the switch:

Accept some kind of SST file spacejam/sled#438 or handle transactions and atomic replications.
Snapshots (i.e. read the database at a certain point in time).
~~column-families~~ namespacing feature #27 (spacejam/sled#478).

Store the default pagination configuration

The search engine has as default pagination by 20 documents. Set this value alterable.

Replace hasbrown::HashSet by std::collections::Hashset

hashbrown by std

Save blobs using a key-value store

The problem that we faced while implementing this database was to save the blobs generated by our engine. The engine can work with positive and negative blobs as explained in the deep-dive.

These blobs are currently written to the filesystem. The problem is that the files order must be saved, you can read more about why in the issue #15. Additionally the positive blobs are associated with an sst file, a file that contains all key values of the newly added documents, a format RocksDB can be populated with.

One of the feature that the final database must have is to be fed with updates (addition, deletion, update) while it is running. The big problem was: How can we ensure that the modifications received by the database are saved durably (according to the ACID properties) ?

A temporary solution could be to use an existing key-value store like RocksDB, it already follows some of the ACID properties. It will be used to store the key-values of the newly indexes documents and the index files (.idx, .map). Using column families as a separation.

When an update is made we first stores the index files under an incremented id as key, update the document key-values and set the current key id to the one created. We have a little problem here: ingesting an sst file cannot be done inside of a transaction.

Allow serializing unstructured data to create an update

Currently only structure serialization is supported to create an update, in the future it would be great to serialize HashMaps for example. This way we will allow users to send custom objects without needing to recompile.

The current schema does not know anything about the document unique identifer we must resolve this issue first (#22), this way users would be able to insert documents without needing to compute its id.

Improve the query distinct function performances

The query distinct function call the distinct function given by the user inside the raw bucket sort pass and another time inside the final aggregation loop.

https://github.com/Kerollmops/MeiliDB/blob/e36deb60dacc8e70db00d7747555758d15f949b9/src/rank/query_builder.rs#L151-L191

The user given distinct function is always cpu consumming related to the fact that it fetch documents fields from the key-value store.

It could have been interresting to store the outputs of the firsts distinct function calls.

We need to accept soft updates

Soft updates are updates that can be ingested by the database without any cpu overhead.
Unlike the heavy updates they doesn't carry any fst-docids entry therefore no blob merge is needed and entry modification can be seen in realtime.

Obviously soft updates can only be done on non-indexed attributes.

Allowing this can be useful for custom ranking rules using non-indexed attributes frequently changing like a number_of_clicks attribute for example.

Change the DocIndex file format to allow stream writing

Currently the file format to store the document indexes associated with the word in the fst map stores an index at the start of the file followed by all the corresponding document indexes. You can read more about the format in the deep-dive.

This flat data structure does not allow us to "stream" the data to disk directly when merging indexes for example, because we need to know about the final index to write it ahead.

I propose to invert the order and to store the index and its size at the end of the file, this way we would be able to "stream" the document indexes to the file and only keep the offsets the file in ram. When all words have been traveled and written in append-only to the file we are able to write all offsets followed by the size of this index (all offsets).

To read this file and the document indexes contained in it we will read the last 8bytes to know the size of the index (the offsets) and use this index, like before, to know which document indexes are associated with and id.

The indexation must create only one file

...with all key and values.

Currently the fst map and the values associated with it (i.e. document id, attribute and position) are in two different files, this is favourable to mistakes: a user can ask to open an fst map dump with a wrong values dump and produce wrong results while searching into it.

We must disallow this, it must be impossible to do that kind of mistake, creating only one file by concatenating the current ones could be a valid solution.

Implement better error types

This important for the library user to know what wrong happen when something wrong happen 😃
But it is also important for us to know what is the source of an error.

There is a big problem for the moment, we use RocksDB, that is a C++ library wrapper, errors are just simple strings. We do not have any information on the kind of the error, is it an IO error, a logic error ? Does the key that I want to remove does not exists or is it wrongly spelled ?

We can probably create a workaround that wil be an enum error wrapping the RocksDB error String...

Add a sessionId on the search request

For the moment, all search statistics and analytics made on the server are not specific enough. We cannot know if the search "pants" is unique or if it's preceded by "p" then "pa" then "pan" then "pant" and finally "pants". It's mean we are unable to find the most searched words for an index.

It should be great to add on the search requests headers a sessionId to keep a history of what user search. The goal is not to spy the users but just to have enough information to know what the final search is.

Amortize document updates ingestion costs

The current update ingestion implementation recompute the reverse index.
It is an important CPU cost mostly for adding a small number of documents.

We can't deal with this anymore!
We must be able to acppet any size updates and make them be available to search without the minimum overhead.

We thought about something similar to database caching algorithms: putting most recently used entries in memory providing a rapid access to it. When too much entries are cached, a compaction is performed (i.e. some of the Sled's future strategy).

So, basically when a user propose some documents update/insertion/deletion we will not put every document attributes updates inside of an SST file, paying the cost of keeping words-docindex ordered in RAM and constructing a new FST.

Inserting words-docindex and document attributes values directly into the key-value store is the solution. Internally the database will extract words of the indexed attributes and update the special "revindex" keyspace, dedicated to reverse index data, storing DocIndex associated to words as keys-values, like the current FST.

The advantage of a key-value store is that, like a BTree and an FST, keys are ordered, making O(log n) unions possible between these datastructures at search time.

Obviously, we should find a rule to trigger compaction, it must be based on the latency involved in doing the union, probably related to the number of words/doc-indexes stored as raw keys. A compaction will be an update of the FST by integrating "revindex" word-docindex into the FST.

Make the indexing more customizable

The engine uses a lot of magic numbers. Mostly during the indexing phase. For the typo tolerance configuration, the number of words indexed in a field, the biggest possible proximity, etc..

This issue should allow the end user to configures all these parameters.

One possibility is to store them on the config already saved in the KV store. The search and the indexing config parts could be separated.

Think about data oriented design

Currently the mosts cpu consumming parts of MeiliDB are the criterions, it is almost 40% of the total time (for 9 million fields).

We spotted something interresting in time measurements, one of our criterion takes much more time than the others but it only does a sum of one of the Match property.

The sum_of_typos criterion takes 5.41ms to sort 97360 documents and the sum_of_words_attribute takes 19.76ms for the exact same number of documents.

97360 total documents to classify
626460 total matches to classify

query_all took 87.45 ms

criterion SumOfTypos,               documents group of size 97360
criterion SumOfTypos                sort took 5.41 ms

criterion NumberOfWords,            documents group of size 97360
criterion NumberOfWords             sort took 5.36 ms

criterion WordsProximity,           documents group of size 97360
criterion WordsProximity            sort took 7.24 ms

criterion SumOfWordsAttribute,      documents group of size 97360
criterion SumOfWordsAttribute       sort took 19.76 ms

criterion SumOfWordsPosition,       documents group of size 33657
criterion SumOfWordsPosition        sort took 10.48 ms

criterion Exact,                    documents group of size 16708
criterion Exact                     sort took 1.96 ms

criterion DocumentId,               documents group of size 16708
criterion DocumentId                sort took 1.98 ms

It is nearly the sames algorithms:

https://github.com/Kerollmops/MeiliDB/blob/0a3d069fbc0108cca6953c813453b2dd24d8b68d/src/rank/criterion/sum_of_typos.rs#L14-L26

https://github.com/Kerollmops/MeiliDB/blob/0a3d069fbc0108cca6953c813453b2dd24d8b68d/src/rank/criterion/sum_of_words_attribute.rs#L13-L19

The only difference is that the attribute field is not at the start of Match.
It is probably padded, making the CPU cache unhappy.

struct Match {
    query_index: u32,
    distance: u8,
    attribute: u16,
    word_index: u32,
    is_exact: bool,
}

So we thought about data oriented design, putting related data in the same memory space to make the CPU cache happy.

All of these fields will probably be stored in the same vectors, each vector represent a property.
All of the documents matches properties will be in the same five vectors, everything is a matter of indexes and slices.

// OLD algorithm struct
struct Match {
    query_index: u32,
    distance: u8,
    attribute: u16,
    word_index: u32,
    is_exact: bool,
}

let matches: Vec<Match>;

// NEW algorithm struct
struct Matches {
    query_index: Vec<u32>,
    distance: Vec<u8>,
    attribute: Vec<u16>,
    word_index: Vec<u32>,
    is_exact: Vec<bool>,
}

https://stackoverflow.com/questions/17074324/how-can-i-sort-two-vectors-in-the-same-way-with-criteria-that-uses-only-one-of

Save the state of the search engine (its index)

in other word dump the index files. It is fairly easy to dump the fst::map and the doc_indexes but we need to dump the RocksDB too, as an SST file if possible.

https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F

It seems that we will be forced to iterate over the keys and select the ones that are present in the metas. But if it is possible to remove keys will we do operations on the index (differences of negative and positive doc indexes) it must not been necessary and an easy* big dump could be sufficient.

The real advantage of doing so will be to give the new index to other search engines and to keep a state even if the engine crash. Ideally the state should be a unique file (e.g. a tar.sz file, tar combined with snappy).

Implement an open_read_only for Database

This way it will be possible to run multiple instances of the same database and only test the query functions, the ones that does not affect the database content.

Rank documents

https://blog.algolia.com/how-algolia-tackled-the-relevance-problem-of-search-engines/
https://blog.algolia.com/inside-the-algolia-enginepart-4-textual-relevance/