meilisearch / charabia Goto Github PK

Library used by Meilisearch to tokenize queries and documents

License: MIT License

Rust 99.82% Shell 0.18%

charabia's Introduction

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

charabia's People

Contributors

Stargazers

Watchers

charabia's Issues

Disable HMM feature of Jieba

Today, we are using the Hidden Markov Model algorithm (HMM) provided by the cut method of Jieba to segment unknown Chinese words in the Chinese segmenter.

drawback

Following the subdiscussion in the official discussion about Chinese support in Meilisearch, it seems that the HMM feature of Jieba is not relevant in the context of a search engine. This feature creates longer words and inconsistencies in the segmentation, which reduces the recall of Meilisearch without significantly raising the precision.

enhancement

Deactivate the HMM feature in Chinese segmentation.

Files expected to be modified

/src/segmenter/chinese.rs

Misc

related to product#503

Latin script: Segmenter should split camelCased words

Today, Meilisearch is splitting snake_case, SCREAMING_CASE and kebab-case properly but doesn't split PascalCase nor camelCase.

drawback

Meilisearch doesn't completely support code documentation.

enhancement

Make Latin Segmenter split camelCased/PascalCase words:

"camelCase" -> ["camel", "Case"]
"PascalCase" -> ["Pascal", "Case"]
"IJsland" -> ["IJsland"] (Language trap)
"CASE" -> ["CASE"] (another trap)

Files expected to be modified

/src/segmenter/latin.rs

Arabic script: Implement specialized Segmenter

Currently, the Arabic Script is segmented on whitespaces and punctuation.

Drawback

Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:

Enhancement

We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.

Decompose Japanese compound words

Summary

The morphological dictionary that Lindera includes by default is IPADIC.
IPADIC includes many compound words. For example, 関西国際空港 (Kansai International Airport).
However, if you index in the default mode, the word 関西国際空港 (Kansai International Airport) will be indexed in the term 関西国際空港, and you will not be able to search for the keyword 空港 (Airport).
So, Lindera has a function to decompose such compound words.
This is a feature similar to Kuromoji's search mode.

`num_graphemes_from_bytes` does not work when used for a prefix of a raw Token

The Issue

The output of num_graphemes_from_bytes is wrong when:

num_bytes is smaller than the length of the string
the token does not have the char_map initialized - possibly since the Token was created outside of Tokenizer or because the unicode segmenter was not run.

It should return num_bytes back since each character is assumed to occupy one byte. Instead, it returns the length of the underlying string.

Context

This bug was introduced by me in #59 😆

Classify tokens after Segmentation instead of Normalization

Classifying tokens after Segmentation instead of Normalization in the tokenization pipeline would enhance the precision of the stop_words classification.
Today, stop words need to be normalized to be properly classified, however, the normalization is more or less lossy and can classy unexpected stop words.
For instance, in French maïs (corn in 🇬🇧) would be normalized as mais (but in 🇬🇧), and so, maïs will be classified as a stop word if the stop word list contains mais.
This would not be possible if the classifier is called before the normalizer.

Technical approach

Invert the normalization step and the classification step in the tokenization process

Chinese highlight

Following this issue created in MeiliSearch: meilisearch/meilisearch#2091

This fix should be done on the tokenizer side.

Refactor normalizers

Today creating a normalizer is way harder than creating a segmenter, this is mainly due to the char map, a necessary field to manage highlights.

Technical Approach

Refactor the Normalizer trait to implement a normalize_str and normalize_char method that takes a Cow<str> in parameter and return a Cow<str>. All the char map creation should be done in a function calling these two methods.

Implement Jyutping normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Cantonese, we should normalize Chinese characters into Jyutping using the kCantonese dictionary of the unihan database.
We should find an efficent way to normalize characters, and so, the dictionary may be reformated.

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to product#503
original source of the dictionnary: unihan.zip in https://unicode.org/Public/UNIDATA/

Rework meilisearch tokenizer

TBD

meilisearch/meilisearch#624

Implement Pinyin normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to product#503

Add an allowlist to the tokenizer builder

Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.

drawback

Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.

enhancement

Add a new setting in the TokenizerBuilder forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.

Technical approach:

add an optional allowlist parameter to the method detect of the Detect trait in detection/mod.rs
add a segment_with_allowlist and a segment_str_with_allowlist with an additional allowlist parameter to the Segment trait in segmenter/mod.rs
add an allowlist method to the TokenizerBuilder struct in tokenizer.rs

The allowlist should be a hashmap of Script -> [Languages]

Files expected to be modified

Latin script: segmenter should support word segmentation

We should be able to support splitting words by methods other than the text casing. Libraries like instant-segment exist to do that.

redneckbossryan -> redneck, boss and ryan can be extracted
massachusetsinstutitute -> massachusetts, institute

Compile/Instal Charabia on openBSD

I am on openBSD running on Raspberry Pi 4. I am unable to install meilisearch due to cargo not able to find charabia. So i decided to
compile from source.
I downloaded the source from github and did "cargo run' in the charabia source code. I get the error:

error: failed to parse manifest at `/home/kabira/LibOpenSource/charabia/Cargo.toml`

Caused by:
  namespaced features with the `dep:` prefix are only allowed on the nightly channel and requires the `-Z namespaced-features` flag on the command-line

Any workaround suggesting would be great.

Change the crate name

Waiting for internal decisions

Explain the name of the repo in the README

Following the @CaroFG idea, we could explain the name of the repo in the README since some people finds it "offensive"

Here are some explanations Many made on Twitter:

readme Hebrew segmentation link points to jieba

As the title: on the readme, clicking on "unicode-segmentation" on the Hebrew row takes the user to the jieba repo.

I assume the correct link would be the same as Latin's "unicode-segmentation."

Create an issue triage process differencing each tokenizer scopes

Create an issue triage process differencing each tokenizer scopes (detector, segmenter, normalizer, classifier)

Lindera compilation issue due to an error when reading dictionnaries

It is not the first time that I get an issue with lindera not being able to read an archive that is downloaded from some Google drive folder.

Tokenizer for Ja/Ko

Hello~
I'm currently testing tokenizer with Japanese/Korean, but seems it is not working correctly.

Is there some working plan for this?

Thanks.

Handle words containing non-separating dots and coma in Latin tokenization

Summary

Handle S.O.S as one word (S.O.S) instead of three (S, O, S) or numbers like 3.5 as one word (3.5) instead of two (3, 5).

Explanation

The actual tokenizer considers any . or , as a hard separator meaning that the two separated words are not considered to be part of the same context.
But there is some exceptions for some words like number that are separated by . or , but should be considered as one and only one word.

We should modify the actual latin tokenizer to handle this case.

Disable char map creation for when we don't need it

When we want to index the tokens just to store the words in the inverted index we don't need the char map, maybe we can provide an option to disable creating it.

Move the FST based Segmenter in a standalone file

For the Thaï segmenter, we tried a Final-state-transducer (FST) based segmenter.
This segmenter has really good performance and the dictionaries encoded as FSTs are smaller than raw txt/csv/tsv dictionaries.
For now, the segmenter is in the Thaï segmenter file (segmenter/thai.rs), and, in order to reuse it for other Languages, it would be better to move this segmenter to its own file.
A new struct FstSegmenter may be created wrapping all the iterative segmentation logic.

File expected to be modified

segmenter/thai.rs
create segmenter/utils.rs

Handle non-breakable spaces

The tokenizer must handle non-breakable spaces.

For example, it should handle the following examples this way:

3 456 678 where the (space) is not considered as a separator
~~- Альфа where ь is not considered as space, so a separator~~

~~Related to meilisearch/meilisearch#1335~~ cf @shekhirin comment

Add workflows that run benchmarks on the main branch

Enhance Chinese normalizer by unifying `Z`, `Simplified`, and `Semantic` variants

Following the official discussion about Chinese support in Meilisearch, it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

There are several dictionaries listing variations that we can use, I suggest using the kvariants dictionary made by hfhchan (see the related documentation on the same repo).

technical approach

Import and Rework the dictionary to be a key-value binding of each variant, then, in the Chinese normalizer, convert the provided character before transliterating it into Pinyin.

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to meilisearch/product#503

Handle multi languages in the same attribute

The Tokenizer currently uses the whatlang library that detects the language of the attribute (in probability).

The Tokenizer must be able to detect several languages in the same attributes.

Also, maybe it's a better idea to let the user decide the language?

Implement an efficient `Nonspacing Mark` Normalizer

In the Information Retrieval (IR) context, removing Nonspacing Marks like diacritics is a good way to increase recall without losing much precision, like in Latin, Arabic, or Hebrew.

Technical Approach

Implement a new Normalizer, named NonspacingMarkNormalizer, that removes the nonspacing marks from a provided token (find a naive implementation with the exhaustive list in the Misc section).
Because there are a lot of sparse character ranges to match, it would be inefficient to create a big if-forest to know if a character is a nonspacing mark.
This way, I suggest trying several implementations of the naive implementation below in a small local project.

Interesting Rust Crates

hyperfine: a small command-line tool to benchmark several binaries
roaring-rs: a bitmap data structure that has an efficient contains method
once_cell: a good Library to create lazy statics already used in the repository

Misc

naive implementation of is_nonspacing_mark
related discussion about the Arabic Language

Project naming question

Hello.

How do you think about naming the project with camel_case? Currently Tokenizer is starting with capital letter...
https://doc.rust-lang.org/1.0.0/style/style/naming/README.html

This is just a question, so if there is a specific reason, just ignore it 🙏

Upgrade Whatlang dependency

Whatlang introduced new Languages and Scripts in the newer version.
We should upgrade our dependency to the latest version.

Reimplement Japanese Segmenter

Reimplement Japanese segmenter using Lindera.

TODO list

Read Contributing.md about Segmenters implementation
Lindera loads dictionaries at initialization
- Ensure that Lindera is not initialized at each tokenization
- Add a feature flag for Japanese
use a custom config to initailize Lindera (better segmentation for search usage)

test segmenter

Add benchmarks

Implement a Japanese specialized Normalizer

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ, is also spelled 駄目, or だめ

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

create /src/normalizer/japanese.rs
/src/normalizer/mod.rs

Misc

related to product#532

Upgrade dependencies

Checked with @ManyTheFish

We need to upgrade the dependencies of Charabia for v0.30.0

Tokenizer refactoring strategy

Implementation Branch: tokenizer-v1.0.0
Draft PR: #77

Summary

As a fast search engine, Meilisearch needs a tokenizer that is a pragmatic balance between processing time and relevancy.
The current implementation of the tokenizer leaks clarity and contains ugly hotfixes making contributions, optimizations, and maintenance difficult.

How to find a pragmatic balance between processing time and relevancy?

First of all, we are not linguists and we don't speak or understand most of the Language that we would want to support, this means that we can't write a tokenizer from scratch and prove that this tokenizer is relevant or not.
That's why the current implementation, and the future ones, rely on segmentation libraries like jieba, unicode-segmentation, or lindera to segment texts in words, theses libraries are recommended and included by external contributors in the library.
But this has some limits and the main one is the processing time, some libraries, even if they have good relevancy, don't suit our needs because the processing time is too long (👋 Jieba).

Relevancy

Because we can't measure relevancy by ourselves, we want to continue to rely on the community and external libraries.
In this perspective, we need to make the inclusion of an external library by an external contributor the easiest as possible:

Code shape

Refactor Pipeline by removing preprocessors and making normalizers global #76
Refactor Analyzer in order to make a new Tokenizerregistration straightforward #76
Simplify the return value of Tokenizer (returning a Script and a &str instead of a Token) #76
Wrap normalizer in an iterator allowing them to yield several items from 1 (["l'aventure"] -> ["l", "'", "adventure"]) #76
~~Add a search mode in Segmenter returning all the word derivation~~ (tokenizers search mode are doing ngrams internally)
Enhance clarity by renaming some structure, function, and files (Segmenter instead of Tokenizer, chinese_cmn.rs instead of jieba.rs) #76
Create test macro allowing contributors to easily test their tokenizer and improve the trust we have in tests assuring that all tokenizers are equally tested

Documentation and contribution processes

Add documenting comments in main structures (Token, Tokenizer trait..) #76
Add a template of a tokenizer as a dummy example of how to add a new tokenizer #76
Add a template of a normalizer as a dummy example of how to add a new normalizer
Add a CONTRIBUTING.md explaining how to test, bench, and implement tokenizers
Enhance README.md
~~Create an issue triage process differencing each tokenizer scopes (detector, segmenter, normalizer, classifier)~~ #88

Minimal requirement to have no regressions

Use unicode-segmentation instead of legacy tokenizer for Latin tokenization #76
Reimplement Chinese Segmenter (using Jieba)
~~Reimplement Japanese Segmenter (using Lindera)~~ #89
Reimplement Deunicode Normalizer only on Script::Latin
Reimplement traditional Chinese translation preprocessor into a Normalizer only on Language::Cmn
Reimplement control Character remover Normalizer

Processing time

Because tokenization has an impact on Meilisearch performances, we have to measure the processing time of every new implementation and define limits that can't be reached in order to be merged. Sometimes, we should think of implementing by ourselves instead of relying on an external library that could significantly impact Meilisearch performances.

Refactor benchmarks to ease benchmarks creation by any contributor
Defines hard limits, like throughput thresholds, to objectively accept or refuse a contribution
~~Add workflows that run benchmarks on the main branch~~ #91

Publish meilisearch tokenizer as a crates

In order to increase the visibility and external contributions, we may publish this library as a crate.

#51
Add a user documentation
#35

crates link: https://crates.io/crates/charabia

NLP

For now, we don't plan to use NLP to tokenize in Meilisearch.

The requirement or advice of chinese word segmentation

Describe the requirement
i expect the chinese input text will be splited into all possible words
for example:

The behavior of Current Version

The advice of optimization
i notice that you use jieba default constraction and this will cause some highlight errors or search errors of chinese word segmentation.So,can you use the cat_all method from jieba library to get chinese word segmentation?

Additional text or screenshots

I except your reply,thanks @ManyTheFish

Add Actual Tokenizer state

Export actual meilisearch Tokenizer in this repository to start with a compatible version of it.
The goal it to enhance this iso state and be able to test it iteratively on meilisearch instead of deliver a final version

Implement a Compatibility Decomposition Normalizer

Meilisearch is unable to find Canonical and Compatibility equivalences, for instance, ｶﾞｷﾞｸﾞｹﾞｺﾞ can't be found with a query ガギグゲゴ.

Technical approach

Implement a new Normalizer CompatibilityDecompositionNormalizer using the method nfkd of the unicode-normalization crate.

Files expected to be modified

Misc

related to product#532

Comment on `Analyzer.analyze` default is out of date

https://github.com/meilisearch/tokenizer/blob/1dfc8ad9f5b338c39c3bc5fd5b2d0c1328314ddc/src/analyzer.rs#L300-L301

This default behavior mentioned in the comment seems to have changed in #27 :

             normalizer: Box::new(IdentityNormalizer),
-            tokenizer: Box::new(UnicodeSegmenter),
+            tokenizer: Box::new(LegacyMeilisearch),
-            normalizer: Box::new(IdentityNormalizer),
+            normalizer: Box::new(normalizer),

Introduce HTML tags separators

Related to meilisearch/meilisearch#955

We might introduce the HTML tags as soft and hard separators: / < >

And we should define if they are soft or hard separators

l'aventure
d'avantage
qu'il
...

By default, the Latin segmenter doesn't split them.

Publish manually the first version
Add meili-bot as an Owner
Automate using CI

⚠️ Should be done by a core-engine team member

Wrong matching for Arabic

Related to meilisearch/meilisearch#1331

Korean support

Hello. I’m going to submit a PR for korean support, please review.

meilisearch / charabia Goto Github PK

charabia's Introduction

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

✨ Features

📖 Documentation

🚀 Getting started

⚡ Supercharge your Meilisearch experience

🧰 SDKs & integration tools

⚙️ Advanced usage

📊 Telemetry

📫 Get in touch!

👩‍💻 Contributing

📦 Versioning

charabia's People

Contributors

Stargazers

Watchers

Forkers

charabia's Issues

drawback

enhancement

Files expected to be modified

Misc

drawback

enhancement

Files expected to be modified

Drawback

Enhancement

Summary

The Issue

Context

Technical approach

Technical Approach

drawback

enhancement

Files expected to be modified

Misc

Related

drawback

enhancement

Files expected to be modified

Misc

drawback

enhancement

Files expected to be modified

Summary

Explanation

File expected to be modified

technical approach

Files expected to be modified

Misc

Technical Approach

Interesting Rust Crates

Misc

TODO list

drawback

Technical approach

Interesting libraries

Files expected to be modified

Misc

Summary

How to find a pragmatic balance between processing time and relevancy?

Relevancy

Code shape

Documentation and contribution processes

Minimal requirement to have no regressions

Processing time

Publish meilisearch tokenizer as a crates

NLP

Technical approach

Files expected to be modified

Misc

Recommend Projects

Recommend Topics

Recommend Org