Git Product home page Git Product logo

charabia's Introduction

Dependency status License Bors enabled

โšก A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow ๐Ÿ”

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

A bright colored application for finding movies screening near the user A dark colored application for finding movies screening near the user

๐Ÿ”ฅ Try it! ๐Ÿ”ฅ

โœจ Features

  • Search-as-you-type: find search results in less than 50 milliseconds
  • Typo tolerance: get relevant matches even when queries contain typos and misspellings
  • Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
  • Sorting: sort results based on price, date, or pretty much anything else your users need
  • Synonym support: configure synonyms to include more relevant content in your search results
  • Geosearch: filter and sort documents based on geographic data
  • Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
  • Security management: control which users can access what data with API keys that allow fine-grained permissions handling
  • Multi-Tenancy: personalize search results for any number of application tenants
  • Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
  • RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
  • Easy to install, deploy, and maintain

๐Ÿ“– Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

๐Ÿš€ Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

โšก Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

๐Ÿงฐ SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

Logos belonging to different languages and frameworks supported by Meilisearch, including React, Ruby on Rails, Go, Rust, and PHP

โš™๏ธ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

๐Ÿ“Š Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

๐Ÿ“ซ Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

๐Ÿ—ž Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

๐Ÿ’Œ Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

Thank you for your support!

๐Ÿ‘ฉโ€๐Ÿ’ป Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

๐Ÿ“ฆ Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

charabia's People

Contributors

afluffyhotdog avatar bors[bot] avatar carofg avatar choznerol avatar crudiedo avatar curquiza avatar cymruu avatar daniel-shuy avatar datamaker avatar dependabot[bot] avatar draliragab avatar dureuill avatar gmourier avatar goodhoko avatar harshalkhachane avatar irevoire avatar kerollmops avatar kination avatar manythefish avatar marinpostma avatar matthias-wright avatar meili-bors[bot] avatar meili-bot avatar miiton avatar mosuka avatar qbx2 avatar roms1383 avatar samyak2 avatar xshadowlegendx avatar yenwel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

charabia's Issues

Disable HMM feature of Jieba

Today, we are using the Hidden Markov Model algorithm (HMM) provided by the cut method of Jieba to segment unknown Chinese words in the Chinese segmenter.

drawback

Following the subdiscussion in the official discussion about Chinese support in Meilisearch, it seems that the HMM feature of Jieba is not relevant in the context of a search engine. This feature creates longer words and inconsistencies in the segmentation, which reduces the recall of Meilisearch without significantly raising the precision.

enhancement

Deactivate the HMM feature in Chinese segmentation.

Files expected to be modified

Misc

related to product#503

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Latin script: Segmenter should split camelCased words

Today, Meilisearch is splitting snake_case, SCREAMING_CASE and kebab-case properly but doesn't split PascalCase nor camelCase.

drawback

Meilisearch doesn't completely support code documentation.

enhancement

Make Latin Segmenter split camelCased/PascalCase words:

  • "camelCase" -> ["camel", "Case"]
  • "PascalCase" -> ["Pascal", "Case"]
  • "IJsland" -> ["IJsland"] (Language trap)
  • "CASE" -> ["CASE"] (another trap)

Files expected to be modified

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Arabic script: Implement specialized Segmenter

Currently, the Arabic Script is segmented on whitespaces and punctuation.

Drawback

Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:

the agglutinated word ุงู„ุดุฌุฑุฉ => The Tree is a combination of ุงู„ู€ and ุดุฌุฑุฉ
ุงู„ู€ is equivalent to The and it's always connected (not space separated) to the next word.

Enhancement

We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaรฏ Segmenter.


Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Decompose Japanese compound words

Summary

The morphological dictionary that Lindera includes by default is IPADIC.
IPADIC includes many compound words. For example, ้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ (Kansai International Airport).
However, if you index in the default mode, the word ้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ (Kansai International Airport) will be indexed in the term ้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ, and you will not be able to search for the keyword ็ฉบๆธฏ (Airport).
So, Lindera has a function to decompose such compound words.
This is a feature similar to Kuromoji's search mode.

`num_graphemes_from_bytes` does not work when used for a prefix of a raw Token

The Issue

The output of num_graphemes_from_bytes is wrong when:

  • num_bytes is smaller than the length of the string
  • the token does not have the char_map initialized - possibly since the Token was created outside of Tokenizer or because the unicode segmenter was not run.

It should return num_bytes back since each character is assumed to occupy one byte. Instead, it returns the length of the underlying string.

Context

This bug was introduced by me in #59 ๐Ÿ˜†

See also: meilisearch/milli#426 (comment)

Classify tokens after Segmentation instead of Normalization

Classifying tokens after Segmentation instead of Normalization in the tokenization pipeline would enhance the precision of the stop_words classification.
Today, stop words need to be normalized to be properly classified, however, the normalization is more or less lossy and can classy unexpected stop words.
For instance, in French maรฏs (corn in ๐Ÿ‡ฌ๐Ÿ‡ง) would be normalized as mais (but in ๐Ÿ‡ฌ๐Ÿ‡ง), and so, maรฏs will be classified as a stop word if the stop word list contains mais.
This would not be possible if the classifier is called before the normalizer.

Technical approach

Invert the normalization step and the classification step in the tokenization process

Refactor normalizers

Today creating a normalizer is way harder than creating a segmenter, this is mainly due to the char map, a necessary field to manage highlights.

Technical Approach

Refactor the Normalizer trait to implement a normalize_str and normalize_char method that takes a Cow<str> in parameter and return a Cow<str>. All the char map creation should be done in a function calling these two methods.

Implement Jyutping normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Cantonese, we should normalize Chinese characters into Jyutping using the kCantonese dictionary of the unihan database.
We should find an efficent way to normalize characters, and so, the dictionary may be reformated.

Files expected to be modified

Misc

related to product#503
original source of the dictionnary: unihan.zip in https://unicode.org/Public/UNIDATA/

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Implement Pinyin normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Mandarin, we should normalize Chinese characters into Pinyin using the pinyin crates.

Files expected to be modified

Misc

related to product#503

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Add an allowlist to the tokenizer builder

Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.

drawback

Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.

enhancement

Add a new setting in the TokenizerBuilder forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.

Technical approach:

  1. add an optional allowlist parameter to the method detect of the Detect trait in detection/mod.rs
  2. add a segment_with_allowlist and a segment_str_with_allowlist with an additional allowlist parameter to the Segment trait in segmenter/mod.rs
  3. add an allowlist method to the TokenizerBuilder struct in tokenizer.rs

The allowlist should be a hashmap of Script -> [Languages]

Files expected to be modified

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Compile/Instal Charabia on openBSD

I am on openBSD running on Raspberry Pi 4. I am unable to install meilisearch due to cargo not able to find charabia. So i decided to
compile from source.
I downloaded the source from github and did "cargo run' in the charabia source code. I get the error:

error: failed to parse manifest at `/home/kabira/LibOpenSource/charabia/Cargo.toml`

Caused by:
  namespaced features with the `dep:` prefix are only allowed on the nightly channel and requires the `-Z namespaced-features` flag on the command-line

Any workaround suggesting would be great.

Explain the name of the repo in the README

Following the @CaroFG idea, we could explain the name of the repo in the README since some people finds it "offensive"

Here are some explanations Many made on Twitter:

we choose the name of this repository in the same mood as discord or meili, giving the name of the problem we want to solve.
Personally, I donโ€™t feel like itโ€™s an offensive word, but more a funny pun with โ€œcharโ€.
Moreover, other tokenizers donโ€™t always have an understandable name, for instance lindera maintained by @minoru_osuka or even jieba.
I hope my explanation was clear enough and I hope the name will not discourage you to use or even contribute to the project! ๐Ÿ˜Š

readme Hebrew segmentation link points to jieba

As the title: on the readme, clicking on "unicode-segmentation" on the Hebrew row takes the user to the jieba repo.

I assume the correct link would be the same as Latin's "unicode-segmentation."

Tokenizer for Ja/Ko

Hello~
I'm currently testing tokenizer with Japanese/Korean, but seems it is not working correctly.

Is there some working plan for this?

Thanks.

Handle words containing non-separating dots and coma in Latin tokenization

Summary

Handle S.O.S as one word (S.O.S) instead of three (S, O, S) or numbers like 3.5 as one word (3.5) instead of two (3, 5).

Explanation

The actual tokenizer considers any . or , as a hard separator meaning that the two separated words are not considered to be part of the same context.
But there is some exceptions for some words like number that are separated by . or , but should be considered as one and only one word.

We should modify the actual latin tokenizer to handle this case.

Move the FST based Segmenter in a standalone file

For the Thaรฏ segmenter, we tried a Final-state-transducer (FST) based segmenter.
This segmenter has really good performance and the dictionaries encoded as FSTs are smaller than raw txt/csv/tsv dictionaries.
For now, the segmenter is in the Thaรฏ segmenter file (segmenter/thai.rs), and, in order to reuse it for other Languages, it would be better to move this segmenter to its own file.
A new struct FstSegmenter may be created wrapping all the iterative segmentation logic.

File expected to be modified

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Handle non-breakable spaces

The tokenizer must handle non-breakable spaces.

For example, it should handle the following examples this way:

  • 3 456 678 where the (space) is not considered as a separator
    - ะะปัŒั„ะฐ where ัŒ is not considered as space, so a separator

Related to meilisearch/meilisearch#1335 cf @shekhirin comment

Enhance Chinese normalizer by unifying `Z`, `Simplified`, and `Semantic` variants

Following the official discussion about Chinese support in Meilisearch, it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

to know more about each variant, you can read the dedicated report on unicode.org

There are several dictionaries listing variations that we can use, I suggest using the kvariants dictionary made by hfhchan (see the related documentation on the same repo).

technical approach

Import and Rework the dictionary to be a key-value binding of each variant, then, in the Chinese normalizer, convert the provided character before transliterating it into Pinyin.

Files expected to be modified

Misc

related to meilisearch/product#503

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Handle multi languages in the same attribute

The Tokenizer currently uses the whatlang library that detects the language of the attribute (in probability).

The Tokenizer must be able to detect several languages in the same attributes.

Also, maybe it's a better idea to let the user decide the language?

Implement an efficient `Nonspacing Mark` Normalizer

In the Information Retrieval (IR) context, removing Nonspacing Marks like diacritics is a good way to increase recall without losing much precision, like in Latin, Arabic, or Hebrew.

Technical Approach

Implement a new Normalizer, named NonspacingMarkNormalizer, that removes the nonspacing marks from a provided token (find a naive implementation with the exhaustive list in the Misc section).
Because there are a lot of sparse character ranges to match, it would be inefficient to create a big if-forest to know if a character is a nonspacing mark.
This way, I suggest trying several implementations of the naive implementation below in a small local project.

Interesting Rust Crates

  • hyperfine: a small command-line tool to benchmark several binaries
  • roaring-rs: a bitmap data structure that has an efficient contains method
  • once_cell: a good Library to create lazy statics already used in the repository

Misc

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Upgrade Whatlang dependency

Whatlang introduced new Languages and Scripts in the newer version.
We should upgrade our dependency to the latest version.

Reimplement Japanese Segmenter

Reimplement Japanese segmenter using Lindera.

TODO list

  • Read Contributing.md about Segmenters implementation
  • Lindera loads dictionaries at initialization
    • Ensure that Lindera is not initialized at each tokenization
    • Add a feature flag for Japanese
  • use a custom config to initailize Lindera (better segmentation for search usage)

TokenizerConfig { mode: Mode::Decompose(Penalty::default()), ..TokenizerConfig::default() }

  • test segmenter

้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ้™ๅฎšใƒˆใƒผใƒˆใƒใƒƒใ‚ฐ ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก should give ["้–ข่ฅฟ", "ๅ›ฝ้š›", "็ฉบๆธฏ", "้™ๅฎš", "ใƒˆใƒผใƒˆใƒใƒƒใ‚ฐ", " ", "ใ™ใ‚‚ใ‚‚", "ใ‚‚", "ใ‚‚ใ‚‚", "ใ‚‚", "ใ‚‚ใ‚‚", "ใฎ", "ใ†ใก"]

  • Add benchmarks

Implement a Japanese specialized Normalizer

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ใƒ€ใƒก, is also spelled ้ง„็›ฎ, or ใ ใ‚

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

  • wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

Misc

related to product#532

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Tokenizer refactoring strategy

Implementation Branch: tokenizer-v1.0.0
Draft PR: #77

Summary

As a fast search engine, Meilisearch needs a tokenizer that is a pragmatic balance between processing time and relevancy.
The current implementation of the tokenizer leaks clarity and contains ugly hotfixes making contributions, optimizations, and maintenance difficult.

How to find a pragmatic balance between processing time and relevancy?

First of all, we are not linguists and we don't speak or understand most of the Language that we would want to support, this means that we can't write a tokenizer from scratch and prove that this tokenizer is relevant or not.
That's why the current implementation, and the future ones, rely on segmentation libraries like jieba, unicode-segmentation, or lindera to segment texts in words, theses libraries are recommended and included by external contributors in the library.
But this has some limits and the main one is the processing time, some libraries, even if they have good relevancy, don't suit our needs because the processing time is too long (๐Ÿ‘‹ Jieba).

Relevancy

Because we can't measure relevancy by ourselves, we want to continue to rely on the community and external libraries.
In this perspective, we need to make the inclusion of an external library by an external contributor the easiest as possible:

Code shape

  • Refactor Pipeline by removing preprocessors and making normalizers global #76
  • Refactor Analyzer in order to make a new Tokenizerregistration straightforward #76
  • Simplify the return value of Tokenizer (returning a Script and a &str instead of a Token) #76
  • Wrap normalizer in an iterator allowing them to yield several items from 1 (["l'aventure"] -> ["l", "'", "adventure"]) #76
  • Add a search mode in Segmenter returning all the word derivation (tokenizers search mode are doing ngrams internally)
  • Enhance clarity by renaming some structure, function, and files (Segmenter instead of Tokenizer, chinese_cmn.rs instead of jieba.rs) #76
  • Create test macro allowing contributors to easily test their tokenizer and improve the trust we have in tests assuring that all tokenizers are equally tested

Documentation and contribution processes

  • Add documenting comments in main structures (Token, Tokenizer trait..) #76
  • Add a template of a tokenizer as a dummy example of how to add a new tokenizer #76
  • Add a template of a normalizer as a dummy example of how to add a new normalizer
  • Add a CONTRIBUTING.md explaining how to test, bench, and implement tokenizers
  • Enhance README.md
  • Create an issue triage process differencing each tokenizer scopes (detector, segmenter, normalizer, classifier) #88

Minimal requirement to have no regressions

  • Use unicode-segmentation instead of legacy tokenizer for Latin tokenization #76
  • Reimplement Chinese Segmenter (using Jieba)
  • Reimplement Japanese Segmenter (using Lindera) #89
  • Reimplement Deunicode Normalizer only on Script::Latin
  • Reimplement traditional Chinese translation preprocessor into a Normalizer only on Language::Cmn
  • Reimplement control Character remover Normalizer

Processing time

Because tokenization has an impact on Meilisearch performances, we have to measure the processing time of every new implementation and define limits that can't be reached in order to be merged. Sometimes, we should think of implementing by ourselves instead of relying on an external library that could significantly impact Meilisearch performances.

  • Refactor benchmarks to ease benchmarks creation by any contributor
  • Defines hard limits, like throughput thresholds, to objectively accept or refuse a contribution
  • Add workflows that run benchmarks on the main branch #91

Publish meilisearch tokenizer as a crates

In order to increase the visibility and external contributions, we may publish this library as a crate.

  • #51
  • Add a user documentation
  • #35

crates link: https://crates.io/crates/charabia

NLP

For now, we don't plan to use NLP to tokenize in Meilisearch.

The requirement or advice of chinese word segmentation

Describe the requirement
i expect the chinese input text will be splited into all possible words
for example:
1652617278(1)

The behavior of Current Version
1652617473(1)

The advice of optimization
i notice that you use jieba default constraction and this will cause some highlight errors or search errors of chinese word segmentation.So,can you use the cat_all method from jieba library to get chinese word segmentation?
1652617925(1)

Additional text or screenshots
1652618015(1)

I except your reply,thanks @ManyTheFish

Add Actual Tokenizer state

Export actual meilisearch Tokenizer in this repository to start with a compatible version of it.
The goal it to enhance this iso state and be able to test it iteratively on meilisearch instead of deliver a final version

Implement a Compatibility Decomposition Normalizer

Meilisearch is unable to find Canonical and Compatibility equivalences, for instance, ๏ฝถ๏พž๏ฝท๏พž๏ฝธ๏พž๏ฝน๏พž๏ฝบ๏พž can't be found with a query ใ‚ฌใ‚ฎใ‚ฐใ‚ฒใ‚ด.

Technical approach

Implement a new Normalizer CompatibilityDecompositionNormalizer using the method nfkd of the unicode-normalization crate.

Files expected to be modified

Misc

related to product#532

Hey! ๐Ÿ‘‹
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! ๐Ÿค

Make Latin Segmenter split on `'`

In French, some determiners and adverbs are fusioned with words that begin with a vowel using the ' character:

  • l'aventure
  • d'avantage
  • qu'il
  • ...

By default, the Latin segmenter doesn't split them.

Publish tokenizer to crate.io

We should automate this push in a CI (triggered on each release for example)

  • Publish manually the first version
  • Add meili-bot as an Owner
  • Automate using CI

โš ๏ธ Should be done by a core-engine team member

Korean support

Hello. Iโ€™m going to submit a PR for korean support, please review.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.