pemistahl / lingua-rs Goto Github PK

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text

License: Apache License 2.0

Rust 91.56% Python 8.44%

rust rust-library rust-crate language-detection language-classification language-recognition natural-language-processing nlp language-identification nlp-machine-learning language-processing

lingua-rs's People

Contributors

Stargazers

Watchers

lingua-rs's Issues

Add Kanji support

Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.

It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.

This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.

Also see greyblake/whatlang-rs/issues/122

Use compact string type to reduce memory consumption

All language models together consume a large amount of memory. By using a compact string type such as compact_str, the memory consumption ought to be reduced.

Memory leak

First, thanks for your work on Lingua.

I found a memory leak when using lingua with the tokio runtime. I think the problem comes from the fact that the language detector might be Send or Sync to multiples threads.

I tried using with_preloaded_language_models() on the builder with no success.
Even when I build a new LanguageDetector in each method call I still have a memory leak.

Issue with fragment coordinates inside DetectionResult

I've identified a slight problem with the DetectionResults produced when using the feature detect_multiple_languages_of. I do appreciate this is an "experimental" feature, but it probably needs to be addressed.

To obtain the text in these fragments I'm doing this:

let mut output_str = format!("document {} self.n_ldocs {} text:\n|{}|\n", self.path_str, self.n_ldocs, ldoc_text);
for detection_result in detection_results {
    let fragment: &str = &ldoc_text[detection_result.start_index()..detection_result.end_index()];
    let confidence = self.handling_framework.language_detector.compute_language_confidence(fragment, detection_result.language());
    if confidence > 0.25 {
        ... // probably sensible language classification
    }
    else {
        ... // may be gibberish, or rather technical or something like that
    }
}

Obviously these are coordinates on the bytes in the String.

With the above, VERY occasionally, I get a panic. Examples:
"thread '<unnamed>' panicked at 'byte index 654 is not a char boundary; it is inside 'ο' (bytes 653..655) of ``ἐυηνεμον: ἐυ \"good\" + ἀνεμος \"wind\". ..."
or
"thread '<unnamed>' panicked at 'byte index 65 is not a char boundary; it is inside '\u{f0fc}' (bytes 64..67) of `` 7.3 ..."

In processing about 2 million words I only get a handful of panics, and these appear to be on VERY exotic and obscure Unicode: accented Ancient Greek or U+F0FC "Private Use Character".

But the trouble is that, currently, I'm not too sure how to "catch" such a thing before it panics: if a char boundary is violated by [result.start_index()..result.end_index()] this does not produce a Result, it just panics! And this in turn means that the whole of the rest of the document I'm parsing never gets processed. (My documents are being parsed in parallel threads so other documents are unaffected).

Naturally, I'm trying to find a way to test each proposed slice to find out whether or not it's "legal". But even if I do, obviously this would mean a lot more processing time.

So maybe you might want to think about providing indices on the chars rather than the bytes: users could then do this:

let text_vec = whole_string.chars().collect::<Vec<_>>();
fragment = text_vec[detection_result.start_index_of_chars()..detection_result.end_index_of_chars()].iter().cloned().collect::<String>();

Publish musllinux wheel

Hi,

first of all thanks for this cool project!
I would like to build alpine Linux images with it and can not use your latest version v2.0.2 due to no musllinux wheel.
I found an example here how this could be accomplished.

setMinimumRelativeDistance doesn't seem to work in the WASM build.

I am trying to use the setMinimumRelativeDistance with WASM in node, but it doesn't seem like the output is different.

const detectorBuilder = LanguageDetectorBuilder.fromISOCodes6391([
  "fr",
  "en",
  "de",
]);
detectorBuilder.setMinimumRelativeDistance(0.99);
detectorBuilder.enablePreloadingLanguageModels();
const detector = detectorBuilder.build();

detector.computeLanguageConfidenceValues("prologue");
/*
[
  { language: 'french', confidence: 1 },
  { language: 'english', confidence: 0.8472942943014279 },
  { language: 'german', confidence: 0.755856527416462 }
]
*/

It also doesn't look like I need to use the init method, like the documentation shows.
Am I using the WASM build correctly?

Add `pub fn native_name(&self) -> String` for `Language`

lingua is a great crate, please consider to add pub fn native_name(&self) -> String for Language

How to rebuild existing language models?

Hello, maybe there is someone who can explain me how to update the already existing language models? I am doing bachelors thesis on botanical terms in Latvian, and I have laready added a couple thousands of these botanicla terms to the .txt file , but what should i do next to teach the AI how to tell them apart ? I have added 1500 lines of text to "Lv.txt" as well as "la.txt". How can I rebuild the language models?

Add absolute confidence metric

In addition to the current relative confidence metric, an absolute confidence metric shall be implemented which is able to say how likely it is that a given text is written in a specific language, independently from all the other languages.

Long runtime of language detection

Hi,

I'm evaluating the lingua-rs libraray and I discovered a long running time of the following program:

#[macro_use]
extern crate lazy_static;

use std::env;
use std::fs;

use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};

lazy_static! {
    static ref DETECTOR: LanguageDetector = {
        LanguageDetectorBuilder::from_languages(&[
            Language::Dutch,
            Language::English,
            Language::French,
            Language::German,
            Language::Hungarian,
            Language::Italian,
            Language::Portuguese,
            Language::Russian,
            Language::Spanish,
            Language::Finnish,
            Language::Swedish,
        ])
        .build()
    };
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() != 2 {
        eprintln!("missing argument");
        std::process::exit(1);
    }

    let filename = args.get(1).unwrap();
    match fs::read_to_string(filename) {
        Ok(content) => match DETECTOR.detect_language_of(&content) {
            Some(lang) => println!("{},{}", filename, lang.iso_code_639_3()),
            _ => println!("{},", filename),
        },
        _ => println!("{},", filename),
    };

    std::process::exit(0);
}

First I thought the reason for the long running time could be the language detector. But even after moving this part into an lazy_static block, the runtime is very slow. Is a running time of 9.82 seconds to be expected with lingua for an article of 26,712 words? Are there ways to speed up the program?

I would welcome your response,
Nico

error[E0308]: mismatched types

Hello,

This crate looks awesome! I plan on using it to detect languages of subtitles. Thanks for writing this crate! I added it to cargo.toml, but I get this on cargo build:

error[E0308]: mismatched types
   --> /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/lingua-1.3.2/src/json.rs:265:32
    |
262 | fn get_language_models_directory(language: Language) -> Dir<'static> {
    |                                                         ------------ expected `include_dir::Dir<'static>` because of return type
...
265 |         Language::Afrikaans => AFRIKAANS_MODELS_DIRECTORY,
    |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `include_dir::Dir`, found struct `include_dir::dir::Dir`
    |
    = note: perhaps two different versions of crate `include_dir` are being used?

For more information about this error, try `rustc --explain E0308`.

Here is my full cargo.toml dependency list:

clap = { version = "3.0", features = ["derive"] }
regex = "1"
tmdb = "3.0.0"
log = "0.4"
pretty_env_logger = "0.4.0"
walkdir = "2"
dirs = "3.0"
serde = { version = "1.0", features = ["derive"]   }
serde_yaml = "0.8"
md-5 = "0.9"
unicode-truncate = "0.2.0"
lazy_static = "1.4.0"
common-path = "1.0.0"
lingua = "1.3.2"

$ rustc --version
rustc 1.58.1 (db9d1b20b 2022-01-20)

I'm kind of a Rust noob so I'm not sure how to fix this...

[Q] What's difference?

I found lingua-rs, lingua-py in your repositories.

What is the difference between lingua-ra and lingua-py?

P.S.
First, I cannot find lingua-py in PyPI.
So, I tried creating python binding of lingua-rs as lingua-py a little while ago.
What do you think about this initiative?

Add whichlang library to accuracy reports

There is a new language detection library named whichlang. Let's add it to the accuracy reports and find out how well it performs in comparison to Lingua and the other contenders.

Significant startup time

Hi,

I am trying to embed this into my program but I am seeing a very long startup time setting up the detector. Can I ask (on the high level) why that is (other than the files being larger than other libraries due to the use of 5-grams). Anything we can do to speed up the initialization time?

Happy to help contribute if needed.

Individual vs inclusive 639-3 codes

Hi,

I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.

The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.

So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:

fn map_fasttext_to_lingua(label: &String) -> Option<Language> {
    let language = label.split('_').collect::<Vec<&str>>()[4]; // remove __label__ prefix

    // Convert FT codes to Lingua codes that do not match directly
    match language.as_ref() {
        // FT uses invidual Azerbaijain (North, South) codes, Lingua uses inclusive
        "azb" | "azj" => return Some(Language::Azerbaijani),
        // FT uses individual Albanian Tosk code, Lingua uses SQ inclusive code
        // Seems that all the text in the test set is Tosk
        "als" => return Some(Language::Albanian),
        // FT using individual Standard Latvian code
        "lvs" | "lvg" => return Some(Language::Latvian).
        // Despite indonesian individual code is being used in Lingua, Malay inclusive is also
        // being used
        "zsm" => return Some(Language::Malay),
        // same with mongolian
        "khk" => return Some(Language::Mongolian),
        "pes" | "prs" => return Some(Language::Persian),
        _ => {},
    }

    for lingua_language in Language::iter() {
        if language == lingua_language.iso_code_639_3().to_string() {
            return Some(lingua_language);
        }
    }
    println!("Language code '{}' not found", language);
    None
}

I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant

$ cat lingua-rs/language-models/lv/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    989 __label__lvs_Latn
      4 __label__est_Latn
      3 __label__lit_Latn
      1 __label__pol_Latn
      1 __label__oci_Latn
      1 __label__kor_Hang
      1 __label__hun_Latn
$ cat lingua-rs/language-models/fa/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    985 __label__pes_Arab
     13 __label__prs_Arab
      1 __label__yue_Hant
      1 __label__arb_Arab
$ cat lingua-rs/language-models/az/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    999 __label__azj_Latn
      1 __label__tur_Latn

so maybe Lingua is using inclusive codes but in practice it is only covering one of the variants of that inclusive code?

For context, these are the list of inclusive and individual codes and names from Wikipedia:
Latvian lav – inclusive code

lvs – Standard Latvian language
ltg – Latgalian language

Farsi fas – inclusive code

Azerbaijaini aze – inclusive code

azj – North Azerbaijani
azb – South Azerbaijani

There is also the case of Malay, where Lingua uses the inclusive code msa code but this code includes Indonesian ind. Maybe the lingua code should be Standard Malay zsm? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:

$ cat lingua-rs/language-models/ms/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    802 __label__ind_Latn
    186 __label__zsm_Latn
      5 __label__eng_Latn
      3 __label__jav_Latn
      1 __label__yue_Hant
      1 __label__pol_Latn
      1 __label__hrv_Latn
      1 __label__cat_Latn
$ cat lingua-rs/language-models/id/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr 
    957 __label__ind_Latn
     38 __label__zsm_Latn
      4 __label__jav_Latn
      1 __label__sun_Latn

Sorry about this "brick" of text and thank you for your tool, it is really helpful!

Use parallel iterators to speed up ngram probability lookup

Currently, loading a LanguageDetector's language models is relatively slow because it is done sequentially. Parallel iterators as offered by the Rayon library should be the proper means to speed up this process.

Add WASM support

This isn't quite specific to lingua-rs but I've been looking into WebAssembly lately and it would be great to be able to use Lingua-rs into a wasm project. I did an initial test but it failed on a problem with bzip2-sys. I'll have to keep looking into it and let you know what I find but I thought you might be interested.

Make each language a separate feature

Hey. As I understood all languages are included in the binary. That is convenient, but produces very large binaries and significantly increases the compilation time. It would be great to be able to include only the languages I need by specifying them in Cargo.toml features.

Language detector is sometimes non-deterministic

I am observing non-deterministic output from lingua on text that is, to be fair, hard to classify. Minimal reproduction:

use lingua::{Language, LanguageDetectorBuilder};
use std::collections::HashMap;

fn main() {
    let langs = vec![
        "en", "de", "fr", "es", "it", "pt", "ja", "ko", "hi", "ar", "zh",
    ];

    let detector = LanguageDetectorBuilder::from_languages(
        &langs
            .iter()
            .map(|l| {
                Language::from_iso_code_639_1(
                    &l.parse()
                        .unwrap_or_else(|e| panic!("failed to load lang {}; {:?}", l, e)),
                )
            })
            .collect::<Vec<_>>(),
    )
    .build();

    let text = "ام وی با نیکی میناج تیزر داشت؟؟؟؟؟؟ i vote for bts ( _ ) as the _ via ( _ )";

    let mut results: HashMap<String, usize> = HashMap::new();
    for _ in 0..100 {
        let lang = detector
            .detect_language_of(text)
            .unwrap()
            .iso_code_639_1()
            .to_string();
        *results.entry(lang).or_insert(0) += 1;
    }

    dbg!(results);
}

output:

[src/bin/lingua_ndt_test.rs:34] results = {
    "en": 44,
    "ar": 56,
}

Lingua version: lingua = "1.2.2"`

Is this intentional, or a known issue?

Thanks!

LanguageDetector needs at least 2 languages to choose from Error

Hi, I copied the sample code into my project and deleted all the languages except english:

use lingua::Language::{English};
use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};

pub fn is_valid_english_word(word: String) {

    let languages = vec![English];
    let detector: LanguageDetector = LanguageDetectorBuilder::from_languages(&languages).build();
    let detected_language: Option<Language> = detector.detect_language_of("languages are awesome");

    detector.compute_language_confidence_values(text)

    assert_eq!(detected_language, Some(English));
}

My use case is that I am building a game like wordle, and I want some function that I can pass an arbitrary string into and have the function return a bool whether the string is a valid english word or not...

Is there a reason this is not supported by lingua-rs? Or am is there some other function I could be calling to efficiently accomplish this? thanks!

PS. here is the full error I see when running it:

thread 'main' panicked at 'LanguageDetector needs at least 2 languages to choose from', /Users/jim/.cargo/registry/src/github.com-1ecc6299db9ec823/lingua-1.4.0/src/builder.rs:91:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Failing to compile lingua 1.0.2

When trying to run a file that uses lingua 1.0.2 as a dependency, compiling fails with the following error. This is a new rust project, with no other dependencies and just a println, as I just wanted to check out the library

Compiling lingua v1.0.2
error[E0603]: module `export` is private
   --> C:\Users\luana\.cargo\registry\src\github.com-1ecc6299db9ec823\lingua-1.0.2\src\ngram.rs:19:12
    |
19  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |
note: the module `export` is defined here
   --> C:\Users\luana\.cargo\registry\src\github.com-1ecc6299db9ec823\serde-1.0.119\src\lib.rs:275:5
    |
275 | use self::__private as export;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `lingua`.

feature flag to disable wasm

Oddly I had to add a feature flag to lingua-rs to disable wasm so that I could use it in a WebAssembly project. It was't actually disabling wasm as much as it's disabling the wasm-bindgen. I'm running it in a non-browser setting and the code generated from bindgen causes problems. Let me know if you'd like me to put in a PR. It was only a couple of minor changes.

Improvements with multi-language detection?

Firstly: THANK YOU! Not least for attempting the daunting task of multi-language detection. I have great hopes for this crate!

Ultimately, say one is creating an Elasticsearch index (my case), and a proportion of your documents do contain mixed languages. Detecting the language of fragments of lines or paragraphs is critically important: if you try and index French text with an English stemmer analyser, this risks making a very defective index, so you're likely in that case to opt not to use a stemmer analyser at all...

Here is an example of output from multi-language detection:

..text:

|Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email is [email protected]. But on my phone (after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting up on Linux.
installing Plex on Linux (from IT Diary 2022-01)
- from here
chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|

These are the detection results (languages were English, French, German, Spanish, Latin, Irish):

DetectionResult { start_index: 0, end_index: 76, word_count: 2, language: French } |Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- |

DetectionResult { start_index: 76, end_index: 277, word_count: 2, language: English } |various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email |

DetectionResult { start_index: 277, end_index: 317, word_count: 2, language: Latin } |is [email protected]. But on my phone |

DetectionResult { start_index: 317, end_index: 423, word_count: 2, language: English } |(after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting |

DetectionResult { start_index: 423, end_index: 470, word_count: 2, language: French } |up on Linux.
installing Plex on Linux (from IT |

DetectionResult { start_index: 470, end_index: 497, word_count: 5, language: English } |Diary 2022-01)
- from here
|

DetectionResult { start_index: 497, end_index: 592, word_count: 2, language: Latin } |chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|

Obviously this is an analysis of what you might call "jottings", not proper sentences, and indeed "jottings of a technical jargony IT-language kind". But even so, I think there should be some way of "giving up the attempt" and saying "can't make head or tail of this part of your text: REJECT".

Perhaps more importantly, as it currently stands the multi-language detection part of your app doesn't deliver confidence levels in its DetectionResults. Without such levels it gets difficult to put these results to any practical use. I can of course then subject each of these DetectionResult text fragments to a second analysis, for confidence, using the same LanguageDetector, using the Language value from detection_result.language(). This actually makes the above results much more manageable: not surprisingly the supposed French and Latin fragments above turn out to have a very low confidence, the English ones higher.

So I think it'd be nice to incorporate a confidence rating into DetectionResult. I suspect it'd also help with the previous issue of "False positives with gibberish": i.e. feed in about 10 lines... splits this into multiple language fragments. But all have confidence of 0.2 or less: conclusion: GIBBERISH detected!

Naturally all this is going to take CPU power. But you could probably find some optimisations with incorporating confidence-rating, compared to what I'm doing above, a 2-stage analysis ...

GPU Support

Could you make it possible that the calculations are done on the gpu instead of the cpu. The performance is quite bad, when running on the cpu, but that is the case for every neural network. Both tensorflow and torch have rust bindings, so it could speed up the performance drastically.

note: the performance in rust is somehow worse than in python. Is this due to the libraries or did you implement matrix multiplication yourself?

Bug in iso_code_639_3() debug print

println!("{:?} -- {} -- {:?}", i, i.iso_code_639_3(), i.iso_code_639_3());
Italian -- ita -- ITA

Debug print shouldn't be giving a different value than a normal print.

Also it seems impossible to do anything with the value returned:

30 |       if i == "Italian" { println!("Match"); }
   |            ^^ no implementation for `Language == str
   
30 |       if i.to_owned() == "Italian" { println!("Match"); }
   |                          ^^^^^^^^^ expected enum `Language`, found `&str`
   
30 |       if i.display() == "Italian" { println!("Match"); }
   |            ^^^^^^^ method not found in `&Language`

It appears that we're missing display()... Do I just need to use the strum_macros crate? Should I attempt to make a PR to add the display() trait?

The accuracy is looking really good; I've tested so far with about 50 subtitles (grabbing 10 lines evenly distributed throughout the file) and it's 100% accurate so far. Thanks for such an awesome library, and your help using it. :)

model.rs panics when creating language model from text that contains multi-byte Unicode characters

lingua-rs/src/model.rs

Line 135 in 4e00063

let slice = &lowercased_line[i..i + ngram_length];

This line of code resulted in the error below when reading a line containing the character 'ō':

thread 'main' panicked at 'byte index 23 is not a char boundary; it is inside 'ō' (bytes 22..24) of nau mai haere mai ki tō mātou pae tukutuku. kia kore koe e ngaro taku reo rangatira.', src/model.rs:135:30

A similar error occurred in the 'compute_relative_frequencies' function at line model.rs:160 for the same reason. I believe these functions iterate through 'line' by byte which doesn't work with the Unicode representation for 'ō'. I noticed a different approach was used in implementing the 'from' function for 'TestDataLanguageModel' which appears to be a solution that handles multi-byte Unicode correctly:

model.rs:185: let slice = &chars[i..i + ngram_length].iter().collect::();

Detect multiple languages in mixed-language text

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

Add builder pattern to WASM API

The current WASM API does not strictly follow the builder pattern used in the Rust API. That's because I did not know back then how to accomplish it with wasm-bindgen. While working on the WASM API for grex, I found out how to implement the builder pattern. Before creating the JavaScript distribution for Lingua, these changes have to be applied first.

Spoken languages

Chinese is a written language but it is spoken in dialects, for example people write Chinese but speak Mandarin. Other (spoken) dialects exists as well, such as Cantonese, Hokien, Teochew and a bunch of them, is there a way to detect them?

Mention Ruby wrapper?

Hiya! I noticed you had a Python wrapper listed in your README. I just wanted to offer that I have done a Ruby wrapper myself, and am currently running it in production: https://github.com/gjtorikian/what_you_say

Just in case you wanted to call it out. Thanks for the project!

Add low accuracy mode

Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors. The large language models also consume significant amounts of memory. These requirements might not be feasible for systems running low on resources.

For users who want to classify mostly long texts or need to save resources, a so-called low accuracy mode will be implemented that loads only a small subset of the language models into memory. The API will be as follows:

LanguageDetectorBuilder::from_all_languages().with_low_accuracy_mode().build();

The downside of this approach is that detection accuracy for short texts consisting of less than 120 characters will drop significantly. However, detection accuracy for texts which are longer than 120 characters will remain mostly unaffected.

Add traditional vs simplified chinese support

Hi! I'm considering using Lingua in a project. It compares favourably to CLD2 and Whatlang on our dataset (social media posts), but one of our requirements is that we need to distinguish between traditional and simplified chinese, which Lingua does not support.

Are there any plans to support this? Our requirement for Chinese support probably won't be crucial until later in the year, so if support is in development that would go a long way.

Thanks!

Any chance to have an npm package ?

Thanks a lot :)

Let LanguageDetectorBuilder::build() return a Result

One of the things I've been working on is an HTTP service wrapper around lingua. My implementation allows the caller to define the various pieces of a detector or just use the default for the service instance. If they specify invalid options (say, an unrecognized language or out of range minimum relative distance) I have to either validate the options manually before attempting to create the builder or catch the panic. If validation was done in LanguageDetectorBuilder::build and that returned a Result<LanguageDetector, Error> or the like I could rely on the library itself for validation.

Reduce resources to load language models

Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be ndarray.

Add more classification metrics in library comparisons

Add more classification metrics such as Precision, Recall, Specificity and F1.

Replace json format with more efficient binary file format

Currently, the language models are stored as json files. Json is somewhat slow to deserialize. Let's investigate whether there is a more efficient binary file format which can be deserialized faster.

A promising candidate could be the MessagePack or Protobuf format.

lingua-java

Can you provide a java version of the library?

Add Python bindings

As the pure Python implementation of Lingua is quite slow, let's write Python bindings for the Rust implementation using PyO3 and Maturin.

Find a more compact compression algorithm for language model files

The json language models are currently compressed as zip files. For WASM compilation and usage of the library in the browser, it is beneficial to compress the language models as compactly as possible. Let's investigate whether there is a more compact compression algorithm that produces smaller language model files.

A promising candidate could be the Brotli algorithm.

Resources for training

In case that you are not aware, there is this wonderful repo including data for a lot of languages that Lingua does not support: https://github.com/laurieburchell/open-lid-dataset. All curated for LID purposes.

Add confidence metric for single language

Hello. I am doing a cryptology library. I need to detect if the text is english or not. Could you allow confidence value for a single language please?

Add comparison with Google's CLD2 language detection library

There exist Rust bindings for Google's Compact Language Detector 2 which is used in the Chrome browser. The bindings allow for an easy comparison with Lingua which shall be accomplished with this issue here.

Intermittant error: linking with `cc` failed: exit status: 1

I've gotten this a couple of times after typing cargo build, but the next time it works again.

error: linking with `cc` failed: exit status: 1
  |
  = note: "cc" "-m64" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.11thvd6dfrecx3o8.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.126r354jddyt8ext.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.191d0x9893lv2bbw.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1acltcxpyenktyb.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1as25vvf9jmoh2x5.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1dxeb8h4ggp0ezfg.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1fxa3oexw379eh9t.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1izq35ezbr0ynn4f.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1jp4tn3syr526dep.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1kpze6mtr5rejw6a.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1l0tbf3cw2vqwzqt.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1mst7z3zzhjdvev.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1n7vqlaxjtwoyoc0.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1ngdq1qwtg62ubmf.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1p06jk4ovpvghbxm.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1tbpm7v33r3lrr1u.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1v46pzmy82fsvyec.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.1v6cixxqqvlcvnxc.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.20nkx17jbiuikytr.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.21dupkfr2zva2suh.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.21iactnrizy8wk18.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.24dpbovrffybde7m.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.24tybzr57wt674l8.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.27298jmjpfpnia77.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.27xf825hyj5gcks.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2asnwwgqzox9mcx0.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2cjmbipfwj9c41p1.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2dqp1tazgnwoed6s.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2emla0uthiyogye7.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2jjr0hl8ix8j484j.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2m1de2ixtx3q4ngm.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2py7d37uulxkv806.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2sfayxzh091x0r7h.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2sncnj0gn2td4yp1.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2vwyifuquewqd5l2.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.2yq1butq2oc95hhf.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.30fk08igbnd7olgf.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.30z0gx35cvisweg4.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.31lb4g351sp8zapq.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.32ir1n2sbcv5lw1k.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.34xl4xscub5f3umx.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.38uu5ugij68359lt.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3ev57zjm9x059xdu.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3h3dabh8n5hh0k51.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3hus8so4jxvtyffe.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3n5gn6sm1s48srpj.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3r601szxxqc0ni1g.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3rn92x5xfyy2v9pl.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3umpyfwh04wmcrva.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3vkwl38al2g4ikbv.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3xshp041q3zitizl.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.3zxcy064b3smpwne.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.41vccw18bqom2au3.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4aboqfa115mxoe85.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4aotxfpu9sdhtjk5.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4b2y0nqxjovwvxga.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4crwdpdbv9bvd014.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4lorqtttfts4hgyh.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4n5vt5t5o583egdn.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4pkr3amzyqcz3ldn.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4tj3l2apmepaqk7o.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4vcyp6bxyab24zc7.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4vmtw5bwzte5qtkr.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.4wkgdwbwm8hy6pp8.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.5367g0tx3pii9kdh.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.536yflxwv5vgc7h8.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.53i4wpjs1rpth6gs.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.55lpsyjefz3ki9gs.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.56i92estdp8qzqzt.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.5fjy3a5k4rzazruq.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.5nl3qrvfvztuhgg.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.66o534q71pzh0a3.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.79vxic2w4oezj3g.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.emhx73d7ov0kx64.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.llabwf19r62xsec.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.lzc28pp1tpsxrs7.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.qmmdulbjm1e81pq.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.rd682tx83iua0gw.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.uhegtwb0e3wphzn.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.uhf4hdkal7zj7q2.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.x0kztkb91whu3yq.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.xb2i9ovqwx2xtqi.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.xy1aeiw4qbq6ilv.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.yak2iprw2uwpnto.rcgu.o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c.17d0rw6hh0hiypvy.rcgu.o" "-Wl,--as-needed" "-L" "/home/michael/rust/subtitle/target/debug/deps" "-L" "/home/michael/rust/subtitle/target/debug/build/bzip2-sys-46e0634bee4121eb/out/lib" "-L" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/home/michael/rust/subtitle/target/debug/deps/liblingua-ed6b1bd7ec39d525.rlib" "/home/michael/rust/subtitle/target/debug/deps/libserde_json-ce92a1ab39bbbe59.rlib" "/home/michael/rust/subtitle/target/debug/deps/libryu-050e098a2dda45ef.rlib" "/home/michael/rust/subtitle/target/debug/deps/libitoa-4d5d7f7e186c0f94.rlib" "/home/michael/rust/subtitle/target/debug/deps/libzip-ece73881225903f5.rlib" "/home/michael/rust/subtitle/target/debug/deps/libtime-8908003473c053ea.rlib" "/home/michael/rust/subtitle/target/debug/deps/libthiserror-3ee51433ef07d1bd.rlib" "/home/michael/rust/subtitle/target/debug/deps/libbzip2-c55932eb9934526a.rlib" "/home/michael/rust/subtitle/target/debug/deps/libbzip2_sys-af16f7c3f0e5902a.rlib" "/home/michael/rust/subtitle/target/debug/deps/libflate2-1a1115700cf39281.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcrc32fast-c0f218621d004ca5.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_zulu_language_model-64a6743b6005349f.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_yoruba_language_model-224e4ff34fcb195c.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_xhosa_language_model-5d228a3f42c254d9.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_welsh_language_model-c55490ea0e1bdd0f.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_vietnamese_language_model-c600e5b3e1db245c.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_urdu_language_model-c96c4d2f92a2667d.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_ukrainian_language_model-e1ec8232f6068720.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_turkish_language_model-2a58406bb93ffdab.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_tswana_language_model-d9f098582b21f0a5.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_tsonga_language_model-7f86ca071e58cfe4.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_thai_language_model-a13168ee29cd3c08.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_telugu_language_model-c2199a51b4e8562b.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_tamil_language_model-206331b4d82e51d7.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_tagalog_language_model-4ad60186d5af52d7.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_swedish_language_model-f785be7e8ef1494a.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_swahili_language_model-39d95c7df060de33.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_spanish_language_model-7fdee99c07de22e0.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_sotho_language_model-7e719f856a1b04c2.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_somali_language_model-a82ef18286060ab2.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_slovene_language_model-b1f3e1a6d0be5710.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_slovak_language_model-f0e6ef65b3743fb7.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_shona_language_model-5f55a81467a2cdab.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_serbian_language_model-543c98ed7aa217c3.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_russian_language_model-7fefcf6e93f267fb.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_romanian_language_model-7ed889fce4073500.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_punjabi_language_model-b4c7fa3c91260141.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_portuguese_language_model-96e3737eb32f94ff.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_polish_language_model-1e7ee06be25535b1.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_persian_language_model-737f5ca384c73be1.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_nynorsk_language_model-08ce01d989598e6e.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_mongolian_language_model-b3e93dc69da5510b.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_marathi_language_model-1babe9f83f9ac371.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_maori_language_model-789ffa462326b3ae.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_malay_language_model-4f5d6c88575ad5b8.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_macedonian_language_model-fd2ea2b3f16ee49e.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_lithuanian_language_model-39c4a59c0136305a.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_latvian_language_model-5a0fed3217f4bfd4.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_latin_language_model-ecdd1a700629f92d.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_korean_language_model-320b3755d47bc459.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_kazakh_language_model-0c411fff1cbd1b74.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_japanese_language_model-bacfcc8004bf5bbb.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_italian_language_model-bb4b17f555b03c85.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_irish_language_model-e4c738bd0d0a105c.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_indonesian_language_model-8554c850a1b88420.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_icelandic_language_model-e7fdd34007264d9c.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_hungarian_language_model-f84383d3e7f90613.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_hindi_language_model-0e831745b8245449.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_hebrew_language_model-206d54f2583050ae.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_gujarati_language_model-b8507c6398184236.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_greek_language_model-8ce379ce04fb1a30.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_german_language_model-6133f4d10bac6939.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_georgian_language_model-83b9f30207395bdc.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_ganda_language_model-ca8db64e0c7bc2e6.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_french_language_model-e131e22c490d9d18.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_finnish_language_model-3eb323b232510a5f.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_estonian_language_model-41160611c5f9ba78.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_esperanto_language_model-a2096676e482b4c0.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_english_language_model-0cc07153de6797eb.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_dutch_language_model-45f3d38c59d18a2d.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_danish_language_model-053ebfe0e1698880.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_czech_language_model-46b3630a71c51783.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_croatian_language_model-d545bb4691cb6a72.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_chinese_language_model-2c228e16eb104292.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_catalan_language_model-606670fad98fdb20.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_bulgarian_language_model-d46042781fb7dbc3.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_bosnian_language_model-6656750d55377201.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_bokmal_language_model-2823183bbc539117.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_bengali_language_model-3c80623941f109b8.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_belarusian_language_model-a813d6a3347cb529.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_basque_language_model-243062d09687b528.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_azerbaijani_language_model-8a12b2b9a4530432.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_armenian_language_model-f7155c575bcc13fa.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_arabic_language_model-2f5a48313782c3eb.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_albanian_language_model-c941a6053509daa6.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblingua_afrikaans_language_model-a5ad036e3a3655f2.rlib" "/home/michael/rust/subtitle/target/debug/deps/libinclude_dir-e641a0289c35b12d.rlib" "/home/michael/rust/subtitle/target/debug/deps/libserde-651822612513554e.rlib" "/home/michael/rust/subtitle/target/debug/deps/libfraction-847fa46f8973b16f.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum-933bb0e8c83d276b.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_rational-6e27a372e1d2ccbd.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_complex-b5f896634803a7d3.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_bigint-3a4897602813e640.rlib" "/home/michael/rust/subtitle/target/debug/deps/librayon-eb5c879b6830fa22.rlib" "/home/michael/rust/subtitle/target/debug/deps/librayon_core-7a7f669454868282.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_cpus-cc15a9bf47fd9fc0.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcrossbeam_deque-e17884f69efc8ee2.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcrossbeam_epoch-63429b775bf0550a.rlib" "/home/michael/rust/subtitle/target/debug/deps/libmemoffset-c771a95d298bf070.rlib" "/home/michael/rust/subtitle/target/debug/deps/libscopeguard-3c0626f1adc91bb3.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcrossbeam_channel-c30e0327498e4778.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcrossbeam_utils-915b534af8c46b3e.rlib" "/home/michael/rust/subtitle/target/debug/deps/libitertools-f87152c6de794f46.rlib" "/home/michael/rust/subtitle/target/debug/deps/libstrum-5ed8345b830f2bf8.rlib" "/home/michael/rust/subtitle/target/debug/deps/libregex-d0fa6d6ad1ccfd8c.rlib" "/home/michael/rust/subtitle/target/debug/deps/libaho_corasick-a0937382ff8ed835.rlib" "/home/michael/rust/subtitle/target/debug/deps/libregex_syntax-218be71617600397.rlib" "/home/michael/rust/subtitle/target/debug/deps/libonce_cell-9c211aa164c984a2.rlib" "/home/michael/rust/subtitle/target/debug/deps/libmaplit-579da72ef836b0c9.rlib" "/home/michael/rust/subtitle/target/debug/deps/libsubparse-269016addd310109.rlib" "/home/michael/rust/subtitle/target/debug/deps/libchardet-4cdfc950178dc823.rlib" "/home/michael/rust/subtitle/target/debug/deps/libvobsub-ee9e302cc061972b.rlib" "/home/michael/rust/subtitle/target/debug/deps/libsafemem-193ea9e9ac99ae03.rlib" "/home/michael/rust/subtitle/target/debug/deps/libregex-5bb32593868c7702.rlib" "/home/michael/rust/subtitle/target/debug/deps/libutf8_ranges-1c62538f44b87a5c.rlib" "/home/michael/rust/subtitle/target/debug/deps/libregex_syntax-542f75a84d24ad30.rlib" "/home/michael/rust/subtitle/target/debug/deps/libucd_util-825ca322f095cf2c.rlib" "/home/michael/rust/subtitle/target/debug/deps/libthread_local-a54ac9afb9104833.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblazy_static-54f0ec464d8b1ca3.rlib" "/home/michael/rust/subtitle/target/debug/deps/libaho_corasick-857d483733623952.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnom-c116472db25f6052.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblog-a163410a1fa14877.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblog-1d2b579999f3967f.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblazy_static-3455d2760b51f45b.rlib" "/home/michael/rust/subtitle/target/debug/deps/libimage-24d2dc56c561d784.rlib" "/home/michael/rust/subtitle/target/debug/deps/libenum_primitive-6ef34525dd90e139.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_traits-5885b75fedf4c4ce.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_rational-20a9545941de99a5.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_iter-30335a17ad6a3ea6.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_integer-feda2d697837371e.rlib" "/home/michael/rust/subtitle/target/debug/deps/libnum_traits-1e97bec4fca918d0.rlib" "/home/michael/rust/subtitle/target/debug/deps/liberror_chain-50a3756de0889b91.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcast-0b8ed6e1671cc356.rlib" "/home/michael/rust/subtitle/target/debug/deps/libitertools-29154d2738053e3a.rlib" "/home/michael/rust/subtitle/target/debug/deps/libeither-bc343242c2f141a8.rlib" "/home/michael/rust/subtitle/target/debug/deps/libfailure-680141de5c607a7f.rlib" "/home/michael/rust/subtitle/target/debug/deps/libbacktrace-e04326a532a73f51.rlib" "/home/michael/rust/subtitle/target/debug/deps/libminiz_oxide-c1ece4febcc85929.rlib" "/home/michael/rust/subtitle/target/debug/deps/libadler-7d5e004860bdd1ff.rlib" "/home/michael/rust/subtitle/target/debug/deps/libobject-41cda1797f29291c.rlib" "/home/michael/rust/subtitle/target/debug/deps/libmemchr-7a85b35aceb91ef0.rlib" "/home/michael/rust/subtitle/target/debug/deps/liblibc-998997e8b8431161.rlib" "/home/michael/rust/subtitle/target/debug/deps/libaddr2line-257b95aac0bdd3a6.rlib" "/home/michael/rust/subtitle/target/debug/deps/libgimli-da9bfac713f7c0d7.rlib" "/home/michael/rust/subtitle/target/debug/deps/librustc_demangle-548379ae201bc0f6.rlib" "/home/michael/rust/subtitle/target/debug/deps/libencoding_rs-36899678387093fe.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcfg_if-d79d4230b67dff33.rlib" "/home/michael/rust/subtitle/target/debug/deps/libcombine-580aa026ac4ade6a.rlib" "/home/michael/rust/subtitle/target/debug/deps/libascii-6b2d56dc9f0fde1b.rlib" "/home/michael/rust/subtitle/target/debug/deps/libbyteorder-a8009d40ace471f8.rlib" "-Wl,--start-group" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-4c74cbab78ec4891.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libpanic_unwind-0ef58120f7b95253.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libminiz_oxide-e35e56ad39c7e20e.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libadler-671a9f10c55c6c87.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libobject-ee577127549b7793.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libmemchr-bed369233e55d851.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libaddr2line-e8504b1ed73d6c6f.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libgimli-411eeeec028606dc.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-0ddec007a0883060.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_demangle-7c5cb27d99d10614.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libhashbrown-6c448d94453f4d95.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_alloc-22835d1ac5e3244b.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libunwind-84878e033904a7a4.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcfg_if-c0badcb9f7c5eab7.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liblibc-b4424726f33da388.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-aa0bad4c4d134922.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-483ad457673e0f5c.rlib" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-6cfcec236d576603.rlib" "-Wl,--end-group" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-5667a4a7e2c48d47.rlib" "-Wl,-Bdynamic" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L" "/home/michael/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/michael/rust/subtitle/target/debug/deps/subtitle-8217c7740d593e6c" "-Wl,--gc-sections" "-pie" "-Wl,-zrelro,-znow" "-nodefaultlibs"
  = note: collect2: error: ld returned 1 exit status

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"

Bias towards non-English languages?

I won't bombard you with any more issues. I just think this crate is really excellent and am excited by it. It's going to make my Elasticsearch indices and my use of them much better.

So most of the strings I'm subjecting to analysis are in the range 100 chars to maybe 1000 chars.

I have quite a few bilingual documents in my corpus of documents, almost all between English and some other language. Usually with English in one column and the other language in the other. So parsing the document tends to produce quite a bit of text with, say, Irish and English mixed.

In those cases Irish almost always seems to be chosen as "language with the highest confidence". So then I thought I'd examine the levels of confidence for all 6 languages for all these bilingual Irish-English strings. To my surprise, it is usually Irish 1.0 and English 0.0! Or occasionally Irish 0.88 and English 0.09, something like that.

This tends to suggest that if a non-English language is detected it is given a higher "weighting" than English.

But the thing is, if you are offering multiple-language detection (which I realise is an experimental feature at this stage), having a bias against any language in this way is a bit unfortunate: it means that it is harder to identify strings where there appear to be runs of more than one language, so you can then change to detect_multiple_languages_of for more detailed analysis.

I'd be interested to hear what you have to say about this. Meanwhile I may well clone your app and see if there are any obvious ways I might be able to tweak things a bit to address some of the issues I have currently.

Make lazy-loading of language models optional

Currently, all language models are preloaded into memory when creating an instance of LanguageDetector.
In order to be consistent with the JVM implementation of Lingua, language models should be lazy-loaded by default. A new method LanguageDetectorBuilder.with_preloaded_language_models() shall be introduced that explicitly enables the previous behavior.

Types `IsoCode639_X`: please derive `Copy, Hash, EnumVariantNames, Deserialize` and `Serialize` also

The enums IsoCode639_1 and IsoCode639_3 are part of your public API.

Please consider to replace them by the types in the crate codes-iso-639.

If this is not possible, please derive Copy, Hash, EnumString, EnumVariantNames, Deserialize, Serialize also.
This would make them much more useful for the consumer of your API.

#[cfg(feature = "serde")]
use serde::{Deserialize, Serialize};
use strum_macros::{EnumString, EnumVariantNames};

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, EnumString, EnumVariantNames)]
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
#[allow(clippy::upper_case_acronyms)]
#[strum(ascii_case_insensitive, serialize_all = "kebab_case")]
pub enum IsoCode639_1 {...}

The same applies to IsoCode639_3.

Is `LanguageDetector` thread-safe?

Is it safe to call methods on the same LanguageDetector object from different threads?

Add parallel equivalents for all methods in `LanguageDetector`

This issue is related to #262.

Python's approaches to concurrency and parallelism are not very efficient when applied to algorithms implemented in Rust. So let's write parallel equivalents for all methods in LanguageDetector with the help of the awesome Rayon library. Python code then just needs to call those methods to benefit from performant and efficient low-level parallelism.

False positives with gibberish

There are some false positives when inputting gibberish, Lingua identifies them as languages when it should return None.

Examples:
vszzc hvwg wg zcbu hslh
5HeQsKSTseGZrDvdCAUYr6DyxS5jy4953UWACh9bN2rUFkj2sDuY3BS
VGhpcyBpcyBhbiBleGFtcGxlIG9mIGJhc2U2NA==
KZDWQ4DDPFBHAY3ZIJUGE2KCNRSUORTUMNDXQ3CJI44W2SKHJJUGGMSVGJHECPJ5

The project I'm working on has a lot of gibberish. We need to identify between different languages and gibberish. I've been looking for solutions but I'm not an expert at NLP.

I'd like your opinion on what the best solution for that use case would be.

pemistahl / lingua-rs Goto Github PK

lingua-rs's People

Contributors

Stargazers

Watchers

Forkers

lingua-rs's Issues

Recommend Projects

Recommend Topics

Recommend Org