Git Product home page Git Product logo

pemistahl / lingua-go Goto Github PK

View Code? Open in Web Editor NEW
1.1K 11.0 64.0 231.56 MB

The most accurate natural language detection library for Go, suitable for short text and mixed-language text

License: Apache License 2.0

Go 100.00%
natural-language-processing language-detection language-recognition language-classification language-identification language-processing nlp nlp-machine-learning golang-library go

lingua-go's Introduction

lingua

Build Status codecov supported languages Go Reference Go Report Card license


1. What does this library do?

Its task is simple: It tells you which language some text is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist?

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, the only other comprehensive open source library in the Go ecosystem for this task is Whatlanggo. Unfortunately, it has two major drawbacks:

  1. Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it does not provide adequate results.
  2. The more languages take part in the decision process, the less accurate are the detection results.

Lingua aims at eliminating these problems. She nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. She draws on both rule-based and statistical methods but does not use any dictionaries of words. She does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported?

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:

  • A
    • Afrikaans
    • Albanian
    • Arabic
    • Armenian
    • Azerbaijani
  • B
    • Basque
    • Belarusian
    • Bengali
    • Norwegian Bokmal
    • Bosnian
    • Bulgarian
  • C
    • Catalan
    • Chinese
    • Croatian
    • Czech
  • D
    • Danish
    • Dutch
  • E
    • English
    • Esperanto
    • Estonian
  • F
    • Finnish
    • French
  • G
    • Ganda
    • Georgian
    • German
    • Greek
    • Gujarati
  • H
    • Hebrew
    • Hindi
    • Hungarian
  • I
    • Icelandic
    • Indonesian
    • Irish
    • Italian
  • J
    • Japanese
  • K
    • Kazakh
    • Korean
  • L
    • Latin
    • Latvian
    • Lithuanian
  • M
    • Macedonian
    • Malay
    • Maori
    • Marathi
    • Mongolian
  • N
    • Norwegian Nynorsk
  • P
    • Persian
    • Polish
    • Portuguese
    • Punjabi
  • R
    • Romanian
    • Russian
  • S
    • Serbian
    • Shona
    • Slovak
    • Slovene
    • Somali
    • Sotho
    • Spanish
    • Swahili
    • Swedish
  • T
    • Tagalog
    • Tamil
    • Telugu
    • Thai
    • Tsonga
    • Tswana
    • Turkish
  • U
    • Ukrainian
    • Urdu
  • V
    • Vietnamese
  • W
    • Welsh
  • X
    • Xhosa
  • Y
    • Yoruba
  • Z
    • Zulu

4. How good is it?

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

  1. a list of single words with a minimum length of 5 characters
  2. a list of word pairs with a minimum length of 10 characters
  3. a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua and Whatlanggo running over the data of Lingua's supported 75 languages. Additionally, I have added Google's CLD3 to the comparison with the help of the gocld3 bindings. Languages that are not supported by CLD3 or Whatlanggo are simply ignored during the detection process.

Each of the following sections contains two plots. The bar plot shows the detailed accuracy results for each supported language. The box plot illustrates the distributions of the accuracy values for each classifier. The boxes themselves represent the areas which the middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the median of the distributions.

4.1 Single word detection


Single Word Detection Performance


Bar plot Single Word Detection Performance



4.2 Word pair detection


Word Pair Detection Performance


Bar plot Word Pair Detection Performance



4.3 Sentence detection


Sentence Detection Performance


Bar plot Sentence Detection Performance



4.4 Average detection


Average Detection Performance


Bar plot Average Detection Performance



4.5 Mean, median and standard deviation

The table below shows detailed statistics for each language and classifier including mean, median and standard deviation.

Open table
Language Average Single Words Word Pairs Sentences
Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
  CLD3   Whatlang Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
  CLD3   Whatlang Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
  CLD3   Whatlang Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
  CLD3   Whatlang
Afrikaans 79 64 55 51 58 38 22 21 81 62 46 39 97 93 98 92
Albanian 88 80 55 - 69 54 18 - 95 86 48 - 100 99 98 -
Arabic 98 94 90 89 96 88 79 77 99 96 92 91 100 99 100 99
Armenian 100 100 99 - 100 100 100 - 100 100 100 - 100 100 97 -
Azerbaijani 90 82 81 64 77 71 62 45 92 78 82 58 99 96 99 91
Basque 84 75 62 - 71 56 33 - 87 76 62 - 93 92 92 -
Belarusian 97 92 84 81 92 80 67 64 99 95 86 80 100 100 100 98
Bengali 100 100 99 100 100 100 98 100 100 100 99 100 100 100 99 100
Bokmal 58 50 - 34 39 27 - 15 59 47 - 28 77 75 - 60
Bosnian 35 29 33 - 29 23 19 - 35 29 28 - 41 36 52 -
Bulgarian 87 78 70 61 70 56 45 37 91 81 66 57 99 96 98 89
Catalan 70 58 48 - 51 33 19 - 74 60 42 - 87 82 84 -
Chinese 100 100 92 100 100 100 92 100 100 100 83 100 100 100 100 100
Croatian 73 60 42 55 53 36 26 28 74 57 42 44 90 86 58 91
Czech 80 71 64 50 66 54 39 31 84 72 65 46 91 87 88 71
Danish 81 70 58 47 61 45 26 24 84 70 54 38 98 95 95 79
Dutch 77 64 58 47 55 36 29 22 81 61 47 36 96 94 97 82
English 81 63 54 49 55 29 22 17 89 62 44 35 99 97 97 94
Esperanto 84 66 57 52 67 44 22 25 85 61 51 45 98 93 98 88
Estonian 92 83 70 61 80 62 41 36 96 88 69 53 100 99 99 94
Finnish 96 91 80 71 90 77 58 45 98 95 84 70 100 100 99 98
French 89 77 55 64 74 52 22 37 94 83 49 59 99 98 94 97
Ganda 91 84 - - 79 65 - - 95 87 - - 100 100 - -
Georgian 100 100 98 100 100 100 99 100 100 100 100 100 100 100 96 100
German 89 80 66 65 74 57 40 38 94 84 62 60 100 99 98 97
Greek 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Gujarati 100 100 100 100 100 100 99 100 100 100 100 100 100 100 100 100
Hebrew 100 100 - 90 100 100 - 76 100 100 - 94 100 100 - 99
Hindi 73 33 58 52 61 11 34 27 64 20 45 40 95 67 95 88
Hungarian 95 90 76 62 87 77 53 37 98 94 76 53 100 100 99 95
Icelandic 93 88 71 - 83 72 42 - 97 92 70 - 100 99 99 -
Indonesian 61 47 46 67 39 25 26 39 61 46 45 66 83 71 66 95
Irish 91 85 67 - 82 70 42 - 94 90 66 - 96 95 94 -
Italian 87 71 62 56 69 42 31 25 92 74 57 47 100 98 98 96
Japanese 100 100 98 99 100 100 97 100 100 100 96 100 100 100 100 97
Kazakh 92 90 82 - 80 78 62 - 96 93 83 - 99 99 99 -
Korean 100 100 99 100 100 100 100 100 100 100 100 100 100 100 98 100
Latin 87 73 62 - 72 49 44 - 93 76 58 - 97 94 83 -
Latvian 93 87 75 59 85 75 51 36 97 90 77 54 99 97 98 87
Lithuanian 95 87 72 62 86 76 42 38 98 89 75 56 100 98 99 92
Macedonian 84 72 60 62 66 52 30 39 86 70 54 55 99 95 97 94
Malay 31 31 22 - 26 22 11 - 38 36 22 - 28 35 34 -
Maori 91 82 52 - 82 62 22 - 92 87 43 - 99 98 91 -
Marathi 85 39 84 73 74 16 69 52 85 30 84 74 96 72 98 93
Mongolian 97 95 83 - 93 89 63 - 99 98 87 - 99 99 99 -
Nynorsk 66 52 - 34 41 25 - 10 66 49 - 24 91 81 - 69
Persian 90 80 76 70 78 62 57 46 94 80 70 66 100 98 99 99
Polish 95 90 77 66 85 77 51 45 98 93 80 59 100 99 99 94
Portuguese 81 69 53 57 59 42 21 26 85 70 40 48 99 95 97 96
Punjabi 100 100 100 100 100 100 99 100 100 100 100 100 100 100 100 100
Romanian 87 72 53 59 69 49 24 34 92 74 48 52 99 94 88 90
Russian 90 78 71 53 76 59 48 40 95 84 72 52 98 92 93 68
Serbian 88 78 78 57 74 62 63 34 90 80 75 51 99 91 95 86
Shona 91 81 76 68 78 56 51 44 96 86 79 65 100 100 99 95
Slovak 84 75 63 - 64 49 32 - 90 78 61 - 99 97 96 -
Slovene 82 67 63 48 61 39 29 25 87 68 60 38 99 93 99 81
Somali 92 85 69 68 82 64 38 38 96 90 70 66 100 100 100 99
Sotho 86 72 49 - 67 43 15 - 90 75 33 - 100 97 98 -
Spanish 70 56 48 48 44 26 16 19 69 49 32 33 97 94 96 93
Swahili 81 70 57 - 60 43 25 - 84 68 49 - 98 97 98 -
Swedish 84 72 61 49 64 46 30 24 88 76 56 40 99 94 96 83
Tagalog 78 66 - 52 52 36 - 23 83 67 - 43 98 96 - 90
Tamil 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99 100
Telugu 100 100 99 100 100 100 99 100 100 100 100 100 100 100 99 100
Thai 100 100 99 100 100 100 100 100 100 100 100 100 100 100 98 99
Tsonga 84 72 - - 66 46 - - 89 73 - - 98 97 - -
Tswana 84 71 - - 65 44 - - 88 73 - - 99 96 - -
Turkish 94 87 69 54 84 71 41 26 98 91 70 44 100 100 97 92
Ukrainian 92 86 81 72 84 75 62 53 97 92 83 71 95 93 98 93
Urdu 91 80 61 57 80 65 39 31 94 78 53 46 98 96 92 94
Vietnamese 91 87 66 73 79 76 26 36 94 87 74 85 99 98 99 97
Welsh 91 82 69 - 78 61 43 - 96 87 66 - 99 99 98 -
Xhosa 82 69 66 - 64 45 40 - 85 67 65 - 98 94 92 -
Yoruba 74 62 15 22 50 33 5 11 77 61 11 14 96 92 28 41
Zulu 81 70 63 70 62 45 35 44 83 72 63 68 97 94 92 98
Mean 86 77 69 67 74 61 48 48 89 78 67 63 96 93 93 91
Median 89.0 80.0 68.0 62.0 74.0 57.0 41.0 38.0 94.0 81.0 66.0 57.0 99.0 97.0 98.0 94.0
Standard Deviation 13.08 17.29 19.04 20.22 18.41 24.9 27.86 29.28 13.12 18.93 21.83 24.22 11.05 11.91 13.95 11.24

5. Why is it better than other libraries?

Every language detector uses a probabilistic n-gram model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language.

A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance.

In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable.

6. Test report generation

If you want to reproduce the accuracy results above, you can generate the test reports yourself for both classifiers and all languages by doing:

cd cmd
go run accuracy_reporter.go

For gocld3 to run successfully, you need to install the exact version 3.17.3 of Google's protocol buffers which is a bit unfortunate. For each detector and language, a test report file is then written into /accuracy-reports. As an example, here is the current output of the Lingua German report:

##### German #####

>>> Accuracy on average: 89.23%

>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 73.90%
Erroneously classified as Dutch: 2.30%, Danish: 2.10%, English: 2.00%, Latin: 1.90%, Bokmal: 1.60%, Basque: 1.20%, French: 1.20%, Italian: 1.20%, Esperanto: 1.10%, Swedish: 1.00%, Afrikaans: 0.80%, Tsonga: 0.70%, Nynorsk: 0.60%, Portuguese: 0.60%, Yoruba: 0.60%, Finnish: 0.50%, Sotho: 0.50%, Welsh: 0.50%, Estonian: 0.40%, Irish: 0.40%, Polish: 0.40%, Spanish: 0.40%, Swahili: 0.40%, Tswana: 0.40%, Bosnian: 0.30%, Icelandic: 0.30%, Tagalog: 0.30%, Albanian: 0.20%, Catalan: 0.20%, Croatian: 0.20%, Indonesian: 0.20%, Lithuanian: 0.20%, Maori: 0.20%, Romanian: 0.20%, Xhosa: 0.20%, Zulu: 0.20%, Latvian: 0.10%, Malay: 0.10%, Slovak: 0.10%, Slovene: 0.10%, Somali: 0.10%, Turkish: 0.10%

>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 94.10%
Erroneously classified as Dutch: 0.90%, Latin: 0.80%, English: 0.70%, Swedish: 0.60%, Danish: 0.50%, French: 0.40%, Bokmal: 0.30%, Irish: 0.20%, Tagalog: 0.20%, Afrikaans: 0.10%, Esperanto: 0.10%, Estonian: 0.10%, Finnish: 0.10%, Italian: 0.10%, Maori: 0.10%, Nynorsk: 0.10%, Somali: 0.10%, Swahili: 0.10%, Tsonga: 0.10%, Turkish: 0.10%, Welsh: 0.10%, Zulu: 0.10%

>> Detection of 1000 sentences (average length: 111 chars)
Accuracy: 99.70%
Erroneously classified as Dutch: 0.20%, Latin: 0.10%

7. How to add it to your project?

go get github.com/pemistahl/lingua-go

8. How to build?

Lingua requires at least Go version 1.18.

git clone https://github.com/pemistahl/lingua-go.git
cd lingua-go
go build

The source code is accompanied by an extensive unit test suite. To run the tests, simply say:

go test

9. How to use?

9.1 Basic usage

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    if language, exists := detector.DetectLanguageOf("languages are awesome"); exists {
        fmt.Println(language)
    }

    // Output: English
}

9.2 Minimum relative distance

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word prologue, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated in the following way:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        WithMinimumRelativeDistance(0.9).
        Build()

    language, exists := detector.DetectLanguageOf("languages are awesome")

    fmt.Println(language)
    fmt.Println(exists)

    // Output:
    // Unknown
    // false
}

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise Unknown will be returned most of the time as in the example above. This is the return value for cases where language detection is not reliably possible. This value is not meant to be included in the set of input languages when building the language detector. If you include it, it will be automatically removed from the set of input languages.

9.3 Confidence values

Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? These questions can be answered as well:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 0.93
    // French: 0.04
    // German: 0.02
    // Spanish: 0.01
}

In the example above, a slice of ConfidenceValue is returned containing all possible languages sorted by their confidence value in descending order. Each value is a probability between 0.0 and 1.0. The probabilities of all languages will sum to 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned for this language. The other languages will receive a value of 0.0.

There is also a method for returning the confidence value for one specific language only:

confidence := detector.ComputeLanguageConfidence("languages are awesome", lingua.French)
fmt.Printf("%.2f", confidence)

// Output:
// 0.04

The value that this method computes is a number between 0.0 and 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned. If the given language is not supported by this detector instance, the value 0.0 will always be returned.

9.4 Eager loading versus lazy loading

By default, Lingua uses lazy-loading to load only those language models on demand which are considered relevant by the rule-based filter engine. For web services, for instance, it is rather beneficial to preload all language models into memory to avoid unexpected latency while waiting for the service response. If you want to enable the eager-loading mode, you can do it like this:

lingua.NewLanguageDetectorBuilder().
    FromAllLanguages().
    WithPreloadedLanguageModels().
    Build()

Multiple instances of LanguageDetector share the same language models in memory which are accessed asynchronously by the instances.

9.5 Low accuracy mode versus high accuracy mode

Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors. The large language models also consume significant amounts of memory. These requirements might not be feasible for systems running low on resources. If you want to classify mostly long texts or need to save resources, you can enable a low accuracy mode that loads only a small subset of the language models into memory:

lingua.NewLanguageDetectorBuilder().
    FromAllLanguages().
    WithLowAccuracyMode().
    Build()

The downside of this approach is that detection accuracy for short texts consisting of less than 120 characters will drop significantly. However, detection accuracy for texts which are longer than 120 characters will remain mostly unaffected.

In high accuracy mode (the default), the language detector consumes approximately 1,800 MB of memory if all language models are loaded. In low accuracy mode, memory consumption is reduced to approximately 110 MB. The goal is to further reduce memory consumption in later releases.

An alternative for a smaller memory footprint and faster performance is to reduce the set of languages when building the language detector. In most cases, it is not advisable to build the detector from all supported languages. When you have knowledge about the texts you want to classify you can almost always rule out certain languages as impossible or unlikely to occur.

9.6 Detection of multiple languages in mixed-language texts

In contrast to most other language detectors, Lingua is able to detect multiple languages in mixed-language texts. This feature can yield quite reasonable results but it is still in an experimental state and therefore the detection result is highly dependent on the input text. It works best in high-accuracy mode with multiple long words for each language. The shorter the phrases and their words are, the less accurate are the results. Reducing the set of languages when building the language detector can also improve accuracy for this task if the languages occurring in the text are equal to the languages supported by the respective language detector instance.

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    sentence := "Parlez-vous français? " + 
        "Ich spreche Französisch nur ein bisschen. " +
        "A little bit is better than nothing."

    for _, result := range detector.DetectMultipleLanguagesOf(sentence) {
        fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])
    }

    // Output:
    // French: 'Parlez-vous français? '
    // German: 'Ich spreche Französisch nur ein bisschen. '
    // English: 'A little bit is better than nothing.'
}

In the example above, a slice of DetectionResult is returned. Each entry in the slice describes a contiguous single-language text section, providing start and end indices of the respective substring.

9.7 Methods to build the LanguageDetector

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance. The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

// Include all languages available in the library.
lingua.NewLanguageDetectorBuilder().FromAllLanguages()

// Include only languages that are not yet extinct (= currently excludes Latin).
lingua.NewLanguageDetectorBuilder().FromAllSpokenLanguages()

// Include only languages written with Cyrillic script.
lingua.NewLanguageDetectorBuilder().FromAllLanguagesWithCyrillicScript()

// Exclude only the Spanish language from the decision algorithm.
lingua.NewLanguageDetectorBuilder().FromAllLanguagesWithout(lingua.Spanish)

// Only decide between English and German.
lingua.NewLanguageDetectorBuilder().FromLanguages(lingua.English, lingua.German)

// Select languages by ISO 639-1 code.
lingua.NewLanguageDetectorBuilder().FromIsoCodes639_1(lingua.EN, lingua.DE)

// Select languages by ISO 639-3 code.
lingua.NewLanguageDetectorBuilder().FromIsoCodes639_3(lingua.ENG, lingua.DEU)

10. What's next for version 1.5.0?

Take a look at the planned issues.

11. Contributions

Any contributions to Lingua are very much appreciated. Please read the instructions in CONTRIBUTING.md for how to add new languages to the library.

lingua-go's People

Contributors

dependabot[bot] avatar dsxack avatar pemistahl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingua-go's Issues

panic: runtime error: slice bounds out of range [:10] with length 9

When I run the code from this example:
https://github.com/pemistahl/lingua-go#96-detection-of-multiple-languages-in-mixed-language-texts

go run . test.txt

I got this error, if the text has only one word:

English 0 10 :
panic: runtime error: slice bounds out of range [:10] with length 9

goroutine 1 [running]:
main.main()
	/home/rom/w/kube/apps/tts/split/split-text.go:49 +0x3ce
exit status 2

How to reproduce:
cat test.txt

testword

cat ./split-text.go

package main

import (
  "fmt"
  "github.com/pemistahl/lingua-go"
  "os"
)

func getFileContent(filename string) string {
	testData, err := os.ReadFile(filename)
	if err != nil {
		panic(err.Error())
	}
	return string(testData)
}

func main() {
  if len(os.Args) < 2 {
    fmt.Println("Missing parameter, provide file name!")
    return
  }
  filename := os.Args[1]

  languages := []lingua.Language{
    lingua.English,
    lingua.Finnish,
  }

  detector := lingua.NewLanguageDetectorBuilder().
    FromLanguages(languages...).
    Build()

  sentence := getFileContent(filename)
  for _, result := range detector.DetectMultipleLanguagesOf(sentence) {
      fmt.Printf("%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex())
      fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])
  }
}

`lingua.Unknown` is not handled appropriately if included in the set of input languages

I have started testing the library a few days ago and just saw a first nil pointer panic like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4314fd8]

goroutine 22414 [running]:
github.com/pemistahl/lingua-go.loadJson(0xa834340, 0x45537a0)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/json.go:37 +0x178
github.com/pemistahl/lingua-go.languageDetector.loadLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:612 +0x8d
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:552 +0x146
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:526 +0x145
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:484 +0xcb
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:452 +0xb7
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:176 +0x455

Seems as if even reading from an embed File can fail at some point.

I'm using go version go1.17.2 darwin/amd64.

How to get back an ISO 639-1 code

How do I get back the ISO 639-1 code for a language? My use case is a text that is turned to HTML and I want to set the lang attribute, e.g. lang="en" or lang="de".

Sadly, I'm new to Go, so this is what I have in a test and I always get back EU for English.

func TestLanguage(t *testing.T) {
	language := lingua.English
	if lingua.IsoCode639_1(language) != lingua.EN {
		t.Logf("Language: %s, ISO 639-1: %s", language.String(), lingua.IsoCode639_1(language).String())
		t.Fail()
	}
}

Result:

    page_test.go:84: Language: English, ISO 639-1: EU

I guess I'm confused about the IsoCode639_1 func and type?

Add possibility to select language by ISO string as part of this library

I would really appreciate possibility to select language by ISO string as part of this library - I plan to load some configuration including ISO string from json and keeping mapping by myself is kind of pain. Something along these lines would be great:

stringToIsoCode639_1 = map[string]IsoCode639_1 {
	"AF": AF,
	...
}

func GetLanguageFromStringIsoCode639_1(code string) Language {
	for _, language := range AllLanguages() {
		if language.IsoCode639_1() == stringToIsoCode639_1[code] {
			return language
		}
	}
	return -1
}

Also, I noticed that this function is not exactly optimal. It has linear complexity with regards to number of languages. It's probably not noticeable due to relatively small number of languages, but still, it can be also optimized by lookup map:

IsoCode639_1ToLanguage = map[IsoCode639_1]Language {
	AF: Afrikaans,
	...
}

func GetLanguageFromIsoCode639_1(isoCode IsoCode639_1) Language {
	if val, ok := IsoCode639_1ToLanguage[isoCode]; ok {
		return val
	}
	return -1
}

How do you generate the ngram probabilities?

Hi, how do you actually generate the ngram probabilities?

I suspect that there might be something rotten in the state of Denmark, because I just tried to load the data for English trigrams and received these:

...
afd:0.0006082185532001166 afe:0.1033211267248698 aff:0.2505226878191564 afg:0.018677378071186912 afh:0.00011404097872502186 afi:0.008679785602959997 afj:6.335609929167881e-05 afk:0.00022808195745004371 afl:0.00424485865254248 afm:0.0002914380567417225 afn:0.000532191234050102 afo:0.010098962227093602
...
aôn:1 aõs:0.8 aúc:0.1 aúj:0.1 aúl:0.8 aül:1 aýt:1 aÿe:1 aća:1 ača:0.5 ači:0.5 ağl:0.4 ağr:0.2 ała:0.14285714285714285 ałb:0.14285714285714285 ało:0.14285714285714285 ały:0.14285714285714285 ałę:0.2857142857142857 ańs:0.6363636363636364 aşa:0.16666666666666666 aşg:0.16666666666666666 aşı:0.16666666666666666 aši:0.4 ašk:0.2 ašm:0.2 ašp:0.2 aţi:1 aźm:1 aża:1 aži:0.6666666666666666 ažs:0.3333333333333333 ași:1 aʔi:1 affo:1
...

And it certainly does not seem right.

`go get` error: `unknown revision serialization/v1.2.0`

I hope this is related to the library and not something with my Go installation.

When I run: go get github.com/pemistahl/lingua-go
I get the following error:

go: downloading github.com/pemistahl/lingua-go/serialization v1.2.0
github.com/pemistahl/lingua-go imports
        github.com/pemistahl/lingua-go/serialization: reading github.com/pemistahl/lingua-go/seriali
zation/go.mod at revision serialization/v1.2.0: unknown revision serialization/v1.2.0

Support absolute language confidence metric

Hi,
In my scenario, the goal is to detect whether the input text is in English or another language. I'm not sure how to utilize the library to accomplish this task. For instance, if the input text is in a specified language, such as Vietnamese, I expect the detection as non english

	languages := []lingua.Language{
		lingua.English,
		lingua.Vietnamese,
		lingua.Unknown,
	}

	sentence := "Thông tin tài khoản của bạn"

	detector := lingua.NewLanguageDetectorBuilder().
		FromLanguages(languages...).
		WithMinimumRelativeDistance(0.9).
		Build()

	confidenceValues := detector.ComputeLanguageConfidenceValues(sentence)

	for _, elem := range confidenceValues {
		fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
	}

output:

Vietnamese: 1.00
English: 0.00

when remove lingua.Vietnamese from expected language list, the program outputs English: 1.00, I would like the result is other language type rather than engilsh.
please help me on how to do this.
Thanks in advance.

Detect multiple languages in mixed-language text

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

Add absolute confidence metric

I do see that the README has this example:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 1.00
    // French: 0.79
    // German: 0.75
    // Spanish: 0.72
}

But if I do detector.ComputeLanguageConfidenceValues("yo bebo ein large quantity of tasty leche"), English is still going to result in 1.0. How do I get something like a certainty / probability that the text is English? Because 1.0 doesn't seem so helpful in that case. It might just be my lack of math experience, I'm assuming this is possible with the values above in the example, but I don't exactly see how.

I tested with a long italian text but the output is "17" which is English. How to make it work correctly?

I tested the adaptation of the basic.go for calling the golang function from a c-code:

package main

import "C"

import (
        "fmt"
        "github.com/pemistahl/lingua-go"
)

var lan lingua.Language

//var lan int

//export Langdetectfunct
func Langdetectfunct(text *C.char) int {

    textS := C.GoString(text);

    detector := lingua.NewLanguageDetectorBuilder().
        FromAllLanguages().
        Build()

    if language, exists := detector.DetectLanguageOf(textS); exists {
        lan = language
    }
    lan = lingua.English

    return int(lan)

}

func main() {
    // https://github.com/pemistahl/lingua-go/blob/main/language.go 

    testo := "Il liceo classico, noto in passato anche come ginnasio, è una scuola secondaria di secondo grado quinquennale a ciclo unico del sistema scolastico italiano incentrata sugli studi umanistici. Fu istituito come scuola d'élite con la riforma Gentile nel 1923, traendo origini dal ginnasio-liceo>

    ctesto := C.CString(testo);

    res := Langdetectfunct(ctesto);
    fmt.Println(res)
}

Despite the Italian text is long, it outputs "17" which, according to the language codes list here: https://github.com/pemistahl/lingua-go/blob/main/language.go is "English" :

raphy@raohy:~/go-lang-detect$ go run basic.go
17

Why? Did I make any mistake? How to make it work correctly?

Compile-time language inclusion

If generated data for languages will be split between per-language, it's possible to strip down bundled language immensely.

Tag handling

Each file can contain //go:build directive that controls inclusion of languages

By default, all generated files can include //go:build !lingua_ignore which means "unless built with -tags lingua_ignore, include this file". That is the same behaviour as it is now.

Then, build constraint //go:build (!lingua_ignore && !lingua_no<language>) || lingua_<language> will be built when either tags -lingua_<language> is specified or -tags lingua_no<language> is NOT specified.

Thus, if you want all languages to be included, you simply do nothing and when you want to reduce language set to the minimum, you use build tags like -tags lingua_ignore,lingua_en,lingua_es,etc.

If you want to exclude only several languages, you add -tags lingua_noge without adding lingua_ignore.

Model loading

For now, models are loaded from a single point in detector.go through embed.FS.
Instead of that, each language-model/<language> could contain .go file that has aforementioned build constraints.

This file can also load all *.zip files into separate embed.FS entity which can be then passed to the "main" filesystem in language-model package.

language-model package then can implement interface for fs.SubFS.
It could be as simple as generated file that has switch/case for all available languages that includes all language-model/* packages.
Or, if you don't want to use generation, it should be simple enough to add Register method that init function of language-model/<language>/ package can then call. It won't be called if language package is ignored.

Language detection is sometimes non-deterministic

To reproduce the issue:

package main

import (
	"github.com/pemistahl/lingua-go"
	"log"
)

func main() {
	detectorAll := lingua.NewLanguageDetectorBuilder().FromAllLanguages().WithPreloadedLanguageModels().Build()
	for i := 0; i < 1000 ; i++ {
		lang,_:=detectorAll.DetectLanguageOf("Az elmúlt hétvégén 12-re emelkedett az elhunyt koronavírus-fertőzöttek száma Szlovákiában. Mindegyik szociális otthon dolgozóját letesztelik, Matovič szerint az ingázóknak még várniuk kellene a teszteléssel")
		log.Println(lang.IsoCode639_1().String())

	}
}

Thank you for amazing work!

Detection of multiple languages: bytes, runes

Detection of multiple languages sometimes returns indices in bytes, but sometimes in runes (code points):
To reproduce:

package main

import (
  "fmt"
  "github.com/pemistahl/lingua-go"
)

func main() {
  sentence := ""
  fmt.Printf("--- this will return indices in bytes:")
  sentence = "Parlez çççç? I would like"
  split(sentence);

  fmt.Printf("\n\n")
  fmt.Printf("--- this will return indices in code points (runes):")
  sentence = "ççççfran"
  split(sentence);
}

func split(sentence string) {
  languages := []lingua.Language{
    lingua.English,
    lingua.French,
  }

  detector := lingua.NewLanguageDetectorBuilder().
    FromLanguages(languages...).
    // WithLowAccuracyMode().
    Build()
  detectionResults := detector.DetectMultipleLanguagesOf(sentence)

  fmt.Printf("\ninput str:\n%s\n", sentence)

  for i := 0; i < len(sentence); i++ {
    fmt.Printf("% x", sentence[i])
    // fmt.Printf("%q", sentence[i])
  }
  fmt.Printf("\n")

  for _, result := range detectionResults {
    fmt.Printf("\n%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex())

    fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])

    fmt.Printf("%s: '%s'\n", result.Language(), string([]rune(sentence)[result.StartIndex():result.EndIndex()]))
  }
}

output:

--- this will return indices in bytes:
input str:
Parlez çççç? I would like
 50 61 72 6c 65 7a 20 c3 a7 c3 a7 c3 a7 c3 a7 3f 20 49 20 77 6f 75 6c 64 20 6c 69 6b 65

French 0 17 :
French: 'Parlez çççç? '
French: 'Parlez çççç? I wo'

English 17 29 :
English: 'I would like'
English: 'uld like'


--- this will return indices in code points (runes):
input str:
ççççfran
 c3 a7 c3 a7 c3 a7 c3 a7 66 72 61 6e

French 0 8 :
French: 'çççç'
French: 'ççççfran'

Find more memory-efficient data structure for language models

Currently, the language models are loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be Gonum.

panic: decimal division by 0

There is a panic in the latest 1.3.1 version.

panic: decimal division by 0
  55
  56 goroutine 41191 [running]:
  57 github.com/shopspring/decimal.Decimal.QuoRem({0xc0248f08e0, 0xffff9e58}, {0xc058a1f080, 0xffff9e58}, 0x10)
  58     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:565 +0x2c5
  59 github.com/shopspring/decimal.Decimal.DivRound({0xc0248f08e0?, 0x58a1e100?}, {0xc058a1f080?, 0x7f1272d8?}, 0x10)
  60     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:607 +0x56
  61 github.com/shopspring/decimal.Decimal.Div(...)
  62     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:552
  63 github.com/pemistahl/lingua-go.languageDetector.computeConfidenceValues({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  64     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:615 +0x1af
  65 github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  66     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:351 +0x8b4
  67 github.com/pemistahl/lingua-go.languageDetector.DetectLanguageOf({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  68     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:147 +0x58

Library size optimization

Thanks for this very efficient library (it's the best I've tested so far).

Unfortunately, I struggle with its size, because whatever parameters I choose, it keeps adding around 120MiB to my app (which is 50MiB, assets included). Since I am using Kubernetes, the docker image matters.

I am only interested in checking a few languages, but it seems that whatever language (or options I choose), the whole package is still compiled. May be I miss something ...

If not, it would be nice to be able to provide the languages as imports (lingua.English, ..., lingua.Languages) in order to keep the binary small.

Reduce "bloat"

Hi,

Thanks for the excellent work first and foremost, but may I suggest keeping metadata (e.g. 61c7054) separately, outside this repository. You could create another repo, e.g. github.com/pemistahl/lingua-go-accuracy-reports or similar.

The comparisons are useful but also currently bloat the repository, plus they introduce quite a few extra dependencies, i.e. https://github.com/pemistahl/lingua-go/blob/main/go.sum.

What do you think?

Add low accuracy mode

Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors. The large language models also consume significant amounts of memory. These requirements might not be feasible for systems running low on resources.

For users who want to classify mostly long texts or need to save resources, a so-called low accuracy mode will be implemented that loads only a small subset of the language models into memory. The API will be as follows:

lingua.NewLanguageDetectorBuilder().FromAllLanguages().WithLowAccuracyMode().Build()

The downside of this approach is that detection accuracy for short texts consisting of less than 120 characters will drop significantly. However, detection accuracy for texts which are longer than 120 characters will remain mostly unaffected.

Go type not supported in export: lingua.Language

I would like to call lingua-go functions from C++ code

I tried to generate h file and so file for this code ( following the indications found here: https://github.com/vladimirvivien/go-cshared-examples)

basic.go :

package main

import "C"

import (
        "github.com/pemistahl/lingua-go"
)

var lan lingua.Language

//export Langdetectfunct
func Langdetectfunct(text string) lingua.Language {

    detector := lingua.NewLanguageDetectorBuilder().
        FromAllLanguages().
        Build()

    if language, exists := detector.DetectLanguageOf(text); exists {
        lan = language
    }
    lan = lingua.English

    return lan

}

func main() {}

Doing:

raphy@raohy:~/go-lang-detect$ go build -o basic.so -buildmode=c-shared basic.go

I get :

raphy@raohy:~/go-lang-detect$ go build -o basic.so -buildmode=c-shared basic.go
# command-line-arguments
./basic.go:14:35: Go type not supported in export: lingua.Language

Detection of multiple languages strange results

I took a few tests with lingua as I am interested in the “Detection of multiple languages in mixed-language texts” feature.

I checked the following texts:

Hello, I told you the house is green. Hallo, ich habe dir gesagt, das Haus ist grün.

Hallo, ich sage, das Haus ist grün. Hello, I told you the house is green.

Lingua returned the following to me:

First text:

English: 'Hello, I told you the house is green. Hallo, '
German: 'ich habe dir gesagt, das Haus ist grün.'

Second text:

German: 'Hallo, ich sage das Haus ist grün. Hello, I '
English: 'told you the house is green.'

Whereby neither Hello in German should be a correct word, nor Hallo in English.

Perhaps a parameter can be added to the DetectMultipleLanguagesOf that ensures that punctuation marks are considered, and only one language is returned per sentence.

Strange matching for Spanish phrase detected as Finnish

Hey! I've been messing with this library, most of it seems great! There is one issue I've ran into with a spanish phrase being detected as Finnish, as it has a confidence level of 1, I'm unsure if this is intended.

Phrase: ¿les gustan los pokemon?

With the following code:

package main

import (
	"log"

	"github.com/pemistahl/lingua-go"
)

func main() {
	detector := lingua.
		NewLanguageDetectorBuilder().
		FromAllSpokenLanguages().
		WithPreloadedLanguageModels().
		Build()

	content := "¿les gustan los pokemon?"
	lang, reliable := detector.DetectLanguageOf(content)
	log.Println(lang.String(), reliable)

	log.Println(" --- ")

	confidences := detector.ComputeLanguageConfidenceValues(content)
	for _, langConf := range confidences {
		log.Println(langConf.Language().String(), langConf.Value())
	}
}

The following output is produced:

2022/04/20 00:22:43 Finnish true
2022/04/20 00:22:43  --- 
2022/04/20 00:22:43 Finnish 1
2022/04/20 00:22:43 English 0.9883978684270469
2022/04/20 00:22:43 Indonesian 0.978563900119626
2022/04/20 00:22:43 Spanish 0.9747851212151981
2022/04/20 00:22:43 Croatian 0.9724182360849759
2022/04/20 00:22:43 Lithuanian 0.9647225277871057
2022/04/20 00:22:43 Estonian 0.9641581778214242
2022/04/20 00:22:43 Esperanto 0.9606587809451471
2022/04/20 00:22:43 Polish 0.9594230676987932
2022/04/20 00:22:43 Slovene 0.9546050214213473
2022/04/20 00:22:43 Malay 0.9541465232681227
2022/04/20 00:22:43 Albanian 0.9524198444722406
2022/04/20 00:22:43 Italian 0.9486618781887298
2022/04/20 00:22:43 Catalan 0.946963416607054
2022/04/20 00:22:43 Danish 0.9403916449998727
2022/04/20 00:22:43 Bosnian 0.9269675882527444
2022/04/20 00:22:43 Portuguese 0.9261989417434195
2022/04/20 00:22:43 German 0.919921338933763
2022/04/20 00:22:43 Sotho 0.9152876229202939
2022/04/20 00:22:43 Dutch 0.9145928120132025
2022/04/20 00:22:43 French 0.9140644855054184
2022/04/20 00:22:43 Slovak 0.9125324543349711
2022/04/20 00:22:43 Latvian 0.9119548274103094
2022/04/20 00:22:43 Tswana 0.9030296447404719
2022/04/20 00:22:43 Romanian 0.8980252449808623
2022/04/20 00:22:43 Nynorsk 0.8962667914904449
2022/04/20 00:22:43 Tagalog 0.8961041054613276
2022/04/20 00:22:43 Swedish 0.8861739698250194
2022/04/20 00:22:43 Hungarian 0.8860583424196719
2022/04/20 00:22:43 Bokmal 0.8860501842325473
2022/04/20 00:22:43 Swahili 0.8855438630695021
2022/04/20 00:22:43 Czech 0.877987508198549
2022/04/20 00:22:43 Welsh 0.8706583132077192
2022/04/20 00:22:43 Turkish 0.8635506224236865
2022/04/20 00:22:43 Yoruba 0.8618678522282041
2022/04/20 00:22:43 Basque 0.8587542505212317
2022/04/20 00:22:43 Afrikaans 0.8435800177987139
2022/04/20 00:22:43 Maori 0.8429171795365868
2022/04/20 00:22:43 Ganda 0.8407646218672701
2022/04/20 00:22:43 Icelandic 0.8248853640378799
2022/04/20 00:22:43 Tsonga 0.8245248538291974
2022/04/20 00:22:43 Irish 0.817982923494266
2022/04/20 00:22:43 Zulu 0.8175325635441859
2022/04/20 00:22:43 Shona 0.8008811823165958
2022/04/20 00:22:43 Xhosa 0.7829601259301775
2022/04/20 00:22:43 Vietnamese 0.774240344355879
2022/04/20 00:22:43 Azerbaijani 0.7541427903961347
2022/04/20 00:22:43 Somali 0.7538078988192347

I'm not sure why it ranked Spanish as 4th. Is there a good method to get around this? Unfortunately given my use case I need to detect from a wide range of languages like this.

This library is overall awesome, I'm using the latest stable release, thank you for this!

Strange results for Chinese with Japanese

To reproduce:

package main

import (
	"github.com/pemistahl/lingua-go"
	"fmt"
)

func main() {
	detector := lingua.NewLanguageDetectorBuilder().
		FromAllLanguages().
		Build()

	text := "上海大学是一个好大学. わー!"
	if language, exists := detector.DetectLanguageOf(text); exists {
		fmt.Println(language.String()) // Japanese
	}
}

Expected:
Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

build to wasm, run slowly.

I build the simple example to detect the word 'app', it work. but when I build it to wasm.it cost about 6 second to done.Is there any way to solve this problem?

Panics at loadJson

Code to reproduce:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 1.00
    // French: 0.79
    // German: 0.75
    // Spanish: 0.72
}

go.mod

module lingua

go 1.16

require github.com/pemistahl/lingua-go v1.0.0

go env:

❯ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitriysmotrov/Library/Caches/go-build"
GOENV="/Users/dmitriysmotrov/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/dmitriysmotrov/.gvm/pkgsets/go1.16.5/global/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dmitriysmotrov/.gvm/pkgsets/go1.16.5/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/dmitriysmotrov/.gvm/gos/go1.16.5"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitriysmotrov/.gvm/gos/go1.16.5/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.5"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/dmitriysmotrov/space/dsxack/lingua/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/z5/8ts06jv92yjc5sp5mdsdzr2h0000gn/T/go-build2817996487=/tmp/go-build -gno-record-gcc-switches -fno-common"

Expect: no panics

Actual:

panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10e9c82]

goroutine 22 [running]:
archive/zip.(*ReadCloser).Close(0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/archive/zip/reader.go:161 +0x22
panic(0x11841e0, 0x12c0160)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/runtime/panic.go:965 +0x1b9
github.com/pemistahl/lingua-go.loadJson(0x18, 0x5, 0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/json.go:32 +0x18e
github.com/pemistahl/lingua-go.loadFivegrams(...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:925
github.com/pemistahl/lingua-go.germanFivegramModel.func1.1()
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:368 +0x45
sync.(*Once).doSlow(0xc0000b0f30, 0xc000094b68)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:68 +0xec
sync.(*Once).Do(...)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:59
github.com/pemistahl/lingua-go.germanFivegramModel.func1(0x11831c0, 0xc00009afc0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:367 +0xbb
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:530 +0x1cb
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:516 +0xf7
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:474 +0xca
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:442 +0xca
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:170 +0x525
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10e9c82]

goroutine 18 [running]:
archive/zip.(*ReadCloser).Close(0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/archive/zip/reader.go:161 +0x22
panic(0x11841e0, 0x12c0160)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/runtime/panic.go:965 +0x1b9
github.com/pemistahl/lingua-go.loadJson(0x18, 0x1, 0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/json.go:32 +0x18e
github.com/pemistahl/lingua-go.loadUnigrams(...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:925
github.com/pemistahl/lingua-go.germanUnigramModel.func1.1()
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:368 +0x45
sync.(*Once).doSlow(0xc0000b1d40, 0xc000064b68)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:68 +0xec
sync.(*Once).Do(...)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:59
github.com/pemistahl/lingua-go.germanUnigramModel.func1(0x11831c0, 0xc00009b050)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:367 +0xbb
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:538 +0x128
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:516 +0xf7
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:474 +0xca
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:442 +0xca
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:170 +0x525

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.