google / cld3 Goto Github PK

License: Apache License 2.0

C++ 99.58% CMake 0.20% Python 0.22%

cld3's Introduction

Compact Language Detector v3 (CLD3)

Model
Supported Languages
Installation
Bugs and Feature Requests
Credits

Model

CLD3 is a neural network model for language identification. This package contains the inference code and a trained model. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer. The remaining components of the network are a hidden (Rectified linear) layer and a softmax layer.

To get a language prediction for the input text, we simply perform a forward pass through the network.

Supported Languages

The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

Output Code	Language Name	Script Name
af	Afrikaans	Latin
am	Amharic	Ethiopic
ar	Arabic	Arabic
bg	Bulgarian	Cyrillic
bg-Latn	Bulgarian	Latin
bn	Bangla	Bangla
bs	Bosnian	Latin
ca	Catalan	Latin
ceb	Cebuano	Latin
co	Corsican	Latin
cs	Czech	Latin
cy	Welsh	Latin
da	Danish	Latin
de	German	Latin
el	Greek	Greek
el-Latn	Greek	Latin
en	English	Latin
eo	Esperanto	Latin
es	Spanish	Latin
et	Estonian	Latin
eu	Basque	Latin
fa	Persian	Arabic
fi	Finnish	Latin
fil	Filipino	Latin
fr	French	Latin
fy	Western Frisian	Latin
ga	Irish	Latin
gd	Scottish Gaelic	Latin
gl	Galician	Latin
gu	Gujarati	Gujarati
ha	Hausa	Latin
haw	Hawaiian	Latin
hi	Hindi	Devanagari
hi-Latn	Hindi	Latin
hmn	Hmong	Latin
hr	Croatian	Latin
ht	Haitian Creole	Latin
hu	Hungarian	Latin
hy	Armenian	Armenian
id	Indonesian	Latin
ig	Igbo	Latin
is	Icelandic	Latin
it	Italian	Latin
iw	Hebrew	Hebrew
ja	Japanese	Japanese
ja-Latn	Japanese	Latin
jv	Javanese	Latin
ka	Georgian	Georgian
kk	Kazakh	Cyrillic
km	Khmer	Khmer
kn	Kannada	Kannada
ko	Korean	Korean
ku	Kurdish	Latin
ky	Kyrgyz	Cyrillic
la	Latin	Latin
lb	Luxembourgish	Latin
lo	Lao	Lao
lt	Lithuanian	Latin
lv	Latvian	Latin
mg	Malagasy	Latin
mi	Maori	Latin
mk	Macedonian	Cyrillic
ml	Malayalam	Malayalam
mn	Mongolian	Cyrillic
mr	Marathi	Devanagari
ms	Malay	Latin
mt	Maltese	Latin
my	Burmese	Myanmar
ne	Nepali	Devanagari
nl	Dutch	Latin
no	Norwegian	Latin
ny	Nyanja	Latin
pa	Punjabi	Gurmukhi
pl	Polish	Latin
ps	Pashto	Arabic
pt	Portuguese	Latin
ro	Romanian	Latin
ru	Russian	Cyrillic
ru-Latn	Russian	English
sd	Sindhi	Arabic
si	Sinhala	Sinhala
sk	Slovak	Latin
sl	Slovenian	Latin
sm	Samoan	Latin
sn	Shona	Latin
so	Somali	Latin
sq	Albanian	Latin
sr	Serbian	Cyrillic
st	Southern Sotho	Latin
su	Sundanese	Latin
sv	Swedish	Latin
sw	Swahili	Latin
ta	Tamil	Tamil
te	Telugu	Telugu
tg	Tajik	Cyrillic
th	Thai	Thai
tr	Turkish	Latin
uk	Ukrainian	Cyrillic
ur	Urdu	Arabic
uz	Uzbek	Latin
vi	Vietnamese	Latin
xh	Xhosa	Latin
yi	Yiddish	Hebrew
yo	Yoruba	Latin
zh	Chinese	Han (including Simplified and Traditional)
zh-Latn	Chinese	Latin
zu	Zulu	Latin

Installation

CLD3 is designed to run in the Chrome browser, so it relies on code in Chromium. The steps for building and running the demo of the language detection model are:

check out the Chromium repository.
copy the code to //third_party/cld_3
Uncomment language_identifier_main executable in src/BUILD.gn.
build and run the model using the commands:

gn gen out/Default
ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
out/Default/language_identifier_main

Bugs and Feature Requests

Open a GitHub issue for this repository to file bugs and feature requests.

Announcements and Discussion

For announcements regarding major updates as well as general discussion list, please subscribe to: [email protected]

Credits

Original authors of the code in this package include (in alphabetical order):

Alex Salcianu
Andy Golding
Anton Bakalov
Chris Alberti
Daniel Andor
David Weiss
Emily Pitler
Greg Coppola
Jason Riesa
Kuzman Ganchev
Michael Ringgaard
Nan Hua
Ryan McDonald
Slav Petrov
Stefan Istrate
Terry Koo

cld3's People

Contributors

Stargazers

Watchers

Forkers

jbaiter gregbowyer runt18 000yvra000 chagge benjamesbabala stevenlol 6676401088 akihikodaki duzhanyuan bluecrab hades210 privacore nanaakwasiabayieboateng codeaudit sjl2015 duongtung4691 praveenmunagapati nico vyraun miku jmhodges harishsiitd jvcop npetrini emamatcyber90 dkuspawono williamtambellini exhorder the-mad-pirate kwonoj pombredanne dalavancloud yazici noeltoby s771019 akihiroota87 abde87 reaganhuang joesadmercado lixia755324 iamthebot tobygz iruka2019 efce vgaurav3011 ranabet browse-holdings spamala1 heartyharts hyfine abosamoor vaxelrod neotim dwayne45 howardchenhd andrews2017 tigershi traktormaster williamchiu amykhuang pkasting isabella232 matthewleon mickeyzo12 studypython2016 hirnimeshrampuresoftware odidev craigdigital vetvrg neurotech-hq zhangqile900621 corydolphin sammlung fgaim sysujayce apachecn andreinosov antonioalegria eazy001 paul-june ajakk fulayjan seantolstoyevski msavardi aschen gerhobbelt central-intelligence-tf-185 ghas-results zolekode barnabasszabolcs alexanderwinkler

cld3's Issues

Easy examples yield funny results

Code to reproduce:

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=10000)
results = detector.FindTopNMostFreqLangs(text=sample_text, num_langs=2)
print(sample_text)
for result in results:
    print(result.language, result.is_reliable, result.probability, result.proportion)

Weird results:
tus ojos me hace sentir
lt True 0.786892831325531 1.0 # 🤖😬🤣
und False 0.0 0.0

sin red y voy a mil
af True 0.8103252649307251 1.0 # y is not in afrikaans
und False 0.0 0.0

yo te veo pero tu no ves
ja-Latn True 0.9469742178916931 1.0 # japanese, really? these are the most basic spanish words
und False 0.0 0.0

aunque no me veas, mirame
de True 0.9972571730613708 1.0 # no and me are very simple words that are not German
und False 0.0 0.0

esta al reves
eo True 0.7365820407867432 1.0 # in Esperanto there's no word ending with -es
und False 0.0 0.0

aunque no veas
de True 0.9875902533531189 1.0 # no and me are very simple words that are not German
und False 0.0 0.0

Many Chinese pages are identified as English

For example http://www.zgshige.com/c/2018-12-19/7995447.shtml

Non-iso 639-* language codes

There are some languages that are not part of Iso 639-1, they are named correctly according to the next available inclusive standard (iso 639-2 or iso 639-3) however there are there are two incorrect or incorrectly named languages:

Hebrew language language code is 'he' usually, not 'iw'.
'Filipino' is not 'fil' but 'Tagalog', 'tl'.

Fixing these languages should be as easy as writing a wrapper.

Some of the English words detect as different language

Please check the below sheet. For most of the simple English words it detects as different language

request for documentation: how to add a new language

Apologies for overlooking this, but are there any instructions on how to add a new language? In my case, this is about multiple low-resource languages, some without ISO639-3 codes. I have training data at hand but would welcome a howto or a few pointers on where to start. Thanks a lot!

Import gcld3 fails

Hi! I'm trying to use gcld3 for language detection on MacOS (12.6.1). I have installed the latest protobuf through Homebrew and am trying to import gcld3 in a Jupyter notebook, but the cell just gets stuck processing indefinitely. Is there anything I'm missing here? Why can't I import the library?

Thanks!

Request for a branch or tag

Would it be possible to create a simple tag ('0.1', 'alpha', 'beta', ... or whatever) to easily identify the current version/status of the master branch ?

Certain Spanish twitter pages are identified as English

This is found in crbug.com/809243.

Repro twitter page: https://twitter.com/paurubio

Almost all tweets are Spanish, but it's still identified as English when Chrome tries to detect page language using CLD.

I'm also attaching Chrome's text dump that's passed to CLD for language detection: dump.txt

Installation instructions unclear

1- What does "check out the Chromium repository." mean, should I just look at it?
2- "copy the code to //third_party/cld_3", what code, and where is this folder supposed to be? cd // in my shell sends me to /. Should I have a directory with path /third_party/cld_3?
3- "Uncomment language_identifier_main executable in src/BUILD.gn" why isn't it already uncommented?
4- Why does the repo have a CMakeLists.txt but the users have to follow these weird "build" instructions?

This whole language detection thing sounds pretty awesome but I was very disappointed to find these instructions.

Thanks.

Language Detection incorrect

Language detected by CLD is incorrect for below pages :

Both the pages are in English but CLD detects Danish and Vietnamese as the language of the pages.

Concurrency issues

What is the right way of using NNetLanguageIdentifier in concurrently with threads?

I thought that if I create separate instance for each thread it should be ok, but I start getting access violation exceptions in NNetLanguageIdentifier constructor when threads are running concurrently.
I was able to solve that by adding a global lock but I wonder what is the best way to use the code concurrently.
May be if I create single instance of NNetLanguageIdentifier and share it between threads should be ok? Thoughts?

Support for libcld3 installation

In order to include this library in another C++ program, the libcld3.a and include files need to be installed. However, the makefile does not include support for this right now. I was able to add that support by appending the following code to the CMakeLists.txt file:

install(DIRECTORY include/ DESTINATION include)
install(TARGETS ${PROJECT_NAME} 
    ARCHIVE DESTINATION lib
    LIBRARY DESTINATION lib
    RUNTIME DESTINATION bin)

You will need to create an /include directory with all the *.h files. Then you can compile and install the library as follows:

cd cld3
mkdir build.release
cd build.release
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --target install

Using cld3 in your program is pretty simple:

#include <nnet_language_identifier.h>

string get_language(const string &str)
{
    if ( str.empty() ) return "";
    chrome_lang_id::NNetLanguageIdentifier lang_id;
    const chrome_lang_id::NNetLanguageIdentifier::Result result = lang_id.FindLanguage(str); 
    return result.is_reliable ? result.language : "";
}

Hope this helps anyone who might be stuck.

Expose Span Information from FindTopNMostFrequentLangs

When calling FindTopNMostFrequentLangs(text,num_langs), it would be helpful to know the ranges of text that each result applies to. For example, if you had the string "Hello, my name is 三船敏郎. It's a pleasure to meet you.", it would be helpful to know that English applies to indices 0-16 and 24-52, while Japanese applies to indices 17-23. I propose the following:

Add vector<pair<int,int>> to LangChunkStats that keeps track of ranges of text the language applies to. The vector can be populated using the script_span.offset and script_span.text_bytes.
Add the vector to Result when populating results vector.

These small changes would give the caller more detailed information about the language of each section of text, if there are multiple languages detected.

Support for Unit Testing

The CMakeLists.txt file is missing support for building the unit tests to validate all the languages. You can append the following code to create a language_id_test executable that will test each of the supported languages and report any errors:

add_executable(language_id_test src/nnet_lang_id_test.cc src/nnet_lang_id_test_data.cc)
target_link_libraries(language_id_test cld3 ${Protobuf_LITE_LIBRARIES})

Bad identification for short input

Hi
Could anyone just confirm that short inputs are usually not correctly identified by CLD3 ?
Some examples:

text: Hello
language: sr
probability: 0.830728
reliable: 1
proportion: 1

text: Hello world
language: ky
probability: 0.719188
reliable: 1
proportion: 1

text: Hello my world
language: ky
probability: 0.521224
reliable: 0
proportion: 1

text: Hello my great world
language: ja
probability: 0.278577
reliable: 0
proportion: 1

text: Hello the great world of Artificial Intelligence
language: en
probability: 0.980107
reliable: 1
proportion: 1

Swedish detected when all text is arabic

ruby تَخيّل أنّ الله بِعظمتِه يُحِبُك نَاقِلة وبَاحِثة للمَواضِيع التَرفيهية والمُفيدة بَعض الثِريد ز تَعود لصَاحبِيها الأصلِيين حُقوقية عَربية مُسلِمة أسعَى لأن أترُكَ أثَراً للأجَر لعضُو تكست يونجون 𓂅 أوّل قناة بالتِيلي لنَشر مُتتَاليات يونجوُن صَور مفَلترة فخَمة مَعلومِات عنهه كلهِة هنا مُنو باعتِقادك يستحِق لقِب فرقةه ممُهده الطريق فَي الكيبوب بايّسك من txt ٌ مُتتاليات الفرقِةه الصَاعدة تكِست هَلقناة مُختصةه فقط لنِشر عن كُل ما يخص مِلوك الجَيل الرَابع تمورو باي توقذر مُتتاليات أيديتِ ز أيكونز وصور عِرض فِخمةه كل ذا تلقِاها بهِل قناة جَست للاِشتراك لِستةه ودعِم لالِيسون َ محُتوى قنواتِ الكيبوب والانِمي القِبول اقِل مَ اقبَل الزيَادة و أكثَر لتِموَرو بآي تَوقيذِر 𓂅 كلهِة هنا 彡 تحَبيِن تكست 𓂅 أعلاَن حقيِقي أكبَر قناَه لہ فخَر الروكيز تكسِت ̧ مُتتاليات ونشِر مُرتب وكلشِيء يخصهم 𓏲 فاَن لتكسِت أشتركِ 彡 تعَرف تصَمم يِن ᤣ كليِشةه حقيِقيةه أوُل قناَه لہ تعليِم تصاميِم و ايديِتز بسهُوله وشرح مَفهوم ̧ تحبِ تصممِ أدخلِي هناَ

Is detected as sv for FindLanguage function. However, if you remove first ruby word, it begins to be detected as arabic but with pretty high score of sv

Lang Id, name, score
49 sv 15.801
88 ar 13.1218

To me this seems quite unexpected :)

Training set

Is the training set that was used to train CLD3 available somewhere? Alternatively, and a bit off-topic, is there a sort of standard or very good dataset used for writing a language classifier like this one?

Korean is detected as so many different language with some symbols

[1] pry(main)> identifier =  CLD3::NNetLanguageIdentifier.new(1, 2048)
=> #<CLD3::NNetLanguageIdentifier:0x0000557ad8972f68
 @cc=#<CLD3::Unstable::NNetLanguageIdentifier::Pointer address=0x0000557ad8014870>>
[2] pry(main)> identifier.find_language('안녕하세요')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[3] pry(main)> identifier.find_language('A: 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[4] pry(main)> identifier.find_language('A. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[5] pry(main)> identifier.find_language('Q. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[6] pry(main)> identifier.find_language('"안녕하세요"')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[7] pry(main)> identifier.find_language('Q:안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[8] pry(main)> identifier.find_language('A. 코스프레?')
=> #<struct Struct::Result language=:zh, probability=0.27146071195602417, reliable?=false, proportion=1.0>
[9] pry(main)> identifier.find_language('A. 코스프레?\n마녀 하고 싶어요')
=> #<struct Struct::Result language=:ne, probability=0.9822306632995605, reliable?=true, proportion=1.0>

Korean uses specialised characterset called Hangul(한글) So 1-gram based detection should result almost 100% rate, But it is detected as zh, ne, hi, etc

fatal error C1083: Cannot open include file: 'google/protobuf/port_def.inc': No such file or directory

I am having the following error when runnign >pip install gcld3 :

EROOR:

\AppData\Local\Temp\pip-install-7_v9ujss\gcld3_9a116eb59c8049b5a46f1c8cf8ca323d\src\cld_3/protos/feature_extractor.pb.h(10): fatal error C1083: Cannot open include file: 'google/protobuf/port_def.inc': No such file
or directory

System information:

Windows 10
Python 3.8
protobuf==3.12.2
libprotoc 3.12.1

Notes:
I also tried with Python 3.7 as well as using the most recent version of protobuf==3.15.8

cmake: error: 'google/protobuf/stubs/common.h' file not found

As for #16 I'm trying to build the static library:

ip-192-168-1-105:AI loretoparisi$ git clone https://github.com/google/cld3.git
Cloning into 'cld3'...
remote: Enumerating objects: 16, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 429 (delta 6), reused 8 (delta 3), pack-reused 413
Receiving objects: 100% (429/429), 2.88 MiB | 481.00 KiB/s, done.
Resolving deltas: 100% (295/295), done.

and then building with Cmake

ip-192-168-1-105:AI loretoparisi$ cd cld3/
ip-192-168-1-105:cld3 loretoparisi$ mkdir build
ip-192-168-1-105:cld3 loretoparisi$ cd build/
ip-192-168-1-105:build loretoparisi$ cmake ..
-- The C compiler identification is AppleClang 10.0.0.10001145
-- The CXX compiler identification is AppleClang 10.0.0.10001145
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Found Protobuf: /usr/local/lib/libprotobuf.dylib (found version "3.5.1") 
-- Protobuf_FOUND= TRUE
-- Protobuf_VERSION= 3.5.1
CMake Warning at CMakeLists.txt:11 (message):
  Protobuf 2.5 and CLD3 seems happy together.  This script does NOT check if
  your verison of protobuf is compatible.
-- Protobuf_LIBRARIES= /usr/local/lib/libprotobuf.dylib
-- Protobuf_LITE_LIBRARIES= /usr/local/lib/libprotobuf-lite.dylib
-- PROTO_HDRS= /Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.h;/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/sentence.pb.h;/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/task_spec.pb.h
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/loretoparisi/Documents/Projects/AI/cld3/build

Now running make

ip-192-168-1-105:build loretoparisi$ make
[  2%] Running C++ protocol buffer compiler on src/task_spec.proto
[  5%] Running C++ protocol buffer compiler on src/feature_extractor.proto
[  8%] Running C++ protocol buffer compiler on src/sentence.proto
Scanning dependencies of target cld3
[ 10%] Building CXX object CMakeFiles/cld3.dir/cld_3/protos/feature_extractor.pb.cc.o
In file included from /Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.cc:4:
/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
#include <google/protobuf/stubs/common.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[2]: *** [CMakeFiles/cld3.dir/cld_3/protos/feature_extractor.pb.cc.o] Error 1
make[1]: *** [CMakeFiles/cld3.dir/all] Error 2
make: *** [all] Error 2
ip-192-168-1-105:build lor

Since I'm on macOS I have tried a protobuf update

Error: protobuf 3.5.1_1 is already installed
To upgrade to 3.7.0, run `brew upgrade protobuf`

After upgrading to protobuf 3.7.0 I get the same very error.

gcld3 compatible with pylint and mypy

It would be nice to make the python library gcld3 compatible with pylint and mypy by defining the types of the module.
It seems completely lost by pybind.

Thanks in advance,
Loic

Spanish manual language problems detection

Hi cld3 team!
Thank you so much for this development, it is so useful!
I have used your language detection package vía R (see code) and then done some manual tagging for Spanish (see "human" column in this csv) and have found some things that might be interesting but I am unsure of how to make it useful for you?
For instance, related to this issue, from a list of conference titles, those in "Spanglish" got tagged as English w/cld2 and as Spanish with cld3.
Also, while cld3 got real better at distinguishing Galician from Spanish there is still one case in which it got this tag wrong: "SAMEBibl: Sistema Automático de Migración a Europeana para Bibliotecas" (should be Spanish)
Hope this is somewhat useful :)

Relation among CLD2 Score and CLD3 Accuracy

In my project I have to port the language detector from CLD2 to CLD3. The CLD2 has a concept of Score, and Percentage of some language in the text. Internally the Score is calculated from a probability (not exposed in my understanding) in some way (my assumption was from the field textBytes that represents the size in bytes of the text, the accuracy and distribution of each label in the text), something like Acc=1-textBytes/Score
In CLD2 the function that normalizes these scores is

normalized_score3[2] = GetNormalizedScore(language3[2],
                                                  ULScript_Common,
                                                  bytecount3,
                                                  doc_tote->Score(2));

That said, since I need to upgrade to CLD3, I have at some point to convert from CLD2 Score to CLD3 accuracy value. Any hint how to achieve that?

Here for reference:
dachev/node-cld#52

Not abel to import cld3 python on mac

Installed protobuf
Installed https://github.com/Elizafox/cld3

on import i get the following error
ImportError: dlopen(/Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so, 2): Symbol not found: __ZN6google8protobuf2io18StringOutputStreamC1EPSs
Referenced from: /Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so
Expected in: flat namespace
in /Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so

Also Tried to install https://github.com/jbaiter/cld3
But get the following build error

Collecting git+https://github.com/jbaiter/cld3.git
Cloning https://github.com/jbaiter/cld3.git to /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1
Building wheels for collected packages: cld3
Running setup.py bdist_wheel for cld3 ... error
Complete output from command /Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-wheel-53bhx___ --python-tag cp35:
/Users/karan.kothari/anaconda/lib/python3.5/distutils/extension.py:132: UserWarning: Unknown Extension options: 'include_paths'
warnings.warn(msg)
running bdist_wheel
running build
running build_ext
building 'cld3' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.5
creating build/temp.macosx-10.7-x86_64-3.5/src
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3/protos
creating build/temp.macosx-10.7-x86_64-3.5/src/script_span
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/karan.kothari/anaconda/include -arch x86_64 -Isrc -I/Users/karan.kothari/anaconda/include/python3.5m -c src/cld3.cpp -o build/temp.macosx-10.7-x86_64-3.5/src/cld3.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/embedding_feature_extractor.h:23:0,
from src/nnet_language_identifier.h:22,
from src/cld3.cpp:593:
src/feature_extractor.h:45:47: fatal error: cld_3/protos/feature_extractor.pb.h: No such file or directory
#include "cld_3/protos/feature_extractor.pb.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1

Failed building wheel for cld3
Running setup.py clean for cld3
Failed to build cld3
Installing collected packages: cld3
Found existing installation: cld3 0.2.2
Uninstalling cld3-0.2.2:
Successfully uninstalled cld3-0.2.2
Running setup.py install for cld3 ... error
Complete output from command /Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-record-u49dl8xm/install-record.txt --single-version-externally-managed --compile:
/Users/karan.kothari/anaconda/lib/python3.5/distutils/extension.py:132: UserWarning: Unknown Extension options: 'include_paths'
warnings.warn(msg)
running install
running build
running build_ext
building 'cld3' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.5
creating build/temp.macosx-10.7-x86_64-3.5/src
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3/protos
creating build/temp.macosx-10.7-x86_64-3.5/src/script_span
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/karan.kothari/anaconda/include -arch x86_64 -Isrc -I/Users/karan.kothari/anaconda/include/python3.5m -c src/cld3.cpp -o build/temp.macosx-10.7-x86_64-3.5/src/cld3.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/embedding_feature_extractor.h:23:0,
from src/nnet_language_identifier.h:22,
from src/cld3.cpp:593:
src/feature_extractor.h:45:47: fatal error: cld_3/protos/feature_extractor.pb.h: No such file or directory
#include "cld_3/protos/feature_extractor.pb.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1

----------------------------------------

Rolling back uninstall of cld3
Command "/Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-record-u49dl8xm/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/

Can't install gcld3 on MacOS Ventura 13.2.1

This looks to be extremely common, but after trying dozens of solutions, I'm still getting an error when installing gcld3.

First, here's the complete error output. I'm using the same recommended format that solved Issue #49 in this example:

coreynorthcutt@Coreys-MacBook-Pro mixedlang % CPATH=/opt/homebrew/include pip3 install gcld3
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
Collecting gcld3
  Using cached gcld3-3.0.13.tar.gz (647 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: gcld3
  Building wheel for gcld3 (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      /usr/local/lib/python3.11/site-packages/setuptools/__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated. Requirements should be satisfied by a PEP 517 installer. If you are using pip, you can try `pip install --use-pep517`.
        dist.fetch_build_eggs(dist.setup_requires)
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-13-x86_64-cpython-311
      creating build/lib.macosx-13-x86_64-cpython-311/gcld3
      copying gcld3/__init__.py -> build/lib.macosx-13-x86_64-cpython-311/gcld3
      running build_ext
      building 'gcld3.pybind_ext' extension
      creating build/temp.macosx-13-x86_64-cpython-311
      creating build/temp.macosx-13-x86_64-cpython-311/gcld3
      creating build/temp.macosx-13-x86_64-cpython-311/src
      creating build/temp.macosx-13-x86_64-cpython-311/src/cld_3
      creating build/temp.macosx-13-x86_64-cpython-311/src/cld_3/protos
      creating build/temp.macosx-13-x86_64-cpython-311/src/script_span
      clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -stdlib=libc++ -I/private/var/folders/0r/1w5d1pkx3ql47d5jx_jqdnd40000gn/T/pip-install-vuwi41an/gcld3_476945d9d6e345219101ad8b4486ddb6/.eggs/pybind11-2.10.3-py3.11.egg/pybind11/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c gcld3/pybind_ext.cc -o build/temp.macosx-13-x86_64-cpython-311/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
      In file included from gcld3/pybind_ext.cc:5:
      In file included from gcld3/../src/nnet_language_identifier.h:22:
      In file included from gcld3/../src/embedding_feature_extractor.h:23:
      In file included from gcld3/../src/feature_extractor.h:45:
      gcld3/../src/cld_3/protos/feature_extractor.pb.h:10:10: fatal error: 'google/protobuf/port_def.inc' file not found
      #include <google/protobuf/port_def.inc>
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for gcld3
  Running setup.py clean for gcld3
Failed to build gcld3

The error with clang stands out: error: command '/usr/bin/clang' failed with exit code 1

This particular error appears no matter how I try to install cld. I get varying preceding error output, ie. if I try to pip install pycld3 instead... but the clang error is always there.

I've so far...

done a brew update and brew upgrade.
verified that Pip, Python, Protoc, Wheel, and Setuptools all have the latest versions.
tried testing in a virtual environment (as some recommended in other threads).
tried appending the --use-pep517 flag as the warning above suggests.
tried setting a million different ENV variables in my local session (every thread seems to suggest something different)

Support for Kurdish (Sorani)

Central Kurdish (Sorani) is a language of Kurdish and has several million native speakers in Iraq and Iran. The code (ckb) is assigned to the language and it uses the Arabic script.

Also, I suggest changing the name of the current Kurdish on the list to Kurdish (Kurmanji), in order to identify both Kurdish languages easily.

How can i import cld3 to other c++ project in visual studio?

I have tried many ways but it doesn't work, can you help me?
How can i import cld3 to other c++ project in visual studio?
thank you.

Train a new model

Hi, someone know how to train a new model on custom data? There are not any documentation...

Traditional Chinese support

Currently the model detects both zh-hans and zh-hant as just zh:

import cld3

documents =  [
            "把中坛元帅固定在神轿",
            "親愛的牽起你的手 並將他們放在我手中",
        ]

for i,d in enumerate(documents):
    print(i,cld3.get_language(d))

will output:

0 LanguagePrediction(language='zh', probability=0.9999771118164062, is_reliable=True, proportion=1.0)
1 LanguagePrediction(language='zh', probability=0.9999103546142578, is_reliable=True, proportion=1.0)

while the first sentence is zh-hans and the second is zh-hant.

issue about detect incorrectlly

I try with many text of korean but CLD3 is unable to detect it.

for example:
Korean text: "이 회의에서는 업계 전반의" => output: vi => should be ko

English text: "hello world" => output: ky => should be en

how can CLD3 detect language more accurately?

thank you very much.

cld3-users mailing list not publicly searchable

I was just checking to see if some stuff I've run into as already been run into by others, but when I clicked on https://groups.google.com/forum/#!forum/cld3-users in the README, Google Groups says "You must be a member of this group to view and participate in it."

It would be nice if we could still read messages without being a member! (I totally understand requiring membership to post, of course.)

Chromium Dependency

Is it possible to avoid the dependency of Chromium? Is this needed for protobuf dependency only? If so it would be worth to add external libraries and a Makefile to compile cld3 directly without the Chromium overhead...

ImportError (Python3.9, protobuf 21.5)

Installed gcld3 using pip, and it worked until protobuf got updated in homebrew.
Now, import gcld3 does not work.
Basically, gcld3's library tries to find libprotobuf.30.dylib but I have libprotobuf.32.dylib (and renaming it to libprotobuf.30.dylib does not work)
What would be a solution for this problem?

Error Message:

Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/__init__.py", line 1, in <module>
    from .pybind_ext import *
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
ImportError: dlopen(/Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/pybind_ext.cpython-39-darwin.so, 2): Library not loaded: /usr/local/opt/protobuf/lib/libprotobuf.30.dylib
  Referenced from: /Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/pybind_ext.cpython-39-darwin.so
  Reason: image not found

"https" makes france to identidy as english

used simple javacpp adapter
https://github.com/bytedeco/javacpp

import org.bytedeco.javacpp.Loader;
import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.annotation.Platform;

@Platform(
    include = {"LangDetect.h"},
    link = {"cLangDetect"}
)
public class LangDetect extends Pointer {
    private native void detect(String var1, int var2, LangData var3);

    private native void allocate();

    public LangDetect() {
        this.allocate();
    }

    public void detect(String str, LangData result) {
        this.detect(str, str.length(), result);
    }

    static {
        Loader.load();
    }
}

test with http inside url

object CLD2Example {
    @JvmStatic
    fun main(args: Array<String>) {
        val ldetect = LangDetect()
        val ld = LangData()
        ldetect.detect("Sampension,3ème caisse de retraite danoise\uD83C\uDDE9\uD83C\uDDF0,#BoycottIsrael,boycott 4 firmes liées à des colonies!\n" +
                "#GroupPalestine\n" +
                "⏩https://t.co/gPIEbpotvk https://t.co/P4TaPvcBdX ", ld)
        println(ld.getName(0))
        println(ld.getScore(0))
    }
}

output:
ENGLISH
0.99

test without http inside url

object CLD2Example {
    @JvmStatic
    fun main(args: Array<String>) {
        val ldetect = LangDetect()
        val ld = LangData()
        ldetect.detect("Sampension,3ème caisse de retraite danoise\uD83C\uDDE9\uD83C\uDDF0,#BoycottIsrael,boycott 4 firmes liées à des colonies!\n" +
                "#GroupPalestine\n" +
                "⏩://t.co/gPIEbpotvk ://t.co/P4TaPvcBdX ", ld)
        println(ld.getName(0))
        println(ld.getScore(0))
    }
}

output:
FRENCH
0.99

Cannot install gcld3 on OS X

Hi all. I have protobuf 3.15.8 installed but when I try to install gcld3 through pip, protobuf headers aren't found when building from source:

$ pip install gcld3
Collecting gcld3
  Using cached gcld3-3.0.13.tar.gz (647 kB)
Building wheels for collected packages: gcld3
  Building wheel for gcld3 (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-wheel-jab5wg99
       cwd: /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/
  Complete output (25 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.9-x86_64-3.7
  creating build/lib.macosx-10.9-x86_64-3.7/gcld3
  copying gcld3/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/gcld3
  running build_ext
  building 'gcld3.pybind_ext' extension
  creating build/temp.macosx-10.9-x86_64-3.7
  creating build/temp.macosx-10.9-x86_64-3.7/gcld3
  creating build/temp.macosx-10.9-x86_64-3.7/src
  creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3
  creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3/protos
  creating build/temp.macosx-10.9-x86_64-3.7/src/script_span
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/lib/python3.7/site-packages/pybind11/include -I/Users/erip/miniconda3/envs/langid/include/python3.7m -c gcld3/pybind_ext.cc -o build/temp.macosx-10.9-x86_64-3.7/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
  In file included from gcld3/pybind_ext.cc:5:
  In file included from gcld3/../src/nnet_language_identifier.h:22:
  In file included from gcld3/../src/embedding_feature_extractor.h:23:
  In file included from gcld3/../src/feature_extractor.h:45:
  gcld3/../src/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
  #include <google/protobuf/stubs/common.h>
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for gcld3
  Running setup.py clean for gcld3
Failed to build gcld3
Installing collected packages: gcld3
    Running setup.py install for gcld3 ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-record-20khe1jh/install-record.txt --single-version-externally-managed --compile --install-headers /Users/erip/miniconda3/envs/langid/include/python3.7m/gcld3
         cwd: /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/
    Complete output (25 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.7
    creating build/lib.macosx-10.9-x86_64-3.7/gcld3
    copying gcld3/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/gcld3
    running build_ext
    building 'gcld3.pybind_ext' extension
    creating build/temp.macosx-10.9-x86_64-3.7
    creating build/temp.macosx-10.9-x86_64-3.7/gcld3
    creating build/temp.macosx-10.9-x86_64-3.7/src
    creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3
    creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3/protos
    creating build/temp.macosx-10.9-x86_64-3.7/src/script_span
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/lib/python3.7/site-packages/pybind11/include -I/Users/erip/miniconda3/envs/langid/include/python3.7m -c gcld3/pybind_ext.cc -o build/temp.macosx-10.9-x86_64-3.7/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
    In file included from gcld3/pybind_ext.cc:5:
    In file included from gcld3/../src/nnet_language_identifier.h:22:
    In file included from gcld3/../src/embedding_feature_extractor.h:23:
    In file included from gcld3/../src/feature_extractor.h:45:
    gcld3/../src/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
    #include <google/protobuf/stubs/common.h>
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-record-20khe1jh/install-record.txt --single-version-externally-managed --compile --install-headers /Users/erip/miniconda3/envs/langid/include/python3.7m/gcld3 Check the logs for full command output.

Do you have any thoughts?

Some of the languages not detected properly if repeats the words multiple times

For example, the below Tamil words are repeated 3 times and it identified as Japanese.
அம்சம் 1,அம்சம் 1,அம்சம் 1

it detects as Tamil, for single word

Did I hit a bug in gcld3?

I made a few experiments to find out what would be the result of detection for some text not in the supported languages list. While it appears that whatever is detected is unreliable so I can reject the detection, I stumbled upon an example where the result is unexpectedly bad

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=1000)
sample = "The last part of this text is pure gibberish with well crafted punctuation. Този текст е на Български. Sdslkmnscd scsun dc mcsaducsdnmlmc icmmklmdsc!"
result = detector.FindTopNMostFreqLangs(text=sample, num_langs=5)
for i in result:
    print(i.language, i.is_reliable, i.proportion, i.probability)```

will surprisingly output this:
`en True 0.4444444477558136 0.9999370574951172
bg True 0.28070175647735596 0.9173890948295593
hu True 0.27485379576683044 0.9084945917129517
und False 0.0 0.0
und False 0.0 0.0`

for one part good text and second part garbage it depends which is first and which has bigger proportion but the result can be correctly interpreted, however the above example is quite bad.

Hindi identified as English

For example on this page:
https://orientalinsurance.org.in/hi/web/guest/privacy-policy

Increase the number of supported languages

Hi!
Do you have any plans to increase the number of supported languages up to 200-300?
The languages like: Chuvash (chv), Mari (mhr), Hill Mari (mrj), Komi (kpv), which have presence in the web, are not included here. And hence are not in multilingual C4 dataset.

Python API docs

https://pypi.org/project/gcld3/ is empty and there are no docs about exposed functions and params in this repo.

Recognition for Occitan language

Hello,

currently the Occitan language is not recognized, would it be possible to add it?
ISO code: oc

Best regards

Chinese text detected as Haitian Creole

>>> import gcld3
>>> ld = gcld3.NNetLanguageIdentifier(0, 50)
>>> res = ld.FindLanguage('污水')
>>> print(res.language, res.probability)
ht 0.7535305619239807

Calling detect_language_mixed on an empty string crashes the R session

To reproduce:

library(cld3)
detect_language_mixed("")

Expected: this produces some result

Actual: R crashes with the following messages:

> detect_language_mixed("")

 *** caught illegal operation ***
address 0x7f6807716e58, cause 'illegal operand'

Traceback:
 1: cld3_detect_language_mixed(as_string(text, vectorize = FALSE),     size)
 2: detect_language_mixed("")

Tests are failing

The language detection tests are failing. It confuses Bosnian with Croatian and Indonesian with Malay. Those languages are in turn similar, but it would be good if the tests could be adapted so that they pass. Note that increasing the max_num_bytes even to 100000 does not fix the test.

Below is the output of the test

Running TestPredictions
  Misclassification:
    Text: Novi predsjednik Mešihata Islamske zajednice u Srbiji (IZuS) i muftija dr. Mevlud ef. Dudić izjavio je u intervjuu za Anadolu Agency (AA) kako je uvjeren da će doći do vraćanja jedinstva među muslimanima i unutar Islamske zajednice na prostoru Sandžaka, te da je njegova ruka pružena za povratak svih u okrilje Islamske zajednice u Srbiji nakon skoro sedam godina podjela u tom dijelu Srbije. Dudić je za predsjednika Mešihata IZ u Srbiji izabran 4. januara, a zvanična inauguracija će biti obavljena u prvoj polovini februara. Kako se očekuje, prisustvovat će joj i reisu-l-ulema Islamske zajednice u Srbiji Husein ef. Kavazović koji će i zvanično promovirati Dudića u novog prvog čovjeka IZ u Srbiji. Dudić će danas boraviti u prvoj zvaničnoj posjeti reisu Kavazoviću, što je njegov privi simbolični potez nakon imenovanja.
    Expected language: bs
    Predicted language: hr
  Misclassification:
    Text: berdiri setelah pengurusnya yang berusia 83 tahun, Fayzrahman Satarov, mendeklarasikan diri sebagai nabi dan rumahnya sebagai negara Islam Satarov digambarkan sebagai mantan ulama Islam  tahun 1970-an. Pengikutnya didorong membaca manuskripnya dan kebanyakan dilarang meninggalkan tempat persembunyian bawah tanah di dasar gedung delapan lantai mereka. Jaksa membuka penyelidikan kasus kriminal pada kelompok itu dan menyatakan akan membubarkan kelompok kalau tetap melakukan kegiatan ilegal seperti mencegah anggotanya mencari bantuan medis atau pendidikan. Sampai sekarang pihak berwajib belum melakukan penangkapan meskipun polisi mencurigai adanya tindak kekerasan pada anak. Pengadilan selanjutnya akan memutuskan apakah anak-anak diizinkan tetap tinggal dengan orang tua mereka. Kazan yang berada sekitar 800 kilometer di timur Moskow merupakan wilayah Tatarstan yang
    Expected language: id
    Predicted language: ms
  Failure: 2 wrong predictions

Undefined symbol

I have followed the instructions on the README until the command ninja -C out/Default third_party/cld_3/src/src:language_identifier_main . I have uncommented language_identifier_main and set the PATH variable but when I run that command I get the following error. I have tried changing the created files, updated gcc but I am not sure why it is failing on loading std libraries.

[306/306] LINK ./language_identifier_main
FAILED: language_identifier_main
>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>> referenced by language_identifier_main.cc:29 (../../third_party/cld_3/src/src/language_identifier_main.cc:29)
/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout
>>> referenced by language_identifier_main.cc
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:35 (../../third_party/cld_3/src/src/language_identifier_main.cc:35)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:36 (../../third_party/cld_3/src/src/language_identifier_main.cc:36)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:37 (../../third_party/cld_3/src/src/language_identifier_main.cc:37)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout
>>> referenced by language_identifier_main.cc
>>> referenced by language_identifier_main.cc:48 (../../third_party/cld_3/src/src/language_identifier_main.cc:48)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(bool)
>>> referenced by language_identifier_main.cc:49 (../../third_party/cld_3/src/src/language_identifier_main.cc:49)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(float)
>>> referenced by language_identifier_main.cc:50 (../../third_party/cld_3/src/src/language_identifier_main.cc:50)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()
>>> referenced by language_identifier_main.cc:54 (../../third_party/cld_3/src/src/language_identifier_main.cc:54)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::ios_base::getloc() const
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::ctype<char>::id
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::locale::use_facet(std::__1::locale::id&) const
>>> referenced by __locale:212 (../../buildtools/third_party/libc++/trunk/include/__locale:212)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::locale::~locale()
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::put(char)
>>> referenced by ostream:1001 (../../buildtools/third_party/libc++/trunk/include/ostream:1001)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::flush()
>>> referenced by ostream:1002 (../../buildtools/third_party/libc++/trunk/include/ostream:1002)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

Python binding forks and different fixes

Update: CLD3 now has a Python binding code from Google themselves: gcld3

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3

This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:

Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix res.language comparison; Provide easy pip install under pycld3 name https://github.com/bsolomon1124/pycld3 by @bsolomon1124

Note:

If you use one from pip install cld3 (from PyPI), it is https://github.com/Elizafox/cld3 by @Elizafox
Use pip install pycld3 for an updated version, at https://github.com/bsolomon1124/pycld3 by @bsolomon1124, with all the fixes and improvement listed above

Python Binding Documentation

(based on the documentation from https://github.com/Elizafox/cld3 )

Usage:

Here's some examples:

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> cld3.get_frequent_languages("This piece of text is in English. Този текст е на Български.", 5)
[LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592), LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)]

In short:

get_language returns the most likely language as the named tuple LanguagePrediction. Proportion is always 1.0 when called in this way.
get_frequent_languages will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.

In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence, get_frequent_languages may return fewer results than what you asked for, or none at all.

nnet_lang_id_test::TestPredictions() assertion failure

With latest master, I'm seeing two assertions fail in nnet_lang_id_test::TestPredictions():

Running TestPredictions
  Misclassification: 
    Text: Novi predsjednik Mešihata Islamske zajednice u Srbiji (IZuS) i muftija dr. Mevlud ef. Dudić izjavio je u intervjuu za Anadolu Agency (AA) kako je uvjeren da će doći do vraćanja jedinstva među muslimanima i unutar Islamske zajednice na prostoru Sandžaka, te da je njegova ruka pružena za povratak svih u okrilje Islamske zajednice u Srbiji nakon skoro sedam godina podjela u tom dijelu Srbije. Dudić je za predsjednika Mešihata IZ u Srbiji izabran 4. januara, a zvanična inauguracija će biti obavljena u prvoj polovini februara. Kako se očekuje, prisustvovat će joj i reisu-l-ulema Islamske zajednice u Srbiji Husein ef. Kavazović koji će i zvanično promovirati Dudića u novog prvog čovjeka IZ u Srbiji. Dudić će danas boraviti u prvoj zvaničnoj posjeti reisu Kavazoviću, što je njegov privi simbolični potez nakon imenovanja. 
    Expected language: bs
    Predicted language: hr
  Misclassification: 
    Text: berdiri setelah pengurusnya yang berusia 83 tahun, Fayzrahman Satarov, mendeklarasikan diri sebagai nabi dan rumahnya sebagai negara Islam Satarov digambarkan sebagai mantan ulama Islam  tahun 1970-an. Pengikutnya didorong membaca manuskripnya dan kebanyakan dilarang meninggalkan tempat persembunyian bawah tanah di dasar gedung delapan lantai mereka. Jaksa membuka penyelidikan kasus kriminal pada kelompok itu dan menyatakan akan membubarkan kelompok kalau tetap melakukan kegiatan ilegal seperti mencegah anggotanya mencari bantuan medis atau pendidikan. Sampai sekarang pihak berwajib belum melakukan penangkapan meskipun polisi mencurigai adanya tindak kekerasan pada anak. Pengadilan selanjutnya akan memutuskan apakah anak-anak diizinkan tetap tinggal dengan orang tua mereka. Kazan yang berada sekitar 800 kilometer di timur Moskow merupakan wilayah Tatarstan yang
    Expected language: id
    Predicted language: ms
  Failure: 2 wrong predictions

is this currently expected, or some incorrect build configuration (or dependencies) can cause this?

I have built using #5 's makefile, and protobuf version is 3.1 / 3.3.2 (tested on 2 different versions).

Provide vectors with character ranges of detected languages

It is possible with CLD2 to retrieve vectors with character ranges of detected languages.
Look for ResultChunkVector in https://github.com/CLD2Owners/cld2/blob/master/public/compact_lang_det.h for more details.
That's a very useful feature to extract language-specific blocks out of multi-lingual documents.
As far as I could see, it is currently not possible to do the same thing with CLD3.
How difficult would it be to implement it?

Add support to release linux aarch64 wheels

Problem

On aarch64, pip install gcld3 builds the wheels from source code and then installs it. It requires the user to have a development environment installed on their system. Also, it takes more time to build the wheels than downloading and extracting the wheels from PyPI.

Resolution

On aarch64, pip install gcld3 should download the wheels from PyPI.

@jasonriesa, Please let me know your interest in releasing aarch64 wheels. I can help with this.

Old language code (iw) for Hebrew, should be 'he'

In 1989 ISO 639 changed the language code for Hebrew from 'iw' to 'he'.