Git Product home page Git Product logo

cld3's Introduction

Compact Language Detector v3 (CLD3)

Model

CLD3 is a neural network model for language identification. This package contains the inference code and a trained model. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer. The remaining components of the network are a hidden (Rectified linear) layer and a softmax layer.

To get a language prediction for the input text, we simply perform a forward pass through the network.

Figure

Supported Languages

The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

Output Code Language Name Script Name
af Afrikaans Latin
am Amharic Ethiopic
ar Arabic Arabic
bg Bulgarian Cyrillic
bg-Latn Bulgarian Latin
bn Bangla Bangla
bs Bosnian Latin
ca Catalan Latin
ceb Cebuano Latin
co Corsican Latin
cs Czech Latin
cy Welsh Latin
da Danish Latin
de German Latin
el Greek Greek
el-Latn Greek Latin
en English Latin
eo Esperanto Latin
es Spanish Latin
et Estonian Latin
eu Basque Latin
fa Persian Arabic
fi Finnish Latin
fil Filipino Latin
fr French Latin
fy Western Frisian Latin
ga Irish Latin
gd Scottish Gaelic Latin
gl Galician Latin
gu Gujarati Gujarati
ha Hausa Latin
haw Hawaiian Latin
hi Hindi Devanagari
hi-Latn Hindi Latin
hmn Hmong Latin
hr Croatian Latin
ht Haitian Creole Latin
hu Hungarian Latin
hy Armenian Armenian
id Indonesian Latin
ig Igbo Latin
is Icelandic Latin
it Italian Latin
iw Hebrew Hebrew
ja Japanese Japanese
ja-Latn Japanese Latin
jv Javanese Latin
ka Georgian Georgian
kk Kazakh Cyrillic
km Khmer Khmer
kn Kannada Kannada
ko Korean Korean
ku Kurdish Latin
ky Kyrgyz Cyrillic
la Latin Latin
lb Luxembourgish Latin
lo Lao Lao
lt Lithuanian Latin
lv Latvian Latin
mg Malagasy Latin
mi Maori Latin
mk Macedonian Cyrillic
ml Malayalam Malayalam
mn Mongolian Cyrillic
mr Marathi Devanagari
ms Malay Latin
mt Maltese Latin
my Burmese Myanmar
ne Nepali Devanagari
nl Dutch Latin
no Norwegian Latin
ny Nyanja Latin
pa Punjabi Gurmukhi
pl Polish Latin
ps Pashto Arabic
pt Portuguese Latin
ro Romanian Latin
ru Russian Cyrillic
ru-Latn Russian English
sd Sindhi Arabic
si Sinhala Sinhala
sk Slovak Latin
sl Slovenian Latin
sm Samoan Latin
sn Shona Latin
so Somali Latin
sq Albanian Latin
sr Serbian Cyrillic
st Southern Sotho Latin
su Sundanese Latin
sv Swedish Latin
sw Swahili Latin
ta Tamil Tamil
te Telugu Telugu
tg Tajik Cyrillic
th Thai Thai
tr Turkish Latin
uk Ukrainian Cyrillic
ur Urdu Arabic
uz Uzbek Latin
vi Vietnamese Latin
xh Xhosa Latin
yi Yiddish Hebrew
yo Yoruba Latin
zh Chinese Han (including Simplified and Traditional)
zh-Latn Chinese Latin
zu Zulu Latin

Installation

CLD3 is designed to run in the Chrome browser, so it relies on code in Chromium. The steps for building and running the demo of the language detection model are:

  • check out the Chromium repository.
  • copy the code to //third_party/cld_3
  • Uncomment language_identifier_main executable in src/BUILD.gn.
  • build and run the model using the commands:
gn gen out/Default
ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
out/Default/language_identifier_main

Bugs and Feature Requests

Open a GitHub issue for this repository to file bugs and feature requests.

Announcements and Discussion

For announcements regarding major updates as well as general discussion list, please subscribe to: [email protected]

Credits

Original authors of the code in this package include (in alphabetical order):

  • Alex Salcianu
  • Andy Golding
  • Anton Bakalov
  • Chris Alberti
  • Daniel Andor
  • David Weiss
  • Emily Pitler
  • Greg Coppola
  • Jason Riesa
  • Kuzman Ganchev
  • Michael Ringgaard
  • Nan Hua
  • Ryan McDonald
  • Slav Petrov
  • Stefan Istrate
  • Terry Koo

cld3's People

Contributors

abakalov avatar abosamoor avatar aeubanks avatar akihikodaki avatar akihiroota87 avatar atetubou avatar jasonriesa avatar jeroen avatar nico avatar odidev avatar pkasting avatar tambry avatar vaxelrod avatar williamtambellini avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cld3's Issues

Easy examples yield funny results

Code to reproduce:

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=10000)
results = detector.FindTopNMostFreqLangs(text=sample_text, num_langs=2)
print(sample_text)
for result in results:
    print(result.language, result.is_reliable, result.probability, result.proportion)

Weird results:
tus ojos me hace sentir
lt True 0.786892831325531 1.0 # 🤖😬🤣
und False 0.0 0.0

sin red y voy a mil
af True 0.8103252649307251 1.0 # y is not in afrikaans
und False 0.0 0.0

yo te veo pero tu no ves
ja-Latn True 0.9469742178916931 1.0 # japanese, really? these are the most basic spanish words
und False 0.0 0.0

aunque no me veas, mirame
de True 0.9972571730613708 1.0 # no and me are very simple words that are not German
und False 0.0 0.0

esta al reves
eo True 0.7365820407867432 1.0 # in Esperanto there's no word ending with -es
und False 0.0 0.0

aunque no veas
de True 0.9875902533531189 1.0 # no and me are very simple words that are not German
und False 0.0 0.0

Non-iso 639-* language codes

There are some languages that are not part of Iso 639-1, they are named correctly according to the next available inclusive standard (iso 639-2 or iso 639-3) however there are there are two incorrect or incorrectly named languages:

  1. Hebrew language language code is 'he' usually, not 'iw'.
  2. 'Filipino' is not 'fil' but 'Tagalog', 'tl'.

Fixing these languages should be as easy as writing a wrapper.

request for documentation: how to add a new language

Apologies for overlooking this, but are there any instructions on how to add a new language? In my case, this is about multiple low-resource languages, some without ISO639-3 codes. I have training data at hand but would welcome a howto or a few pointers on where to start. Thanks a lot!

Import gcld3 fails

Hi! I'm trying to use gcld3 for language detection on MacOS (12.6.1). I have installed the latest protobuf through Homebrew and am trying to import gcld3 in a Jupyter notebook, but the cell just gets stuck processing indefinitely. Is there anything I'm missing here? Why can't I import the library?

Thanks!

Request for a branch or tag

Would it be possible to create a simple tag ('0.1', 'alpha', 'beta', ... or whatever) to easily identify the current version/status of the master branch ?

Installation instructions unclear

1- What does "check out the Chromium repository." mean, should I just look at it?
2- "copy the code to //third_party/cld_3", what code, and where is this folder supposed to be? cd // in my shell sends me to /. Should I have a directory with path /third_party/cld_3?
3- "Uncomment language_identifier_main executable in src/BUILD.gn" why isn't it already uncommented?
4- Why does the repo have a CMakeLists.txt but the users have to follow these weird "build" instructions?

This whole language detection thing sounds pretty awesome but I was very disappointed to find these instructions.

Thanks.

Concurrency issues

What is the right way of using NNetLanguageIdentifier in concurrently with threads?

I thought that if I create separate instance for each thread it should be ok, but I start getting access violation exceptions in NNetLanguageIdentifier constructor when threads are running concurrently.
I was able to solve that by adding a global lock but I wonder what is the best way to use the code concurrently.
May be if I create single instance of NNetLanguageIdentifier and share it between threads should be ok? Thoughts?

Support for libcld3 installation

In order to include this library in another C++ program, the libcld3.a and include files need to be installed. However, the makefile does not include support for this right now. I was able to add that support by appending the following code to the CMakeLists.txt file:

install(DIRECTORY include/ DESTINATION include)
install(TARGETS ${PROJECT_NAME} 
    ARCHIVE DESTINATION lib
    LIBRARY DESTINATION lib
    RUNTIME DESTINATION bin)

You will need to create an /include directory with all the *.h files. Then you can compile and install the library as follows:

cd cld3
mkdir build.release
cd build.release
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --target install

Using cld3 in your program is pretty simple:

#include <nnet_language_identifier.h>

string get_language(const string &str)
{
    if ( str.empty() ) return "";
    chrome_lang_id::NNetLanguageIdentifier lang_id;
    const chrome_lang_id::NNetLanguageIdentifier::Result result = lang_id.FindLanguage(str); 
    return result.is_reliable ? result.language : "";
}

Hope this helps anyone who might be stuck.

Expose Span Information from FindTopNMostFrequentLangs

When calling FindTopNMostFrequentLangs(text,num_langs), it would be helpful to know the ranges of text that each result applies to. For example, if you had the string "Hello, my name is 三船 敏郎. It's a pleasure to meet you.", it would be helpful to know that English applies to indices 0-16 and 24-52, while Japanese applies to indices 17-23. I propose the following:

  1. Add vector<pair<int,int>> to LangChunkStats that keeps track of ranges of text the language applies to. The vector can be populated using the script_span.offset and script_span.text_bytes.
  2. Add the vector to Result when populating results vector.

These small changes would give the caller more detailed information about the language of each section of text, if there are multiple languages detected.

Support for Unit Testing

The CMakeLists.txt file is missing support for building the unit tests to validate all the languages. You can append the following code to create a language_id_test executable that will test each of the supported languages and report any errors:

add_executable(language_id_test src/nnet_lang_id_test.cc src/nnet_lang_id_test_data.cc)
target_link_libraries(language_id_test cld3 ${Protobuf_LITE_LIBRARIES})

Bad identification for short input

Hi
Could anyone just confirm that short inputs are usually not correctly identified by CLD3 ?
Some examples:

text: Hello
language: sr
probability: 0.830728
reliable: 1
proportion: 1

text: Hello world
language: ky
probability: 0.719188
reliable: 1
proportion: 1

text: Hello my world
language: ky
probability: 0.521224
reliable: 0
proportion: 1

text: Hello my great world
language: ja
probability: 0.278577
reliable: 0
proportion: 1

text: Hello the great world of Artificial Intelligence
language: en
probability: 0.980107
reliable: 1
proportion: 1

Swedish detected when all text is arabic

ruby تَخيّل أنّ الله بِعظمتِه يُحِبُك نَاقِلة وبَاحِثة للمَواضِيع التَرفيهية والمُفيدة بَعض الثِريد ز تَعود لصَاحبِيها الأصلِيين حُقوقية عَربية مُسلِمة أسعَى لأن أترُكَ أثَراً للأجَر لعضُو تكست يونجون 𓂅 أوّل قناة بالتِيلي لنَشر مُتتَاليات يونجوُن صَور مفَلترة فخَمة مَعلومِات عنهه كلهِة هنا مُنو باعتِقادك يستحِق لقِب فرقةه ممُهده الطريق فَي الكيبوب بايّسك من txt ٌ مُتتاليات الفرقِةه الصَاعدة تكِست هَلقناة مُختصةه فقط لنِشر عن كُل ما يخص مِلوك الجَيل الرَابع تمورو باي توقذر مُتتاليات أيديتِ ز أيكونز وصور عِرض فِخمةه كل ذا تلقِاها بهِل قناة جَست للاِشتراك لِستةه ودعِم لالِيسون َ محُتوى قنواتِ الكيبوب والانِمي القِبول اقِل مَ اقبَل الزيَادة و أكثَر لتِموَرو بآي تَوقيذِر 𓂅 كلهِة هنا 彡 تحَبيِن تكست 𓂅 أعلاَن حقيِقي أكبَر قناَه لہ فخَر الروكيز تكسِت ̧ مُتتاليات ونشِر مُرتب وكلشِيء يخصهم 𓏲 فاَن لتكسِت أشتركِ 彡 تعَرف تصَمم يِن ᤣ كليِشةه حقيِقيةه أوُل قناَه لہ تعليِم تصاميِم و ايديِتز بسهُوله وشرح مَفهوم ̧ تحبِ تصممِ أدخلِي هناَ

Is detected as sv for FindLanguage function. However, if you remove first ruby word, it begins to be detected as arabic but with pretty high score of sv

Lang Id, name, score
49 sv 15.801
88 ar 13.1218

To me this seems quite unexpected :)

Training set

Is the training set that was used to train CLD3 available somewhere? Alternatively, and a bit off-topic, is there a sort of standard or very good dataset used for writing a language classifier like this one?

Korean is detected as so many different language with some symbols

[1] pry(main)> identifier =  CLD3::NNetLanguageIdentifier.new(1, 2048)
=> #<CLD3::NNetLanguageIdentifier:0x0000557ad8972f68
 @cc=#<CLD3::Unstable::NNetLanguageIdentifier::Pointer address=0x0000557ad8014870>>
[2] pry(main)> identifier.find_language('안녕하세요')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[3] pry(main)> identifier.find_language('A: 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[4] pry(main)> identifier.find_language('A. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[5] pry(main)> identifier.find_language('Q. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[6] pry(main)> identifier.find_language('"안녕하세요"')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[7] pry(main)> identifier.find_language('Q:안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[8] pry(main)> identifier.find_language('A. 코스프레?')
=> #<struct Struct::Result language=:zh, probability=0.27146071195602417, reliable?=false, proportion=1.0>
[9] pry(main)> identifier.find_language('A. 코스프레?\n마녀 하고 싶어요')
=> #<struct Struct::Result language=:ne, probability=0.9822306632995605, reliable?=true, proportion=1.0>

Korean uses specialised characterset called Hangul(한글) So 1-gram based detection should result almost 100% rate, But it is detected as zh, ne, hi, etc

fatal error C1083: Cannot open include file: 'google/protobuf/port_def.inc': No such file or directory

I am having the following error when runnign >pip install gcld3 :

EROOR:

\AppData\Local\Temp\pip-install-7_v9ujss\gcld3_9a116eb59c8049b5a46f1c8cf8ca323d\src\cld_3/protos/feature_extractor.pb.h(10): fatal error C1083: Cannot open include file: 'google/protobuf/port_def.inc': No such file
or directory

System information:

  • Windows 10
  • Python 3.8
  • protobuf==3.12.2
  • libprotoc 3.12.1

Notes:
I also tried with Python 3.7 as well as using the most recent version of protobuf==3.15.8

cmake: error: 'google/protobuf/stubs/common.h' file not found

As for #16 I'm trying to build the static library:

ip-192-168-1-105:AI loretoparisi$ git clone https://github.com/google/cld3.git
Cloning into 'cld3'...
remote: Enumerating objects: 16, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 429 (delta 6), reused 8 (delta 3), pack-reused 413
Receiving objects: 100% (429/429), 2.88 MiB | 481.00 KiB/s, done.
Resolving deltas: 100% (295/295), done.

and then building with Cmake

ip-192-168-1-105:AI loretoparisi$ cd cld3/
ip-192-168-1-105:cld3 loretoparisi$ mkdir build
ip-192-168-1-105:cld3 loretoparisi$ cd build/
ip-192-168-1-105:build loretoparisi$ cmake ..
-- The C compiler identification is AppleClang 10.0.0.10001145
-- The CXX compiler identification is AppleClang 10.0.0.10001145
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Found Protobuf: /usr/local/lib/libprotobuf.dylib (found version "3.5.1") 
-- Protobuf_FOUND= TRUE
-- Protobuf_VERSION= 3.5.1
CMake Warning at CMakeLists.txt:11 (message):
  Protobuf 2.5 and CLD3 seems happy together.  This script does NOT check if
  your verison of protobuf is compatible.
-- Protobuf_LIBRARIES= /usr/local/lib/libprotobuf.dylib
-- Protobuf_LITE_LIBRARIES= /usr/local/lib/libprotobuf-lite.dylib
-- PROTO_HDRS= /Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.h;/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/sentence.pb.h;/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/task_spec.pb.h
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/loretoparisi/Documents/Projects/AI/cld3/build

Now running make

ip-192-168-1-105:build loretoparisi$ make
[  2%] Running C++ protocol buffer compiler on src/task_spec.proto
[  5%] Running C++ protocol buffer compiler on src/feature_extractor.proto
[  8%] Running C++ protocol buffer compiler on src/sentence.proto
Scanning dependencies of target cld3
[ 10%] Building CXX object CMakeFiles/cld3.dir/cld_3/protos/feature_extractor.pb.cc.o
In file included from /Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.cc:4:
/Users/loretoparisi/Documents/Projects/AI/cld3/build/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
#include <google/protobuf/stubs/common.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[2]: *** [CMakeFiles/cld3.dir/cld_3/protos/feature_extractor.pb.cc.o] Error 1
make[1]: *** [CMakeFiles/cld3.dir/all] Error 2
make: *** [all] Error 2
ip-192-168-1-105:build lor

Since I'm on macOS I have tried a protobuf update

Error: protobuf 3.5.1_1 is already installed
To upgrade to 3.7.0, run `brew upgrade protobuf`

After upgrading to protobuf 3.7.0 I get the same very error.

gcld3 compatible with pylint and mypy

It would be nice to make the python library gcld3 compatible with pylint and mypy by defining the types of the module.
It seems completely lost by pybind.

Thanks in advance,
Loic

Spanish manual language problems detection

Hi cld3 team!
Thank you so much for this development, it is so useful!
I have used your language detection package vía R (see code) and then done some manual tagging for Spanish (see "human" column in this csv) and have found some things that might be interesting but I am unsure of how to make it useful for you?
For instance, related to this issue, from a list of conference titles, those in "Spanglish" got tagged as English w/cld2 and as Spanish with cld3.
Also, while cld3 got real better at distinguishing Galician from Spanish there is still one case in which it got this tag wrong: "SAMEBibl: Sistema Automático de Migración a Europeana para Bibliotecas" (should be Spanish)
Hope this is somewhat useful :)

Relation among CLD2 Score and CLD3 Accuracy

In my project I have to port the language detector from CLD2 to CLD3. The CLD2 has a concept of Score, and Percentage of some language in the text. Internally the Score is calculated from a probability (not exposed in my understanding) in some way (my assumption was from the field textBytes that represents the size in bytes of the text, the accuracy and distribution of each label in the text), something like Acc=1-textBytes/Score
In CLD2 the function that normalizes these scores is

normalized_score3[2] = GetNormalizedScore(language3[2],
                                                  ULScript_Common,
                                                  bytecount3,
                                                  doc_tote->Score(2));

That said, since I need to upgrade to CLD3, I have at some point to convert from CLD2 Score to CLD3 accuracy value. Any hint how to achieve that?

Here for reference:
dachev/node-cld#52

Not abel to import cld3 python on mac

Installed protobuf
Installed https://github.com/Elizafox/cld3

on import i get the following error
ImportError: dlopen(/Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so, 2): Symbol not found: __ZN6google8protobuf2io18StringOutputStreamC1EPSs
Referenced from: /Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so
Expected in: flat namespace
in /Users/karan.kothari/anaconda/lib/python3.5/site-packages/cld3.cpython-35m-darwin.so

Also Tried to install https://github.com/jbaiter/cld3
But get the following build error

Collecting git+https://github.com/jbaiter/cld3.git
Cloning https://github.com/jbaiter/cld3.git to /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1
Building wheels for collected packages: cld3
Running setup.py bdist_wheel for cld3 ... error
Complete output from command /Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-wheel-53bhx___ --python-tag cp35:
/Users/karan.kothari/anaconda/lib/python3.5/distutils/extension.py:132: UserWarning: Unknown Extension options: 'include_paths'
warnings.warn(msg)
running bdist_wheel
running build
running build_ext
building 'cld3' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.5
creating build/temp.macosx-10.7-x86_64-3.5/src
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3/protos
creating build/temp.macosx-10.7-x86_64-3.5/src/script_span
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/karan.kothari/anaconda/include -arch x86_64 -Isrc -I/Users/karan.kothari/anaconda/include/python3.5m -c src/cld3.cpp -o build/temp.macosx-10.7-x86_64-3.5/src/cld3.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/embedding_feature_extractor.h:23:0,
from src/nnet_language_identifier.h:22,
from src/cld3.cpp:593:
src/feature_extractor.h:45:47: fatal error: cld_3/protos/feature_extractor.pb.h: No such file or directory
#include "cld_3/protos/feature_extractor.pb.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1


Failed building wheel for cld3
Running setup.py clean for cld3
Failed to build cld3
Installing collected packages: cld3
Found existing installation: cld3 0.2.2
Uninstalling cld3-0.2.2:
Successfully uninstalled cld3-0.2.2
Running setup.py install for cld3 ... error
Complete output from command /Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-record-u49dl8xm/install-record.txt --single-version-externally-managed --compile:
/Users/karan.kothari/anaconda/lib/python3.5/distutils/extension.py:132: UserWarning: Unknown Extension options: 'include_paths'
warnings.warn(msg)
running install
running build
running build_ext
building 'cld3' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.5
creating build/temp.macosx-10.7-x86_64-3.5/src
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3
creating build/temp.macosx-10.7-x86_64-3.5/src/cld_3/protos
creating build/temp.macosx-10.7-x86_64-3.5/src/script_span
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/karan.kothari/anaconda/include -arch x86_64 -Isrc -I/Users/karan.kothari/anaconda/include/python3.5m -c src/cld3.cpp -o build/temp.macosx-10.7-x86_64-3.5/src/cld3.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/embedding_feature_extractor.h:23:0,
from src/nnet_language_identifier.h:22,
from src/cld3.cpp:593:
src/feature_extractor.h:45:47: fatal error: cld_3/protos/feature_extractor.pb.h: No such file or directory
#include "cld_3/protos/feature_extractor.pb.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1

----------------------------------------

Rolling back uninstall of cld3
Command "/Users/karan.kothari/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-record-u49dl8xm/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/q4/539_wsw970984yk_xfmvxdrwwhmd5b/T/pip-req-build-64de67b1/

Can't install gcld3 on MacOS Ventura 13.2.1

This looks to be extremely common, but after trying dozens of solutions, I'm still getting an error when installing gcld3.

First, here's the complete error output. I'm using the same recommended format that solved Issue #49 in this example:

coreynorthcutt@Coreys-MacBook-Pro mixedlang % CPATH=/opt/homebrew/include pip3 install gcld3
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
Collecting gcld3
  Using cached gcld3-3.0.13.tar.gz (647 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: gcld3
  Building wheel for gcld3 (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      /usr/local/lib/python3.11/site-packages/setuptools/__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated. Requirements should be satisfied by a PEP 517 installer. If you are using pip, you can try `pip install --use-pep517`.
        dist.fetch_build_eggs(dist.setup_requires)
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-13-x86_64-cpython-311
      creating build/lib.macosx-13-x86_64-cpython-311/gcld3
      copying gcld3/__init__.py -> build/lib.macosx-13-x86_64-cpython-311/gcld3
      running build_ext
      building 'gcld3.pybind_ext' extension
      creating build/temp.macosx-13-x86_64-cpython-311
      creating build/temp.macosx-13-x86_64-cpython-311/gcld3
      creating build/temp.macosx-13-x86_64-cpython-311/src
      creating build/temp.macosx-13-x86_64-cpython-311/src/cld_3
      creating build/temp.macosx-13-x86_64-cpython-311/src/cld_3/protos
      creating build/temp.macosx-13-x86_64-cpython-311/src/script_span
      clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -stdlib=libc++ -I/private/var/folders/0r/1w5d1pkx3ql47d5jx_jqdnd40000gn/T/pip-install-vuwi41an/gcld3_476945d9d6e345219101ad8b4486ddb6/.eggs/pybind11-2.10.3-py3.11.egg/pybind11/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c gcld3/pybind_ext.cc -o build/temp.macosx-13-x86_64-cpython-311/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
      In file included from gcld3/pybind_ext.cc:5:
      In file included from gcld3/../src/nnet_language_identifier.h:22:
      In file included from gcld3/../src/embedding_feature_extractor.h:23:
      In file included from gcld3/../src/feature_extractor.h:45:
      gcld3/../src/cld_3/protos/feature_extractor.pb.h:10:10: fatal error: 'google/protobuf/port_def.inc' file not found
      #include <google/protobuf/port_def.inc>
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for gcld3
  Running setup.py clean for gcld3
Failed to build gcld3

The error with clang stands out: error: command '/usr/bin/clang' failed with exit code 1

This particular error appears no matter how I try to install cld. I get varying preceding error output, ie. if I try to pip install pycld3 instead... but the clang error is always there.

I've so far...

  • done a brew update and brew upgrade.
  • verified that Pip, Python, Protoc, Wheel, and Setuptools all have the latest versions.
  • tried testing in a virtual environment (as some recommended in other threads).
  • tried appending the --use-pep517 flag as the warning above suggests.
  • tried setting a million different ENV variables in my local session (every thread seems to suggest something different)

Support for Kurdish (Sorani)

Central Kurdish (Sorani) is a language of Kurdish and has several million native speakers in Iraq and Iran. The code (ckb) is assigned to the language and it uses the Arabic script.

Also, I suggest changing the name of the current Kurdish on the list to Kurdish (Kurmanji), in order to identify both Kurdish languages easily.

Train a new model

Hi, someone know how to train a new model on custom data? There are not any documentation...

Traditional Chinese support

Currently the model detects both zh-hans and zh-hant as just zh:

import cld3

documents =  [
            "把中坛元帅固定在神轿",
            "親愛的牽起你的手 並將他們放在我手中",
        ]

for i,d in enumerate(documents):
    print(i,cld3.get_language(d))

will output:

0 LanguagePrediction(language='zh', probability=0.9999771118164062, is_reliable=True, proportion=1.0)
1 LanguagePrediction(language='zh', probability=0.9999103546142578, is_reliable=True, proportion=1.0)

while the first sentence is zh-hans and the second is zh-hant.

issue about detect incorrectlly

I try with many text of korean but CLD3 is unable to detect it.

for example:
Korean text: "이 회의에서는 업계 전반의" => output: vi => should be ko

English text: "hello world" => output: ky => should be en

how can CLD3 detect language more accurately?

thank you very much.

Chromium Dependency

Is it possible to avoid the dependency of Chromium? Is this needed for protobuf dependency only? If so it would be worth to add external libraries and a Makefile to compile cld3 directly without the Chromium overhead...

ImportError (Python3.9, protobuf 21.5)

Installed gcld3 using pip, and it worked until protobuf got updated in homebrew.
Now, import gcld3 does not work.
Basically, gcld3's library tries to find libprotobuf.30.dylib but I have libprotobuf.32.dylib (and renaming it to libprotobuf.30.dylib does not work)
What would be a solution for this problem?

Error Message:

Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/__init__.py", line 1, in <module>
    from .pybind_ext import *
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
ImportError: dlopen(/Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/pybind_ext.cpython-39-darwin.so, 2): Library not loaded: /usr/local/opt/protobuf/lib/libprotobuf.30.dylib
  Referenced from: /Volumes/MacintoshHDD/Programming/PycharmProjects/_venvs/rhymedict-crawlers/lib/python3.9/site-packages/gcld3/pybind_ext.cpython-39-darwin.so
  Reason: image not found

"https" makes france to identidy as english

used simple javacpp adapter
https://github.com/bytedeco/javacpp

import org.bytedeco.javacpp.Loader;
import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.annotation.Platform;

@Platform(
    include = {"LangDetect.h"},
    link = {"cLangDetect"}
)
public class LangDetect extends Pointer {
    private native void detect(String var1, int var2, LangData var3);

    private native void allocate();

    public LangDetect() {
        this.allocate();
    }

    public void detect(String str, LangData result) {
        this.detect(str, str.length(), result);
    }

    static {
        Loader.load();
    }
}

test with http inside url

object CLD2Example {
    @JvmStatic
    fun main(args: Array<String>) {
        val ldetect = LangDetect()
        val ld = LangData()
        ldetect.detect("Sampension,3ème caisse de retraite danoise\uD83C\uDDE9\uD83C\uDDF0,#BoycottIsrael,boycott 4 firmes liées à des colonies!\n" +
                "#GroupPalestine\n" +
                "⏩https://t.co/gPIEbpotvk https://t.co/P4TaPvcBdX ", ld)
        println(ld.getName(0))
        println(ld.getScore(0))
    }
}

output:
ENGLISH
0.99

test without http inside url

object CLD2Example {
    @JvmStatic
    fun main(args: Array<String>) {
        val ldetect = LangDetect()
        val ld = LangData()
        ldetect.detect("Sampension,3ème caisse de retraite danoise\uD83C\uDDE9\uD83C\uDDF0,#BoycottIsrael,boycott 4 firmes liées à des colonies!\n" +
                "#GroupPalestine\n" +
                "⏩://t.co/gPIEbpotvk ://t.co/P4TaPvcBdX ", ld)
        println(ld.getName(0))
        println(ld.getScore(0))
    }
}

output:
FRENCH
0.99

Cannot install gcld3 on OS X

Hi all. I have protobuf 3.15.8 installed but when I try to install gcld3 through pip, protobuf headers aren't found when building from source:

$ pip install gcld3
Collecting gcld3
  Using cached gcld3-3.0.13.tar.gz (647 kB)
Building wheels for collected packages: gcld3
  Building wheel for gcld3 (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-wheel-jab5wg99
       cwd: /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/
  Complete output (25 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.9-x86_64-3.7
  creating build/lib.macosx-10.9-x86_64-3.7/gcld3
  copying gcld3/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/gcld3
  running build_ext
  building 'gcld3.pybind_ext' extension
  creating build/temp.macosx-10.9-x86_64-3.7
  creating build/temp.macosx-10.9-x86_64-3.7/gcld3
  creating build/temp.macosx-10.9-x86_64-3.7/src
  creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3
  creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3/protos
  creating build/temp.macosx-10.9-x86_64-3.7/src/script_span
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/lib/python3.7/site-packages/pybind11/include -I/Users/erip/miniconda3/envs/langid/include/python3.7m -c gcld3/pybind_ext.cc -o build/temp.macosx-10.9-x86_64-3.7/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
  In file included from gcld3/pybind_ext.cc:5:
  In file included from gcld3/../src/nnet_language_identifier.h:22:
  In file included from gcld3/../src/embedding_feature_extractor.h:23:
  In file included from gcld3/../src/feature_extractor.h:45:
  gcld3/../src/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
  #include <google/protobuf/stubs/common.h>
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for gcld3
  Running setup.py clean for gcld3
Failed to build gcld3
Installing collected packages: gcld3
    Running setup.py install for gcld3 ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-record-20khe1jh/install-record.txt --single-version-externally-managed --compile --install-headers /Users/erip/miniconda3/envs/langid/include/python3.7m/gcld3
         cwd: /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/
    Complete output (25 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.7
    creating build/lib.macosx-10.9-x86_64-3.7/gcld3
    copying gcld3/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/gcld3
    running build_ext
    building 'gcld3.pybind_ext' extension
    creating build/temp.macosx-10.9-x86_64-3.7
    creating build/temp.macosx-10.9-x86_64-3.7/gcld3
    creating build/temp.macosx-10.9-x86_64-3.7/src
    creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3
    creating build/temp.macosx-10.9-x86_64-3.7/src/cld_3/protos
    creating build/temp.macosx-10.9-x86_64-3.7/src/script_span
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/include -arch x86_64 -I/Users/erip/miniconda3/envs/langid/lib/python3.7/site-packages/pybind11/include -I/Users/erip/miniconda3/envs/langid/include/python3.7m -c gcld3/pybind_ext.cc -o build/temp.macosx-10.9-x86_64-3.7/gcld3/pybind_ext.o -std=c++11 -stdlib=libc++
    In file included from gcld3/pybind_ext.cc:5:
    In file included from gcld3/../src/nnet_language_identifier.h:22:
    In file included from gcld3/../src/embedding_feature_extractor.h:23:
    In file included from gcld3/../src/feature_extractor.h:45:
    gcld3/../src/cld_3/protos/feature_extractor.pb.h:9:10: fatal error: 'google/protobuf/stubs/common.h' file not found
    #include <google/protobuf/stubs/common.h>
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/erip/miniconda3/envs/langid/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"'; __file__='"'"'/private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-install-gsc_bqq7/gcld3_2a1eb194f3294e64ad0609ee476c9a1a/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/v8/ls8m50ks3zz0hzq21681r8m40000gn/T/pip-record-20khe1jh/install-record.txt --single-version-externally-managed --compile --install-headers /Users/erip/miniconda3/envs/langid/include/python3.7m/gcld3 Check the logs for full command output.

Do you have any thoughts?

Did I hit a bug in gcld3?

I made a few experiments to find out what would be the result of detection for some text not in the supported languages list. While it appears that whatever is detected is unreliable so I can reject the detection, I stumbled upon an example where the result is unexpectedly bad

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=1000)
sample = "The last part of this text is pure gibberish with well crafted punctuation. Този текст е на Български. Sdslkmnscd scsun dc mcsaducsdnmlmc icmmklmdsc!"
result = detector.FindTopNMostFreqLangs(text=sample, num_langs=5)
for i in result:
    print(i.language, i.is_reliable, i.proportion, i.probability)```

will surprisingly output this:
`en True 0.4444444477558136 0.9999370574951172
bg True 0.28070175647735596 0.9173890948295593
hu True 0.27485379576683044 0.9084945917129517
und False 0.0 0.0
und False 0.0 0.0`

for one part good text and second part garbage it depends which is first and which has bigger proportion but the result can be correctly interpreted, however the above example is quite bad.

Increase the number of supported languages

Hi!
Do you have any plans to increase the number of supported languages up to 200-300?
The languages like: Chuvash (chv), Mari (mhr), Hill Mari (mrj), Komi (kpv), which have presence in the web, are not included here. And hence are not in multilingual C4 dataset.

Chinese text detected as Haitian Creole

>>> import gcld3
>>> ld = gcld3.NNetLanguageIdentifier(0, 50)
>>> res = ld.FindLanguage('污水')
>>> print(res.language, res.probability)
ht 0.7535305619239807

Calling detect_language_mixed on an empty string crashes the R session

To reproduce:

library(cld3)
detect_language_mixed("")

Expected: this produces some result

Actual: R crashes with the following messages:

> detect_language_mixed("")

 *** caught illegal operation ***
address 0x7f6807716e58, cause 'illegal operand'

Traceback:
 1: cld3_detect_language_mixed(as_string(text, vectorize = FALSE),     size)
 2: detect_language_mixed("")

Tests are failing

The language detection tests are failing. It confuses Bosnian with Croatian and Indonesian with Malay. Those languages are in turn similar, but it would be good if the tests could be adapted so that they pass. Note that increasing the max_num_bytes even to 100000 does not fix the test.

Below is the output of the test

Running TestPredictions
  Misclassification:
    Text: Novi predsjednik Mešihata Islamske zajednice u Srbiji (IZuS) i muftija dr. Mevlud ef. Dudić izjavio je u intervjuu za Anadolu Agency (AA) kako je uvjeren da će doći do vraćanja jedinstva među muslimanima i unutar Islamske zajednice na prostoru Sandžaka, te da je njegova ruka pružena za povratak svih u okrilje Islamske zajednice u Srbiji nakon skoro sedam godina podjela u tom dijelu Srbije. Dudić je za predsjednika Mešihata IZ u Srbiji izabran 4. januara, a zvanična inauguracija će biti obavljena u prvoj polovini februara. Kako se očekuje, prisustvovat će joj i reisu-l-ulema Islamske zajednice u Srbiji Husein ef. Kavazović koji će i zvanično promovirati Dudića u novog prvog čovjeka IZ u Srbiji. Dudić će danas boraviti u prvoj zvaničnoj posjeti reisu Kavazoviću, što je njegov privi simbolični potez nakon imenovanja.
    Expected language: bs
    Predicted language: hr
  Misclassification:
    Text: berdiri setelah pengurusnya yang berusia 83 tahun, Fayzrahman Satarov, mendeklarasikan diri sebagai nabi dan rumahnya sebagai negara Islam Satarov digambarkan sebagai mantan ulama Islam  tahun 1970-an. Pengikutnya didorong membaca manuskripnya dan kebanyakan dilarang meninggalkan tempat persembunyian bawah tanah di dasar gedung delapan lantai mereka. Jaksa membuka penyelidikan kasus kriminal pada kelompok itu dan menyatakan akan membubarkan kelompok kalau tetap melakukan kegiatan ilegal seperti mencegah anggotanya mencari bantuan medis atau pendidikan. Sampai sekarang pihak berwajib belum melakukan penangkapan meskipun polisi mencurigai adanya tindak kekerasan pada anak. Pengadilan selanjutnya akan memutuskan apakah anak-anak diizinkan tetap tinggal dengan orang tua mereka. Kazan yang berada sekitar 800 kilometer di timur Moskow merupakan wilayah Tatarstan yang
    Expected language: id
    Predicted language: ms
  Failure: 2 wrong predictions

Undefined symbol

I have followed the instructions on the README until the command ninja -C out/Default third_party/cld_3/src/src:language_identifier_main . I have uncommented language_identifier_main and set the PATH variable but when I run that command I get the following error. I have tried changing the created files, updated gcc but I am not sure why it is failing on loading std libraries.

[306/306] LINK ./language_identifier_main
FAILED: language_identifier_main
>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>> referenced by language_identifier_main.cc:29 (../../third_party/cld_3/src/src/language_identifier_main.cc:29)
/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout
>>> referenced by language_identifier_main.cc
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:35 (../../third_party/cld_3/src/src/language_identifier_main.cc:35)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:36 (../../third_party/cld_3/src/src/language_identifier_main.cc:36)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by language_identifier_main.cc:37 (../../third_party/cld_3/src/src/language_identifier_main.cc:37)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

>>> referenced by string:1571 (../../buildtools/third_party/libc++/trunk/include/string:1571)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::cout
>>> referenced by language_identifier_main.cc
>>> referenced by language_identifier_main.cc:48 (../../third_party/cld_3/src/src/language_identifier_main.cc:48)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(bool)
>>> referenced by language_identifier_main.cc:49 (../../third_party/cld_3/src/src/language_identifier_main.cc:49)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(float)
>>> referenced by language_identifier_main.cc:50 (../../third_party/cld_3/src/src/language_identifier_main.cc:50)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()
>>> referenced by language_identifier_main.cc:54 (../../third_party/cld_3/src/src/language_identifier_main.cc:54)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(main)

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::ios_base::getloc() const
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::ctype<char>::id
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::locale::use_facet(std::__1::locale::id&) const
>>> referenced by __locale:212 (../../buildtools/third_party/libc++/trunk/include/__locale:212)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::locale::~locale()
>>> referenced by ios:756 (../../buildtools/third_party/libc++/trunk/include/ios:756)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::put(char)
>>> referenced by ostream:1001 (../../buildtools/third_party/libc++/trunk/include/ostream:1001)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: undefined symbol: std::__1::basic_ostream<char, std::__1::char_traits<char> >::flush()
>>> referenced by ostream:1002 (../../buildtools/third_party/libc++/trunk/include/ostream:1002)
>>>               obj/third_party/cld_3/src/src/language_identifier_main/language_identifier_main.o:(std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::endl<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&))

/home/myuser/chromium/src/out/Default/../../third_party/llvm-build/Release+Asserts/bin/ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

Python binding forks and different fixes

Update: CLD3 now has a Python binding code from Google themselves: gcld3

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3


This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:

Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix res.language comparison; Provide easy pip install under pycld3 name https://github.com/bsolomon1124/pycld3 by @bsolomon1124

Note:

Python Binding Documentation

(based on the documentation from https://github.com/Elizafox/cld3 )

Usage:

Here's some examples:

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> cld3.get_frequent_languages("This piece of text is in English. Този текст е на Български.", 5)
[LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592), LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)]

In short:

  • get_language returns the most likely language as the named tuple LanguagePrediction. Proportion is always 1.0 when called in this way.
  • get_frequent_languages will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.

In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence, get_frequent_languages may return fewer results than what you asked for, or none at all.

nnet_lang_id_test::TestPredictions() assertion failure

With latest master, I'm seeing two assertions fail in nnet_lang_id_test::TestPredictions():

Running TestPredictions
  Misclassification: 
    Text: Novi predsjednik Mešihata Islamske zajednice u Srbiji (IZuS) i muftija dr. Mevlud ef. Dudić izjavio je u intervjuu za Anadolu Agency (AA) kako je uvjeren da će doći do vraćanja jedinstva među muslimanima i unutar Islamske zajednice na prostoru Sandžaka, te da je njegova ruka pružena za povratak svih u okrilje Islamske zajednice u Srbiji nakon skoro sedam godina podjela u tom dijelu Srbije. Dudić je za predsjednika Mešihata IZ u Srbiji izabran 4. januara, a zvanična inauguracija će biti obavljena u prvoj polovini februara. Kako se očekuje, prisustvovat će joj i reisu-l-ulema Islamske zajednice u Srbiji Husein ef. Kavazović koji će i zvanično promovirati Dudića u novog prvog čovjeka IZ u Srbiji. Dudić će danas boraviti u prvoj zvaničnoj posjeti reisu Kavazoviću, što je njegov privi simbolični potez nakon imenovanja. 
    Expected language: bs
    Predicted language: hr
  Misclassification: 
    Text: berdiri setelah pengurusnya yang berusia 83 tahun, Fayzrahman Satarov, mendeklarasikan diri sebagai nabi dan rumahnya sebagai negara Islam Satarov digambarkan sebagai mantan ulama Islam  tahun 1970-an. Pengikutnya didorong membaca manuskripnya dan kebanyakan dilarang meninggalkan tempat persembunyian bawah tanah di dasar gedung delapan lantai mereka. Jaksa membuka penyelidikan kasus kriminal pada kelompok itu dan menyatakan akan membubarkan kelompok kalau tetap melakukan kegiatan ilegal seperti mencegah anggotanya mencari bantuan medis atau pendidikan. Sampai sekarang pihak berwajib belum melakukan penangkapan meskipun polisi mencurigai adanya tindak kekerasan pada anak. Pengadilan selanjutnya akan memutuskan apakah anak-anak diizinkan tetap tinggal dengan orang tua mereka. Kazan yang berada sekitar 800 kilometer di timur Moskow merupakan wilayah Tatarstan yang
    Expected language: id
    Predicted language: ms
  Failure: 2 wrong predictions

is this currently expected, or some incorrect build configuration (or dependencies) can cause this?

I have built using #5 's makefile, and protobuf version is 3.1 / 3.3.2 (tested on 2 different versions).

Add support to release linux aarch64 wheels

Problem

On aarch64, pip install gcld3 builds the wheels from source code and then installs it. It requires the user to have a development environment installed on their system. Also, it takes more time to build the wheels than downloading and extracting the wheels from PyPI.

Resolution

On aarch64, pip install gcld3 should download the wheels from PyPI.

@jasonriesa, Please let me know your interest in releasing aarch64 wheels. I can help with this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.