Git Product home page Git Product logo

Comments (7)

bsolomon1124 avatar bsolomon1124 commented on May 2, 2024 7

Using the work of everyone here (thank you everyone!) I've tried to combine the change sets into one clean set of commits and put a shiny new wrapper on things, which also sits on PyPI as pycld3.

https://github.com/bsolomon1124/pycld3

Reviews appreciated. Again, I've made my best effort to make sure the incremental changes across different forks are picked up and put together.

from cld3.

iamthebot avatar iamthebot commented on May 2, 2024 5

@ipla I've fixed these memory leaks in my fork of CLD3. Basically, the elizafox version creates a new model object on each call to get_language and on top of it doesn't clean it up. My fork has both the original functions (but cleans up the objects) and a class called LanguageIdentifier which permits reuse of the model for faster performance.

The fork is iamthebot/cld3

from cld3.

bsolomon1124 avatar bsolomon1124 commented on May 2, 2024 2

@iamthebot

I believe there's still a small error in your fork.

You use the comparison:

str(res.language) != ident.kUnknown:

This is not doing what you think it is.

Originally, res.language is a CPP string, while ident.kUnknown is a const char array (with value "und").

However, str(res.language) does not do the correct coercion in the same way that str(b"hello") does not decode the string; it just makes a str representation of that bytes object.

>>> str(b"hello")
"b'hello'"
>>> str(b"hello") == "hello"  # No!
False

What is needed here is:

if <bytes> res.language != <bytes> ident.kUnknown:

You can prove this for yourself by throwing this into get_language():

cdef string tst = b"und" 
print(tst)
print(str(tst) == ident.kUnknown)
print(tst.decode("utf-8") == ident.kUnknown)

Then

python3 setup.py build_ext --inplace --quiet && python3 -c 'import cld3; cld3.get_language("hello there!")'

Will produce False, False.

from cld3.

bsolomon1124 avatar bsolomon1124 commented on May 2, 2024 1

Hi @jasonriesa and @akihiroota87: do the maintainers of google/cld3 have any interest in incorporating Python bindings within this repo, by reviewing and combining the various forks mentioned above?

As a tangentially related change, as a part of those forks, the Chromium dependency was removed. If that wasn't the case, the logical solution might be a git submodule, but since the C source itself has changed in the forks, that becomes difficult.

from cld3.

iamthebot avatar iamthebot commented on May 2, 2024 1

Thanks @bsolomon1124! I actually just copied that part from the elizafox cld3 fork so I guess many of us had been using this in its broken form for a while lol. The new wrapper looks great and we'll switch to using it soon.

from cld3.

lpla avatar lpla commented on May 2, 2024

I have been testing the Elizafox/cld3 Python binding and I had severe memory issues. The more sentences I detect, the more memory is used. I don't know if this is an issue in cld3 or in the Python binding specifically.

And given that I cannot open any issue in any of the Python binding forks, I though to report it here.

from cld3.

bact avatar bact commented on May 2, 2024

gcld3 - a Python binding for CLD3 from Google

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3

from cld3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.