Git Product home page Git Product logo

Comments (3)

oborchers avatar oborchers commented on May 26, 2024

Hi @mathias3! Help would be very welcome. Although I don't have so much time for this repository, it's still a good thing to have and fills a niche that cannot be filled by more recent advancement in the NLP world.

I've created version 0.1.17 and I have fixed the most glaring issues with the repository, mainly related to the gensim and python incompatibilities.

There is also still the develop branch, which contains many fixes and new features I originally planned to implement or are implemented partially. For example, the code for the following models is fully or partially there:

  • Added Hierarchical (Convolutional) Embeddings for all Models
  • Added MaxPooling
  • Added Features to Sentencevectors
  • Added further unittests
  • Workaround for Numpy memmap issue (numpy/numpy#13172)
  • SVD ram subsampling for SIF / uSIF (customizable, standard is 1 GB of RAM)
  • Minor fixes for nan-handling
  • Minor fixes for sentencevectors class

There are a few things which might make sense to add to the roadmap:

  • Newer models (I don't know, not up to date in this regard)
  • Working the hierarchical op into the main averaging cython routine
  • Support for a user definable embedding class (i.e. fse version of BaseKeyedVectors to get away from the Gensim dependency)
  • Different CI (Travis free mode not longer available)
  • Add pre_inference and post_inference (I think I forgot this one)
  • Refactoring the horribly complicated Input class
  • Reworking the threading (at least from my last experience the input thread is the bottleneck, not the actual computation)
  • Untangling the bad design decision to actually store the BaseKeyedVector from Gensim internally. If users want mmap, they can just load that and pass it.
  • Edit: Approximate nearest neighbor search (i.e. by annoy support)?
  • Return vectors only above a certain threshold #34
  • Fix zero division error #47

Happy to work on some of the issues as well, should have more time next year

Who might be interested to help?
@mathias3 @grantmwilliams @AlexMRuch

from fast_sentence_embeddings.

oborchers avatar oborchers commented on May 26, 2024

@mathias3: There is also a new version on pypi: 0.1.17

from fast_sentence_embeddings.

oborchers avatar oborchers commented on May 26, 2024

Fixed / added in 0.2.0:

  • Offering pretrained models and making them accessible
  • Fix zero division error
  • Bugfixes for python 3.8 builds
  • Code refactoring to black style

from fast_sentence_embeddings.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.