Git Product home page Git Product logo

cliqz-oss / keyvi Goto Github PK

View Code? Open in Web Editor NEW
178.0 22.0 38.0 26.93 MB

Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.

Home Page: https://cliqz.com

License: Apache License 2.0

Shell 1.63% Python 13.36% C++ 84.01% C 0.61% CMake 0.30% Makefile 0.09%
python cpp search data-structures fst big-data

keyvi's Introduction

Keyvi is developed and maintained by Cliqz Engineering Team and Hendrik Muhs. Cliqz is a provider of innovative, privacy-focused browser technologies with integrated quick-search functionality and anti-tracking.

Travis C++ PythonVersions PythonImpl PythonFormat PyPIVersion Coveralls

DEPRECATED

Hey there, fellow Keyvi lovers! This is to inform you that Keyvi has found a new home, and will continue to be developed under the fork at https://github.com/KeyviDev/keyvi. Please go there to get the latest and greatest Keyvi, packed with new, exciting features and bugfixes.

This repo is kept for historical reasons, and will not be actively maintained.

Keyvi

Keyvi - the short form for "Key value index" - defines a special subtype of the popular key value store (KVS) technologies. As you can imagine from the name, keyvi is an immutable key value store, therefore an index not a store. Keyvi's strengths: high compression ratio and extreme scalability. So if you need online read/writes keyvi is not for you, however, if your use case is mostly reads and infrequent writes you might be interested in checking keyvi out.

Introduction

Install

Quick

Precompiled binary wheels are available for OS X and Linux on PyPi. To install use:

pip install pykeyvi

From Source

The core part is a C++ header-only library, but the TPIE 3rdparty library needs to be compiled once. The commandline tools are also part of the C++ code. For instructions check the Readme file.

For the python extension pykeyvi check the Readme file in the pykeyvi subfolder.

Usage

Internals

If you like to go deep down in the basics, keyvi is inspired by the following 2 papers:

Licence and 3rdparty dependencies

keyvi is licenced under apache license 2.0, see licence for details.

In addition keyvi uses 3rdparty libraries which define their own licence. Please check their respective licence. The 3rdparty libraries can be found at keyvi/3rdparty.

Contributing

  • Bug reports, feature requests and general question can be added as an Issue.

  • PRs are welcome.

  • Questions? Concerns? Feel free to contact us.

keyvi's People

Contributors

ankit-cliqz avatar david-cliqz avatar hendrikmuhs avatar michael-a-cliqz avatar narekgharibyan avatar simonalger avatar subu-cliqz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keyvi's Issues

Revisit Madvise match changes

see #144

(got somehow closed automatically after git index rewrites, needs to be rebased)

Note: Not sure if the madvise change itself is useful, but some changes in match.h are.

Core dump while processing a 33GB dataset

This is probably the biggest dataset we have tried yet. The error happens at around 88% (Progress: 223600000/253842680 according to Jenkins). The backtrace is (excluding the Python fluff):

#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1  0x00007efd82de9ceb in memcpy (__len=4, __src=0x7fff198ea2d0, __dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:51
#2  WriteRawValue (length=4, buffer=0x7fff198ea2d0, offset=<optimized out>, this=0x2dce340) at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_persistence.h:160
#3  keyvi::dictionary::fsa::internal::SparseArrayBuilder<keyvi::dictionary::fsa::internal::SparseArrayPersistence<unsigned short> >::WriteTransition (this=0x37f59a0, offset=0, 
    transitionId=<optimized out>, transitionPointer=<optimized out>) at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_builder.h:408
#4  0x00007efd82e01849 in WriteState (unpacked_state=..., offset=4294967180, this=0x37f59a0) at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_builder.h:246
#5  keyvi::dictionary::fsa::internal::SparseArrayBuilder<keyvi::dictionary::fsa::internal::SparseArrayPersistence<unsigned short> >::PersistState (this=<optimized out>, unpacked_state=...)
    at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_builder.h:86
#6  0x00007efd82e2517e in ConsumeStack (end=<optimized out>, this=<optimized out>) at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/generator.h:371
#7  Add<std::basic_string<char> > (value=..., input_key=..., this=0x3372fc8) at /raid/keyvi/keyvi/src/cpp/dictionary/fsa/generator.h:218
#8  keyvi::dictionary::DictionaryCompiler<keyvi::dictionary::fsa::internal::SparseArrayPersistence<unsigned short>, keyvi::dictionary::fsa::internal::JsonValueStore>::Compile(std::function<void (unsigned long, unsigned long, void*)>, void*) (this=0x3372e50, progress_callback=..., user_data=user_data@entry=0x7efd704a6b50)
    at /raid/keyvi/keyvi/src/cpp/dictionary/dictionary_compiler.h:138
#9  0x00007efd82ddbda4 in __pyx_pf_7pykeyvi_22JsonDictionaryCompiler_18Compile (__pyx_v_args=0x7efd7044e050, __pyx_v_self=0x7efd70869d90) at src/pykeyvi.cpp:3946
#10 __pyx_pw_7pykeyvi_22JsonDictionaryCompiler_19Compile (__pyx_v_self=0x7efd70869d90, __pyx_args=0x7efd7044e050, __pyx_kwds=<optimized out>) at src/pykeyvi.cpp:3835
# dpkg -l | grep keyvi
ii  keyvi              0.1.3         amd64        key value index
ii  python-keyvi       0.1.4         amd64        python bindings for keyvi

Could this be the problem?

[RFC] Subpackages for pykeyvi

For a new feature I like to discuss modularization. I like to put the new feature into a subpackage ('pykeyvi.index'). Playing a bit with cython it unfortunately turns out, cython does not support it easily:

https://github.com/cython/cython/wiki/FAQ#how-to-compile-cython-with-subpackages

There is a recommendation on the wiki:

https://github.com/cython/cython/wiki/PackageHierarchy

which would build a 'so' for every subpackage, which would mean a lot of wasted space and higher memory usage.

RFC: Move the cython extension into a hidden namespace ('pykeyvi._core') keeping the flat structure, create a python module structure and import accordingly, this would keep a single so file but still provide subpackages:

 - pykeyvi
  - __init__.py
  - index
   - __init__.py

pykeyvi/init.py:

from pykeyvi._core import JsonDictionaryCompiler, ...

pykeyvi/index/init.py:

from pykeyvi._core import IndexReader, IndexWriter

In the longer run this also solves 2 problems:

  • cleanup existing structure
  • create python code extensions (like the keyvicli library)

Installationd dependency on Ubuntu

Apart from the installations steps mentioned,
one must need to install few things as well, otherwise one get msgrpc package not found and other similar exceptions.
easy_install msgpack-python
apt-get install python-tk python-setuptools
sudo python -m easy_install -U snappy`
these are the requirements to download this on c3.2xlarge machine on AWS.

0.2 Cleanup Todo's

  • Integer and Completion dictionary type mess
  • move memory parameter into parameter list

pykeyvi segfault 11

sample code

compiler = pykeyvi.StringDictionaryCompiler()
compiler.Add('a', 'b')
compiler.WriteToFile('dump1.kv')

Stacktrace

keyvi::dictionary::DictionaryCompiler<keyvi::dictionary::fsa::internal::SparseArrayPersistence<unsigned short>, keyvi::dictionary::fsa::internal::StringValueStore, keyvi::dictionary::sort::TpieSorter<keyvi::dictionary::sort::key_value_pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, keyvi::dictionary::fsa::ValueHandle> > >::WriteToFile<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 243
1   pykeyvi.so                    	0x000000010d0aef1b __pyx_pw_7pykeyvi_24StringDictionaryCompiler_15WriteToFile(_object*, _object*) + 411

OSX, Python 2.7/3.6

Configurable compressor support

Currently keyvi only supports zlib, which compresses well, but is very slow and does not do a good job for short strings. Add a framework that

  1. allows the user to select the compression method
  2. allows the user to select the compression threshold
  3. includes also short string compression methods.

Document parameters - what and how - for keyvi command line and python API

Currently, we use keyvi compiler option of "floating_point_precision" for word embeddings in sharding/compiling step. It would be nice to pass this option to command line / python api of keyvi for any keyvi file where the values will be vectors.

Ex: keyvi_compiler_options = {"minimization": "off", "floating_point_precision": "single"}

This will be helpful in reducing the size of massive keyvi files composed of vector values. (Ex. Document Vectors- ~2.1B Vectors ~ 300 Dimensions). I haven't been able to figure out how one can use this feature. A standalone example with documentation will be useful.
@hendrikmuhs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.