spotify / annoy Goto Github PK

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

License: Apache License 2.0

Python 31.15% C 3.36% C++ 48.99% Go 4.39% Lua 10.54% SWIG 1.04% CMake 0.53%

c-plus-plus python nearest-neighbor-search locality-sensitive-hashing approximate-nearest-neighbor-search golang lua

annoy's Introduction

Annoy

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Install

To install, simply do pip install --user annoy to pull down the latest version from PyPI.

For the C++ version, just clone the repo and #include "annoylib.h".

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

Python code example

from annoy import AnnoyIndex
import random

f = 40  # Length of item vector that will be indexed

t = AnnoyIndex(f, 'angular')
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Right now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.

Full Python API

AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
a.save(fn, prefault=False) saves the index to disk and loads it (see next function). After saving, no more items can be added.
a.load(fn, prefault=False) loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.
a.unload() unloads.
a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.
a.get_nns_by_vector(v, n, search_k=-1, include_distances=False) same but query by vector v.
a.get_item_vector(i) returns the vector for item i that was previously added.
a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.
a.get_n_items() returns the number of items in the index.
a.get_n_trees() returns the number of trees in the index.
a.on_disk_build(fn) prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)
a.set_seed(seed) will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).

Notes:

There's no bounds checking performed on the values so be careful.
Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))

The C++ API is very similar: just #include "annoylib.h" to get access to it.

Tradeoffs

There are just two main parameters needed to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.

n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.

If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. the value of n_trees will not affect search time if search_k is held constant and vice versa. Basically it's recommended to set n_trees as large as possible given the amount of memory you can afford, and it's recommended to set search_k as large as possible given the time constraints you have for the queries.

You can also accept slower search times in favour of reduced loading times, memory usage, and disk IO. On supported platforms the index is prefaulted during load and save, causing the file to be pre-emptively read from disk into memory. If you set prefault to False, pages of the mmapped index are instead read from disk and cached in memory on-demand, as necessary for a search to complete. This can significantly increase early search times but may be better suited for systems with low memory compared to index size, when few queries are executed against a loaded index, and/or when large areas of the index are unlikely to be relevant to search queries.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

Dot Product distance (contributed by Peter Sobot and Pavel Korobov) reduces the provided vectors from dot (or "inner-product") space to a more query-friendly cosine space using a method by Bachrach et al., at Microsoft Research, published in 2014.

More info

Dirk Eddelbuettel provides an R version of Annoy.
Andy Sloane provides a Java version of Annoy although currently limited to cosine and read-only.
Pishen Tsai provides a Scala wrapper of Annoy which uses JNA to call the C++ library of Annoy.
Atsushi Tatsuma provides Ruby bindings for Annoy.
There is experimental support for Go provided by Taneli Leppä.
Boris Nagaev wrote Lua bindings.
During part of Spotify Hack Week 2016 (and a bit afterward), Jim Kang wrote Node bindings for Annoy.
Min-Seok Kim built a Scala version of Annoy.
hanabi1224 built a read-only Rust version of Annoy, together with dotnet, jvm and dart read-only bindings.
Presentation from New York Machine Learning meetup about Annoy
Annoy is available as a conda package on Linux, OS X, and Windows.
ann-benchmarks is a benchmark for several approximate nearest neighbor libraries. Annoy seems to be fairly competitive, especially at higher precisions:

Source code

It's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)

The code should support Windows, thanks to Qiang Kou and Timothy Riley.

To run the tests, execute python setup.py nosetests. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.

Discuss

Feel free to post any questions or comments to the annoy-user group. I'm @fulhack on Twitter.

annoy's People

Contributors

Stargazers

Watchers

Forkers

sromberg dreamfrog litaoshao pombredanne piskvorky hellcoderz josephwinston iopenstack dbarenas cyberkurumin stuntgoat emillamm wajihcz silky imclab vicgc djour ryanrhymes ummae edwinyumakiuchi wavelets cfregly zengqiang2006 mindis henryslzhao mbit-cloud boostrack boostrack-oss ogrisel mdbecker eddelbuettel apsaltis thirdwing mmadsen johnistan ronxin c3h3 provemyself coderfi ashneo76 sisirkoppaka aminorex eraldop dvenum scshepard robertrlindner liaobs houzz atqamar fsun2p10 danstowell cojito jonbakerfish maheshakya beifeizhou xieyanfu evilhonduco edwardt emaxerrno longbinchen guitarmind ml-ai-nlp-ir xsongx codeaudit bssrdf easonlv akiratu xc35 joshreyn steamshon cc13ny achuthasourabhc jcharit1 c4e8ece0 ycaihua subu-cliqz dcdillon fxcebx tigerneil pandasasa yilab bewang-tech nhonth rosmo jeremy-rutman agangzz duhaime joshua-chin siriman abliznyuk lenovor tomzhang lenguyenthedat linearregression kmribti cpehle ahmadyan directorscut82 mathkann princeofdarkness76

annoy's Issues

Euclidean metric is much slower than Angular?

I am running an experiment with 1.2 million nodes, each represented by a 500-dim vector. On the same machine, when using Angular distance, it takes about 1 minute to build a tree; whereas using Euclidean, it takes about 1 hour. Is this difference caused by Euclidean distance having to calculate an extra offset value, or other things?

Expose the accuracy-performance tradeoff of lookups

See https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L467

The n * _roots.size() "constant" is pretty arbitrary and could be exposed to the user as a knob to do tradeoffs in lookup time, instead of having to build super big indexes in construction time using lots of trees.

Idea for improved hyperplane splits

Instead of sampling just 2 points, sample k points for some reasonable k as well as a uniformly random hyperplane h. Then, project all k points down to h. In 2d this would amount to points on a line. It's now trivial to find a second hyperplane h' perpendicular to h that splits the projected points and will in turn split the points in the original space. See the diagram below for a visual representation. This method should improve splits by making the splits more evenly distributed while still maintaining the nondeterminism to avoid trees replicating the same splits.

Boost library linking problem on Python 3

I needed to s/boost_python/boost_python3 in setup.py to get a build that would work on Python 3.

This is on Arch Linux.

Euclidean distance instead of angular used?

Whether or not I create an index using the angular distance or the Euclidean distance, the distances returned for the nearest neighbor search always seem to be Euclidean distances (get_nns_by_vector(include_distances=True)).

I don't know what goes wrong because based on my cursory glance through the source this shouldn't be happening. I also don't know if it's only these distance calculations that are affected or if the whole index always uses the Euclidean distance instead of the angular distance.

knn for high dimension sparse vectors?

for high dimension, i mean size of vector larger than 3000
for sparse, i mean vector with none zero less than 1%
is annoy a good choice for this kind of knn task? if answer is no, what can we do?

Ability to insert items with non-integer keys

Annoy currently only supports indexes keyed by integers. This makes it difficult to build indexes with non-numeric keys as it requires keeping track of a separate key --> int mapping outside of annoy. It would be nice if annoy could support non-numeric keys and keep an internal integer mapping for array indexing.
Erik - are there any hesitations about my adding support for this?

Cannot use after install

import annoy
Traceback (most recent call last):
File "", line 1, in
File "annoy/init.py", line 15, in
from .annoylib import *
ImportError: No module named annoylib

ImportError: Symbol not found: _PyModule_Create2

I install annoy with pip2 install annoy on my Mac. And it goes very well. But when I try to import annoy in CPython, some error come!

>>> import annoy
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/annoy/__init__.py", line 15, in <module>
    from .annoylib import *
ImportError: dlopen(/usr/local/lib/python2.7/site-packages/annoy/annoylib.so, 2): Symbol not found: _PyModule_Create2
  Referenced from: /usr/local/lib/python2.7/site-packages/annoy/annoylib.so
  Expected in: flat namespace
 in /usr/local/lib/python2.7/site-packages/annoy/annoylib.so

According to my speculation, the error comes from setup.py compiling the annoylib.so. Sadly, I am not sure about this? Any idea about this? And my Mac OS is EI Capitan Version10.11.1. Python comes from brew install python and it is Python2.7.10

is it bug in searching ?

in order to implement it for my own use, I read the source code.
I guess there might be a logic bug -- unless I understand the idea in a wrong way..
I will try to explain it step by step.

during the _make_tree() step, a memory optimisation is used for nodes with n_descendants <= K, which means a sub-tree with only leave nodes can have a size as large as K.

_get_all_nns(), vector nns is used to store all potential candidates, and its size is limited as k * number_of_trees. where k is number of nearest neighbours to report.

at _get_all_nns(), the search for a tree stops when it meets a internal node with only leave nodes, and all of these leave nodes are added to vector nns.

the problem is that vector nns quickly reaches it size limitation after searching for a few trees, and the majority of the forest is not visited.

In my test example, I build 500 trees for 2000 nodes, vectors has dimension of 400, and I want to find top 5 NN. _get_all_nns() only visits 10 trees and stops, leaving the remaining 490 trees untouched.
Maybe that explains why annoy works faster than many other programs...

I am not totally convinced that it is a heuristic instead of a bug in the program...

Any comments are helpful.

Add Maximum Inner Product Search

See http://arxiv.org/pdf/1410.5410v2.pdf

This is pretty easy to implement. I will probably do it myself, but in case I don't get to it, I wanted to jot some notes down and this seems like a good place to do it.

The basic algorithm is very simple and can be used in Annoy Angular mode with the following modifications, assuming we use the recommended m=2, U=0.75 parameters in the paper:

Initialization:

Add two to the input dimension before any add_item has been called, and pad the last two dimensions with 0s in add_item.

When it is time to build():

First, find the norm of the largest vector in the input set and compute a scale parameter U / ||x_max||, and then re-scale all vectors by this factor. Re-scaling the universe does not change the argmax of any maximum inner product.
While rescaling, set the extra two dimensions of each vector to [0.5 - ||x_i||^2, 0.5 - ||x_i||^4]
Resume the standard build process.

When it is time to query:

Just append two zeros to the query vector. That's it. The result will be the maximum dot product, instead of the maximum cosine distance, because of the brilliant / insane proofs in the paper above.

Use 8-bit elements

I bet you can quantize the vector elements to 8 bit and make the index 4x smaller – this should help with cache locality as well as fitting more trees in the same index. Will try at some point

Item ids must be sequential starting at zero

It was non-obvious to me that the integer ids supplied to add_item() must be sequential and start at zero. But if they are not, get_n_items() returns bogus counts. Is this the intended contract? It seems like if the ids are simply increasing sequential integers that AnnoyIndex would manage incrementing the ids. Why have to specify it in add_item() at all? Am I missing something?

For example, this appears to work:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(0, [1, 2, 3])
>>> t.add_item(1, [2, 2, 3])
>>> t.add_item(2, [3, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
3

But use arbitrary ids, and the counts are tricked:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(1, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
2

Or:
>>>
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(100, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
101

"Too Many Files Open" Error - Files Not Being Closed?

I've been using the ANNOY Python API to build and save several index tables in order to load and reuse subsets of these in another script (at most about 400 in a subset at a time). The library is wonderfully fast, but I've been encountering "too many files open" errors after some time. I'm loading each index once, doing stuff with it, then unloading it before the loop continues onwards with the next index table.

Here is a partial code snippet that reproduces the error by reloading the same index table (510 x 64) over and over (run the loop multiple times if the error doesn't show up immediately):

from annoy import AnnoyIndex
index_table = AnnoyIndex( 64,metric='angular' )
for i in xrange(10000):
    index_table.load( index_table_filename )
    print i  # just seeing when it failed
    index_table.unload( index_table_filename )

After some sleuthing into the GitHub source code, I notice that unload() unmaps the file from memory, but doesn't close the file opened using load(). Would it be possible to get file closing functionality incorporated into the Python wrapper somehow please? Or am I missing something vital in using ANNOY?

Thanks for all the hard work on the API!

Memory leak in get_item_vector() ?

I don't really know how to diagnose memory leak, however I do notice memory usage keep increasing when calling get_item_vector repetitively.

#!/usr/bin/env python3

from annoy import AnnoyIndex
from random import random, sample
from numpy import array

index = AnnoyIndex(500)
collection_count = 1000000

for i in range(collection_count):
    if (i + 1) % 10000 == 0:
        print('Inserting vector #{}'.format(i + 1))

    index.add_item(i, array([random() for _ in range(500)]))

print('building index')
index.build(1)


for i in range(collection_count):
    print('{}: {}'.format(i, len(index.get_item_vector(i))))

However, the memory problem is gone if I do a gc.collect() after each call to get_item_vector does that help?

My installation of annoy

$ pip show annoy

---
Name: annoy
Version: 1.6.2
Location: /home/jeffrey04/machine-learning/lib/python3.4/site-packages
Requires:

Using python 3.4 on Ubuntu 14.04 64bit

$ python --version
Python 3.4.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.3 LTS
Release:        14.04
Codename:       trusty
$ uname -a
Linux gideon 4.2.0-19-generic #23~14.04.1-Ubuntu SMP Thu Nov 12 12:33:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

norm of splitting points

(i'm only considering the angular case here)

At creation time, the split points are created as a vector difference here
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L182-L184
This ends up with the split points of different nodes having different norms.

At query time, the margin value that is used as priority for the PQ (https://github.com/spotify/annoy/blob/master/src/annoylib.h#L552), is computed as a dot product: https://github.com/spotify/annoy/blob/master/src/annoylib.h#L165

This means that the norm of the split points ends up having an effect on the order of the traversal of the tree, which I don't think is an intended effect.

The java version of annoy computes the cosine margin from the split point which sidesteps this problem, and intuitively seems the correct thing to do.

Is there something that I am missing here?

What are you using Annoy for?

(I couldn't think of a better way than filing a ticket to ask for)

I'm planning to speak at the NYC ML meetup later this year about ANN and was curious if people are happy to share some use cases for Annoy!

Compilation error (boost/random.hpp: No such file or directory)

I tried to install after cloning from github.
About my machine:

$ uname -a
Linux 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

And the error message:

$ python setup.py install
running install
running build
running build_py
running build_ext
building 'annoy.annoylib' extension
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoymodule.cc -o build/temp.linux-x86_64-2.7/src/annoymodule.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/annoymodule.cc:15:0:
src/annoylib.h:32:28: fatal error: boost/random.hpp: No such file or directory
 #include <boost/random.hpp>
                            ^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Is there anything I can do to fix this?

Thank you!

installing from pypi or pip

src/annoylib.h:39:10: fatal error: 'boost/version.hpp' file not found

include <boost/version.hpp>

Installing Error

Hi, I install annoy with pip on a CentOS server but got these errors, any idea?

    gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xxx/.pyenv/versions/2.7.10/include/python2.7 -c src/annoymodule.cc -o build/temp
.linux-x86_64-2.7/src/annoymodule.o -O3 -march=native -ffast-math
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
    In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
                     from src/annoymodule.cc:16:
    /home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1188:0: warning: "_POSIX_C_SOURCE" redefined
     #define _POSIX_C_SOURCE 200112L
     ^
    In file included from /usr/include/stdio.h:28:0,
                     from src/annoylib.h:18,
                     from src/annoymodule.cc:15:
    /usr/include/features.h:162:0: note: this is the location of the previous definition
     # define _POSIX_C_SOURCE 200809L
     ^
    In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
                     from src/annoymodule.cc:16:
    /home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1210:0: warning: "_XOPEN_SOURCE" redefined
     #define _XOPEN_SOURCE 600
     ^
    In file included from /usr/include/stdio.h:28:0,
                     from src/annoylib.h:18,
                     from src/annoymodule.cc:15:
    /usr/include/features.h:164:0: note: this is the location of the previous definition
     # define _XOPEN_SOURCE 700
     ^

    /tmp/ccfNu0mQ.s: Assembler messages:
    /tmp/ccfNu0mQ.s:21160: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:21163: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:21532: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24347: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24350: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24718: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:26863: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:26865: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:27238: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:29676: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:29678: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:30054: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32409: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32413: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32782: Error: suffix or operands invalid for `vbroadcastss'
    error: command 'gcc' failed with exit status 1

AnnoyIndex's bulk load option

I have to load in, says 1 millions (700 dims) records from a csv file or pandas dataframe.
Is there any way to do a bulk load for AnnoyIndex instead of loading each record one by one with add_item(i, v)?

Windows ?

One can use MapViewOfFile on Windows in lieu of mmap -- and the CRAN package mmap by Jeff Ryan does so (implementing mmap access to file from R). See the CRAN page or the GitHub copy of the repo.

We could probably borrow the functionality to deal with int32_t relatively easily. Obviously not urgent.

other distance functions

I realized today it's pretty easy to generalize to arbitrary distance functions. If every split is just the midpoint normal of two points then you can split by checking whether d(a, x) < d(b, x) where a and b are the points defining the split.

Interestingly this generalizes to crazy shit like edit distance. Would be cool to do a spell checker using Annoy :)

Does windows support now?

I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.

small typo in the API documentation

Hi there,

In the "Full Python API" section under README.rst, this method
a.get_item_vector_item(i)
should be
a.get_item_vector(i)

Best,
Pei

bug with "get_nns_by_item"?

Hi. I think I have uncovered a bug in "get_nns_by_item". Here is my code to reproduce it.
The bug does not seem to appear in "get_nns_by_vector".

import random


import annoy
DIM = 50
t = annoy.AnnoyIndex(DIM, metric='angular')  # Length of item vector that will be indexed

for i in range(100):
    v = [random.gauss(0, 1) for _ in range(DIM)]
    t.add_item(i, v)

t.build(10)

print('nb_items:', t.get_n_items())

#BUG Annoy : TypeError: get_nns_by_item() takes no keyword arguments

#works...
#res1 = t.get_nns_by_item(50, 3, -1, True)
#does not work:
res1 = t.get_nns_by_item(50, 3, search_k=-1, include_distances=True)
print('res1:', res1)

#both calls work with get_nns_by_vector:
vq = [random.gauss(0, 1) for _ in range(DIM)]
res2a = t.get_nns_by_vector(vq, 2, -1, True)
res2b = t.get_nns_by_vector(vq, 2, search_k=-1, include_distances=True)
print('res2a:', res2a)
print('res2b:', res2b)

Failures in _annoy_test.py

Hello!

I ran tests from _annoy_test.py and got 7 failures:

FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item
self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements.
First extra element 2:
2

[0, 1]
[0, 1, 2]
? +++

FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector
self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements.
First extra element 2:
2

[0, 1]
[0, 1, 2]
? +++

FAIL: test_large_index (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index
self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [5, 4] != [504, 505]

First differing element 0:
5
504

[5, 4]
[504, 505]

FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index
self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j])
AssertionError: Lists differ: [228, 229] != [535, 534]

First differing element 0:
228
535

[228, 229]'ve added test/
[535, 534]

FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10
self.assertTrue(self.precision(10) == 1.0)
AssertionError: False is not true

FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100
self.assertTrue(self.precision(100) >= 0.99)
AssertionError: False is not true

FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000
self.assertTrue(self.precision(1000) >= 0.99)
AssertionError: False is not true

Ran 14 tests in 2.622s

FAILED (failures=7)

Process finished with exit code 7

reserved identifier violation

I would like to point out that an identifier like "_K" does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

Add method or parameter for returning (itemid, score) tuples

It would be nice to be able to iterate over a get_nns call with not just the itemid but the score also, especially as it's available to the C++ code; currently you have to reconstruct the score manually by normalizing and taking your own dot product in numpy or something.

load file returns 0

I ran annoy 1.2.2 on Amazon EC2 with ubuntu linux 14.04, the file loading always return 0, any idea why? I would really appreciate your ideas.

from annoy import AnnoyIndex
f=50
i=AnnoyIndex(f)
i.load('items.tree')
0

PS: the file is small enough, several Kbs

Vector elements are truncated to ints

I did a sudo pip install of annoy on my OSX laptop and had the following unusual results:

>>> from annoy import AnnoyIndex
>>> import random
>>>
>>> f = 2
>>> t = AnnoyIndex(f)
>>> t.add_item(0, [0, 2.5])
0
>>> t.add_item(1, [3.5, 0])
0
>>> t.build(10)
0
>>> t.get_item_vector(0)
[0, 2]
>>> t.get_item_vector(1)
[3, 0]
>>> t.get_distance(0,1)
2.0

A few issues:

A 0 is returned and printed by the add_item() and build() methods.
When the items are queried, the vector that is returned has truncated elements.
The get_distance() method doesn't return the proper angular distance, nor the distance under the assumption that the vector elements are truncated.

cannot import name AnnoyIndex

ImportError Traceback (most recent call last)
/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed

/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed

ImportError: cannot import name AnnoyIndex

the log of install:
Running setup.py install for annoy
Successfully installed annoy-1.5.1

And my OS is redhat 7

thanks~

Use LMDB for storage

@LongbinChen pointed out that LMDB would be a good way to support a series of things.

http://symas.com/mdb/microbench/

If we swap out the storage to rely in LMDB we would get a lot of features for free

Ability to use any key, not just integers
Out-of-core data (dataset doesn't fit in RAM)
Decouple index structure from nodes (could potentially set _K independently and optimize for it).
Incremental add_item (and deletes) would be much easier to implement

I'm not sure how tricky this would be, but probably not more than 5-10h. Mentioning it in case someone wants to give it a shot :)

When the data size is too large to fit into the memory?

I got 50 million 1000 dim vectors need too be indexed, but I don't get that much ram to hold all the data during building the tree. Is possible to modify annoy to support that amount of data? Any hint on how to do that?

double save turns process into zombie on OS X

A very easy way to turn my python process into a zombie that can't be killed even with -9, and blocks my Terminal tab:

from annoy import *
import random

def main():
    f = 10
    t = AnnoyIndex(f)
    for i in xrange(0):
        v = [random.gauss(0, 1) for z in xrange(f)]
        t.add_item(i, v)
    t.build(-1)
    t.save("t.ann")
    t.save("t.ann")

In fact, this is all I need:

from annoy import *
t = AnnoyIndex(10)
t.save("t.ann")
t.save("t.ann")

This happens on an OS X Yosemite. The same thing simply coredumps on Ubuntu.

idea for faster/better tree structure

What if at every split we just sample two items and compute the hyperplane that separates them?

For angular distance you would normalize the two items first. This would also generalize to arbitrary pseudometrics nicely: Manhattan, KL-divergence, etc.

For highdimensional distributions where there's structure in the data, this is more likely to sample from that structure. It will also be much faster in general since we are guaranteed to split the point set in two subsets

Calculate NN for only one vector

Hi,

In case I want to get NN for only one vector A1, "build" function needs to calculate NN for all vectors A1..n. But I don't need NN for vectors other than A1 vector, calculating NN for all other vectors is waste of time.
Are there any good way to calculate NN for only one vector?

Are there any plans to employ SIMD?

Looks like current implementation doesn't use SIMD intrinsics so is there a plan to do so? I guess this will improve speed significantly

Parallelize build?

Conceptually, if the trees are independent, shouldn't we be able to build the trees in parallel across multiple cores? Is this something that could be implemented easily?

I haven't had a chance to dive deeply into the code/algorithm yet so I may be misunderstanding something.

Use of angle brackets around file names for include statements

Would you like to replace any double quotes by angle brackets around file names for include statements?

return value of fopen() in save() is not checked

segfaults if you try to write somewhere you can't.

install from pip broken

Tried both on an EC2 machine and on my laptop in a new virtualenv. Same error:

Downloading/unpacking annoy
  Downloading annoy-1.0.2.tar.gz
  Running setup.py (path:/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py) egg_info for package annoy
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>
        for line in open('README.rst'):
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>

    for line in open('README.rst'):

IOError: [Errno 2] No such file or directory: 'README.rst'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /Users/rogueleaderr/tmp/tenv/build/annoy
Storing debug log for failure in /Users/rogueleaderr/.pip/pip.log

Seems to install okay if I clone the repo and do

python setup.py install

But it would be nice to install in one line from pip.

Vector metadata

It would be convenient to associate each vector with a metadata which would then be saved in the index on disk. Currently, you need to maintain a separate data structure which is inconvenient in some cases.

can't install with Python3

Hi. Newbie to Github here. Thanks for this wonderful piece of software.

I think I've a small bug in the install with Python 3:

(annoy)horace89@horace89:~/tst1/annoy$ python -V
Python 3.4.0
(annoy)horace89@horace89:~/tst1/annoy$ pip install annoy              
Collecting annoy
  Using cached annoy-1.4.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-ygnacgga/annoy/setup.py", line 34, in <module>
        long_description = readme_note + fobj.read()
      File "/home/horace89/.virtualenvs/annoy/lib/python3.4/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3767: ordinal not in range(128)

If I build a virtualenv with Python2.7, the error disappears and the build is ok.
If I use a virtualenv with Python3.4 and if I edit the setup.py file with (I comment out fobj.read()) :
with open('README.rst') as fobj: long_description = readme_note #+ fobj.read()
, the error disappears and the build is ok.
So I think, there's an issue with the utf-8 encoding.

annoy compilation fail

Installing annoy from repo on EC2 m3.2xlarge instance (Ubuntu 13.10):

$ uname -a
Linux m3-2xlarge 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

fails with

$ python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/annoy
copying annoy/__init__.py -> build/lib.linux-x86_64-2.7/annoy
running build_ext
building 'annoy.annoylib' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoylib.cc -o build/temp.linux-x86_64-2.7/src/annoylib.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Angular<float>]’:
src/annoylib.cc:542:30:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117:   required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     add_item(item, &w[0]);
                         ^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Angular<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117:   required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     get_nns_by_vector(&w[0], n, &result);
                                        ^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Euclidean<float>]’:
src/annoylib.cc:542:30:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121:   required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     add_item(item, &w[0]);
                         ^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Euclidean<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121:   required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     get_nns_by_vector(&w[0], n, &result);
                                        ^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I tried adding -fpermissive to CFLAGS, which changes the errors into warnings. Then annoy builds and installs, but doesn't work correctly anymore. It returns fewer than requested/nonsense NNs:

======================================================================
FAIL: test_get_nns_by_item (__main__.AngularIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/_annoy_test.py", line 41, in test_get_nns_by_item
    self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

======================================================================
FAIL: test_large_index (__main__.EuclideanIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/_annoy_test.py", line 129, in test_large_index
    self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [104, 105] != [356, 357]

etc.

Any idea what's going on?

Segfault if you save() without doing build()

It's a PEBKAC error but it's sort of user unfriendly.

Program received signal SIGSEGV, Segmentation fault.
*__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
3710    malloc.c: No such file or directory.
        in malloc.c
(gdb) where
#0  *__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
#1  0x00007fffefff95bc in AnnoyIndexPython<float, Angular<float> >::save_py(std::string const&) () from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#2  0x00007fffefff44b7 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (AnnoyIndexPython<float, Angular<float> >::*)(std::string const&), boost::python::default_call_policies, boost::mpl::vector3<void, AnnoyIndexPython<float, Angular<float> >&, std::string const&> > >::operator()(_object*, _object*) ()
   from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#3  0x00007fffefdb7bde in boost::python::objects::function::call(_object*, _object*) const () from /usr/lib/libboost_python-py26.so.1.42.0
#4  0x00007fffefdb7e88 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#5  0x00007fffefdbf703 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const () from /usr/lib/libboost_python-py26.so.1.42.0
#6  0x00007fffefff2c67 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<ErrnoException, void (*)(ErrnoException const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(ErrnoException const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) ()
   from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#7  0x00007fffefdbf4d3 in boost::python::handle_exception_impl(boost::function0<void>) () from /usr/lib/libboost_python-py26.so.1.42.0
#8  0x00007fffefdb4788 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#9  0x000000000041f117 in PyObject_Call ()
#10 0x00000000004a74e8 in PyEval_EvalFrameEx ()
#11 0x00000000004a97f1 in PyEval_EvalCodeEx ()
#12 0x00000000004a98c2 in PyEval_EvalCode ()
#13 0x00000000004c9aee in PyRun_FileExFlags ()
#14 0x00000000004c9d04 in PyRun_SimpleFileExFlags ()
#15 0x000000000041a89d in Py_Main ()
#16 0x00007ffff69e7cad in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>,
    fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe158) at libc-start.c:228
#17 0x0000000000419a99 in _start ()

About distances

Hi there. I'm trying to use Annoy to get distances information using 'angular' (or cosine distance). But I can't get it right.

My code is below:

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
for i in xrange(1000):
    v = [random.gauss(0, 1) for z in xrange(f)]
    t.add_item(i, v)

t.build(10)  # 10 trees

v = [random.gauss(0, 1) for z in xrange(f)]
data = t.get_nns_by_vector(v, 5, include_distances=True)  # will find the 5 nearest neighbors
print data

I expected to have an output with distances lesser than 1. But what I got is:

([227, 987, 542, 302, 407], [0.9761863350868225, 1.085146427154541, 1.0856424570083618, 1.1076160669326782, 1.116296648979187])```

Would you pls tell me what is wrong?

Windows support?

The readme suggests that pip install annoy should work on Windows, but I get the following error:

c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory

My googling tells me that unistd.h is a Linux-only file.

Is there an easy fix for this?

Thanks for the awesome package!

And in case the full error message is useful:

C:\Anaconda>pip install annoy
Collecting annoy
  Downloading annoy-1.5.2.tar.gz (625kB)
    100% |################################| 626kB 409kB/s
Building wheels for collected packages: annoy
  Running setup.py bdist_wheel for annoy
  Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:

  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\annoy
  copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
  running build_ext
  building 'annoy.annoylib' extension
  creating build\temp.win-amd64-2.7
  creating build\temp.win-amd64-2.7\Release
  creating build\temp.win-amd64-2.7\Release\src
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
  cl : Command line warning D9002 : ignoring unknown option '-O3'
  cl : Command line warning D9002 : ignoring unknown option '-march=native'
  cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
  annoymodule.cc
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
ind semantics are not enabled. Specify /EHsc
  c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
irectory
  error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

  ----------------------------------------
  Failed building wheel for annoy
Failed to build annoy
Installing collected packages: annoy
  Running setup.py install for annoy
    Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
\\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    running build_ext
    building 'annoy.annoylib' extension
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
    cl : Command line warning D9002 : ignoring unknown option '-O3'
    cl : Command line warning D9002 : ignoring unknown option '-march=native'
    cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
    annoymodule.cc
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
nwind semantics are not enabled. Specify /EHsc
    c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
 directory
    error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
-5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
intu9\annoy

Out of memory under OS X when adding items to index

First of all, thanks for open-sourcing this. The landscape of ANN solutions is quite sparse, so it's great to see something happening :)

I've been doing small tests under OS X with a few dozen items so far. Since the results looked very promising I decided to throw some more data at annoy.
However, during the first run, I directly ran out of memory and had to restart my notebook (16GB RAM). The test involved iterating over 500 items, each containing one hundred 128-dimensional vectors (SIFT image descriptors).

I assumed that the vectors are too large and therefore reduced the dimensions to 64 and later 32. This didn't have any effect on the memory usage though. I then decided to reduce the number of items down to 50 and then 25. This didn't make a difference as well.

Are there any known problems with annoy under OS X?
Also, can you give recommendations regarding a system configuration where annoy is known to run well? I'd be happy run this inside a VM if necessary.

Thanks,
Chris

Split methods from Python bindings to support other languages

To better support bindings in other languages, it would be nice if the python-specific bindings were split into another file, so that it'd be easier to write bindings in other languages, like node.js, Java or PHP, for example.