Git Product home page Git Product logo

autofaiss's People

Contributors

bamine avatar davnn avatar dependabot[bot] avatar dobraczka avatar evaia avatar hitchhicker avatar josephcappadona avatar mbompr avatar nateagr avatar quentin-auge avatar rom1504 avatar victor-paltz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autofaiss's Issues

Distributed training

Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.

I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?

Control verbosity of messages

Hi, thanks for this library, it really helps, when working with faiss! One minor problem I have is that I would like to control the verbosity of the messages, since I use this autofaiss in my own library. The simplest way to do that would probably through the use of python's logging module.

Is there anything planned in that regard?

Torch Tensor support?

I want to ask whether doing KNN search with torch tensors is supported? Many thanks!

Make ingestion pipeline require less disk space

Currently the flow is:

  • download a large amount of embeddings
  • convert to numpy
  • run autofaiss to produce an index

It works well but requires a large amount of disk space

It's possible to instead do download -> convert -> add for each part of the embedding collection (and remove temporary files when doing the next part)
One way to do this could be to opensource the pyspark job doing this
It could also be possible to implement this directly in python here.

A simple way to do this could also be to have better support of remote file systems directly in quantize.

Vector normalization while building index

Hi!
According to the docs faiss doens't natively support cosine similarity as distance metric. The closest one is inner product which additionaly needs to prenormalize embedding vectors. In FAQ authors propose a way to do it manually with their function faiss.normalize_L2.
I have exactly the same case and would be glad, if autofaiss have an optional flag which additionally prenormalize vectors before building index.
It seems to me that it's not so difficult and ones should add faiss.normalize_L2 to each place where iterate over embedding_reader. If so i can make a PR.

make autofaiss not use TemporaryDirectory

TemporaryDirectory is a local folder which may not have any room
the user should specify what is the temporary folder (in fact we already have an option for this)

x8 vs x4fsr

INFO:autofaiss: Computing best hyperparameters for index faiss_titles.faiss 05/05/2022, 07:16:53                                                            
WARNING:autofaiss:The maximum nearest neighbors coverage is 10.65% for this index. It means that when requesting 20 nearest neighbors, the average number of retrieved neighbors will be 2. The program will try to find the best hyperparameters to reach 95% of this max coverage at least, and then will optimize the search time for this target. The index search speed could be higher than the requested max search speed.

What can we do to prevent this?

This happened with "OPQ768_768,IVF262144_HNSW32,PQ768x8" -> bad max coverage
With the index_key "OPQ768_768,IVF262144_HNSW32,PQ768x4fsr", everything was ok. The vectors were just a bit too compressed.

My d is 768.

Thank you

fix estimation of training memory used by autofaiss

just tried it and the new estimation at https://github.com/criteo/autofaiss/pull/81/files doesn't fully capture the memory needed for training

when training an index such as OPQ32_224,IVF131072_HNSW32,PQ32x8 faiss trains the index in 2 steps
The first step seems to be indeed using the memory assumed by the current estimation (for example 21.5GB for 11M vectors of dimension 512) but then the second step uses some more ram.
I am not sure yet what are these 2 steps, but I'd guess something like a primary then secondary index

Let's figure it out then add some more tests for this (could be scheduled tests instead of tests that run for every commit)

module 'faiss' has no attribute 'swigfaiss'

python 3.8.12
autofaiss                 2.13.2                   pypi_0    pypi
faiss-cpu                 1.7.2                    pypi_0    pypi
libfaiss                  1.7.2            h2bc3f7f_0_cpu    pytorch

First of all, thank you for the great project! I get the error: module 'faiss' has no attribute 'swigfaiss' when running the following command:

import autofaiss

autofaiss.build_index(
    "embeddings.npy",
    "autofaiss.index",
    "autofaiss.json",
    metric_type="ip",
    should_be_memory_mappable=True,
    make_direct_map=True)

The error appears when running it for make_direct_map=True.

Tested using conda 4.11.0 or mamba 0.15.3 using pytorch or conda-forge channel.

Fix potential out of disk problem when producing N indices

When we produce N indices (with nb_indices_to_keep larger than 1), within the function of optimize_and_measure_indices, we download N indices from remote in one shot (see here), if the machine running autofaiss has limited disk space, it would fail due to No space left error.

Add func for load npz vectors

Hi! I have a numpy matrices that saved as npz files. Unfortunately Autofaiss support only npy. Can you add that functionality?

Make current available memory properly aggregate all the memory needs

  • the index final size should be subtracted from the amount of memory adding is allowed to use
  • the index untrained size should be subtracted from the amount of memory training is allowed to use

this would make it possible to have stronger guaranties about how much memories autofaiss would use

[Feature Request:] Add new features to a previously built index

Right now there does not seem to be an easy way to take an already-built index and add more embeddings to it (from the same distribution). This is obviously already indirectly supported by / possible with autofaiss because distributed training already does it, and also it is something easily supported by FAISS backbone. But I wonder if we can expose an easy interface to take a built index and add more features from a new set of embeddings (Using all the bells and whistles provided by autofaiss/embedding-reader for reading embeddings from a numpy-parquet format). Perhaps a update_index interface?

Thanks!

get_optimal_index_keys_v2 support faiss AutoTune

def get_optimal_index_keys_v2(
nb_vectors: int,
dim_vector: int,
max_index_memory_usage: str,
flat_threshold: int = 1000,
quantization_threshold: int = 10000,
force_pq: Optional[int] = None,
make_direct_map: bool = False,
should_be_memory_mappable: bool = False,
ivf_flat_threshold: int = 1_000_000,
use_gpu: bool = False,
) -> List[str]:
"""
Gives a list of interesting indices to try, *the one at the top is the most promising*
See: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index for
detailed explanations.
"""
# Exception cases:

Make embedding iterator faster on high latency file systems

s3 and hdfs are low latency high bandwidth file systems
On these fs, fetching files sequentially is slow
Today our embedding iterator read files sequentially

This could be made faster by reading files in parallel or even parts of files in parallel using pyarrow readers that includes threads internally

build_index can't handle empty numpy files

Hello,

I'm currently running a workflow in argo which is generating several embedding files, in parallel, based on a database search.
If no data was found, the workflow returns a empty numpy file:

np.save(os.path.join(output, "features", filename), np.empty(0, np.float32))

Sadly the build_index is not capable of handling those files:

Using 4 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 04/08/2022, 09:54:53
Reading total number of vectors and dimension 04/08/2022, 09:54:53

  0%|          | 0/16 [00:00<?, ?it/s]
 19%|█▉        | 3/16 [00:00<00:00, 29.92it/s]
 56%|█████▋    | 9/16 [00:00<00:00, 87.73it/s]
>>> Finished "Reading total number of vectors and dimension" in 0.1517 secs
>>> Finished "Launching the whole pipeline" in 0.1517 secs
Traceback (most recent call last):
  File "/usr/local/bin/autofaiss", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 395, in main
    fire.Fire({"build_index": build_index, "tune_index": tune_index, "score_index": score_index})
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 143, in build_index
    nb_vectors, vec_dim = read_total_nb_vectors_and_dim(
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 258, in read_total_nb_vectors_and_dim
    for c in p.imap_unordered(file_to_line_count, file_paths):
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 252, in file_to_line_count
    return matrix_reader.get_row_count()
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 101, in get_row_count
    return self.get_shape()[0]

Would be great if it could handle it, by just showing a waning in the logs or a flag to allow it.

Windows parallelization

Hi! Thank you for the great project! Unfortunately I'm experiencing some issues, which could be caused by Windows (10 Pro) and I'm not sure how to solve them.

I installed autofaiss with conda into a new env with Python 3.6. First, I had problems with import:
ImportError: DLL load failed while importing _swigfaiss: The specified module could not be found.

I solved that by first installing openblass, numpy and faiss from conda-forge:
conda create --name faiss_env python=3.6
conda activate faiss_env
conda install conda-forge::blas=*=openblas
conda install -c conda-forge numpy
conda install -c conda-forge faiss
pip install autofaiss

Then I tried to run the example from README, but I have encountered an error in embedding_reader:

~\.conda\envs\faiss_env\lib\site-packages\embedding_reader\get_file_list.py in _get_file_list(path, file_format, sort_result)
     42     path = make_path_absolute(path)
     43     fs, path_in_fs = fsspec.core.url_to_fs(path)
---> 44     prefix = path[: path.index(path_in_fs)]
ValueError: substring not found

I found out that the problem is in the fsspec.core.url_to_fs method, namely in the private method _strip_protocol on the line 402 in fsspec\core.py:
urlpath = fs._strip_protocol(url)
This line changes backward slashes to forward slashes and therefore the substring path_in_fs is not found in the string path.

Now comes the incomprehensible part: when I changed the private method _strip_protocol to general method strip_protocol (I only deleted the leading underscore), the ValueError disapeared and the function preserved backward slashes in the path... but then another error appeared:
RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *) at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98: Error: 'f' failed: could not open C:\Users\USER\AppData\Local\Temp\tmp2jqscc1t for writing: Permission denied

This seems to me like the problem with parallelization and I don't know how to solve it. I suppose that the solution of the ValueError was not the correct one and there is still some problem with Windows implementation.

Can you give me some advice how to find out a solution to this?

Thanks!

add option to save keys from parquet embeddings into a new parquet collection

to avoid reading the embeddings parquet a second time, we could consider extracting, yielding and saving the keys from the parquet files in the read embeddings function.
These keys could be saved either as parquet, either in some format convenient for fast random access (eg arrow, hdf5 for one way, leveldb for 2 way).
That would probably be convenient but let's keep this for another PR

(Another option is to do this in another utility that would read only the key column, to be seen what is best)

build_index is very slow

machine:

  • cpu-machine:Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  • mem: 32G
  • cpu-cores: 16

code:

from autofaiss import build_index
import numpy as np

embeddings = np.float32(np.random.rand(1000000, 512))
index, index_infos = build_index(embeddings, save_on_disk=False)

log:
image

Add all parameters from doc to readme

 embeddings_path : str
        Local path containing all preprocessed vectors and cached files.
        Files will be added if empty.
    output_path: str
        Destination path of the quantized model on local machine.
    index_key: Optinal(str)
        Optional string to give to the index factory in order to create the index.
        If None, an index is chosen based on an heuristic.
    index_param: Optional(str)
        Optional string with hyperparameters to set to the index.
        If None, the hyper-parameters are chosen based on an heuristic.
    max_index_query_time_ms: float
        Bound on the query time for KNN search, this bound is approximative
    max_index_memory_usage: str
        Maximum size allowed for the index, this bound is strict
    current_memory_available: str
        Memory available on the machine creating the index, having more memory is a boost
        because it reduces the swipe between RAM and disk.
    use_gpu: bool
        Experimental, gpu training is faster, not tested so far
    metric_type: str
        Similarity function used for query:
            - "ip" for inner product
            - "l2" for euclidian distance

use merging strategy in non-pyspark mode as well

the strategy to create a few small indices the memory usage during adding and (if using the special merge on disk function) completely cap the memory used by autofaiss in general, making it possible to create arbitrarily big indices with a fixed amount of ram

let's use that strategy not only for pyspark mode, but even for the normal mode
adding N indices to normal mode should also be possible by reusing the code from distributed

multi index ideas

  • building one index or a thousand indices from one embedding set has the same cost if doing one training and grouping at read time (allows doing one index per strict category)
  • building N index-parts then merging may make it easier to parallelize reading and building. it could also post pone the memory cost only at merge time, which might be beneficial (for example unlocks building in many memory constrained executors then merging in one big machine afterwards, or maybe even merging with memory mapping to use no memory for merging)

some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp

Tests

  • check hnsw size > flat size

decrease memory used by merging

Currently merging in distributed mode requires to store the whole index in memory
Possible strategies:

  • improve faiss merge into to avoid putting everything in memory
  • producing N index instead of one and letting the user search in all of them at search time

Suspicious constant 1-recall score

I have trained 3 different index and every time, my 1-recall@20 are exactly the same:

INFO:autofaiss: 1-recall@20: 0.802
INFO:autofaiss: 1-recall@40: 0.824

But there is some variation in the 20-recall and 40-recall scores.

3 digits of exactitude is too much.

What do you think about it?

add_with_ids is not implemented for Flat indexes

Hello, I'm encountering an issue using autofaiss with flat indexes.
build_index raises an error (in my case, when embeddings are ndarray, I did not test with parquet embeddings) in distributed mode, for flat indexes. This error could be related to facebookresearch/faiss#1212 (method index.add_with_ids is not implemented for flat indexes).

from autofaiss import build_index

build_index(
    embeddings=np.ones((100, 512)),
    distributed="pyspark",
    should_be_memory_mappable=True,
    index_path="hdfs://root/user/foo/knn.index",
    index_key="Flat",
    nb_cores=20,
    max_index_memory_usage="32G",
    current_memory_available="48G",
    ids_path="hdfs://root/user/foo/test_indexing_out/ids",
    temporary_indices_folder="hdfs://root/user/foo/indices/tmp/",
    nb_indices_to_keep=5,
    index_infos_path="hdfs://root/user/r.laby/test_indexing_out/index_infos.json",
)

raises

RuntimeError: Error in virtual void faiss::Index::add_with_ids(faiss::Index::idx_t, const float*, const idx_t*) at /project/faiss/faiss/Index.cpp:39: add_with_ids not implemented for this type of index

Is it expected ? Or could this be fixed ?
Thanks !

GPU on A100

import numpy as np
from autofaiss import build_index

embeddings = np.float32(np.random.rand(700, 700))


build_index(
    embeddings=embeddings,  # type: ignore
    index_path="knn.index",
    index_infos_path="infos.json",
    should_be_memory_mappable=True,
    use_gpu=True,
)

On my A100, the use_gpu=True breaks the flow.

get_optimal_index_keys_v2 returns an empty list

I am using autofaiss 2.14.0 and it works for some parts of the data I am working on, but not for some. I keep getting this error and I do not know where to look at:

2022-04-21 17:46:40,649 [INFO]: There are 16325691 embeddings of dim 768
2022-04-21 17:46:40,653 [INFO]: >>> Finished "Reading total number of vectors and dimension" in 37.7308 secs
2022-04-21 17:46:40,653 [INFO]:         Compute estimated construction time of the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,659 [INFO]:                 -> Train: 16.7 minutes
2022-04-21 17:46:40,659 [INFO]:                 -> Add: 2.3 minutes
2022-04-21 17:46:40,659 [INFO]:                 Total: 19.0 minutes
2022-04-21 17:46:40,659 [INFO]:         >>> Finished "Compute estimated construction time of the index" in 0.0057 secs
2022-04-21 17:46:40,659 [INFO]:         Checking that your have enough memory available to create the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,802 [INFO]:         >>> Finished "Checking that your have enough memory available to create the index" in 0.1431 secs
2022-04-21 17:46:40,803 [INFO]: >>> Finished "Launching the whole pipeline" in 37.8808 secs
Traceback (most recent call last):
  File "process.py", line 26, in <module>
    chunks_to_precalculated_knn_(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 373, in chunks_to_precalculated_knn_
    index, embeddings = chunks_to_index_and_embed(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 334, in chunks_to_index_and_embed
    index = index_embeddings(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 288, in index_embeddings
    build_index(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 224, in build_index
    necessary_mem, index_key_used = estimate_memory_required_for_index_creation(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/build.py", line 46, in estimate_memory_required_for_index_creation
    index_key = get_optimal_index_keys_v2(
IndexError: list index out of range

Misunderstanding of the estimated computing time

I am not sure whether I misunderstand something or there is an error, but when building my index with autofaiss is written Train: 16.7 minutes but takes ~11 secs Finished "Launching the whole pipeline" in 11.1440 secs?

Using 16 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 01/28/2022, 08:15:47
There are 4269 embeddings of dim 1024
	Compute estimated construction time of the index 01/28/2022, 08:15:47
		-> Train: 16.7 minutes
		-> Add: 0.0 seconds
		Total: 16.7 minutes
	>>> Finished "Compute estimated construction time of the index" in 0.0000 secs
	Checking that your have enough memory available to create the index 01/28/2022, 08:15:47
20.6MB of memory will be needed to build the index (more might be used if you have more)
	>>> Finished "Checking that your have enough memory available to create the index" in 0.0009 secs
	Selecting most promising index types given data characteristics 01/28/2022, 08:15:47
	>>> Finished "Selecting most promising index types given data characteristics" in 0.0000 secs
	Creating the index 01/28/2022, 08:15:47
		-> Instanciate the index HNSW15 01/28/2022, 08:15:47
		>>> Finished "-> Instanciate the index HNSW15" in 0.0036 secs
The index size will be approximately 17.2MB
The memory available for adding the vectors is 7.0GB(total available - used by the index)
Will be using at most 1GB of ram for adding
		-> Adding the vectors to the index 01/28/2022, 08:15:47
Using a batch size of 244140 (memory overhead 953.7MB)
100%|██████████| 1/1 [00:00<00:00, 74.53it/s]		>>> Finished "-> Adding the vectors to the index" in 0.1602 secs
	>>> Finished "Creating the index" in 0.1647 secs
	Computing best hyperparameters 01/28/2022, 08:15:47

	>>> Finished "Computing best hyperparameters" in 3.3091 secs
The best hyperparameters are: efSearch=21
	Compute fast metrics 01/28/2022, 08:15:50
2000
	>>> Finished "Compute fast metrics" in 7.6499 secs
	Saving the index on local disk 01/28/2022, 08:15:58
	>>> Finished "Saving the index on local disk" in 0.0091 secs
Recap:
{'99p_search_speed_ms': 30.39110283832997,
 'avg_search_speed_ms': 3.7983315605670214,
 'compression ratio': 0.9678652870286923,
 'index_key': 'HNSW15',
 'index_param': 'efSearch=21',
 'nb vectors': 4269,
 'reconstruction error %': 0.0,
 'size in bytes': 18066382,
 'vectors dimension': 1024}
>>> Finished "Launching the whole pipeline" in 11.1440 secs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.