The full log: Faiss assertion err == CUBLAS_STATUS_SUCCE

index.add(xb) # error happends here <p dir="auto

faiss::gpu::runMatrixMult failure,about facebookresearch/faiss

Comments (37)

tigert1998 commented on May 27, 2024 1

conda install -c conda-forge faiss-gpu
This fix it for me.

from faiss.

wickedfoo commented on May 27, 2024

Possibility that you ran out of GPU memory?

from faiss.

wickedfoo commented on May 27, 2024

What were you trying to run?

from faiss.

hellolovetiger commented on May 27, 2024

train data shape: (2000000, 1000)
base data shape: (20000000, 1000)
query data shape: (1000000, 1000)
data type: float32

my code:

index = faiss.index_factory(d, "OPQ16_512,IVF4096,PQ16")
co = faiss.GpuClonerOptions()
co.usePrecomputed = False
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

My GPU memory is 8GB. I just tried the bench bench_gpu_sift1m.py, the same error.

from faiss.

wickedfoo commented on May 27, 2024

index.add(xb) # error happends here

instead of giving all of the (20000000, 1000) at once, try giving it in chunks of (10000, 1000) or so.
This is a issue that will be fixed at some point, the GPU side is less friendly unless you handle chunking the input beforehand, but eventually we'll handle that automatically.

from faiss.

wickedfoo commented on May 27, 2024

Only GpuIndexFlat* handles passing large amounts of data all at once for add or search at present.

from faiss.

hellolovetiger commented on May 27, 2024

I see. Actually I used numpy.memmap to load the data. Sorry, could you give me some guidance on how to chunk the input data that can be loaded with index.add?

from faiss.

hellolovetiger commented on May 27, 2024

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

from faiss.

hellolovetiger commented on May 27, 2024

Just made some changes on the bench code bench_gpu_sift1m.py, still the same error. Populating top 10000 not work, either. Seems it is not memory issue. Maybe there is something wrong with the CUBLAS. By the way, do you have a plan to publish an official docker image to avoid some problems caused by installation?

#################################################################
#  Approximate search experiment
#################################################################

print "============ Approximate search"

index = faiss.index_factory(d, "IVF4096,PQ64")

# faster, uses more memory
# index = faiss.index_factory(d, "IVF16384,Flat")

co = faiss.GpuClonerOptions()

# here we are using a 64-byte PQ, so we must set the lookup tables to
# 16 bit float (this is due to the limited temporary memory).
co.useFloat16 = True

index = faiss.index_cpu_to_gpu(res, 0, index, co)

print "train"

index.train(xt)

print "add vectors to index"

index.add(xb[:10000])

from faiss.

mdouze commented on May 27, 2024

Hi
Note that the code above will not work for 1000-dim data (because 1000 is not a multiple of 64).
We do not have plans for a Docker image.

from faiss.

hellolovetiger commented on May 27, 2024

Hi, mdouze. Above code is from bench_gpu_sift1m.py. I used the data from http://corpus-texmex.irisa.fr/, following the instruction in https://github.com/facebookresearch/faiss/tree/master/benchs. I just wanted to check if the bench code works. Turn out to be the same error with my own.

from faiss.

mdouze commented on May 27, 2024

Ok, so this is the exact script bench_gpu_sift1m.py applied to the SIFT1M dataset and not your 20M*1000-dim dataset, correct?
On which type of GPU are you running this?

from faiss.

hellolovetiger commented on May 27, 2024

Yes for your first question.
My GPU is GeForce GTX 1080

from faiss.

mdouze commented on May 27, 2024

It could be the same bug as issue #8. Unfortunately we do not have the hardware to reproduce it, so we would be grateful if you could narrow down the error for us:

Does it still crash in the add?
If yes, could you add fewer vectors until it does not crash any more?
could you set co.usePrecomputed = false and test again?
could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?

from faiss.

wickedfoo commented on May 27, 2024

You can also try running cuda-memcheck on the bench_gpu_sift1m.py to see if anything gets printed out that does not look like the following:

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaPointerGetAttributes. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x126239]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e44]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]
=========
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x11de53]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e65]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]

Another thing is to try resetting the GPU via nvidia-smi and trying again.

Also, you could try and investigate which CUDA shared libraries it is trying to load, to see if there is a mismatch if you have multiple CUDA SDK versions installed.

from faiss.

wickedfoo commented on May 27, 2024

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

Faiss GPU reserves about 18% of available GPU memory up front for scratch space. This amount is controllable via StandardGpuResources, but it will run slower if you decrease it by a lot (due to cudaMalloc/cudaFree overhead). 1-2 GB of scratch space seems to be appropriate for most workloads.

from faiss.

hellolovetiger commented on May 27, 2024

For your questions:

could you add fewer vectors until it does not crash any more?
It will always crash no matter how small the number of vectors is.
could you set co.usePrecomputed = false and test again?
It works. But it doesn't work for my own code. I will give more tries.
could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?
It will fail if setting co.usePrecomputed = True

Some other infos:
ldd gpu/test/demo_ivfpq_indexing_gpu ==>

linux-vdso.so.1 =>  (0x00007ffcc0066000)
libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007fa709dfd000)
liblapack.so.3 => /usr/lib/liblapack.so.3 (0x00007fa709661000)
libcublas.so.8.0 => /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0 (0x00007fa706cb1000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa706aa9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa70688b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa706687000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa706383000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa70607d000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fa705e6e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa705c58000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa705893000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa70b606000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00007fa70408a000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fa703d70000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fa703b34000)

from faiss.

hellolovetiger commented on May 27, 2024

@wickedfoo

here is cuda-memcheck result (setting co.usePrecomputed = True):

============ Approximate search
train
add vectors to index
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted========= Error: process didn't terminate successfully
========= Internal error (7)
========= No CUDA-MEMCHECK results found

resetting the GPU via nvidia-smi doesn't work
There is only one CUDA SDK: V8.0.44

from faiss.

wickedfoo commented on May 27, 2024

Are you compiling with clang or gcc?

from faiss.

hellolovetiger commented on May 27, 2024

gcc

from faiss.

mdouze commented on May 27, 2024

I believe this is related to the GPU, which is similar to issue #8

from faiss.

yhpku commented on May 27, 2024

I meet the same problem. My GPU is TITAN X. I want to index 1000000 512 dimension vectors using faiss.GpuIndexFlatL2. Then it will meet this issue. But if I cut the number 1000000 to 500000, it will be normal. It seems the max number of vectors is 500000. Because 60*0000 vectors will also cause this problem. The following is my code:
d = 1000000 # dimension
nb = 512 # database size
nq = 1000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02

res = faiss.StandardGpuResources()

index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
# index = faiss.IndexFlatL2(d)   # build the index
print index.is_trained
index.add(xb)

print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))

time_1 = time.time()
k = 1                       # we want to see 4 nearest neighbors
D, I = index.search(xc[:1], k)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

from faiss.

hellolovetiger commented on May 27, 2024

@yhpku , Thanks. I tried GTX 1080 and Titan X, both failed. Seems yours is caused by OOM. IndexFlatL2 will load all the data all at once for add or search. So, maybe 500000 is the upper limitation for Titan X. You can try IndexIVFPQ, which compresses the stored vectors with a lossy compression.

from faiss.

mdouze commented on May 27, 2024

Hi @yhpku, in the code above you use 512 vectors in 1M dimensions. Is this what you want?

from faiss.

yhpku commented on May 27, 2024

@mdouze，that's not. I means 1M vectors in 512 dimensions

from faiss.

mdouze commented on May 27, 2024

@hellolovetiger, Titan X should work. Does bench_gpu_sift1m.py crash on Titan X? What error?

from faiss.

mdouze commented on May 27, 2024

@yhpku, please fix your code then.

from faiss.

hellolovetiger commented on May 27, 2024

On Titan X,
For demo_ivfpq_indexing_gpu, the error is:

Adding the vectors to the index
Segmentation fault (core dumped)

For bench_gpu_sift1m.py,

============ Approximate search
train
WARNING clustering 100000 points to 4096 centroids: please provide at least 159744 training points
add vectors to index
Segmentation fault (core dumped)

The error will be gone if setting co.usePrecomputed = False

For my own code:

#train data shape: (2000000, 1000)
#base data shape: (20000000, 1000)
#query data shape: (1000000, 1000)
#data type: float32

index = faiss.index_factory(d, "OPQ16_512,IVF1024,PQ16")
co = faiss.GpuClonerOptions()
co.useFloat16 = False
co.usePrecomputed = False
co.indicesOptions = faiss.INDICES_CPU
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

The error is:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 5669326848 B, highwater 5669326848 B)
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted (core dumped)

When I cut the base data from 20M to 3M, the error becomes:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 6144000000 B, highwater 6144000000 B)
Faiss assertion err == cudaSuccess failed in char* faiss::gpu::StackDeviceMemory::Stack::getAlloc(size_t, cudaStream_t) at utils/StackDeviceMemory.cpp:71Aborted (core dumped)

Seems it becomes a memory issue.
By the way, GpuIndexIVFPQ will still encounter memory issue if the base vectors is too big?

from faiss.

wickedfoo commented on May 27, 2024

@hellolovetiger,

You are running out of GPU memory. Do not try and add so many vectors at once. 3M * 1000 * sizeof(float) is 12 GB.

Try adding the vectors in chunks of 10000 to 50000 instead.

from faiss.

wickedfoo commented on May 27, 2024

After adding to the index, the vectors will then be compressed via PQ, and then you can add more. But, before compression, each vector takes 4000 bytes of memory ( = 1000 * sizeof(float)), not 16 bytes (PQ16).

from faiss.

wickedfoo commented on May 27, 2024

Problems with attempting to add large CPU resident vectors all at once will be fixed internally at some point. But in the meantime you will have to incrementally add them.

from faiss.

hellolovetiger commented on May 27, 2024

Got it. Thanks, @wickedfoo . It is better to add these infos to wiki. 😃

from faiss.

yhpku commented on May 27, 2024

@mdouze ,I am sorry , this is a typing error. The actual code is as follows. And the error output is, "Faiss assertion err == cudaSuccess failed in faiss::gpu::StackDeviceMemory::Stack::~Stack() at utils/StackDeviceMemory.cpp:54Aborted (core dumped)".

time_1 = time.time()
d = 512                           # dimension
nb = 700000                      # database size
nq = 1000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02
res = faiss.StandardGpuResources()
index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
print index.is_trained
index.add(xb)
print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))
time_1 = time.time()
D, I = index.search(xc[:2], 1)
print (D)
print (I)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

from faiss.

mdouze commented on May 27, 2024

Closing this issue now, because the discussion derived. Please open a new one if it is blocking.

from faiss.

anty-zhang commented on May 27, 2024

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here.
the difference between nvcc and nvidia-smi is here.

my env
my makefile

from faiss.

zhangxinyu-xyz commented on May 27, 2024

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here. the difference between nvcc and nvidia-smi is here.

my env my makefile

You are lucky. Unfortunately, it does not work when I tried to use the faiss-gpu on cuda 11.1.

from faiss.

sayfulloh11 commented on May 27, 2024

conda install -c conda-forge faiss-gpu

Hi,
I tried the same command
thank you it resolved my problem.

from faiss.

faiss::gpu::runMatrixMult failure about faiss HOT 37 CLOSED

Comments (37)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent