<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

@runtime_ccall thread safety about cudaapi.jl HOT 5 CLOSED

juliagpu commented on August 20, 2024

@runtime_ccall thread safety

from cudaapi.jl.

Comments (5)

yuyichao commented on August 20, 2024 3

The difference is that if the reader of the Ref want to use what the producer of the Ref value have created, i.e. if the thread that reads a non-NULL pointer want to actually use that non-NULL pointer for anything, then it needs to make sure the memory operations are ordered. You aren't interested in the pointer, you are interested to the content of a whole bunch of memory refered to by the pointer, the synchronization is to make sure that the initialization of those memory is finished beforee you use the pointer from the point of view of the thread reading the non-NULL. This fact isn't a given just because you've read a non-NULL value.

In practice, for return value of dlsym. The loading of the shared library itself involving mmap will probably be fine without any synchronization. The initialization of the library, including relocations and global constructors etc, won't get any automatic synchronization though and they need synchronization on the reader side even if dlopen/dlsym implementation uses locks to synchronize among themselves.

Now I'm 99% sure this is one of the cases covered by the consume order which is cheaper on ARM (and slightly on AArch64) since calling that pointer guarantees a pretty strong order on the reader side in the hardware and consume order allow the compiler to take advantage of this. However, AFAICT the implementation of that is a mess and it seems that LLVM doesn't even support it yet...

So you always need a release store. Ideally, the read is a consume read and there will be essentially no performance cost at all (one instruction difference on aarch64 and one extra barrier on arm). For most function pointers, just use unordered on the reader side will be undefined behavior but will probably be fine since it relies on compiler reordering around a opaque function pointer to break (still, UB ....). It's actually not much different from the situation now though since the Ref's are well aligned and a torn read is basically not going to happen without the compiler doing something funny....
Using acquired load will be safe and well defined and have no performance cost on x86 (x86 load is stronger than acquired load).

from cudaapi.jl.

c42f commented on August 20, 2024

To be clear, I'm not an expert on memory consistency and the ways that compilers and hardware can fail to provide a consistent view in the presence of data races. I just read that article recently and it's somewhat alarming :-)

What you clearly do need here is for the stores and loads of the pointer-sized word to be atomic so that the reader thread will see either 0 or the whole pointer word. It seems this would be guaranteed by using an LLVM unordered atomic load and store https://llvm.org/docs/Atomics.html#unordered. (I think I read somewhere that this is guaranteed on x86 (_64) regardless though, for (aligned?) loads and stores up to the pointer width. Worth looking up.)

from cudaapi.jl.

yuyichao commented on August 20, 2024

I do t know what you are using it for but if you need to generate a pointer from one thread and use it in another, you need the store to be release store and the load to be acquire load.

Now, depend on how you are using it, it is indeed possible that the hardware give you the semantics you want. However, the compiler is allowed to break it.

from cudaapi.jl.

c42f commented on August 20, 2024

Yes, I should have clarified in my previous message: at the very minimum it should be an unordered atomic load/store. But acquire / release may be needed; I just don't have a good feeling for when they're required.

@yuyichao great to have you on the scene here! What's being done here is that @runtime_ccall is memoizing the result of a dlsym in a global cache using a Ref. The code in question:

CUDAapi.jl/src/call.jl

Lines 21 to 41 in 4dd48be

 @gensym fptr_cache 

 @eval __module__ begin 

 const $fptr_cache = Ref(C_NULL) 

 end 

 return quote 

 # use a closure to hold the lookup and avoid code bloat in the caller 

 @noinline function lookup_fptr() 

 library = Libdl.dlopen($(esc(library))) 

 $(esc(fptr_cache))[] = Libdl.dlsym(library, $(esc(function_name))) 

 $(esc(fptr_cache))[] 

 end 

 fptr = $(esc(fptr_cache))[] 

 if fptr == C_NULL # folded into the null check performed by ccall 

 fptr = lookup_fptr() 

 end 

 ccall(fptr, $(map(esc, args)...)) 

 end

The interesting thing here is that it should be safe to do the lookup and store to memory multiple times. But indeed the result will be written by one thread and read by other threads. I'd be super interested to know about what can go wrong if we don't use acquire/release.

If it's load acquire and store release semantics we need, Threads.Atomic{Ptr} seems to cater well for that.

from cudaapi.jl.

c42f commented on August 20, 2024

You aren't interested in the pointer, you are interested to the content of a whole bunch of memory refered to by the pointer, the synchronization is to make sure that the initialization of those memory is finished beforee you use the pointer from the point of view of the thread reading the NULL

Oh I see, that makes a whole heap of sense. Thanks for the excellent explanation, I really appreciate understanding this better.

So in summary, I believe we should just use Threads.Atomic{Ptr} getindex and setindex! for this (which uses load acquire and store release internally, respectively). From what you say it sounds like that should be correct and also not have a performance cost on x86.

from cudaapi.jl.

@runtime_ccall thread safety about cudaapi.jl HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	@gensym fptr_cache
	@eval __module__ begin
	const $fptr_cache = Ref(C_NULL)
	end

	return quote
	# use a closure to hold the lookup and avoid code bloat in the caller
	@noinline function lookup_fptr()
	library = Libdl.dlopen($(esc(library)))
	$(esc(fptr_cache))[] = Libdl.dlsym(library, $(esc(function_name)))

	$(esc(fptr_cache))[]
	end

	fptr = $(esc(fptr_cache))[]
	if fptr == C_NULL # folded into the null check performed by ccall
	fptr = lookup_fptr()
	end

	ccall(fptr, $(map(esc, args)...))
	end