Git Product home page Git Product logo

Comments (5)

yuyichao avatar yuyichao commented on August 20, 2024 3

The difference is that if the reader of the Ref want to use what the producer of the Ref value have created, i.e. if the thread that reads a non-NULL pointer want to actually use that non-NULL pointer for anything, then it needs to make sure the memory operations are ordered. You aren't interested in the pointer, you are interested to the content of a whole bunch of memory refered to by the pointer, the synchronization is to make sure that the initialization of those memory is finished beforee you use the pointer from the point of view of the thread reading the non-NULL. This fact isn't a given just because you've read a non-NULL value.

In practice, for return value of dlsym. The loading of the shared library itself involving mmap will probably be fine without any synchronization. The initialization of the library, including relocations and global constructors etc, won't get any automatic synchronization though and they need synchronization on the reader side even if dlopen/dlsym implementation uses locks to synchronize among themselves.

Now I'm 99% sure this is one of the cases covered by the consume order which is cheaper on ARM (and slightly on AArch64) since calling that pointer guarantees a pretty strong order on the reader side in the hardware and consume order allow the compiler to take advantage of this. However, AFAICT the implementation of that is a mess and it seems that LLVM doesn't even support it yet...

So you always need a release store. Ideally, the read is a consume read and there will be essentially no performance cost at all (one instruction difference on aarch64 and one extra barrier on arm). For most function pointers, just use unordered on the reader side will be undefined behavior but will probably be fine since it relies on compiler reordering around a opaque function pointer to break (still, UB ....). It's actually not much different from the situation now though since the Ref's are well aligned and a torn read is basically not going to happen without the compiler doing something funny....
Using acquired load will be safe and well defined and have no performance cost on x86 (x86 load is stronger than acquired load).

from cudaapi.jl.

c42f avatar c42f commented on August 20, 2024

To be clear, I'm not an expert on memory consistency and the ways that compilers and hardware can fail to provide a consistent view in the presence of data races. I just read that article recently and it's somewhat alarming :-)

What you clearly do need here is for the stores and loads of the pointer-sized word to be atomic so that the reader thread will see either 0 or the whole pointer word. It seems this would be guaranteed by using an LLVM unordered atomic load and store https://llvm.org/docs/Atomics.html#unordered. (I think I read somewhere that this is guaranteed on x86 (_64) regardless though, for (aligned?) loads and stores up to the pointer width. Worth looking up.)

from cudaapi.jl.

yuyichao avatar yuyichao commented on August 20, 2024

I do t know what you are using it for but if you need to generate a pointer from one thread and use it in another, you need the store to be release store and the load to be acquire load.

Now, depend on how you are using it, it is indeed possible that the hardware give you the semantics you want. However, the compiler is allowed to break it.

from cudaapi.jl.

c42f avatar c42f commented on August 20, 2024

Yes, I should have clarified in my previous message: at the very minimum it should be an unordered atomic load/store. But acquire / release may be needed; I just don't have a good feeling for when they're required.

@yuyichao great to have you on the scene here! What's being done here is that @runtime_ccall is memoizing the result of a dlsym in a global cache using a Ref. The code in question:

CUDAapi.jl/src/call.jl

Lines 21 to 41 in 4dd48be

@gensym fptr_cache
@eval __module__ begin
const $fptr_cache = Ref(C_NULL)
end
return quote
# use a closure to hold the lookup and avoid code bloat in the caller
@noinline function lookup_fptr()
library = Libdl.dlopen($(esc(library)))
$(esc(fptr_cache))[] = Libdl.dlsym(library, $(esc(function_name)))
$(esc(fptr_cache))[]
end
fptr = $(esc(fptr_cache))[]
if fptr == C_NULL # folded into the null check performed by ccall
fptr = lookup_fptr()
end
ccall(fptr, $(map(esc, args)...))
end

The interesting thing here is that it should be safe to do the lookup and store to memory multiple times. But indeed the result will be written by one thread and read by other threads. I'd be super interested to know about what can go wrong if we don't use acquire/release.

If it's load acquire and store release semantics we need, Threads.Atomic{Ptr} seems to cater well for that.

from cudaapi.jl.

c42f avatar c42f commented on August 20, 2024

You aren't interested in the pointer, you are interested to the content of a whole bunch of memory refered to by the pointer, the synchronization is to make sure that the initialization of those memory is finished beforee you use the pointer from the point of view of the thread reading the NULL

Oh I see, that makes a whole heap of sense. Thanks for the excellent explanation, I really appreciate understanding this better.

So in summary, I believe we should just use Threads.Atomic{Ptr} getindex and setindex! for this (which uses load acquire and store release internally, respectively). From what you say it sounds like that should be correct and also not have a performance cost on x86.

from cudaapi.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.