GHC 7.10 includes several new primops. You can see which by comparing these two inter

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[atomic-primops] Expose and test new atomic primops in GHC 7.10,about rrnewton/haskell-lockfree

Comments (23)

jberryman commented on July 28, 2024

I can help with this; I plan to play with some of the other fetch* functions in coming days/weeks. Here are the new things, AFAICT:

Reads/writes with implied barriers (@tibbe or someone: what's the benefit of these over using a normal read/write + one of the barriers exported here? I see we don't have writeBarrier and loadLoadBarrier exposed in ghc-prim yet):

atomicReadIntArray#
atomicWriteIntArray#

Additional fetch* functions:

Atomic ops on new small arrays:

casSmallArray#

For maintaining backwards compatibility, the new fetch* functions can be easily made into CAS loops. The only thing I'm not sure about is what should be done with casSmallArray; I think that might depend on what happens in primitive. I just opened this ticket haskell/primitive#20.

from haskell-lockfree.

tibbe commented on July 28, 2024

I can give advice, if needed. Otherwise I'm not sure how I can help. :)

from haskell-lockfree.

rrnewton commented on July 28, 2024

Thanks for the offer to help, Brandon! I added you as a collaborator on this repo, or feel free to make pull requests if you prefer.

Well, I guess there are really three parts to this:

the basic wrapper that exposes a Primop at the IO level
testing that wrapper/primop
benchmarking

And help with any or all of those is welcome. Johan, do you have some other place that you have tested/benchmarked the new inline primops in GHC 7.10? For one thing, it would be great to quantify how much faster the inline primops are compared to the GHC 7.8/7.6 out of line primops. Do you have those numbers?

from haskell-lockfree.

rrnewton commented on July 28, 2024

Aha, I see that many of these wrappers already exist in the GHC testsuite, in AtomicPrimops.hs, so we can just steal them from Johan's code there. Also, it doesn't actually hurt to duplicate that test code probably...

from haskell-lockfree.

rrnewton commented on July 28, 2024

Actually, Johan, the more I look at your 7.10 changes the more extensive they seem -- native code x86 support, and LLVM, and out-of-line fallbacks for all codegens. Nice. Have all of those variants passed the tests (e.g. Sparc and PPC)?

from haskell-lockfree.

rrnewton commented on July 28, 2024

Brandon, I don't have a good answer for your question about reads/writes with implied barriers, but maybe Johan does. I'm afraid I don't use those and I don't actually know if an Intel architecture treats a load followed by a full barrier any differently than an atomic load (i.e. with the "lock" prefix in asm?).

However, what should make a big performance difference in our current implementation will be which operations, on which backends, are inline.

the *Barrier operations, exposed by Data.Atomics, are all calls to C functions in the GHC RTS and thus out-of-line
It looks like atomicRead is actually out-of-line in GHC 7.10's native x86 backend, whereas it is inline in the LLVM codegen.

from haskell-lockfree.

tibbe commented on July 28, 2024

And help with any or all of those is welcome. Johan, do you have some other place that you have tested/benchmarked the new inline primops in GHC 7.10? For one thing, it would be great to quantify how much faster the inline primops are compared to the GHC 7.8/7.6 out of line primops. Do you have those numbers?

IIRC I did benchmark manually when I added the primops, but probably not as thoroughly as I had (I only had a little bit of time to add the primops). I checkout that we output the same assembly code as GCC I think. Help benchmarking would be appreciated. It's a bit annoying that GHC doesn't have good support for timing benchmarks in the test suite.

Have all of those variants passed the tests (e.g. Sparc and PPC)?

Except they seem to fail for newer GCCs as they renamed the __builtin_fetch_and_X builtins. Erik is working on fixing that I think (there's a ticket on GHC's Trac).

Brandon, I don't have a good answer for your question about reads/writes with implied barriers, but maybe Johan does. I'm afraid I don't use those and I don't actually know if an Intel architecture treats a load followed by a full barrier any differently than an atomic load (i.e. with the "lock" prefix in asm?).

I believe they should be the same. I did some background reading at the the time I implemented the primops. To implement atomic reads and writes on x86 you need a memory barrier (which a fence and a lock both provide, at the same cost AFAIK) on either the read or the write, but not both. Putting it on the write is cheaper as there are typically more reads than writes.

However, what should make a big performance difference in our current implementation will be which operations, on which backends, are inline.

Somewhat annoyingly things aren't as inline as we'd like them to be, because we need to use a CallishMachOp to convince GHC to not float our ops past other reads/writes. They should be somewhat better than before though and the performance in LLVM should probably be even better.

from haskell-lockfree.

jberryman commented on July 28, 2024

@rrnewton I was imagining implementing the backwards compatible versions of these new functions in terms of the cas that we expose, but now that I've started I remembered of course that what we do is copy their implementations in cbits/primops.cmm. So dumb question: is that what we want to do with these new fetch* functions? Or do we not want to mess around with that if we don't have to?

And if it's the former, I guess I'm not sure how to proceed; It looks like they're implemented by way of doAtomicRMW in https://github.com/ghc/ghc/blob/f44333eae7bc7dc7b6003b75874a02445f6b633b/compiler/codeGen/StgCmmPrim.hs

from haskell-lockfree.

rrnewton commented on July 28, 2024

Personally I think it's fine if the new primitives are only for the new GHC (7.10). Sure, ideally they could be backported to old GCC as foreign primops, but the need does not seem pressing. By the time any libs actually use them seriously I bet everyone will have moseyed along to 7.10 anyway ;-), and if I'm wrong we could work on the backporting them at a later date.

from haskell-lockfree.

jberryman commented on July 28, 2024

Okay, great. Then if it's okay with you I think I'll implement them with a CAS loop for <7.10; I think that should be very straightforward and save people some headaches.

And maybe I should emit a warning in the case that we've fallen back to a CAS loop for those functions, since the behavior under contention is different, and users of the lib seem more likely to care about that sort of thing. What do you think?

from haskell-lockfree.

rrnewton commented on July 28, 2024

Yeah, both sound good to me!

On Mon Jan 19 2015 at 8:13:08 PM Brandon Simmons [email protected]
wrote:

Okay, great. Then if it's okay with you I think I'll implement them with a
CAS loop for <7.10; I think that should be very straightforward and save
people some headaches.

And maybe I should emit a warning in the case that we've fallen back to a
CAS loop for those functions, since the behavior under contention is
different, and users of the lib seem more likely to care about that sort of
thing. What do you think?

—
Reply to this email directly or view it on GitHub
#43 (comment)
.

from haskell-lockfree.

jberryman commented on July 28, 2024

@tibbe Any ideas as to the best way to get atomicRead/WriteIntArray for GHC < 7.10? I can't get a full a barrier with a write/readByteArray + the barriers we have exposed (storeLoadBarrier, loadLoadBarrier, writeBarrier).

The only idea I have is to use maybe (\a offs-> fetchAddIntArray a offs 0) for atomicReadIntArray, and I guess a CAS loop for atomicWriteIntArray, although that's even worse.

from haskell-lockfree.

jberryman commented on July 28, 2024

FYI I'm working on the fetch* tests, and will try to make a PR tomorrow. If there's no better solution for atomicRead/WriteIntArray I'll implement some sort of approach like above, with WARNING pragmas. I guess that's slightly better than making the export list conditional.

from haskell-lockfree.

tibbe commented on July 28, 2024

@jberryman you only need a barrier for one of them (I suggest the write). I'd use a CAS for the write.

from haskell-lockfree.

jberryman commented on July 28, 2024

@jberryman you only need a barrier for one of them (I suggest the write). I'd use a CAS for the write.

@tibbe Oh I remember you mentioning:

To implement atomic reads and writes on x86 you need a memory barrier (which a fence and a lock both provide, at the same cost AFAIK) on either the read or the write, but not both. Putting it on the write is cheaper as there are typically more reads than writes.

...and I meant to follow up. So first of all there's the issue of the re-ordering that GHC does. If I understood you, however they're implemented ensures GHC won't move any reads or writes across them?

On the hardware level, are you sure we need only one barrier? I think I may see what you mean, but my brain is a little foggy right now: x86 only moves stores past loads (to different locations), so if atomicWriteIntArray contains a barrier that stops loads moving across it then we're okay.

But what if the user wants to use atomicReadIntArray surrounded by normal writeByteArray, and (rightly) expects them to not be moved across? And what about the other architectures that GHC targets?

Finally, a longish bit below. I've been getting killed trying to implement Peterson's lock (I think the canonical demonstration of x86 re-ordering issues), and I'm hitting a couple issues, which I would love some help with:

The algorithm spins forever at a certain point in the busy wait unless I do one of the following
- set NoBuffering and do a putStr in the busy wait
- do a putStrLn "" in the busy wait (although this seems to hang too, for 1M iterations, but works okay for 100K!)
I don't think this is related to the issue solved by -fno_omit_yields, but I seem to run into this sort of trouble every time I try to write a busy wait or read/write loop
I can't get the thing to fail when substituting the normal, non-barrier reads and writes! i.e. it still works in the shoddy way described above, and does not demonstrate failure as we'd like
I observed ONE SINGLE failure (the test completed but assertions failed); I didn't get any useful information and haven't been able to reproduce it again (yet), but I am fairly certain it wasn't from any change to the algorithm, e.g. replacing with non-atomic reads and writes for a test run

It's frustrating not having any proof that atomicRead/WriteIntArray are behaving as advertised.

Anyway, if you'd like to take a look, here's my test I've been banging my head against; I hope I got the imports right. You don't need HUnit; just replace the asserts with the obvious test and error.

    import Control.Monad
    import Test.HUnit (Assertion, assertEqual, assertBool)
    import Data.Atomics (atomicReadIntArray, atomicWriteIntArray)
    import Data.Primitive
    import Control.Concurrent
    import Data.List(sort)

    import System.IO

    main :: IO ()
    main = do
           -- MAKE SURE -N2, or greater

           hSetBuffering stdout NoBuffering  -- TODO DEBUGGING (THIS APPEARS NECESSARY FOR PUTSTR TRICK BELOW TO WORK, TOO)
           test_atomic_read_write_barriers2 100000

    test_atomic_read_write_barriers2 iters = do
        let theWrite mba = atomicWriteIntArray mba 0
            theRead mba = atomicReadIntArray mba 0
        {- NOTE: WE WANT TO MAKE SURE THESE FAIL, BUT THEY DON'T !!
        let theWrite mba (v::Int) = writeByteArray mba 0 v
            theRead mba = readByteArray mba 0 :: IO Int
         -}
        let true = 1 :: Int
            false = 0 :: Int
        -- For kicks, a bunch of padding to ensure these are on different cache-lines:
        flag0 <- newByteArray (sizeOf (undefined :: Int) * 32)
        flag1 <- newByteArray (sizeOf (undefined :: Int) * 32)
        turn <- newByteArray (sizeOf (undefined :: Int) * 32)
        writeByteArray flag0 0 false
        writeByteArray flag1 0 false

        -- We use our lock to get an atomic counter:
        counter <- newByteArray (sizeOf (undefined :: Int) * 32)
        writeByteArray counter 0 (0::Int)

        let petersonIncr flagA flagB turnVal = do
                theWrite flagA true
                theWrite turn turnVal
                let busyWait = do
                      flagBVal <- theRead flagB
                      turnVal' <- theRead turn
                      if turnVal == 1 then putStr "x"  else putStr "+" -- TODO DEBUGGING (THIS APPEARS NECESSARY, AND MUST HAPPEN HERE)
                      when (flagBVal == true && turnVal' == 1) busyWait
                busyWait
                -- start critical section --
                old <- theRead counter
                theWrite counter (old+1)
                -- exit critical section --
                theWrite flagA false
                return old

        out1 <- newEmptyMVar
        out2 <- newEmptyMVar
        void $ forkIO $ 
            (replicateM iters $ petersonIncr flag0 flag1 1)
              >>= putMVar out1
        void $ forkIO $ 
            (replicateM iters $ petersonIncr flag1 flag0 0)
              >>= putMVar out2

        -- make sure we got some interleaving, and that output was correct:
        res1 <- takeMVar out1
        res2 <- takeMVar out2

        let numGaps gaps _ [] = gaps
            numGaps gaps prev (x:xs)
                | prev+1 == x = numGaps gaps x xs
                | otherwise   = numGaps (gaps+1) x xs

        -- if this fails, fix the test or call with more iters
        assertBool "test_atomic_read_write_barriers2 had enough interleaving to be legit" $
               numGaps (0::Int) (-1::Int) res1 > 10000 
            && numGaps (0::Int) (-1::Int) res2 > 10000

        -- braindead merge check:
        let ok = sort res1 == res1  
                  &&  sort res2 == res2  
                  &&  sort (res1++res2) == [0..iters*2-1]

        assertBool "test_atomic_read_write_barriers2" ok

from haskell-lockfree.

tibbe commented on July 28, 2024

...and I meant to follow up. So first of all there's the issue of the re-ordering that GHC does. If I understood you, however they're implemented ensures GHC won't move any reads or writes across them?

As long as they're implemented out-of-line, as a C call or a mach op inside GHC, GHC won't move anything across.

On the hardware level, are you sure we need only one barrier? I think I may see what you mean, but my brain is a little foggy right now: x86 only moves stores past loads (to different locations), so if atomicWriteIntArray contains a barrier that stops loads moving across it then we're okay.

I don't quite remember the details (but I remember the conclusion), one should be enough.

But what if the user wants to use atomicReadIntArray surrounded by normal writeByteArray, and (rightly) expects them to not be moved across?

I don't think that's a correct expectations. The guarantees for ordering apply to data race free programs and if you throw in a non-atomic operation on the same memory location your program is no longer race free and all bets are off.

from haskell-lockfree.

rrnewton commented on July 28, 2024

Does the problem in that test case stem from a busy-wait that does not allocate? GHC had a bug (IMHO) where IO threads that don't allocate could never be preempted, potentially starving other threads. I don't know if this has been fixed.

from haskell-lockfree.

jberryman commented on July 28, 2024

@tibbe , thanks for the reply

I don't think that's a correct expectations. The guarantees for ordering apply to data race free programs and if you throw in a non-atomic operation on the same memory location your program is no longer race free and all bets are off.

This is confusing. When I read "Implies a full memory barrier" I assume normal reads and writes won't be moved across.

Is the only guarantee we have that if we restrict ourselves to some subset of operations, that they will be ordered with respect to each other? If so what is that subset (atomicRead/WriteIntArray + the fetch + cas operations?)?

And the fetch* and cas* family of functions are also documented as "Implies a full memory barrier". Is that also incorrect? i.e. are operations only ordered w/r/t other cas/fetch operations? That's not what I gathered from previous discussions.

Edit: And atomic-primops exports what I understood are supposed to be proper memory barriers; does your "all bets are off" comment pertain to those as well?

from haskell-lockfree.

jberryman commented on July 28, 2024

@rrnewton

Does the problem in that test case stem from a busy-wait that does not allocate? GHC had a bug (IMHO) where IO threads that don't allocate could never be preempted, potentially starving other threads. I don't know if this has been fixed.

Yeah, that's the issue I was referring to as well. Inserting a yield into the busy wait, or compiling with fno-omit-yields doesn't help.

from haskell-lockfree.

jberryman commented on July 28, 2024

@rrnewton It seems to me not so important to have the atomicRead/WriteInt functions since: we have our memory barrier functions, neither of us have had a need for reads and writes with full barrier, what they're actually supposed to be doing is at this point in time very confusing (to me), and I can't get a test to work.

Also casSmallArray I think needs to wait on primitive.

Do you think I should submit a PR with the new fetch functions implemented, and with my current WIP on atomic reads/writes commented?

from haskell-lockfree.

jberryman commented on July 28, 2024

@rrnewton When you have the time, would you consider rolling a new atomic-primops release with these changes? I'm working on a new library that will make use of fetchOrIntArray.

from haskell-lockfree.

rrnewton commented on July 28, 2024

@jberryman -- I am back above water now post ICFP and I can do this soonish. Also, I added you to the maintainers list so feel free to go ahead and do the release/upload if you get to it first.

from haskell-lockfree.

jberryman commented on July 28, 2024

Just released what we've got as 0.8. I'm closing this issue. I think folks can open separate issues/requests for the missing atomicRead/Write and casSmall functions if they want. Hope ICFP went well!

from haskell-lockfree.

[atomic-primops] Expose and test new atomic primops in GHC 7.10 about haskell-lockfree HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent