Comments (23)
I can help with this; I plan to play with some of the other fetch*
functions in coming days/weeks. Here are the new things, AFAICT:
Reads/writes with implied barriers (@tibbe or someone: what's the benefit of these over using a normal read/write + one of the barriers exported here? I see we don't have writeBarrier
and loadLoadBarrier
exposed in ghc-prim
yet):
- atomicReadIntArray#
- atomicWriteIntArray#
Additional fetch*
functions:
- fetchAddIntArray#
- fetchSubIntArray#
- fetchAndIntArray#
- fetchNandIntArray#
- fetchOrIntArray#
- fetchXorIntArray#
Atomic ops on new small arrays:
- casSmallArray#
For maintaining backwards compatibility, the new fetch*
functions can be easily made into CAS loops. The only thing I'm not sure about is what should be done with casSmallArray
; I think that might depend on what happens in primitive
. I just opened this ticket haskell/primitive#20.
from haskell-lockfree.
I can give advice, if needed. Otherwise I'm not sure how I can help. :)
from haskell-lockfree.
Thanks for the offer to help, Brandon! I added you as a collaborator on this repo, or feel free to make pull requests if you prefer.
Well, I guess there are really three parts to this:
- the basic wrapper that exposes a Primop at the IO level
- testing that wrapper/primop
- benchmarking
And help with any or all of those is welcome. Johan, do you have some other place that you have tested/benchmarked the new inline primops in GHC 7.10? For one thing, it would be great to quantify how much faster the inline primops are compared to the GHC 7.8/7.6 out of line primops. Do you have those numbers?
from haskell-lockfree.
Aha, I see that many of these wrappers already exist in the GHC testsuite, in AtomicPrimops.hs, so we can just steal them from Johan's code there. Also, it doesn't actually hurt to duplicate that test code probably...
from haskell-lockfree.
Actually, Johan, the more I look at your 7.10 changes the more extensive they seem -- native code x86 support, and LLVM, and out-of-line fallbacks for all codegens. Nice. Have all of those variants passed the tests (e.g. Sparc and PPC)?
from haskell-lockfree.
Brandon, I don't have a good answer for your question about reads/writes with implied barriers, but maybe Johan does. I'm afraid I don't use those and I don't actually know if an Intel architecture treats a load followed by a full barrier any differently than an atomic load (i.e. with the "lock" prefix in asm?).
However, what should make a big performance difference in our current implementation will be which operations, on which backends, are inline.
- the
*Barrier
operations, exposed byData.Atomics
, are all calls to C functions in the GHC RTS and thus out-of-line - It looks like atomicRead is actually out-of-line in GHC 7.10's native x86 backend, whereas it is inline in the LLVM codegen.
from haskell-lockfree.
And help with any or all of those is welcome. Johan, do you have some other place that you have tested/benchmarked the new inline primops in GHC 7.10? For one thing, it would be great to quantify how much faster the inline primops are compared to the GHC 7.8/7.6 out of line primops. Do you have those numbers?
IIRC I did benchmark manually when I added the primops, but probably not as thoroughly as I had (I only had a little bit of time to add the primops). I checkout that we output the same assembly code as GCC I think. Help benchmarking would be appreciated. It's a bit annoying that GHC doesn't have good support for timing benchmarks in the test suite.
Have all of those variants passed the tests (e.g. Sparc and PPC)?
Except they seem to fail for newer GCCs as they renamed the __builtin_fetch_and_X builtins. Erik is working on fixing that I think (there's a ticket on GHC's Trac).
Brandon, I don't have a good answer for your question about reads/writes with implied barriers, but maybe Johan does. I'm afraid I don't use those and I don't actually know if an Intel architecture treats a load followed by a full barrier any differently than an atomic load (i.e. with the "lock" prefix in asm?).
I believe they should be the same. I did some background reading at the the time I implemented the primops. To implement atomic reads and writes on x86 you need a memory barrier (which a fence and a lock both provide, at the same cost AFAIK) on either the read or the write, but not both. Putting it on the write is cheaper as there are typically more reads than writes.
However, what should make a big performance difference in our current implementation will be which operations, on which backends, are inline.
Somewhat annoyingly things aren't as inline as we'd like them to be, because we need to use a CallishMachOp
to convince GHC to not float our ops past other reads/writes. They should be somewhat better than before though and the performance in LLVM should probably be even better.
from haskell-lockfree.
@rrnewton I was imagining implementing the backwards compatible versions of these new functions in terms of the cas
that we expose, but now that I've started I remembered of course that what we do is copy their implementations in cbits/primops.cmm
. So dumb question: is that what we want to do with these new fetch*
functions? Or do we not want to mess around with that if we don't have to?
And if it's the former, I guess I'm not sure how to proceed; It looks like they're implemented by way of doAtomicRMW
in https://github.com/ghc/ghc/blob/f44333eae7bc7dc7b6003b75874a02445f6b633b/compiler/codeGen/StgCmmPrim.hs
from haskell-lockfree.
Personally I think it's fine if the new primitives are only for the new GHC (7.10). Sure, ideally they could be backported to old GCC as foreign primops, but the need does not seem pressing. By the time any libs actually use them seriously I bet everyone will have moseyed along to 7.10 anyway ;-), and if I'm wrong we could work on the backporting them at a later date.
from haskell-lockfree.
Okay, great. Then if it's okay with you I think I'll implement them with a CAS loop for <7.10; I think that should be very straightforward and save people some headaches.
And maybe I should emit a warning in the case that we've fallen back to a CAS loop for those functions, since the behavior under contention is different, and users of the lib seem more likely to care about that sort of thing. What do you think?
from haskell-lockfree.
Yeah, both sound good to me!
On Mon Jan 19 2015 at 8:13:08 PM Brandon Simmons [email protected]
wrote:
Okay, great. Then if it's okay with you I think I'll implement them with a
CAS loop for <7.10; I think that should be very straightforward and save
people some headaches.And maybe I should emit a warning in the case that we've fallen back to a
CAS loop for those functions, since the behavior under contention is
different, and users of the lib seem more likely to care about that sort of
thing. What do you think?—
Reply to this email directly or view it on GitHub
#43 (comment)
.
from haskell-lockfree.
@tibbe Any ideas as to the best way to get atomicRead/WriteIntArray
for GHC < 7.10? I can't get a full a barrier with a write/readByteArray
+ the barriers we have exposed (storeLoadBarrier
, loadLoadBarrier
, writeBarrier
).
The only idea I have is to use maybe (\a offs-> fetchAddIntArray a offs 0)
for atomicReadIntArray
, and I guess a CAS loop for atomicWriteIntArray
, although that's even worse.
from haskell-lockfree.
FYI I'm working on the fetch*
tests, and will try to make a PR tomorrow. If there's no better solution for atomicRead/WriteIntArray
I'll implement some sort of approach like above, with WARNING
pragmas. I guess that's slightly better than making the export list conditional.
from haskell-lockfree.
@jberryman you only need a barrier for one of them (I suggest the write). I'd use a CAS for the write.
from haskell-lockfree.
@jberryman you only need a barrier for one of them (I suggest the write). I'd use a CAS for the write.
@tibbe Oh I remember you mentioning:
To implement atomic reads and writes on x86 you need a memory barrier (which a fence and a lock both provide, at the same cost AFAIK) on either the read or the write, but not both. Putting it on the write is cheaper as there are typically more reads than writes.
...and I meant to follow up. So first of all there's the issue of the re-ordering that GHC does. If I understood you, however they're implemented ensures GHC won't move any reads or writes across them?
On the hardware level, are you sure we need only one barrier? I think I may see what you mean, but my brain is a little foggy right now: x86 only moves stores past loads (to different locations), so if atomicWriteIntArray
contains a barrier that stops loads moving across it then we're okay.
But what if the user wants to use atomicReadIntArray
surrounded by normal writeByteArray
, and (rightly) expects them to not be moved across? And what about the other architectures that GHC targets?
Finally, a longish bit below. I've been getting killed trying to implement Peterson's lock (I think the canonical demonstration of x86 re-ordering issues), and I'm hitting a couple issues, which I would love some help with:
-
The algorithm spins forever at a certain point in the busy wait unless I do one of the following
- set
NoBuffering
and do aputStr
in the busy wait - do a
putStrLn ""
in the busy wait (although this seems to hang too, for 1M iterations, but works okay for 100K!)
I don't think this is related to the issue solved by
-fno_omit_yields
, but I seem to run into this sort of trouble every time I try to write a busy wait or read/write loop - set
-
I can't get the thing to fail when substituting the normal, non-barrier reads and writes! i.e. it still works in the shoddy way described above, and does not demonstrate failure as we'd like
-
I observed ONE SINGLE failure (the test completed but assertions failed); I didn't get any useful information and haven't been able to reproduce it again (yet), but I am fairly certain it wasn't from any change to the algorithm, e.g. replacing with non-atomic reads and writes for a test run
It's frustrating not having any proof that atomicRead/WriteIntArray
are behaving as advertised.
Anyway, if you'd like to take a look, here's my test I've been banging my head against; I hope I got the imports right. You don't need HUnit
; just replace the asserts
with the obvious test and error
.
import Control.Monad
import Test.HUnit (Assertion, assertEqual, assertBool)
import Data.Atomics (atomicReadIntArray, atomicWriteIntArray)
import Data.Primitive
import Control.Concurrent
import Data.List(sort)
import System.IO
main :: IO ()
main = do
-- MAKE SURE -N2, or greater
hSetBuffering stdout NoBuffering -- TODO DEBUGGING (THIS APPEARS NECESSARY FOR PUTSTR TRICK BELOW TO WORK, TOO)
test_atomic_read_write_barriers2 100000
test_atomic_read_write_barriers2 iters = do
let theWrite mba = atomicWriteIntArray mba 0
theRead mba = atomicReadIntArray mba 0
{- NOTE: WE WANT TO MAKE SURE THESE FAIL, BUT THEY DON'T !!
let theWrite mba (v::Int) = writeByteArray mba 0 v
theRead mba = readByteArray mba 0 :: IO Int
-}
let true = 1 :: Int
false = 0 :: Int
-- For kicks, a bunch of padding to ensure these are on different cache-lines:
flag0 <- newByteArray (sizeOf (undefined :: Int) * 32)
flag1 <- newByteArray (sizeOf (undefined :: Int) * 32)
turn <- newByteArray (sizeOf (undefined :: Int) * 32)
writeByteArray flag0 0 false
writeByteArray flag1 0 false
-- We use our lock to get an atomic counter:
counter <- newByteArray (sizeOf (undefined :: Int) * 32)
writeByteArray counter 0 (0::Int)
let petersonIncr flagA flagB turnVal = do
theWrite flagA true
theWrite turn turnVal
let busyWait = do
flagBVal <- theRead flagB
turnVal' <- theRead turn
if turnVal == 1 then putStr "x" else putStr "+" -- TODO DEBUGGING (THIS APPEARS NECESSARY, AND MUST HAPPEN HERE)
when (flagBVal == true && turnVal' == 1) busyWait
busyWait
-- start critical section --
old <- theRead counter
theWrite counter (old+1)
-- exit critical section --
theWrite flagA false
return old
out1 <- newEmptyMVar
out2 <- newEmptyMVar
void $ forkIO $
(replicateM iters $ petersonIncr flag0 flag1 1)
>>= putMVar out1
void $ forkIO $
(replicateM iters $ petersonIncr flag1 flag0 0)
>>= putMVar out2
-- make sure we got some interleaving, and that output was correct:
res1 <- takeMVar out1
res2 <- takeMVar out2
let numGaps gaps _ [] = gaps
numGaps gaps prev (x:xs)
| prev+1 == x = numGaps gaps x xs
| otherwise = numGaps (gaps+1) x xs
-- if this fails, fix the test or call with more iters
assertBool "test_atomic_read_write_barriers2 had enough interleaving to be legit" $
numGaps (0::Int) (-1::Int) res1 > 10000
&& numGaps (0::Int) (-1::Int) res2 > 10000
-- braindead merge check:
let ok = sort res1 == res1
&& sort res2 == res2
&& sort (res1++res2) == [0..iters*2-1]
assertBool "test_atomic_read_write_barriers2" ok
from haskell-lockfree.
...and I meant to follow up. So first of all there's the issue of the re-ordering that GHC does. If I understood you, however they're implemented ensures GHC won't move any reads or writes across them?
As long as they're implemented out-of-line, as a C call or a mach op inside GHC, GHC won't move anything across.
On the hardware level, are you sure we need only one barrier? I think I may see what you mean, but my brain is a little foggy right now: x86 only moves stores past loads (to different locations), so if atomicWriteIntArray contains a barrier that stops loads moving across it then we're okay.
I don't quite remember the details (but I remember the conclusion), one should be enough.
But what if the user wants to use atomicReadIntArray surrounded by normal writeByteArray, and (rightly) expects them to not be moved across?
I don't think that's a correct expectations. The guarantees for ordering apply to data race free programs and if you throw in a non-atomic operation on the same memory location your program is no longer race free and all bets are off.
from haskell-lockfree.
Does the problem in that test case stem from a busy-wait that does not allocate? GHC had a bug (IMHO) where IO threads that don't allocate could never be preempted, potentially starving other threads. I don't know if this has been fixed.
from haskell-lockfree.
@tibbe , thanks for the reply
I don't think that's a correct expectations. The guarantees for ordering apply to data race free programs and if you throw in a non-atomic operation on the same memory location your program is no longer race free and all bets are off.
This is confusing. When I read "Implies a full memory barrier" I assume normal reads and writes won't be moved across.
Is the only guarantee we have that if we restrict ourselves to some subset of operations, that they will be ordered with respect to each other? If so what is that subset (atomicRead/WriteIntArray
+ the fetch
+ cas
operations?)?
And the fetch*
and cas*
family of functions are also documented as "Implies a full memory barrier". Is that also incorrect? i.e. are operations only ordered w/r/t other cas/fetch
operations? That's not what I gathered from previous discussions.
Edit: And atomic-primops
exports what I understood are supposed to be proper memory barriers; does your "all bets are off" comment pertain to those as well?
from haskell-lockfree.
Does the problem in that test case stem from a busy-wait that does not allocate? GHC had a bug (IMHO) where IO threads that don't allocate could never be preempted, potentially starving other threads. I don't know if this has been fixed.
Yeah, that's the issue I was referring to as well. Inserting a yield
into the busy wait, or compiling with fno-omit-yields
doesn't help.
from haskell-lockfree.
@rrnewton It seems to me not so important to have the atomicRead/WriteInt
functions since: we have our memory barrier functions, neither of us have had a need for reads and writes with full barrier, what they're actually supposed to be doing is at this point in time very confusing (to me), and I can't get a test to work.
Also casSmallArray
I think needs to wait on primitive
.
Do you think I should submit a PR with the new fetch
functions implemented, and with my current WIP on atomic reads/writes commented?
from haskell-lockfree.
@rrnewton When you have the time, would you consider rolling a new atomic-primops
release with these changes? I'm working on a new library that will make use of fetchOrIntArray
.
from haskell-lockfree.
@jberryman -- I am back above water now post ICFP and I can do this soonish. Also, I added you to the maintainers list so feel free to go ahead and do the release/upload if you get to it first.
from haskell-lockfree.
Just released what we've got as 0.8. I'm closing this issue. I think folks can open separate issues/requests for the missing atomicRead/Write
and casSmall
functions if they want. Hope ICFP went well!
from haskell-lockfree.
Related Issues (20)
- Cannot find module "Data.Atomics.Counter.Unboxed" HOT 6
- Are memory barriers compiler reordering barriers? HOT 1
- Various cabal issues.
- Saw a genuine nondeterministic test failure
- atomic-primops: can GC duplication of pure objects screw up Tickets and cause a false negative?
- atomic-primops: ghc 8 support needed HOT 7
- Build fail on armv8
- atomic-primops: Need to make barriers inline primops HOT 3
- Missing load-store barrier HOT 1
- [chaselev-deque] stackage availability HOT 4
- Provide loop wrapper for casArrayElem? HOT 2
- Implement multi-item CAS (CASN) HOT 2
- Cut a new atomic-primops release for GHC 8.2.1 HOT 2
- testing/Fetch.hs missing from Hackage release HOT 1
- Recursion in haddocks of `atomic-primops` HOT 1
- CAS fails with newly created ticket HOT 8
- CAS very rarely fails not due to a race, in GHC >= 8.2
- [lockfree-queue]: GHC 9.4 support HOT 2
- Release chaselev-deque-0.5.0.6 on hackage HOT 1
- No more load_load_barrier in GHC HEAD HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from haskell-lockfree.