i think it [continue_running] doesn't have to be volatile - and there is probably not

my pLoad is about a third of the previous value <p dir=

the bottleneck i was assuming to exist is either not there or in

continue_running confused someone, signal handler behaviour is undefined about rt-app HOT 7 CLOSED

scheduler-tools commented on May 2, 2024

continue_running confused someone, signal handler behaviour is undefined

from rt-app.

Comments (7)

v0lker commented on May 2, 2024

it seems i was wrong (but please read on, it gets worse). looking at the latest C17 draft [1], it needs to be volatile sig_atomic_t, and purely because of the signal handling:

"When the processing of the abstract machine is interrupted by receipt of a signal, the values of
objects that are neither lock-free atomic objects nor of type volatile sig_atomic_t are unspecified,
as is the state of the floating-point environment. The value of any object modified by the handler
that is neither a lock-free atomic object nor of type volatile sig_atomic_t becomes indeterminate
when the handler exits, as does the state of the floating-point environment if it is modified by the
handler and not restored to its original state."

[1] http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf

the above also means that the signal handlers are theoretically broken, because they dereference objects that are not of type volatile sig_atomic_t. furthermore, i think a lot of the stuff that is done in the signal handler is theoretically not generally safe to do there:

"If the signal occurs other than as the result of calling the abort or raise function, the behavior is
undefined if [...] the signal handler calls any function in the standard library other than [abort, _Exit, quick_exit, fns in stdatomic.h, atomic_is_lock_free, signal]

SUMMARY

having a volatile that is very frequently queried from all the threads introduces a memory bottleneck / scalability issues. rt-app should be as scalable as possible and the bottlenecks should be created intentionally.
installing a signal handler and terminating via it requires the flag to be of type volatile sig_atomic_t
the current signal handler is invalid according to the standard

PROPOSAL
how about using a signalfd to poll for the termination signals, then call the shutdown fn from there?
instead of blindly pthread_joining() the threads, have a flag per thread that is similarly (seldomly) polled, or, seems a nicer design, could have a pipe to read from, then polling can be done on a fd set including:

the signalfd
the individual thread fds
the timer fd to sleep for opts.duration (or just poll the above 2 every and sleep for )

this would solve all issues above.

there should be no need for a write barrier after continue_running is modified ("There is a sequence point after the evaluations of the function designator and the actual arguments but before the actual call.", so, since there are non-inlineable functions we have sequence points in the right places), but superfluous barriers wouldn't hurt.

would you like to have a PR for this?

from rt-app.

jlelli commented on May 2, 2024

my pLoad is about a third of the previous value

While the description of the problem and the proposal are certainly interesting, I'm having troubles understanding how continue_running affects p_load calculation, as calibration phase happens before threads are created. Could you explain?

Also, do you have some sort of benchmark or example json that highlights the problem (and might show benefit of the proposed solution)?

from rt-app.

v0lker commented on May 2, 2024

the bottleneck i was assuming to exist is either not there or in any case i couldn't measure it. i think i misread something, got confused and assumed continue_running would be read every time the work loop was executed which is obviously not the case. see below (4.) for some data corroborating that the scalability is ok.
as you wrote, continue_running shouldn't have anything to do with the calibration and i was curious how the different value came about. i couldn't reproduce it anymore but found the executable. it is curiously statically linked against libjson-c and dynamically against everything else, and looking at the disassembly, the actual workload is done differently: there are only 4 calls to ldexp in the faster executable (vs the expected 12 in the others). the faster version doesn't seem to do any explicit conversion of the return values from ldexp, whereas the regular one is peppered with cvttsd2si instruction. i'm assuming i used different compiler options accidentally (some optimisation seems to have been applied), which then lead to the confusion. my apologies.
the signal handler issue is probably largely theoretical, but it seems to be invoking undefined behaviour as outlined above.
some data:

using doc/examples/template.json but setting "run" : 50000, to make it a little bit busier.
number of threads: 1, 2, 4, 8, 16
machine is an 8 core hyperthreaded xeon skylake
first a few 8...10 calibration runs (also gathering single-threaded data)
then the remaining ones with the lowest result from the calibration run
there is likely quite a bit of noise in the data (machine was under very light load)

looking at the actual run time (column 3 in the log) that the phase took - i hope that is a good metric (is it?), you can see that it scales very well from 1...8 threads, after which the run time increases (hyperthreaded 16 cores, so, i'm inferring that there might be one FP unit per core)

type                      cores  pLoad                          t_run  [us]
                                  [ns]    Min.    1st Qu.   Median   Mean    3rd Qu.   Max.

dynamic without volatile    1     45     50780    70259    71399    70775    72396     75017
dynamic without volatile    2            54283    70788    72001    70961    72929     78027
dynamic without volatile    4            52209    70662    72200    70698    73265    107469
dynamic without volatile    8            51925    63860    71114    68986    72860    123700
dynamic without volatile   16            77181    91548    94234    95206    95403    148676

dynamic with volatile       1     45     40892    69514    71290    68744    72438     75206
dynamic with volatile       2            51594    70687    71942    70712    72875    104465
dynamic with volatile       4            52436    71180    72412    71391    73398    103340
dynamic with volatile       8            52109    63560    70558    68612    72766    122087
dynamic with volatile      16            64546    90148    93774    92515    94868    129300

static without volatile     1     36     32902    59491    71079    66288    72362     75172
static without volatile     2            52790    69560    70800    69035    71869     92948
static without volatile     4            51630    70468    71649    70353    72690    108864
static without volatile     8            51139    60098    70367    67226    72285    130469
static without volatile    16            98007   103075   103100   103555   103157    136111

static with volatile        1     36     35767    52928    70204    64570    71867     94839
static with volatile        2            53156    69918    71081    69723    72062     74155
static with volatile        4            54037    70516    71640    70382    72728     74785
static with volatile        8            51025    58962    69852    66360    71930    124560
static with volatile       16            91108   103074   103098   103568   103158    174910

mostly dynamic fast         1     13     57059    70934    73026    72532    74410     76624
mostly dynamic fast         2            56752    73190    74501    73115    75481     80892
mostly dynamic fast         4            55133    73632    75126    73782    76191    101513
mostly dynamic fast         8            54644    66822    74342    71988    75873    116326
mostly dynamic fast        16            76146    84872    87182    87325    89265    113314

as in, it seems to take a bit longer for the under-utilised case (could it be that the loop is so short that the FP pipeline is not kept full?) but scales quite a bit better.

the distributions are almost somewhat gaussian, some a little bi-modal. i blame the noise. not sure whether that is interesting but could upload the plots.

again, apologies for the hassle, i thought i actually found something interesting, but i think the data tell us that there isn't a scalability issue.

and the signal handler issue seems mostly theoretical, so, shall we close this issue?

from rt-app.

vingu-linaro commented on May 2, 2024

Regarding the signal handler issue, I agree with the theory but i think we are safe in our case because we are not doing any read-modify-write but only read xor write

from rt-app.

v0lker commented on May 2, 2024

you mean wrt to continue_running? yes, safe currently, because of the type. wrt the other variables? i read it the way that the contents of opt, threads, ft_data and implicit accesses to log_level would be unspecified, so everything that is done with them is potentially problematic.

and i read the second paragraph the way that the various calls to the standard library make the app's shutdown behaviour (hopefully not the whole app's behaviour) undefined.

not sure it's worth rewriting the shutdown behaviour just because of this, though.

from rt-app.

jlelli commented on May 2, 2024

not sure it's worth rewriting the shutdown behaviour just because of this, though

Mmm, right. I'm tempted to say that, since afterall we are shutting down everything, even if we stumble upon some undefined behavior, it shouldn't really matter.

Thanks a lot for your experiments above!!!

from rt-app.

v0lker commented on May 2, 2024

closing it, then, iiuc, the worst that can happen is that something might break at shutdown and there don't seem to be any observations of this occuring.

from rt-app.

continue_running confused someone, signal handler behaviour is undefined about rt-app HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent