Built the latest master (<a class="commit-link" data-hovercard-type="commit" data-hove

Also seeing something similar I guess: <div class="snippet-clipboard-content notra

You mean if in interpolate_bilinear you replace <code

Bisect suggests <a class="commit-link" data-hovercard-type="commit" data-hovercard-url

<a href="https://github.com/darktable-org/darktable/files/15299250/darktable-d-pipe.tx

I guess we have to go the hard way, there seem to be some uses of <code class="notrans

Segfault opening darktable,about darktable-org/darktable

Comments (112)

piratenpanda commented on August 17, 2024 1

Also seeing something similar I guess:

Thread 289 "worker 5" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffdf78006c0 (LWP 28537)]
0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
544	    const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
(gdb) bt full
#0  0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
        uv = {<optimized out>, <optimized out>}
        cv = {<optimized out>, <optimized out>}
        k = <optimized out>
        k = <optimized out>
        UV = Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0x7ffdf77f0e08

from darktable.

elstoc commented on August 17, 2024 1

I tried removing DT_OMP_FOR(collapse(2)) entirely from the interpolate_bilinear function in fast_guided_filter.h and that did prevent the issue. So we're definitely looking in the right place.

Similarly if I replace it with the code that was removed in 5aa29dc, it works

from darktable.

jenshannoschwalm commented on August 17, 2024 1

You mean if in interpolate_bilinear you replace DT_OMP_FOR(collapse(2) by

#ifdef _OPENMP
#pragma omp parallel for collapse(2) default(none) \
  dt_omp_firstprivate(in, out, width_out, height_out, width_in, height_in, ch) \
  schedule(simd:static)
#endif

that "works"?

Pinging @ralfbrown and @dterrahe here, i don't yet fully understand that new omp macro stuff yet, what could be wrong here? Might ____DT_CLONE_TARGETS__ be a problem? Could it be we have some parameters wrongfully used as shared ?

from darktable.

piratenpanda commented on August 17, 2024 1

both work fine

from darktable.

ralfbrown commented on August 17, 2024 1

Had it earlier in the thread, but that message is now being hidden by default....

objdump -S build/lib64/darktable/plugins/libcolorequal.so >colorequal.s

from darktable.

elstoc commented on August 17, 2024

I think it might be related to color equalizer. Retried with a clean config and unedited images. As soon as I enable color equalizer on an image I get the segfault.

Ping @jenshannoschwalm?

from darktable.

elstoc commented on August 17, 2024

Bisect suggests 5aa29dc is the first bad commit

from darktable.

jenshannoschwalm commented on August 17, 2024

No idia so far. Using CE daily without issues, also the code has not changed for quite a while except the new OMP macro usage. Maybe a log can give a clue -d pipe for a first try. Just recompiled everything and having no issues ...

from darktable.

elstoc commented on August 17, 2024

darktable-d-pipe.txt

from darktable.

jenshannoschwalm commented on August 17, 2024

I guess we have to go the hard way, there seem to be some uses of DT_OMP_FOR_SIMD that seem to be wrong. Could you try with DT_OMP_FOR in interpolate_bilinear found in fast_guided_filter.h that simply looks wrong to me.

from darktable.

TurboGit commented on August 17, 2024

Likewise, I'm using CE often.

from darktable.

elstoc commented on August 17, 2024

Could you try with DT_OMP_FOR in interpolate_bilinear found in fast_guided_filter.h that simply looks wrong to me.

Not entirely sure what you mean. If you mean to replace DT_OMP_FOR_SIND with DT_OMP_FOR on line 103 that didn't seem to work. Edit: looking at your latest PR seems that was what you meant. Building latest master after pulling that PR didn't change anything

from darktable.

ralfbrown commented on August 17, 2024

Clone targets shouldn't be an issue, the directive basically tells the compiler to compile the function multiple times, with a different target architecture each time. But I won't categorically rule it out.

Since all of the variables named in the original OpenMP directive are const, it wouldn't matter whether they are declared shared or private since there are no modifications to be propagated to other threads.

DT_OMP_FOR(collapse(2)) expands to #pragma omp parallel for default(firstprivate) schedule(static) collapse(2)
DT_OMP_FOR_SIMD(collapse(2)) expands to #pragma omp parallel for simd default(firstprivate) schedule(simd:static) collapse(2)

@elstoc Does your copy still work correctly if you use the original directive but remove the "simd:" from it? That would give the equivalent of DT_OMP_FOR. Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()? If so, the problem is related to merging the nested loops into a mega-loop, but parallelizing just the outer loop gives nearly all of the speedup we'd get with the collapse(2).

from darktable.

elstoc commented on August 17, 2024

You mean if in interpolate_bilinear you replace DT_OMP_FOR(collapse(2) by...

Yes. Works with this replacement

Does your copy still work correctly if you use the original directive but remove the "simd:" from it?

Yes. Also works.

Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()?

No. Crashes.

from darktable.

ralfbrown commented on August 17, 2024

Hmm. Does it work if you keep the DT_OMP_FOR but remove the __DT_CLONE_TARGETS__ ? Beginning to suspect a compiler bug....

from darktable.

elstoc commented on August 17, 2024

Does it work if you keep the DT_OMP_FOR but remove the __DT_CLONE_TARGETS__ ?

no. BTW I also tried downgrading gcc (from 14.1.1 to 13.2.1) to no effect

from darktable.

jenshannoschwalm commented on August 17, 2024

Beginning to suspect a compiler bug....

Doesn't this look more like a macro problem? Are you sure the #pragma omp parallel for default(firstprivate) is fully correct?

Also - i am very surprised that the bug is so obvious for chris and i didn't have a single crash for long also using master.
gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7) here

from darktable.

ralfbrown commented on August 17, 2024

That was going to be my next try:
replace the DT_OMP_FOR with

#ifdef _OPENMP
#pragma omp parallel for default(firstprivate) schedule(static) collapse(2)
#endif

which should be exactly the pragma that DT_OMP_FOR(collapse(2)) expands into.

default(none) forces manual naming of the sharing status of all variables used by the loop, so if it compiled without errors, all of the variables have been named, and they are all firstprivate - which is what default(firstprivate) gives you without having to explicitly name the variables in the pragma.

from darktable.

elstoc commented on August 17, 2024

@ralfbrown that #ifdef replacement also caused failure

from darktable.

elstoc commented on August 17, 2024

Clutching at straws here but worth mentioning that I only build using the build.sh script, and clear both the build and install directories beforehand (just in case the people who have it working use some other mechanism).

from darktable.

jenshannoschwalm commented on August 17, 2024

my build script but i guess nothing special

#!/bin/sh

DTDIR=/home/hanno/sources/darktable
ANSWER=""

if [ ! -d "$DTDIR" ]; then
  echo "no darktable directory found"
  read -t 10 ANSWER
  exit
fi

cd $DTDIR

if [ -d "build" ]; then
  echo "uninstall old builds <sudo>"
  cd build
  sudo make uninstall
  make clean
  cd ..
  sudo rm -fr /home/hanno/sources/darktable/build
  sudo rm -fr /home/hanno/.cache/ccache
fi

git checkout master
git pull upstream master
git fetch upstream
git submodule update
rm -fr /home/hanno/.cache/darktable/cached_v*

./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo

Have you tried building with clang, possibly ruling out a compiler problem? Last line here

export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo

BTW - not mentioning ArchL... here but it wouldn't be the first time :-)

from darktable.

elstoc commented on August 17, 2024

Have you tried building with clang

I get a bunch of CMake errors when trying to build with clang which I don't know how to resolve, so stick with gcc

from darktable.

piratenpanda commented on August 17, 2024

Have you tried building with clang, possibly ruling out a compiler problem? Last line here

does not change anything unfortunately. Still crashing in colorequal.c:544

from darktable.

jenshannoschwalm commented on August 17, 2024

A summary as i understand it for now:

There could be 1) a bug in the way we use the for default(firstprivate) pragma, - @elstoc tests seem to indicate that 2) that could be correct but there is a compiler/omp issue either depending on build or system or 3) there is a bug in CE code that triggers the issue on some systems - @piratenpanda report might pinpoint to that or 4) don't know.

@piratenpanda i might be good to report about your system here :-) (distribution, gcc version, libgomp?

There is a hypothesis i would like you both to test, could you compile with this colorequal.c replacing what we have? colorequal.zip

from darktable.

elstoc commented on August 17, 2024

Seems to work for me with the provided colorequal.zip

from darktable.

piratenpanda commented on August 17, 2024

also running arch here, gcc 14.1.1

For me colorequal.zip does not work, but bt full now is longer:

#0  0x00007fffd81af8ac in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
        uv = {<optimized out>, <optimized out>}
        cv = {<optimized out>, <optimized out>}
        k = <optimized out>
        k = <optimized out>
        UV = 0x7ffe45800040
        saturation = 0x7ffe11c00040
        sat_shift = <optimized out>
        pixels = <optimized out>
        a_full = 0x7ffe0c200040
        b_full = 0x7ffe0bc00040
#1  0x00007ffff6fce997 in gomp_thread_start (xdata=<optimized out>)
    at /usr/src/debug/gcc/gcc/libgomp/team.c:129
        team = 0x7fffa4026480
        task = 0x7fffa40284e0
        data = <optimized out>
        thr = <optimized out>
        pool = <optimized out>
        local_fn = 0x7fffd81af740 <_prefilter_chromaticity._omp_fn.3>
        local_data = 0x7fffd33e11e0
#2  0x00007ffff1dbfded in start_thread (arg=<optimized out>)
    at pthread_create.c:447
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140729162663616, -2270786405953605323, 140729162663616, -57928, 110, 140736737418544, -2270786405907467979, -2269964834939244235}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#3  0x00007ffff1e430dc in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

from darktable.

jenshannoschwalm commented on August 17, 2024

So both of you are on arch ...

@piratenpanda could you try to use DT_OMP_FOR() at that specific loop L 544 ? That loop is so silly-easy, how could there be something wrong?

Any chance to check on a debug version? Or maybe gdb?

@ralfbrown any fresh idea?

from darktable.

ralfbrown commented on August 17, 2024

Clutching at straws too.... I might get a fresh idea looking at the generated code. After compiling, run

objdump -S build/lib64/darktable/plugins/libcolorequal.so >colorequal.s

and upload colorequal.s. (You can compress it, as it will be fairly large.)

In the mean time, I'll suggest @elstoc trying shared instead of firstprivate, i.e.

#ifdef _OPENMP
#pragma omp parallel for default(shared) schedule(static) collapse(2)
#endif

from darktable.

elstoc commented on August 17, 2024

I'll suggest @elstoc trying shared instead of firstprivate

crash

upload colorequal.s

colorequal.s.txt
(renamed to .txt so github allows it)

from darktable.

jenshannoschwalm commented on August 17, 2024

I think I found a bug in ce code

for(size_t k = 0; k < pixels; k++)
  {
    // For each correction factor, we re-express it as a[0] * U + a[1] * V + b
    const float uv[2] = { UV[2 * k + 0], UV[2 * k + 1] };
    const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
                          a_full[4 * k + 2] * uv[0] + a_full[4 * k + 3] * uv[1] + b_full[2 * k + 1] };

seems to be bad, we should align the float array or use 2 const float each. Will do the pr tonight.

from darktable.

AxelG-DE commented on August 17, 2024

interestingly both of my machines do not crash. #
Anything I can do, let me know

from darktable.

ralfbrown commented on August 17, 2024

Looking through the objdump output, the one substantive difference I see between Chris's compilation and mine is that mine does (float)size_t_var using the vcvtsi2ss instruction while Chris's uses vcvtusi2ss. The latter is "more correct", but a quick Google indicates that that instruction was only added with AVX-512, which my machine doesn't have. Both of our compilations vectorize the interpolate_bilinear loop using 256-bit vector registers; it may take additional compiler flags to get 512-bit vectorization (using zmmN registers instead of ymmN).

@elstoc as a test, try changing all of the size_ts in interpolate_bilinear to ssize_t (two s's, refers to a signed version). If that eliminates the crash, then the vctvusi2ss instruction is the culprit and we need to figure out how to keep GCC from using it on your machine.

from darktable.

elstoc commented on August 17, 2024

that instruction was only added with AVX-512, which my machine doesn't have

Ooh have I got bleeding edge hardware as well as a bleeding edge distro? Arch is the canary in the coal mine

try changing all of the size_ts in interpolate_bilinear to ssize_t

crash

from darktable.

ralfbrown commented on August 17, 2024

Well, I am now officially out of ideas other than reverting that instance of DT_OMP_FOR back to the original written-out pragma, with a comment to the effect that the written-out version is needed to avoid crashes.

from darktable.

jenshannoschwalm commented on August 17, 2024

Just a reminder, @elstoc machine didn't crash when the buffers where enlarged. @elstoc @piratenpanda would you try
colorequal.zip

@ralfbrown there are two ideas 1) make sure each buffer is at least one cacheline larger than required per "plane" 2) don't use non-aligned 2float arrays on the stack. Let's see ...

from darktable.

elstoc commented on August 17, 2024

reverting that instance of DT_OMP_FOR

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

from darktable.

jenshannoschwalm commented on August 17, 2024

reverting that instance of DT_OMP_FOR

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

I agree with @elstoc here. Trying to find the OMP specifications for simd default(firstprivate) ...

from darktable.

elstoc commented on August 17, 2024

try colorequal.zip

crash

from darktable.

piratenpanda commented on August 17, 2024

same for me

from darktable.

ralfbrown commented on August 17, 2024

@elstoc is your current crash in _prefilter_chromaticity at line 544(-ish) or in interpolate_bilinear? I just took a close look at your original backtrace, and its crash was at 544, same as piratenpanda reported. Messing with the DT_OMP_FOR in interpolate_bilinear may merely be causing downstream effects (due to inlining?) which determine whether there's a crash or not....

@jenshannoschwalm Even though the uv and cv arrays aren't aligned, the compiler is vectorizing and optimizing them away, into 256-bit vector registers (thus running four loop iterations at once, plus cleanup of the leftover few). Chris's version is using a bunch of AVX-512 instructions.

from darktable.

elstoc commented on August 17, 2024

Latest backtrace (from the colorequal.c provided by @jenshannoschwalm ) is darktable_bt_O3ZQN2.txt

from darktable.

piratenpanda commented on August 17, 2024

Here's mine darktable_bt_DER5N2.txt

from darktable.

jenshannoschwalm commented on August 17, 2024

Could you check what happens with modified loop parameters? Start with 8 for example and stop as k < pixels-8?
What happens when doing that loop with DT_OMP_FOR()?

from darktable.

ralfbrown commented on August 17, 2024

Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun. But the buffer size has been increased by at least 64 over the actual number of pixels....

In looking over the code for the loop, I noticed that _get_satweight is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.

from darktable.

jenshannoschwalm commented on August 17, 2024

In looking over the code for the loop, I noticed that _get_satweight is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.

See const double val = 0.5 / (double)SATSIZE * (double)i; in initializing the table.

from darktable.

ralfbrown commented on August 17, 2024

I did see that, but if you look more closely, isat in _get_satweight is limited to the range SATSIZE/2 to 3*SATSIZE/2 because 1+CLAMP() can only take values 0.5 to 1.5, and the array index used is (int)floor(isat). But the array itself runs from 0 to 2*SATSIZE, and all of those entries have been filled by _init_satweights.

from darktable.

jenshannoschwalm commented on August 17, 2024

Right, now i see that and it's indeed restricting the effect. Probably i simply oversaw that while introducing the contrast thing.

Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun.

Due to optimizing it could also be the UV overrun.

One other idea, could this all be due to the #include "common/extra_optimizations.h" ?

from darktable.

piratenpanda commented on August 17, 2024

One other idea, could this all be due to the #include "common/extra_optimizations.h" ?

I removed it and it didn't change anything unfortunately

from darktable.

TurboGit commented on August 17, 2024

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

Maybe but we had no report about crash before, so maybe a quite restricted issue on this part of code as it fixes the bug for you and @piratenpanda.

from darktable.

elstoc commented on August 17, 2024

Sure but you'd had no report about a crash until I reported it either. And I haven't really tested master much since that PR went in - I literally built it and it crashed because of an existing edit. Also that PR is only a couple of weeks old and not everyone updates that frequently

from darktable.

jenshannoschwalm commented on August 17, 2024

I literally built it and it crashed because of an existing edit

would you share raw & xmp?

from darktable.

piratenpanda commented on August 17, 2024

For me it happens as soon as colorequalizer is enabled. Doesn't matter if presets or just manual changes

from darktable.

jenshannoschwalm commented on August 17, 2024

Could you please also check tonequalizer module using "avaraged guided filter" in "preserve details" ?

Or "focus peaking" ?

from darktable.

jenshannoschwalm commented on August 17, 2024

Could you also test this?
test.zip

(replaces colorequlizer and fast guided filter code)

BTW i have no cpu info from @piratenpanda and @elstoc (in case we have hit a genuine compiler/gomp problem)

from darktable.

piratenpanda commented on August 17, 2024

I'm on a Ryzen 7950X

test.zip still crashes in colorequal.c:544

from darktable.

jenshannoschwalm commented on August 17, 2024

And this?
colorequal.zip

from darktable.

piratenpanda commented on August 17, 2024

still crashes

from darktable.

jenshannoschwalm commented on August 17, 2024

If you remove the DT_OMP_FOR_SIMD(aligned(a_full, b_full, saturation, UV: 64)) in line 539 or replace it with DT_OMP_FOR() ?

from darktable.

piratenpanda commented on August 17, 2024

both cases crash

from darktable.

jenshannoschwalm commented on August 17, 2024

If you remove the DT_OMP_FOR(collapse(2)) lines - first in interpolate_bilinear2() and then in the "generic?

from darktable.

elstoc commented on August 17, 2024

check tonequalizer module using "avaraged guided filter" in "preserve details" ? Or "focus peaking" ?

Both fine

i have no cpu info

Intel(R) Core(TM) i5-11600K

from darktable.

piratenpanda commented on August 17, 2024

If you remove the DT_OMP_FOR(collapse(2)) lines - first in interpolate_bilinear2() and then in the "generic?

both in fast_guided_filter.h correct?

If yes, still crashes

from darktable.

jenshannoschwalm commented on August 17, 2024

So it seems not to be a cpu feature issue, i think from memory @ralfbrown is using the same cpu as @piratenpanda

If you remove the DT_OMP_FOR(collapse(2)) lines - first in interpolate_bilinear2() and then in the "generic?

both in fast_guided_filter.h correct?

Yes.

from darktable.

jenshannoschwalm commented on August 17, 2024

Maybe we might ask at pixels for more users running master on arch vs other distros?

from darktable.

ralfbrown commented on August 17, 2024

I've got a rather older CPU than @piratenpanda - Threadripper 3970X, bought four years ago. Chris's CPU is around a year newer than mine, and supports AVX-512.

from darktable.

jenshannoschwalm commented on August 17, 2024

@ALL involved here, a summary from what @piratenpanda has been doing from code i published here (for those too lazy to inspect)
On his system it doesn't help

to remove the DT_OMP_FOR_SIMD(aligned(a_full, b_full, saturation, UV: 64)) line from the loop in question
to increase the size of the allocated buffers
to remove the DT_OMP_FOR() pragma from interpolate_bilinear which is different from @elstoc results and would rule out a simple local revert - so it seems to me.

This "all doesn't make sense" to me, that loop is so dammed simple, i guess the "real problem" is somewhere else .

Ideas

could we make the build disabling AVX-512 just to make sure ...
could either of you both affected run dt via gdb so we get register values - check manually for out-of-bounds problems.

from darktable.

piratenpanda commented on August 17, 2024

Never did this before so please tell me if you need something else. This is with the modified fast_guided_filter.h (both removed) and DT_OMP_FOR() in colorequal.c from your latest zip file

rax            0x7fffd80f8260      140736818283104
rbx            0xb7d               2941
rcx            0xb7d               2941
rdx            0x3840              14400
rsi            0xb7d               2941
rdi            0x7ffe3a76e730      140729879291696
rbp            0x7ffdb89f0e70      0x7ffdb89f0e70
rsp            0x7ffdb89f0dc0      0x7ffdb89f0dc0
r8             0x7ffe3a97d730      140729881450288
r9             0x7ffe3a1fee20      140729873591840
r10            0x7ffe50a81bb8      140730251615160
r11            0x1800              6144
r12            0x41c0              16832
r13            0xb7d               2941
r14            0x1430              5168
r15            0xb7d               2941
rip            0x7fffd80e18ce      0x7fffd80e18ce <_prefilter_chromaticity._omp_fn.3+398>
eflags         0x10283             [ CF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x10001             65537
k1             0xffff              65535
k2             0xff0ffeff          4279238399
k3             0xe0                224
k4             0xffe0              65504
k5             0x0                 0
k6             0x7                 7
k7             0x7                 7
fs_base        0x7ffdb8a006c0      140727700948672
gs_base        0x0                 0

from darktable.

elstoc commented on August 17, 2024

gdb is not my favourite thing... is this the sort of thing you're after? (running latest master)

Thread 84 "worker res 1" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff197fa6c0 (LWP 905847)]
0x00007fffa04cd13c in _prefilter_chromaticity._omp_fn.3 () at /srv/local/coding/darktable-master/src/iop/colorequal.c:544
544	    const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
(gdb) bt
#0  0x00007fffa04cd13c in _prefilter_chromaticity._omp_fn.3 () at /srv/local/coding/darktable-master/src/iop/colorequal.c:544
#1  0x00007ffff727e997 in gomp_thread_start (xdata=<optimized out>) at /usr/src/debug/gcc/gcc/libgomp/team.c:129
#2  0x00007ffff1e18ded in ??? () at /usr/lib/libc.so.6
#3  0x00007ffff1e9c0dc in ??? () at /usr/lib/libc.so.6
(gdb) info registers
rax            0x7fffa04e3200      140735882867200
rbx            0x1237              4663
rcx            0x1249              4681
rdx            0x5fcc0             392384
rsi            0x1227              4647
rdi            0x7ffee21e9cc0      140732692077760
rbp            0x7fff197eae70      0x7fff197eae70
rsp            0x7fff197eae00      0x7fff197eae00
r8             0x7fff18f36680      140733611992704
r9             0x1800              6144
r10            0x1                 1
r11            0x1801              6145
r12            0x7ffee556f680      140732746102400
r13            0x7fff20797360      140733738218336
r14            0x122e              4654
r15            0x1241              4673
rip            0x7fffa04cd13c      0x7fffa04cd13c <_prefilter_chromaticity._omp_fn.3+428>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x7                 7
k1             0xff                255
k2             0x7                 7
k3             0x7                 7
k4             0x7                 7
k5             0x7                 7
k6             0x7                 7
k7             0x7                 7
fs_base        0x7fff197fa6c0      140733621184192
gs_base        0x0                 0

from darktable.

ralfbrown commented on August 17, 2024

If you could provide objdump output of the compiled colorequal code, we can see exactly which instruction is segfaulting (and figure out some of the variable values).

from darktable.

piratenpanda commented on August 17, 2024

Just to be clear, run on the libcolorequal.so file? And with which options?

from darktable.

piratenpanda commented on August 17, 2024

I ran all commands I found so far

objdump_g.txt
objdump_h.txt
objdump_TC.txt

from darktable.

piratenpanda commented on August 17, 2024

colorequal.s.txt

I see it now, must have missed that

from darktable.

elstoc commented on August 17, 2024

And for mine (matching the gdb output above - i.e. latest master)...

colorequal.s.txt

from darktable.

ralfbrown commented on August 17, 2024

Tracing things now, @piratenpanda's crash is on reading 64 bytes (using AVX-512) at %r9[4*%rdx+0xC0]. %r9 is thus either a_full or b_full and %rdx is k. Looks like the vectorization is doing 128 bytes per iteration (so we'd need to boost the buffer by at least 2*DT_CACHELINE_SIZE if the compiler isn't properly limiting things).

Chris's crash is on reading 32 bytes (using AVX/AVX2) from either a_full or b_full. Based on the fact that %rdx is used with multipliers of both 2 and 4 around the crash point for both of you, I will tentatively say that Chris's crash is from reading b_full while Benjamin's is from reading a_full. That may just be an artifact of instruction ordering, i.e. which array is read first.

More to come....

I had already been preparing a variation that splits out the parallelized loops into separate functions (my experience has been that the compiler sometimes vectorizes better when the loop is the only thing in a function, even if it subsequently gets inlined). So please try the following version and report the backtrace, registers from gdb, and the (new) objdump output when it crashes.

colorequal.zip

from darktable.

elstoc commented on August 17, 2024

gdb:

Thread 62 "worker res 0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff3bfff6c0 (LWP 923598)]
0x00007fff90924b2d in _apply_prefilter._omp_fn.0 () at /srv/local/coding/darktable-master/src/iop/colorequal.c:498
498	    const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
(gdb) bt
#0  0x00007fff90924b2d in _apply_prefilter._omp_fn.0 () at /srv/local/coding/darktable-master/src/iop/colorequal.c:498
#1  0x00007ffff727e997 in gomp_thread_start (xdata=<optimized out>) at /usr/src/debug/gcc/gcc/libgomp/team.c:129
#2  0x00007ffff1e18ded in ??? () at /usr/lib/libc.so.6
#3  0x00007ffff1e9c0dc in ??? () at /usr/lib/libc.so.6
(gdb) info registers
rax            0x7fff9093a200      140735618982400
rbx            0x2000              8192
rcx            0x1197              4503
rdx            0xaa820             698400
rsi            0x118d              4493
rdi            0x7ffedf729f20      140732647251744
rbp            0x7fff3bfefe70      0x7fff3bfefe70
rsp            0x7fff3bfefe00      0x7fff3bfefe00
r8             0x7ffee4aa77b0      140732734797744
r9             0x1fff              8191
r10            0xfda               4058
r11            0x1                 1
r12            0x7ffee8fbe7b0      140732807243696
r13            0x7ffee2ada3f8      140732701451256
r14            0x1278              4728
r15            0x12be              4798
rip            0x7fff90924b2d      0x7fff90924b2d <_apply_prefilter._omp_fn.0+445>
eflags         0x10206             [ PF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x10001             65537
k1             0xff                255
k2             0x2fffffff          805306367
k3             0x0                 0
k4             0x0                 0
k5             0x1d8               472
k6             0x3b000             241664
k7             0x0                 0
fs_base        0x7fff3bfff6c0      140734200018624
gs_base        0x0                 0

obj dump
colorequal.s.txt

from darktable.

piratenpanda commented on August 17, 2024

darktable_bt_6VEQN2.txt

rax            0x7fffd8051260      140736817599072
rbx            0xb7d               2941
rcx            0xb7d               2941
rdx            0x3840              14400
rsi            0xb7d               2941
rdi            0x7ffe425f8730      140730011977520
rbp            0x7ffde23f0e70      0x7ffde23f0e70
rsp            0x7ffde23f0dc0      0x7ffde23f0dc0
r8             0x7fff7e3f7730      140735311476528
r9             0x7ffe2c5fee20      140729642905120
r10            0x7fff956bebb8      140735700265912
r11            0x1800              6144
r12            0x41c0              16832
r13            0xb7d               2941
r14            0x140f              5135
r15            0xb7d               2941
rip            0x7fffd803a8ce      0x7fffd803a8ce <_prefilter_chromaticity._omp_fn.3+398>
eflags         0x10283             [ CF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x7                 7
k1             0xffff              65535
k2             0x7                 7
k3             0x7                 7
k4             0x0                 0
k5             0x7                 7
k6             0x0                 0
k7             0x0                 0
fs_base        0x7ffde24006c0      140728399300288
gs_base        0x0                 0

colorequal.s.txt

from darktable.

jenshannoschwalm commented on August 17, 2024

Would we possibly need/want to increase the cacheline size?

from darktable.

ralfbrown commented on August 17, 2024

@piratenpanda It appears you ran an old binary but ran objdump on a new compilation, as the reported function doesn't exist with the code I posted.

@elstoc Definitely a crash on reading a_full this time, at offset 96 within a 128-byte batch (to get enough values to shuffle around into 256-bit vector registers). I realized while looking through the code that none of the buffer allocations are checked for success, but that isn't the problem here since the crash is not the result of a null-pointer deref.

@jenshannoschwalm No need to increase the cacheline size - a) if the parallelized loop is reading past the end of the buffer, that's a code generation bug which we should try to work around, and b) if we wanted to up the padding for just this function, we don't have to bump it up everywhere else that rounds to cache lines.

from darktable.

jenshannoschwalm commented on August 17, 2024

yes i saw the missing checks for allocation too while looking for tiled support. We you take of that in a pr or would you want me to do?

from darktable.

ralfbrown commented on August 17, 2024

Already added the checks in my copy.

from darktable.

piratenpanda commented on August 17, 2024

I hope this is better otherwise I don't know what I am doing wrong:

darktable_bt_06A7N2.txt
colorequal.s.txt

rbx            0xb7d               2941
rcx            0xb7d               2941
rdx            0x3840              14400
rsi            0xb7d               2941
rdi            0x7ffe35df8730      140729802262320
rbp            0x7ffdebdf0e70      0x7ffdebdf0e70
rsp            0x7ffdebdf0dc0      0x7ffdebdf0dc0
r8             0x7fff8bff7730      140735542163248
r9             0x7ffe2fbfee20      140729699528224
r10            0x7fffb90bebb8      140736297954232
r11            0x1800              6144
r12            0x41c0              16832
r13            0xb7d               2941
r14            0x140f              5135
r15            0xb7d               2941
rip            0x7fffd81af8ce      0x7fffd81af8ce <_prefilter_chromaticity._omp_fn.3+398>
eflags         0x10283             [ CF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x7                 7
k1             0xffff              65535
k2             0x7                 7
k3             0x7                 7
k4             0x0                 0
k5             0x7                 7
k6             0x0                 0
k7             0x0                 0
fs_base        0x7ffdebe006c0      140728560780992
gs_base        0x0                 0

from darktable.

ralfbrown commented on August 17, 2024

That still looks like an old executable. If you're not doing the install step (not recommended), are you explicitly running the copy in the build directory by using the full path on the commandline?

from darktable.

piratenpanda commented on August 17, 2024

I cleaned the build directory, replaced the c file, built fresh, installed, objdump is from the build directory rest from running the installed version.

from darktable.

ralfbrown commented on August 17, 2024

Odd, unless you have a separate "production" copy on your path. Please try running explicitly from the build directory, i.e. type build/bin/darktable while in the top directory of the repo.

from darktable.

piratenpanda commented on August 17, 2024

there it shows the "dirty" which I was expecting. Really strange..

rax            0x1fff              8191
rbx            0x13cb              5067
rcx            0x7fffd812b260      140736818492000
rdx            0x107a0             67488
rsi            0xf200              61952
rdi            0x13d1              5073
rbp            0x7ffe0f1f0e70      0x7ffe0f1f0e70
rsp            0x7ffe0f1f0e00      0x7ffe0f1f0e00
r8             0x7ffe0c5fb790      140729106020240
r9             0x7ffe3e9fdbe8      140729949084648
r10            0x7ffe0bbfdbe8      140729095543784
r11            0x13c8              5064
r12            0x7ffe0d3fee14      140729120714260
r13            0x13ce              5070
r14            0x13b5              5045
r15            0x13af              5039
rip            0x7fffd8114e0f      0x7fffd8114e0f <_apply_prefilter._omp_fn.0+431>
eflags         0x10287             [ CF PF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
Couldn't get extended state status: Kein passender Prozess gefunden.

darktable_bt_A2S1N2.txt

edit: seems I somehow installed to /usr/local in the last few days.. (most likely by copy pasting Hanno's clang lines) explains the issue I guess

from darktable.

ralfbrown commented on August 17, 2024

That looks like the correct version now, as your crash is exactly the same place as Chris's - reading a_full at offset 96 within a 128-byte chunk.

Try adding substantial padding on the buffer just to verify whether the compiler is generating an over-run. In _prefilter_chromaticity, change

    a_full = dt_alloc_align_float(pixels * 4); 
    b_full = dt_alloc_align_float(pixels * 2);

    a_full = dt_alloc_align_float(pixels * 4 + 2000); 
    b_full = dt_alloc_align_float(pixels * 2 + 2000);

and adding the same padding to the allocation of UV in process().

Also try restricting the parallel loop to a single thread to check whether there's an issue in calculating how many iterations each thread should run, by changing the DT_OMP_FOR in _apply_prefilter to DT_OMP_FOR(num_threads(1)).

from darktable.

elstoc commented on August 17, 2024

Possibly easier if you just provide a test file to replace :) (for me anyway)

from darktable.

ralfbrown commented on August 17, 2024

colorequal.zip

from darktable.

elstoc commented on August 17, 2024

Thanks... crash

from darktable.

piratenpanda commented on August 17, 2024

rax            0x7fffd8dc6260      140736831709792
rbx            0xb7d               2941
rcx            0xb7d               2941
rdx            0x3840              14400
rsi            0x7ffe381f8730      140729840011056
rdi            0x1fff              8191
rbp            0x7ffde45f0e70      0x7ffde45f0e70
rsp            0x7ffde45f0dc0      0x7ffde45f0dc0
r8             0x7ffe31eff730      140729736230704
r9             0x165d              5725
r10            0x7ffe1c3fee20      140729372372512
r11            0xb7d               2941
r12            0xb7d               2941
r13            0xb7d               2941
r14            0x7fffb96bebb8      140736304245688
r15            0x140f              5135
rip            0x7fffd8daee16      0x7fffd8daee16 <_apply_prefilt--Type <RET> for more, q to quit, c to continue without paging--
er._omp_fn.0+390>
eflags         0x10283             [ CF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x7                 7
k1             0xffff              65535
k2             0x7                 7
k3             0x7                 7
k4             0x0                 0
k5             0x7                 7
k6             0x0                 0
k7             0x0                 0
fs_base        0x7ffde46006c0      140728434951872
gs_base        0x0                 0

darktable_bt_YGJ3N2.txt
colorequal.s.txt

from darktable.

ralfbrown commented on August 17, 2024

I was expecting that, but it pretty much confirms that the problem isn't a simple overrun of the end of the buffer. Now to try single-threaded....

colorequal.zip

from darktable.

elstoc commented on August 17, 2024

crash (let me know if you want any output)

from darktable.

ralfbrown commented on August 17, 2024

Yes, please - objdump and registers. I want to verify that the crash is the same place.

My only remaining experiment at the moment is to remove the #include "common/extra_optimizations.h" to see if that's causing an error in newer compiler versions (I'm still stuck on GCC 12.3).

from darktable.

elstoc commented on August 17, 2024

rax            0x7fff9093a200      140735618982400
rbx            0x1237              4663
rcx            0x1249              4681
rdx            0x492fe0            4796384
rsi            0x1227              4647
rdi            0x7ffee01ef040      140732658544704
rbp            0x7fffb5fe1d80      0x7fffb5fe1d80
rsp            0x7fffb5fe1d00      0x7fffb5fe1d00
r8             0x7fff00408040      140733197615168
r9             0x1fff              8191
r10            0x1                 1
r11            0x2000              8192
r12            0x7ffedf8c9040      140732648951872
r13            0x7fff20364040      140733733814336
r14            0x122e              4654
r15            0x1241              4673
rip            0x7fff90924afe      0x7fff90924afe <_apply_prefilter._omp_fn.0+350>
eflags         0x10283             [ CF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0xfeef8000          4277108736
k1             0xff                255
k2             0x60000ff           100663551
k3             0x7                 7
k4             0x7                 7
k5             0x7                 7
k6             0x7                 7
k7             0x7                 7
fs_base        0x7fffb5ffb6c0      140736246822592
gs_base        0x0                 0

colorequal.s.txt

from darktable.

ralfbrown commented on August 17, 2024

Yep, same crash as when using all available cores, only now the index is much larger. So the problem is related to hitting the end of the buffer, I'm just not seeing how....

What happens if you set the upper limit of the loop in _apply_prefilter to npixels-64 instead of npixels?

from darktable.

elstoc commented on August 17, 2024

within your amended colorequal code or the code in master?

from darktable.

jenshannoschwalm commented on August 17, 2024

@piratenpanda checked that already I think

from darktable.

ralfbrown commented on August 17, 2024

@elstoc amended code

@jenshannoschwalm different setup, so double-checking.

That and removing extra_optimizations.h are the only ideas I have left, given that things work on my machine.

from darktable.

jenshannoschwalm commented on August 17, 2024

We could try to disable any AVX512 code generation while compiling?
in interpolate_bilinear the compiled code converts floats to size_t? Any chance for imprecision leading to undefined offsets - resulting in bad pointers later thus writing "somewhere"?
Let's ask in pixls for a) arch users with other hardware b) fedora 39/40 users with AVX512 hardware?

from darktable.

ralfbrown commented on August 17, 2024

Chris doesn't have AVX-512
Any bad offsets in interpolate_bilinear would crash in that code, not later when only the filled buffer of pixel values still exists.
Sure, but without limiting it to just AVX-512 hardware on the Fedora side.

from darktable.

Segfault opening darktable about darktable HOT 112 CLOSED

Comments (112)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent