Git Product home page Git Product logo

Comments (19)

Cyriuz avatar Cyriuz commented on June 21, 2024 2

If you run this after sentry_init it works for me:

void lockUnhandledExceptionFilter()
{
	HMODULE kernel = GetModuleHandle("kernel32.dll");
	if (!kernel)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating kernel32 module.\n");
		return;
	}

	FARPROC setUnhandledExceptionFilter = GetProcAddress(kernel, "SetUnhandledExceptionFilter");
	if (!setUnhandledExceptionFilter)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating SetUnhandledExceptionFilter.\n");
		return;
	}

	#ifdef Q_OS_WIN64
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC3 // ret
	};
	#else
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC2, 0x04, 0x00, // ret 4
	};
	#endif
	SIZE_T bytes = 0;
	if (WriteProcessMemory(GetCurrentProcess(), (LPVOID)setUnhandledExceptionFilter, (LPCVOID)newBody, sizeof(newBody), &bytes) == 0)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed writing process memory.\n");
	}
}

I tried to simplify the other examples I found as much as possible but I'm no Windows programmer so take it with a grain of salt.

The best way to get to the bottom of what it is that clears the filter would be to make a proper hook, and print the callstack in there, but its somewhat involved and I'm not sure what I would do with the info anyway since I guess it theoretically could be multiple things.

To get the WER module you have to build sentry-native with CRASHPAD_WER_ENABLED

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024 2

Is there any updates on this issue?

We must understand that we cannot solve this problem within the Native SDK.

While a solution like the one proposed here: #833 (comment) can work to mitigate the issue, we have to be clear that this kind of solution is the reason why the problem exists in the first place.

We already had support topics that took weeks to resolve. In the end, we discovered that a customer code dependency overwrote SetUnhandledExceptionFilter() in the same way described above. This means that it is not only wrong inside the Native SDK but also wrong as an "officially" suggested mitigation from Sentry's perspective.

Also, sentry_reinstall_backend() is not a solution to the problem but rather a way to identify and debug the issue. If it is used as a permanent mitigation, then only as a last resort, for instance, when a closed source dependency with an unresponsive dev team misbehaves.

IMO, the correct approach for Sentry is to extend the "Advanced Usage" documentation and provide users of the Native SDK with steps to identify the issue, isolate potential culprits, and implement solid mitigations. I did this with multiple users in an ad hoc fashion over the last two years, and while all users had different setups and root causes, there were a lot of common steps in between. Another aspect is to collect known culprits and potential mitigations, if any.

Since the focus over the last six to nine months has been primarily on supporting downstream SDKs to improve quality on mobile platforms, these Desktop documentation topics have been put on the back burner. But we will tackle them.

cc: @kahest

from sentry-native.

Swatinem avatar Swatinem commented on June 21, 2024 1

IMO hooking the UnhandledExceptionFilter completely is a really big hammer. We might be breaking apps in surprising ways that we can’t predict yet.

There are legitimate uses for temporarily using it and then restoring the original handler afterwards. We don’t want to break those usecases. Although as the examples show, restoring the original handler is often not done correctly by other tools either.

We might want to offer this either as a compile time flag or an explicit runtime API though, wdyt @supervacuus ? Then the responsibility of (not) breaking any unpredictable usecases is up to the app author.

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

Hi @mntt-SSE! I could not look into this on any of my Windows machines yet, but it seems you can access both types of machines to verify behavior and configuration.

Can you look in the registry to see whether the machines differ in the settings for HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps? I could imagine some unexpected interaction with the crashpad WER module.

from sentry-native.

mntt-SSE avatar mntt-SSE commented on June 21, 2024

Thanks for your answer! On a machine where the problem exists, the LocalDumps entry looks like this:
image

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

On a machine where Sentry works as expected, no LocalDumps entry exists at all under Windows Error Reporting. I verified with 2 machines that don't work and 1 that does work, and they fit the pattern.

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

Thx for verifying, @mntt-SSE; that is what I expected. LocalDumps bypasses application-local handling in favor of a centralized dump approach. It should be possible to deactivate it for your application by creating a new key (similar to the other apps below the LocalDumpskey), but I haven't tested it yet. Here are the docs for the key: https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

Maybe I should also say, that what is interesting are the values in LocalDumps, not the sub-keys which probably just provide a counter config because those affect all applications not listed below.

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

For further root-causing, it would certainly help if you could also gather the debug output of the Native SDK on the systems where the dump happens in %LOCALAPPDATA%\CrashDumps; maybe something obvious is already visible there.

from sentry-native.

mntt-SSE avatar mntt-SSE commented on June 21, 2024

The LocalDumps key itself does not contain any values other than (Default), which has no data.

I have so far tried to:

  1. Create a key under LocalDumps with the name of the application, and set its DumpType value to 0.
  2. Create a key under Windows Error Reporting called ExcludedApplications, and add a DWORD value with the name of the application as the key, and 1 as the value.
  3. Under the ExcludedApplications key, create another key with the name of the application.
  4. Call the Windows API function WerAddExcludedApplication. This did the same as 2.

I tried restarting the computer after each step, but nothing worked :/ I also checked the corresponding keys under HKEY_CURRENT_USER, but there is nothing there.

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

The LocalDumps key itself does not contain any values other than (Default), which has no data.

That means that it is configured to use the default values.

from sentry-native.

mntt-SSE avatar mntt-SSE commented on June 21, 2024

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 2 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 520m" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "executing task on worker threa" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "sending request using winhttp to <address>:
x-sentry-auth:<auth> sentry_key=<key>, sentry_version=7, sentry_client=sentry.native/0.5.3 
content-type:application/x-sentry-envelope 
content-length:17594 " (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 41 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 243m" (level=-1)

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

sentry_init() returns 0 on all machines I've tested it on. And I can also see that the crashpad_handler.exe process is starting when the application is started, and quits when the application crashes (or when it exits nicely).

from sentry-native.

supervacuus avatar supervacuus commented on June 21, 2024

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

Okay, the logs are wrapped, interleaved, and incomplete. They also show some other transmission (maybe a manual event) but not the crash (because with crashpad, you would not see a transmission from within our worker_thread).

Maybe it is easier to debug this issue with the plain sentry_example.exe. In any case, I would expect a program configured as you described it in your initial description to print all these things - at some point in its execution - until it crashes:

[sentry] INFO using database path "your_database_path"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
[sentry] DEBUG starting crashpad backend with handler "your_handler_path"
[sentry] DEBUG using minidump url "your_dsn_url"
[sentry] DEBUG registering crashpad WER handler "your_wer_module_dll"
[sentry] INFO started crashpad client handler
[sentry] DEBUG processing and pruning old runs
[sentry] INFO flushing session and queue before crashpad handler
[sentry] INFO handing control over to crashpad

Since you sentry_init() successfully (at least according to the return value), I am especially interested if you see the last two lines. If those never appear, then the client-side crash-handler is never invoked, which means that the crashpad_handler will not be triggered (even if it is running).

You can also verify this externally by deleting your database path, executing your program, provoking a crash, and looking into (the newly created) path; there must be a last_crash marker file in there.

from sentry-native.

Cyriuz avatar Cyriuz commented on June 21, 2024

I have been plagued by the same issue, that some machines just refuse to send any reports. After reading this discussion and talking to @supervacuus on discord I spent a day trying a bunch of things:

To start by answering the last question, if the 2 last lines of the sentry output: Those lines does appear on working machines and does not on the machines that doesn't work. This indicates that the UnhandledExceptionFilter is not run, so I started trying to figure out why.

The first thing that proved that it was not a Windows issue (as a lot of blogs and discussions around the web indicates it could be) was that it worked to do sentry_reinstall_backend() just before triggering the crash.

So something in the startup process overwrites our exception handler, and I'm pretty sure it is not any of our dependencies since it should behave the same for all machines in that case, and I couldn't find any reference to SetUnhandledExceptionFilter. Not sure what else it could be, some driver or something? I tried inspecting loaded dlls but didn't see anything significant, but I assume it could be something that gets loaded and then released during startup.

I tried the sentry_example.exe which seems to work fine on all machines, so I would assume it has to do with it being a gui application.

I finally tried replacing the SetUnhandledExceptionFilter with a simple return by overwriting the function address and it works perfectly, but it feels very sad. Searching a bit on github though this seems to be the way people do it. Maybe it is something sentry-native should do to increase its reliability? I assume if you use sentry-native you want it to have full control of these things. But at the same time it is a bit harsh...

from sentry-native.

mntt-SSE avatar mntt-SSE commented on June 21, 2024

Thank you @Cyriuz and @supervacuus for looking into this!

I have also done some more investigation. Firstly, the reason I didn't get the full Sentry logs was because I initialized Sentry before the other log handling. When I switched the order I got output identical to yours @supervacuus, except for the registering crashpad WER handler "your_wer_module_dll", where it says no WER handler is registered.

Secondly, I have found a workaround. I'm building a Qt application (Qt 5.15.8), and I noticed that if I initialize Sentry after the first window is shown (using QWidget::show()), then everything works as expected, which strengthen your theory about it haviing to do with UI @Cyriuz . So now I'm flashing a tiny window before the main one, which is far from ideal, but at least I'm getting the events. Would still be good to have another solution though.

from sentry-native.

Swatinem avatar Swatinem commented on June 21, 2024

That is some very good feedback, thanks for the investigation!

We have indeed heard reports before that Qt / display drivers are constantly overwriting the UnhandledExceptionFilter, similar to managed language runtimes.

While it would be possible to hook / lock overwriting it, we decided against that, as it is a very deep "incision" that we don’t want to do by default.

from sentry-native.

Cyriuz avatar Cyriuz commented on June 21, 2024

Fair enough. Although I wonder if it can still be considered somewhat standard as so many other projects seem to do the same? And considering that sentry-native is, in my understanding at least, supposed to take ownership of the entire crash reporting pipeline I feel like there might be an argument for including it, I guess it could even be optional? Or maybe it could just be a good topic for a page on the sentry docs as a workaround so people can find it easier? I assume it potentially affects most people that use sentry native on Windows.

from sentry-native.

W2Wizard avatar W2Wizard commented on June 21, 2024

Just going to throw in my 2 cents here. Had a very similar issue with regards to QT (5.15.8), the moment it was initialized the entire crash reporting seized to function. There we're no registry keys, or misplaced dumps, in fact there was absolutely nothing.

A quick test basically confirmed what I suspected, I would throw and exception right after sentry initialized, it worked. Did the same right after calling QApplication a(argc, argv); nope back to being broken.

Weirdly enough it worked on some Windows machines just fine (e.g: server, ...) but on others basically nothing worked. No dumps, no logs (only the one indicating that sentry had initialized), no indication that something went wrong whatsoever.

from sentry-native.

rodolfoBee avatar rodolfoBee commented on June 21, 2024

This issue is also affecting a user of the Unreal SDK v0.16.0. It happens with both auto and manual initialisation:

FConfigureSettingsDelegate SettingsDelegate;
SettingsDelegate.BindDynamic(this, &USentryCrashReporterIntegration::HandleSettingsDelegate);

USentrySubsystem* SentrySubsystem = GEngine-<GetEngineSubsystem>USentrySubsystem<();
SentrySubsystem-<InitializeWithSettings(SettingsDelegate);

Is there any updates on this issue?

from sentry-native.

Cyriuz avatar Cyriuz commented on June 21, 2024

While I respect your decision and stance on the issue, mostly for discussions sake, I would argue that because of the nature of this library, its only purpose being catching crashes, the argument that the "solution is the reason why the problem exists in the first place" is not really helpful. For sure it is not good that many dependencies and drivers use this api wrong/hooks it, but that doesn't change the fact that this is the situation we're in.

We want to take ownership of the exception handling but can't possibly do it any other way. If it was just dependencies it would be one thing, it is easy enough to debug who is resetting the hook and we can complain upstream, but for drivers that does it on customer PCs, its just not feasible.

Is it really that big of a deal to say that we don't want anyone else to be able to handle exceptions, given the scope of this lib? My proposed fix doesn't involve using the apis wrong, just making sure no one else is allowed. I agree that it feels wrong in the context of doing it in a library, and maybe it is also wrong to do it here, but I see no other option and I think it would help a lot of users of this lib to save a lot of time debugging things.

In any case docs covering all of this that is easy to find would be great so you can easily find and make your own decisions :)

from sentry-native.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.