At Zed we see 2-5 panics per week across

I will simply trust <a class="user-mention notranslate" data-hovercard-ty

Error codes really ought to include ENFS and EFUSE. ENFS: no m

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

panic: unexpected error during closedir "Bad file descriptor" for unix::fs::Dir,about rust-lang/rust

Comments (29)

bjorn3 commented on July 22, 2024 3

I just tried compiling Zed on Linux with the rustc_codegen_cranelift backend. It crashes reliably with:

fatal runtime error: IO Safety violation: owned file descriptor already closed

Thread 1 "Zed" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: Bestand of map bestaat niet.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007ffff786fe8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7820fb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff780b472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000555562b7e5d1 in std::sys::pal::unix::abort_internal () at library/std/src/sys/pal/unix/mod.rs:366
#5  0x0000555560e37664 in std::sys::pal::unix::fs::debug_assert_fd_is_open () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/std/src/rt.rs:43
#6  0x0000555560dffad6 in std::os::fd::owned::{impl#7}::drop () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/std/src/os/fd/owned.rs:181
#7  0x0000555560e38069 in core::ptr::drop_in_place<std::os::fd::owned::OwnedFd> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#8  0x000055555b0f8dde in core::ptr::drop_in_place<std::sys::pal::unix::fd::FileDesc> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#9  0x000055555b0f928c in core::ptr::drop_in_place<std::sys::pal::unix::process::process_common::Stdio> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#10 0x000055555b0f8856 in core::ptr::drop_in_place<core::option::Option<std::sys::pal::unix::process::process_common::Stdio>> ()
    at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#11 0x000055555b0f930f in core::ptr::drop_in_place<std::sys::pal::unix::process::process_common::Command> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#12 0x000055555b0f8d67 in core::ptr::drop_in_place<std::process::Command> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#13 0x000055555b10be12 in alacritty_terminal::tty::unix::new () at /home/bjorn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/alacritty_terminal-0.23.0/src/tty/unix.rs:285
#14 0x000055555afdc160 in terminal::{impl#7}::new () at crates/terminal/src/terminal.rs:377
#15 0x000055555a49c2fb in project::terminals::{impl#0}::create_terminal () at crates/project/src/terminals.rs:167

$ dist/rustc-clif -vV
rustc 1.80.0-nightly (b1ec1bd65 2024-05-18)
binary: rustc
commit-hash: b1ec1bd65f89c1375d2cf2fb733a87ef390276d3
commit-date: 2024-05-18
host: x86_64-unknown-linux-gnu
release: 1.80.0-nightly
Cranelift version: 0.108.0

Upon further investigation, I found alacritty/alacritty#7996, which fixes the crash I found here. I think it is very likely that the bug this fixes is responsible for the panic described in this issue too.

from rust.

ConradIrwin commented on July 22, 2024 3

Nvm, I just read the attached issue. I’ll pull that into zed and see if it fixes out problems

from rust.

ConradIrwin commented on July 22, 2024 3

We've been running the fixed alacritty in preview for a week, and on stable for a few days, and have seen 0 instances of this - this either means we've been unusually lucky, or this is now solved.

Thank you very much to @the8472 for adding the debug asserts, and to @jakobhellermann for figuring out the problem.

from rust.

workingjubilee commented on July 22, 2024 1

Thank you for sharing!

I am not necessarily opposed to this change, I just felt it would be useful to know a bit more context.

Hmm. I wonder if, in another thread that is not bound by the executor, but does wait until after the worktree code has started up, destroying one of the directories in question would do it? That would make sense, and would explain why we get such a lovely "error" from the OS. That would be a classic case of "force majeure" that we typically do want to ignore, yes.

from rust.

the8472 commented on July 22, 2024 1

I will simply trust @the8472 that EBADF is actually a highly specific error, in this context at least

Too much trust 😅. It's entirely possible that the usual suspects (FUSE, network filesystems, bugs in filesystem impls) do something crazy. But there's hope, let it at least interact with reality before it gets crushed again.

At least libfuse docs say this about the release operation:

The filesystem may reply with an error, but error values are not returned to close() or munmap() which triggered the release.

from rust.

workingjubilee commented on July 22, 2024 1

Error codes really ought to include ENFS and EFUSE.

ENFS: no more-specific error available, just some hot garbage from NFS.
EFUSE: no more-specific error available, just some hot garbage from FUSE.

from rust.

workingjubilee commented on July 22, 2024

@ConradIrwin Can you discuss the tracebacks for this, or does your telemetry lack sufficiently sophisticated crash reporting to make that possible?

from rust.

workingjubilee commented on July 22, 2024

I expect that the standard library should not panic on situations that do happen in reality.

I know you meant something slightly aside from the reading of this, but I do expect the standard library to panic on indexing a slice with its length, or greater than. And people write code like that every day. So clearly the criteria has to be something more like

whether it is reasonable to proceed anyways (arguable?)
whether it is fixable (arguably not, if we get an error as joyously "descriptive" as EBADF)

from rust.

ConradIrwin commented on July 22, 2024

Fair: I do expect panics on programmer error or in a situation where it's unsafe to continue. I am relatively (but it's obviously hard to be 100%) sure this is not programmer error, and I'm convinced that it'd be just fine if the program kept running (c.f. how this error is ignored when dropping a file descriptor).

The full back-trace is below. All line numbers in Zed code are relative to v0.131.6 and in rust 1.77.1. I'm not sure why the top line number is missed by symbolicate, but the panic message narrows it down to this assertion

I believe the reason this panics at on line 4018 of worktree.rs in Zed's code is that child_paths goes out of scope, which was retaining a ReadDir returned from std::fs::read_dir() (via smol) here. I can't see any shenanigans on this codepath that would lead us to to invalidate the file descriptor, which leads me to believe that "this just happens sometimes" when interacting with the filesystem. (Though we are holding this across async/await, so it seems possible that something else is affecting the system's state, but it's very unclear as to what would cause this kind of a problem).

Panic `unexpected error during closedir: Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" }` on thread 38 (com.apple.root.user-initiated-qos)
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/std/src/sys/pal/unix/fs.rs:0:	_$LT$std..sys..pal..unix..fs..Dir$u20$as$u20$core..ops..drop..Drop$GT$::drop::h84b244fb776f88f3
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507:	core::ptr::drop_in_place::<std::path::PathBuf>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507:	core::ptr::drop_in_place::<std::sys::pal::unix::fs::InnerReadDir>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/alloc/src/sync.rs:1756:	<alloc::sync::Arc<std::sys::pal::unix::fs::InnerReadDir>>::drop_slow
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/mem/mod.rs:394:	core::mem::size_of_val_raw::<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/alloc/layout.rs:201:	<core::alloc::layout::Layout>::for_value_raw::<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/alloc/src/boxed.rs:1241:	<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send> as core::ops::drop::Drop>::drop
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507:	core::ptr::drop_in_place::<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507:	core::ptr::drop_in_place::<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>>>
crates/worktree/src/worktree.rs:4018:	<worktree::BackgroundScanner>::scan_dir::{closure#0}
crates/worktree/src/worktree.rs:3754:	<worktree::BackgroundScanner>::scan_dirs::{closure#0}::{closure#0}::{closure#0}
crates/gpui/src/executor.rs:457:	<gpui::executor::Scope>::spawn::<<worktree::BackgroundScanner>::scan_dirs::{closure#0}::{closure#0}::{closure#0}>::{closure#0}
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/task/poll.rs:51:	<core::task::poll::Poll<()>>::map::<core::result::Result<(), alloc::boxed::Box<dyn core::any::Any + core::marker::Send>>, core::result::Result<(), alloc::boxed::Box<dyn core::any::Any + core::marker::Send>>::Ok>
/Users/administrator/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.7.0/src/raw.rs:557:	<async_task::raw::RawTask<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>, (), <gpui::executor::BackgroundExecutor>::spawn_internal<()>::{closure#0}, ()>>::run
libdispatch.dylib	_dispatch_client_callout
libdispatch.dylib	_dispatch_root_queue_drain
libdispatch.dylib	_dispatch_worker_thread2
libsystem_pthread.dylib	_pthread_wqthread
libsystem_pthread.dylib	start_wqthread

from rust.

the8472 commented on July 22, 2024

For comparison a failure to close a file during the drop of a file descriptor is silently ignored as I would expect.

This is different because closing regular files can emit IO errors for which there is no good recovery strategy because it is unclear after calling close whether the FD is already closed successfully or not. The recommended strategy for error handling in those cases is to call sync_all on files first before closing them.

EBADF is very different. It indicates that something closed a file descriptor it did not own, which is a violation of IO safety. Unless you have some evidence that this is just macos weirdness I would expect this to indicate a very serious bug in userspace which can lead to data corruption and UB due to file descriptor confusion.

This error should not be suppressed.

Arguably we should also not silently ignore EBADF for OwnedFd and only suppress other errors such as EIO.

And maybe we should add a fn close(self) -> Result<()> API to allow non-panicking error handling, though even then EBADF shouldn't be silently ignored and the application should be aborted to prevent data from being accidentally clobbered.

from rust.

the8472 commented on July 22, 2024

Caveat: closedir operates on a DIR*, not on a fd directly. So it's possible that the C library returns EBADF for something that's not an invalid file descriptor but some other invalid state in that wrapping struct.
Is the macos C library open source? Then we can check there. If not it'd require reproducing the issue and tracing syscalls to tell the difference.

from rust.

joboet commented on July 22, 2024

It is, though as with everything Apple, it's a mess.

Errors can only occur from the close call, the rest of closedir just does some reclaiming. Most of the kernel is open-source as well, so one could try to figure out if anything except the fd lookup could cause the EBADF, but I'm not motivated enough to do that.

from rust.

the8472 commented on July 22, 2024

Thanks for looking it up. I think that is enough evidence that there's probably some library-UB going on in zed and we shouldn't try to silence it.

from rust.

joboet commented on July 22, 2024

The only place in zed itself which calls as_raw_fd is this code for getting the PID of the child process running in the integrated terminal. While it's definitely very fishy (it's not tied to the lifetime of the PTY), it doesn't close anything so I doubt that this issue is caused there.

As for dependencies, it might help to get a complete backtrace to see which ReadDir::drop specifically is causing this.

from rust.

the8472 commented on July 22, 2024

It's not necessarily readdir that's causing this, it could be the victim of something else going rogue and closing the wrong fd numbers.

from rust.

joboet commented on July 22, 2024

It must be... Though I'm hoping that there is enough locality between the close and the read_dir to point us in the direction.

from rust.

the8472 commented on July 22, 2024

I have an idea how to harden std types against these kinds of things, but it would be fairly expensive and imo only make sense for debug_assert builds. Randomization for file descriptor numbers. But I'm not sure if people would be on board with it.

from rust.

saethlin commented on July 22, 2024

I think that is a very interesting idea and would support merging it as a debug assertion if it finds a bug in Zed.

I am not yet convinced this is a bug in Zed.

from rust.

the8472 commented on July 22, 2024

Yeah, without rootcause or even reproducer I'm not claiming certainty.

Another possibility that I see is some race in the Arc leading to a double-drop.

from rust.

the8472 commented on July 22, 2024

@ConradIrwin I don't think this is a "this just happens sometimes" thing. File descriptors should not randomly vanish from a process. File descriptors are densely allocated resources holding critical data. If they were to get misnumbered under the nose of a process things could go seriously wrong, including data corruption. And if Arc is broken that'd also be serious.

So there could be a serious bug somewhere and that imo warrants further investigation rather than silencing it in the standard library. That it is some benign (although surprising) macos behavior is also possibility, but this should be proven before we silence it.

I have opened #124130 to abort on EBADF in OwnedFd. That should provide a signal to tell whether it's specific to ReadDir or file descriptors are getting clobbered in general. But you'll have to build Zed with a custom rustc build for that, or nightly if it gets merged.

from rust.

workingjubilee commented on July 22, 2024

Arguably we should also not silently ignore EBADF for OwnedFd and only suppress other errors such as EIO.

Old MacDonald had a farm...

Yes, I'm in favor of us, at least for as long as we can tolerate it, silencing specifically whatever errors we expect to have happen and for us to be able to do nothing about, instead of "any error at all". I will simply trust @the8472 that EBADF is actually a highly specific error, in this context at least, and not merely the surface description of "something bad happened lol".

from rust.

ConradIrwin commented on July 22, 2024

Thanks for all the input here, and particularly the reassurance that "this shouldn't happen".

It seems like the next steps are to try and narrow down:

Is this a logic bug in Zed or it's dependencies.
Is this a filesystem bug in macOS (or Fuse, or Samba - which I know we have some users using).

For the in-process case it seems like a "bad file descriptor" means "one that the OS doesn't think you have", so either:

Something is generating arbitrary numbers and sneaking them into this Dir object (seems very unlikely)
Something is closing arbitrary numbers as file descriptors and sometimes this happens to close an open Dir object (seems very unlikely).
Something is closing a file-descriptor twice, and between the first and second close the file-descriptor is re-used by the Dir open, so by the time we come to drop the Dir we have lost it (seems most plausible).
(maybe) Something is closing all file-descriptors on app shutdown, but rust code continues to run after that?
...your idea here...

Although most of our code is rust, we do like to the Cocoa runtime, and a few other non-rust libraries, so the search space is rather large. Ideas for narrowing it down would be very appreciated! (In rust, I think we just need to audit all our crates for AsRawFd/IntoRawFd, and assuming no rust bugs, everything else should be "safe" according to https://rust-lang.github.io/rfcs/3128-io-safety.html?)

For the filesystem bug case, I haven't yet tried reproducing this on other filesystems. I know we have users using Samba at least, and Fuse/sshfs seems quite probable too, so that might be a fruitful path.

If anyone's interested in this kind of spelunking, I'd love a second brain while working through this: https://calendly.com/conradirwin/pairing, otherwise any interesting hypothesis that we can test for how to get a reproduction would be awesome.

If this does turn out to be a bug on the file system side, then I think we do need to not panic in rust. If it turns out to be a bug on the user-space side, then we'll be glad to have tracked it down, and the current behavior seems ok (until we do find a filesystem bug, at which point we'll want to be back to square one).

In parallel we might want to explore ideas from #98338 on how we can handle fallible drops for things like this.

from rust.

the8472 commented on July 22, 2024

(In rust, I think we just need to audit all our crates for AsRawFd/IntoRawFd, and assuming no rust bugs, everything else should be "safe" according to https://rust-lang.github.io/rfcs/3128-io-safety.html?)

Yes, that's the intent.

I'm currently writing a FUSE testcase on linux to see if I can get a EBADF propagated from one of its replies to a close().

from rust.

ConradIrwin commented on July 22, 2024

Thanks! I should be able to do something similar for macFUSE / Samba next week.

from rust.

the8472 commented on July 22, 2024

strace excerpt:

openat(AT_FDCWD, "./fuse-test/foo", O_RDONLY|O_CLOEXEC) = 5
close(5)                                = -1 EBADF (Bad file descriptor)

Hope has been canceled once again.

Test program

use std::{fs::File, thread, time::{Duration, UNIX_EPOCH}};
use fuser::{FileAttr, FileType};

struct TestFs {}

const DUMMY_ATTR: FileAttr = FileAttr {
    ino: 2,
    size: 13,
    blocks: 1,
    atime: UNIX_EPOCH,
    mtime: UNIX_EPOCH,
    ctime: UNIX_EPOCH,
    crtime: UNIX_EPOCH,
    kind: FileType::RegularFile,
    perm: 0o644,
    nlink: 1,
    uid: 501,
    gid: 20,
    rdev: 0,
    flags: 0,
    blksize: 512,
};

impl fuser::Filesystem for TestFs {
    fn init(&mut self, _req: &fuser::Request<'_>, _config: &mut fuser::KernelConfig) -> Result<(), libc::c_int> {
        Ok(())
    }

    fn lookup(&mut self, _req: &fuser::Request<'_>, parent: u64, name: &std::ffi::OsStr, reply: fuser::ReplyEntry) {
        let d = Duration::from_secs(1);
        reply.entry(&d, &DUMMY_ATTR, 0);
    }

    fn open(&mut self, _req: &fuser::Request<'_>, _ino: u64, _flags: i32, reply: fuser::ReplyOpen) {
        reply.opened(1, 0);
    }

    fn flush(&mut self, _req: &fuser::Request<'_>, ino: u64, fh: u64, lock_owner: u64, reply: fuser::ReplyEmpty) {
        reply.error(libc::EBADF);
    }
}

fn main() {
    let fs = TestFs {};

    let t = thread::spawn(|| {
        //let options = [MountOption::RO, MountOption::AutoUnmount, MountOption::AllowRoot];
        fuser::mount2(fs, "fuse-test", &[]).expect("mount failed");
    });

    thread::sleep(Duration::from_millis(500));

    let f = File::open("./fuse-test/foo").unwrap();
    drop(f);

    t.join().unwrap();
}

from rust.

the8472 commented on July 22, 2024

If this replicates on macos and users have some janky fuse drivers that'd be a likely suspect. Though those drivers should have a bug filed against them.

from rust.

ConradIrwin commented on July 22, 2024

A quick (though inconclusive) update: Although I'm not certain it doesn't reproduce, I was unable to reproduce with the example program (it fails before it gets to the end), and my testing with Zed itself, I could only reproduce libgit2/libgit2#6796, not this error with sshfs.

Still to try: updating the example program to run to completion, and setting up a Samba server to test that one.

from rust.

ConradIrwin commented on July 22, 2024

If you can reproduce this reliably, I’d love to debug it together - do you have time to talk the first week of July? If so please reach out to [email protected]

In my mind there’s quite a chance this is a bug in zed or its dependencies, but it’s also possible that this is a failure introduced by the filesystem.

things I’d like to try:

adding logging to open/close fds to see if we can find a double close
Knowing more about the file system you’re using

from rust.

bjorn3 commented on July 22, 2024

I can confirm that using the current master branch of alacritty-terminal fixes the IO safety crash with rustc_codegen_cranelift. It does still crash, but that is because of some unimplemented things in rustc_codegen_cranelift (rust-lang/rustc_codegen_cranelift#1492), not a bug in Zed or any of it's dependencies.

diff --git a/crates/terminal/Cargo.toml b/crates/terminal/Cargo.toml
index ef6260730..3ce869597 100644
--- a/crates/terminal/Cargo.toml
+++ b/crates/terminal/Cargo.toml
@@ -14,7 +14,7 @@ doctest = false
 
 
 [dependencies]
-alacritty_terminal = "0.23"
+alacritty_terminal = { git = "https://github.com/alacritty/alacritty", version = "0.24.1-dev" }
 anyhow.workspace = true
 collections.workspace = true
 dirs = "4.0.0"

alacritty/alacritty#7996 hasn't been published to crates.io yet, so I had to use a git dependency.

from rust.

panic: unexpected error during closedir "Bad file descriptor" for unix::fs::Dir about rust HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent