Summary Hi! We've been seeing some surprising codegen decisions in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Not inlining array/slice access and iteration at opt-level=z causes knock-on effects on size about rust HOT 21 OPEN

cbiffle commented on August 19, 2024

Not inlining array/slice access and iteration at opt-level=z causes knock-on effects on size

from rust.

Comments (21)

cbiffle commented on August 19, 2024 3

Alright, after some further analysis on programs here, I think as of current nightly, my main foe is <Enumerate<T> as Iterator>::next. Failing to inline it causes a checked add (inside the function) and then often a bounds check or other check (in the loop where the information about the range of the index is hidden).

Enumerate is not actually special, so I'm going to evaluate the potential change by copying it into a crate and festooning it with inline attributes and seeing what happens. (Currently all problematic calls to enumerate are in code I control, so I can do this.)

from rust.

cbiffle commented on August 19, 2024 3

In case anyone wonders about that and tries it themselves:

Yes, you can write your own Enumerate outside core and set it to inline aggressively and have it reliably go away at compile time. In our case, we then hit FilterMap, which in turn hits FlatMap, until eventually you've copied the entire iterator machinery. This is problematic because the internal implementation of the iterator machinery uses a bunch of unstable features. So in the end, I decided against this approach.

It turns out that, annoyingly, (0..array.len()).zip(&array) is significantly more efficient with the current compiler than array.iter().enumerate(), so there are now some places I've snuck that in.

It's off the topic of my original post but I'm now looking at f32::clamp. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code. I think there's a potentially strong argument for inline(always) on f32::clamp so that may be the new direction.

from rust.

kevinmehall commented on August 19, 2024 1

I've also encountered similar, and it still persists on nightly. It seems worse on opt-level=z than opt-level=s, but in my case the overall code size is smaller with z despite the bloat in certain cases like this.

Compiler Explorer reproducer

from rust.

saethlin commented on August 19, 2024 1

@cbiffle One thing that's really missing here is that we have no tests at all for the code size of realistic codebases produced by -Copt-level=z. If we were to add Hubris or lilos to rustc-perf (which reports artifact sizes in addition to build time), would you expect that to be a decent signal of how effective various tuning of the standard library or compiler is at providing the optimizations that help you?

from rust.

workingjubilee commented on August 19, 2024

So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and core::slice::Iter -- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying some inline(always) attributes in this neighborhood. (This might not be sufficient for the Enumerate case but could definitely be applied to the array/slice Index impls.)

If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.

@cbiffle While I would first check to see if anyone else has tried something like this in another PR first... yes, we are generally open to people opening such PRs! The main question is whether they do something like "regress compiler performance" or such, so they generally get thrown against rustc-perf first to examine that. It's also not-atypical for inline(always) to be the "wrong answer" when considered across all possible types if it leads to over-inlining, but for many extremely trivial functions in libcore it's often worth checking that assumption, yeah.

In any case, ideally they come with a codegen test (see tests/codegen and codegen-tests in the rustc-dev-guide) that verifies that the emitted IR has the properties you want for the code you wanted optimized. That way whatever is implemented for the optimization can get removed later hypothetically if LLVM ever learns to Just Do The Right Thing.

from rust.

workingjubilee commented on August 19, 2024

@cbiffle Also, a bit obvious but: I strongly recommend investigating the latest behaviors on 1.77 and nightly! There has been a lot of effort, much of it very recent, to make rustc automatically inline more code by doing so on the MIR level, before LLVM has a chance to make decisions (and thus, possibly avoid making good ones, although this sometimes proves too over-eager). And it'd be great to land codegen or assembly tests even if this behavior is fixed for you.

from rust.

saethlin commented on August 19, 2024

I think the only solution to problems like this is #[inline(always)]. That has the deeply unfortunate side effect of producing inlining in unoptimized builds (so perversely fixing this will probably degrade compile times for unoptimized builds, yes this behavior is deliberate, yes I hate it deeply), but I don't think we have a better tool for telling LLVM that no actually its inlining heuristics are wrong for this function and we know better.

In at least one case, opt-level=3 winds up being smaller because of more aggressive inlining causing checks to be elided.

This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z in embedded.

Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests, I'm quite sure that our current and only opt-level=z test is broken because of it: #119878 (comment)

from rust.

kevinmehall commented on August 19, 2024

@saethlin's recent PR #120863 added #[inline(always)] to <Range<usize> as SliceIndex<[T]>>::index and <RangeTo<usize> as SliceIndex<[T]>>::index but not index_mut, explaining the difference between the & and &mut slices in my example on nightly.

from rust.

cbiffle commented on August 19, 2024

@workingjubilee - I can confirm @kevinmehall's observation that the behavior mostly continues on nightly. The Index impls for slices are more reliably inlined (thanks @saethlin!) but several of my other bugbears, such as Enumerate, are not.

(This has the exciting side effect that appeasing clippy, by using enumerate instead of iterating over a range and indexing, actually makes the code bigger and performance worse!)

re: opt-level=3 being smaller sometimes:

This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of opt-level=z in embedded.

Well, the "sometimes" here is the problem. We currently have images that will fit in a system with z that will not fit with 3. I haven't done a deep analysis here recently but last I looked, it was mostly a result of aggressive loop unrolling undoing some attempts at size optimization. These are the sorts of things I'd have hit with pragmas to control unrolling in C, but as far as I'm aware we have no direct equivalent in rustc.

Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always) / inline(never) from outside the source tree, in response to observed compiler behavior. This way my weird constraints don't need to affect e.g. the compile performance of debug builds. Yeah, this would be fragile across compiler versions, but we update the toolchain approximately annually because a lot of things related to binary size are fragile across compiler versions. Anyway, end of side note.

from rust.

cbiffle commented on August 19, 2024

Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests

Thanks for the reference on ISPCCP, I was unfamiliar with it, and it explains several cases where my 15-year-old mental model of LLVM's behavior wasn't matching what I was seeing.

from rust.

workingjubilee commented on August 19, 2024

@cbiffle Well, you can still pass llvm-args to the backend.

One thing that might be interesting to do, that a few people suggested long ago but that hasn't really been brought up since, is to pass special instructions to the pass manager to cope with the fact that Rust generates a lot of IR yet also benefits a lot from inlining, potentially much more on both counts than, say, clang. Given that LLVM recently-ish changed their pass manager interface a lot (and by recently I mean multiple versions ago For Good), it might benefit significantly from someone taking a whack at that, especially for -Copt-level=z.

from rust.

the8472 commented on August 19, 2024

Could cranelift be more easily tweaked for code size? In my naive understanding the egraph concept seems more amenable to the iterative inline, optimize, check size, repeat ... pick the best result approach that would be difficult for LLVM.

Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of inline(always) / inline(never) from outside the source tree, in response to observed compiler behavior.

Does PGO help? It may not be designed to help with size, but for iterators it might help coincidentally since they also tend to be hot.

from rust.

workingjubilee commented on August 19, 2024

@the8472 Currently, Cranelift is somewhat moot in this case: it does not support 32-bit Arm.

from rust.

cbiffle commented on August 19, 2024

@the8472 - unfortunately I have no idea how to apply PGO to a bare-metal embedded platform.

from rust.

kevinmehall commented on August 19, 2024

IIRC Cranelift doesn't do inlining (only MIR inlining) and considers most interprocedural optimizations out of scope. But yes, it seems like egraphs could be much better at this.

Is PGO feasible on Cortex-M? Instrumented build might be too much overhead to even fit on the device, and would need a way to get the profile back to the host. Perhaps something like ARM ETM could capture an execution trace and post-process it into a profile, but it doesn't seem like anyone has implemented such a thing.

from rust.

bjorn3 commented on August 19, 2024

Could cranelift be more easily tweaked for code size?

Cranelift doesn't support optimizing away loads and stores beyond very limited cases. And doesn't support inlining and other inter-procedural optimizations. As such it is almost certainly going to result in larger binaries than LLVM.

from rust.

bjorn3 commented on August 19, 2024

For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality? You could put the "dylib" at a fixed address to avoid all complications of a dynamic linker if you make sure that the "dylib" doesn't reference the main executable. You only need to ensure that the "dylib" is mapped at the expected address by the kernel. Implementing this in a way that doesn't require a (relatively) huge dylib covering the entirety of libcore and liballoc including all unused functions may be non-trivial though. Maybe you can compile the entire system once, keep a record of all actually used symbols from the dylib and then recompile the dylib with a version script making only those symbols public + --gc-sections to discard the rest and then recompile all executables to update the expected locations of all symbols. With -Zdylib-lto you can separately apply LTO to the dylib and the main executable. This all is probably a lot more fragile than making LLVM produce smaller executables though.

from rust.

cbiffle commented on August 19, 2024

For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality?

Bit off topic, but -- we looked into that early on, and the rustc/LLVM support for RWPI (read-write position independence), so that we could share the read-only text and not any read-write data segments, wasn't there yet. (Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.) I haven't seen signs that RWPI is stable so far. We're comfortable with our current approach for now.

from rust.

bjorn3 commented on August 19, 2024

Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.

Libcore and liballoc only have read-only data afaik, so sharing all data too for the dylib shouldn't be a problem, right? You could add more crates to the dylib if you make sure they don't have any mutable data either.

from rust.

the8472 commented on August 19, 2024

It's off the topic of my original post but I'm now looking at f32::clamp. This has stopped inlining reliably recently, and that's a problem because it contains a panic that pulls in all of the floating point formatting code.

Maybe those paths chould be outlined and put into #[cold], that'd help with inlining but not with code size. But maybe there's a reason why this is not the case already. cc @m-ou-se

In the meantime using build-std + panic_immediate_abort might help.

from rust.

workingjubilee commented on August 19, 2024

It turns out that, annoyingly, (0..array.len()).zip(&array) is significantly more efficient with the current compiler than array.iter().enumerate(), so there are now some places I've snuck that in.

That's hilarious.

Anyways thanks for noting all this, @cbiffle, one of the things we've noticed is that we apparently don't enable a number of MIR opts with -Copt-level=s or -Copt-level=z that are probably critical for reducing code size. We probably mostly need to figure out a good testing setup to ensure that enabling them is mostly improvements and not regressions.

from rust.

Not inlining array/slice access and iteration at opt-level=z causes knock-on effects on size about rust HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent