Comments (10)
It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.
#!/bin/sh
set -eux
RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?
A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"
in a subshell and then check if the resulting wall-clock time measurement is too large.
from rust.
Here are the contents of main.rs
:
/*
```
[package]
name = "perf-regression-debug"
version = "0.1.0"
edition = "2021"
[dependencies]
rand = { version = "0.8.5", features = ["alloc", "small_rng"] }
[[bin]]
name = "main"
path = "main.rs"
```
```
$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real 0m4.182s
user 0m4.181s
sys 0m0.000s
```
```
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real 0m7.699s
user 0m7.694s
sys 0m0.004s
```
```
$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release
$ time ./target/release/main
real 0m7.634s
user 0m7.628s
sys 0m0.005s
```
*/
use core::hint::black_box;
use core::ptr;
use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};
const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;
#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
if buf[0] < 0x80 {
return (buf[0] as u32, 1);
}
if buf[0] < 0xF0 {
let x = u32::from_le(unsafe {
ptr::from_ref(buf).cast::<u32>().read_unaligned()
});
if buf[0] < 0b11000000 {
let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
return (decoded, 2);
}
if buf[0] < 0b11100000 {
let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
return (decoded, 3);
}
let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
return (decoded, 4);
}
let decoded = u32::from_le(unsafe {
ptr::from_ref(buf)
.cast::<u8>()
.add(1)
.cast::<u32>()
.read_unaligned()
});
(decoded, ((buf[0] & 0x0F) + 2) as usize)
}
#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
for buf in bufs {
black_box(decode_u32(black_box(buf)));
}
}
const U32_MASKS: [u32; 5] = [
u32::MAX >> 25,
u32::MAX >> 24,
u32::MAX >> 16,
u32::MAX >> 8,
u32::MAX,
];
fn main() {
let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
.unwrap()
.map(|ii| U32_MASKS[ii]);
let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
let encoded_values: Vec<[u8; 5]> =
std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
.map(|value| {
let mut buf = [0u8; 5];
encode_u32(&mut buf, value);
buf
})
.take(RAND_VEC_LEN)
.collect();
for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
benchmark_decode_u32(&encoded_values);
}
}
fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
if x & 0b01111111 == x {
buf[0] = x as u8;
return 1;
}
if x < 0x10000000 {
if x < 0x00004000 {
x <<= 2;
buf[0] = 0b10000000 | ((x as u8) >> 2);
buf[1] = (x >> 8) as u8;
return 2;
}
if x < 0x00200000 {
x <<= 3;
buf[0] = 0b11000000 | ((x as u8) >> 3);
buf[1] = (x >> 8) as u8;
buf[2] = (x >> 16) as u8;
return 3;
}
x <<= 4;
buf[0] = 0b11100000 | ((x as u8) >> 4);
buf[1] = (x >> 8) as u8;
buf[2] = (x >> 16) as u8;
buf[3] = (x >> 24) as u8;
return 4;
}
let bytes = x.to_le_bytes();
buf[0] = 0b11110011;
buf[1] = bytes[0];
buf[2] = bytes[1];
buf[3] = bytes[2];
buf[4] = bytes[3];
5
}
from rust.
@jmillikin How does the performance go on 1.78.0, the current stable?
from rust.
v1.78.0 generates the same assembly for {benchmark_,}decode_u32
as v1.77.2: https://godbolt.org/z/a5hcnh9az.
from rust.
wow, and that's the version that actually has the LLVM upgrade.
from rust.
I tried using cargo-bisect-rustc
and it pointed to 8d76d07, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.
cc @cjgillot who may have some ideas on whether the new code is working as intended or not.
My suspicion is that unification of buf[0]
with the buf[0:4] as u32
is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.
searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01
regressed nightly: nightly-2023-12-31
searched commit range: 3cdd004...2a3e635
regressed commit: 8d76d07
bisected with cargo-bisect-rustc v0.6.8
Host triple: x86_64-unknown-linux-gnu
Reproduce with:
cargo bisect-rustc --start=1.76.0 --end=1.77.0 --script=./test.sh
from rust.
@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh
? 😅
from rust.
Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
let prefix_0 = buf[0];
if prefix_0 < 0x80 {
return (prefix_0 as u32, 1);
}
if prefix_0 < 0xF0 {
let x = u32::from_le(unsafe {
ptr::from_ref(buf).cast::<u32>().read_unaligned()
});
let prefix_1 = buf[0];
// let prefix_1 = prefix_0;
if prefix_1 < 0b11000000 {
let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
return (decoded, 2);
}
if prefix_1 < 0b11100000 {
let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
return (decoded, 3);
}
let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
return (decoded, 4);
}
let decoded = u32::from_le(unsafe {
ptr::from_ref(buf)
.cast::<u8>()
.add(1)
.cast::<u32>()
.read_unaligned()
});
(decoded, ((prefix_0 & 0x0F) + 2) as usize)
}
This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.
If the statement let prefix_1 = prefix_0;
is uncommented, then both compiler versions emit the slow assembly.
from rust.
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-medium
from rust.
Disabling GVN via -Zmir-enable-passes=-GVN
produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).
from rust.
Related Issues (20)
- ICE: wfchk: type parameter .. out of range when instantiating, args=`[{const error}]` HOT 1
- ICE when compiling HOT 5
- Bad dependency on a windows DLL in rust 1.78 HOT 27
- Next-gen solver: Compiler "hang" & eventual OOM when facing complex mutually recursive structs & many projections HOT 6
- std::os::fd module missing on Hermit HOT 2
- rustc segfaults when loading mimalloc with LD_PRELOAD HOT 15
- ICE: kcfi: encode_const: unexpected kind `UnevaluatedConst { .. }`
- Japanese translation of Rust by Example is broken
- Reference to temporary variable passed as `&'static dyn` parameter compiles HOT 2
- What should SIMD bitmasks look like? HOT 6
- clarify local_crate_exports_generics
- Miscompile for `riscv64gc-unknown-none-elf` using `inout()` in `asm!()`
- "you might have meant to return this value" suggests invalid syntax HOT 1
- tidy error: rustc-perf workspace doesn't have a Cargo.lock
- inherent method selection uses subtyping even when using paths
- "./x.py run miri --stage 1" no longer works HOT 5
- Bootstrap rustc wrapper verbosity not tracked
- Mutex isn't guaranteed to have atomic acquire/release semantics HOT 13
- `unnameable_types` lint emits incorrect visibility syntax HOT 1
- Suggest adding `{ .. }` on accidental usage of unit variant pattern for a struct variant HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rust.