Git Product home page Git Product logo

Comments (10)

jmillikin avatar jmillikin commented on June 23, 2024 1

It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.

#!/bin/sh
set -eux

RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"  --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?

A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" in a subshell and then check if the resulting wall-clock time measurement is too large.

from rust.

blyxyas avatar blyxyas commented on June 23, 2024

Here are the contents of main.rs:

/*

```
[package]
name = "perf-regression-debug"
version = "0.1.0"
edition = "2021"

[dependencies]
rand = { version = "0.8.5", features = ["alloc", "small_rng"] }

[[bin]]
name = "main"
path = "main.rs"
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release
$ time ./target/release/main
real    0m7.634s
user    0m7.628s
sys     0m0.005s
```

*/

use core::hint::black_box;
use core::ptr;

use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};

const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;

#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	if buf[0] < 0x80 {
		return (buf[0] as u32, 1);
	}
	if buf[0] < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});
		if buf[0] < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if buf[0] < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((buf[0] & 0x0F) + 2) as usize)
}

#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
	for buf in bufs {
		black_box(decode_u32(black_box(buf)));
	}
}

const U32_MASKS: [u32; 5] = [
	u32::MAX >> 25,
	u32::MAX >> 24,
	u32::MAX >> 16,
	u32::MAX >> 8,
	u32::MAX,
];

fn main() {
	let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
		.unwrap()
		.map(|ii| U32_MASKS[ii]);

	let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
	let encoded_values: Vec<[u8; 5]> =
		std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
			.map(|value| {
				let mut buf = [0u8; 5];
				encode_u32(&mut buf, value);
				buf
			})
			.take(RAND_VEC_LEN)
			.collect();

	for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
		benchmark_decode_u32(&encoded_values);
	}
}

fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
	if x & 0b01111111 == x {
		buf[0] = x as u8;
		return 1;
	}
	if x < 0x10000000 {
		if x < 0x00004000 {
			x <<= 2;
			buf[0] = 0b10000000 | ((x as u8) >> 2);
			buf[1] = (x >> 8) as u8;
			return 2;
		}
		if x < 0x00200000 {
			x <<= 3;
			buf[0] = 0b11000000 | ((x as u8) >> 3);
			buf[1] = (x >> 8) as u8;
			buf[2] = (x >> 16) as u8;
			return 3;
		}
		x <<= 4;
		buf[0] = 0b11100000 | ((x as u8) >> 4);
		buf[1] = (x >> 8) as u8;
		buf[2] = (x >> 16) as u8;
		buf[3] = (x >> 24) as u8;
		return 4;
	}
	let bytes = x.to_le_bytes();
	buf[0] = 0b11110011;
	buf[1] = bytes[0];
	buf[2] = bytes[1];
	buf[3] = bytes[2];
	buf[4] = bytes[3];
	5
}

from rust.

workingjubilee avatar workingjubilee commented on June 23, 2024

@jmillikin How does the performance go on 1.78.0, the current stable?

from rust.

jmillikin avatar jmillikin commented on June 23, 2024

v1.78.0 generates the same assembly for {benchmark_,}decode_u32 as v1.77.2: https://godbolt.org/z/a5hcnh9az.

from rust.

workingjubilee avatar workingjubilee commented on June 23, 2024

wow, and that's the version that actually has the LLVM upgrade.

from rust.

jmillikin avatar jmillikin commented on June 23, 2024

I tried using cargo-bisect-rustc and it pointed to 8d76d07, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.

cc @cjgillot who may have some ideas on whether the new code is working as intended or not.

My suspicion is that unification of buf[0] with the buf[0:4] as u32 is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.


searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01
regressed nightly: nightly-2023-12-31
searched commit range: 3cdd004...2a3e635
regressed commit: 8d76d07

bisected with cargo-bisect-rustc v0.6.8

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --start=1.76.0 --end=1.77.0 --script=./test.sh 

from rust.

blyxyas avatar blyxyas commented on June 23, 2024

@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh? 😅

from rust.

jmillikin avatar jmillikin commented on June 23, 2024

Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	let prefix_0 = buf[0];
	if prefix_0 < 0x80 {
		return (prefix_0 as u32, 1);
	}

	if prefix_0 < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});

		let prefix_1 = buf[0];
		// let prefix_1 = prefix_0;
		if prefix_1 < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if prefix_1 < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((prefix_0 & 0x0F) + 2) as usize)
}

This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.

If the statement let prefix_1 = prefix_0; is uncommented, then both compiler versions emit the slow assembly.

from rust.

apiraino avatar apiraino commented on June 23, 2024

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

from rust.

saethlin avatar saethlin commented on June 23, 2024

Disabling GVN via -Zmir-enable-passes=-GVN produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).

from rust.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.