Git Product home page Git Product logo

Comments (8)

workingjubilee avatar workingjubilee commented on June 23, 2024 1

@Nugine re: the workaround: On current Rust, stable, the decode_asm function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h

from rust.

jhorstmann avatar jhorstmann commented on June 23, 2024 1

Seems the early exit somehow makes llvm loose track of the equivalence to vpavgb instruction. Another workaround thus seems to be to force llvm to calculate both Ok and Err versions:

#[target_feature(enable = "avx2")]
pub unsafe fn decode(
    x: __m256i,
    ch: __m256i,
    ct: __m256i,
    dh: __m256i,
    dt: __m256i,
) -> Result<__m256i, __m256i> {
    let shr3 = _mm256_srli_epi32::<3>(x);

    let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
    let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));

    let o1 = _mm256_shuffle_epi8(ct, h1);
    let o2 = _mm256_shuffle_epi8(dt, h2);

    let c1 = _mm256_adds_epi8(x, o1);
    let c2 = _mm256_add_epi8(x, o2);

    if _mm256_movemask_epi8(c1) != 0 {
        return Err(c2);
    }

    Ok(c2)
}

But I guess this will break down as soon as the function gets inlined if the error value is not otherwise used.

from rust.

saethlin avatar saethlin commented on June 23, 2024

Blaming rust-lang/stdarch#1477

Did you confirm that this is the responsible change or are you guessing?

from rust.

workingjubilee avatar workingjubilee commented on June 23, 2024

@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture?

I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different.

Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so -Ctarget-cpu=native makes it more likely your benchmarks can be run-to-run and job-to-job inconsistent.

from rust.

Nugine avatar Nugine commented on June 23, 2024

Base64-decode in base64-simd has been slower than radix64 since Rust 1.75.0. By comparing the asm generated by 1.74.1 and 1.75.0, I found that one of vpavgb is missing. LLVM doesn't emit vpavgb for one of _mm256_avg_epu8, but a lot of equivalent instructions.

rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM.

To see the asm, you can use the following commands.

git clone https://github.com/Nugine/simd.git
cd simd
rustup override set 1.74.1 # or 1.75.0
RUSTFLAGS="--cfg vsimd_dump_symbols" cargo asm -p base64-simd --lib --simplify --target x86_64-unknown-linux-gnu  --context 1 -- base64_simd::multiversion::decode::avx2 > base64-decode-avx2.asm
cat base64-decode-avx2.asm

Target: x86_64-unknown-linux-gnu
Instruction: AVX2

I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
I'm looking for:

  • a stable workaround method to generate vpavgb
  • why the optimization is missing

from rust.

Nugine avatar Nugine commented on June 23, 2024

@Nugine re: the workaround: On current Rust, stable, the decode_asm function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h

Cool! I'll try asm wrapper.

from rust.

workingjubilee avatar workingjubilee commented on June 23, 2024

based on jhorstmann's remark, it would be nicest to fix this in LLVM, since LLVM appears to have the information necessary to do this optimization, it just is missing it in the early-return case. I don't think partially reverting a diff is unwarranted, however.

from rust.

apiraino avatar apiraino commented on June 23, 2024

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

from rust.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.