EDIT: <a href="https://github.com/str4d/rage/issues/57#issuecomment-758680178" dat

Note: I plan on adding core::arch and <code class="no

I ran a brief test on my laptop, of the form: <div class="snippet-clipboard-conten

Rage is 38% slower at encrypting than Go implementation,about str4d/rage

Comments (64)

tarcieri commented on June 4, 2024 5

All right, squeaky wheel gets the grease. The chacha20 crate was previously running at ~3.5cpb on my laptop with the SSE2 backend.

I rewrote the buffering logic and added a new AVX2 backend which can compute two ChaCha20 blocks in parallel. I've got it down to ~1.4cpb now:

Will double check I didn't break anything and cut a new release soon, then bump the chacha20poly1305 crate.

from rage.

tarcieri commented on June 4, 2024 3

Note: I plan on adding core::arch and packed_simd optimizations to the chacha20 and poly1305 crates soon

from rage.

str4d commented on June 4, 2024 3

Yeah, most of the performance difference is that rage's cryptographic dependencies are essentially pure-Rust at this point, while age is using the Go standard library which includes assembly for basically everything. See also #38, which was necessary because of Go's scrypt being around 64x faster due to having SHA-2 assembly.

The other place to look at for optimisation is my implementation of STREAM. Currently encryption of each chunk involves an allocation because I am not using the Aead::encrypt_in_place API. We could instead allocate a ciphertext-sized buffer inside StreamWriter and then track how much plaintext we are writing into it.

from rage.

tarcieri commented on June 4, 2024 3

chacha20poly1305 v0.3.1 is out with the AVX2 backend, so all you should need to do is cargo update and then build with the following $RUSTFLAGS:

RUSTFLAGS="-Ctarget-feature=+avx2"

Here's the benchmarked improvement on the full AEAD construction:

...so encryption is ~60% faster, and decryption is unchanged.

Note that there's still some low hanging fruit, like a SIMD implementation of Poly1305, and pipelining the execution of ChaCha20 and Poly1305 so they can execute in parallel.

from rage.

str4d commented on June 4, 2024 3

I hadn't focused on parallelizing STREAM yet because age doesn't either yet, and I want to eliminate as much of the delta before relying on threads. That being said, I've put an initial strategy in #148 (cache logical_cpus chunks, then use rayon to process them in parallel), which gives the following benchmarks:

$ cargo clean
$ cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
5.69user 3.03system 0:01.86elapsed 467%CPU (0avgtext+0avgdata 6268maxresident)k
0inputs+0outputs (0major+945minor)pagefaults 0swaps
$ cargo clean
$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
4.47user 2.84system 0:01.69elapsed 432%CPU (0avgtext+0avgdata 6236maxresident)k
0inputs+0outputs (0major+938minor)pagefaults 0swaps

By those numbers, rage is 1.28x slower than age using 4.67x more CPU, and rage compiled with AVX2 is 1.17x slower than age using 4.32x more CPU.

from rage.

tarcieri commented on June 4, 2024 3

FWIW the chacha20 crate has runtime AVX2 detection implemented in the unreleased v0.7.0 version, although rage presently uses c2-chacha instead because it provides marginally better performance when the +avx2 target feature is enabled.

I'm also looking at implementing some end-to-end SIMD buffering in chacha20 which might erase that performance difference when used as a combined chacha20poly1305 AEAD construction.

All that said, one of the big goals of the next release of the RustCrypto crates is runtime detection so target feature customization is no longer required, although that might come at a small performance hit until we can work through all of the impacts that has on e.g. inlining and other optimizations.

from rage.

str4d commented on June 4, 2024 2

I've used pprof to generate a flame graph for rage running as part of the above command (without the explicit AVX2 flag):

Reading the 2 GiB input from /dev/zero is around 17.6% of the execution time, and 23.9% is time inside the c2-chacha crate.

The largest time sink is clearly the poly1305 crate, which does not yet have an AVX2 implementation and is 53.1% of overall execution. I'm going to work on RustCrypto/universal-hashes#46 this weekend to try and address this.

from rage.

str4d commented on June 4, 2024 2

I've managed to speed up poly1305 by refactoring it 😄

Same age as last time (dunno why my laptop is feeling faster today):

$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.15user 0.54system 0:01.69elapsed 100%CPU (0avgtext+0avgdata 10264maxresident)k
0inputs+0outputs (0major+180minor)pagefaults 0swaps

Current master of rage + current master of poly1305 (equivalent to the published crate):

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.37user 0.76system 0:04.13elapsed 99%CPU (0avgtext+0avgdata 41656maxresident)k
0inputs+56outputs (0major+8804minor)pagefaults 0swaps

Current master of rage + RustCrypto/universal-hashes#48:

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.95user 0.63system 0:03.58elapsed 100%CPU (0avgtext+0avgdata 41636maxresident)k
0inputs+56outputs (0major+8802minor)pagefaults 0swaps

Flame graph (highlighted sections are the poly1305 crate, taking up 41.7% of execution time):

from rage.

str4d commented on June 4, 2024 2

It's true that you're unlikely to be armoring 2 GiB of data, but it's not outside the intended use case. Armoring was specifically added to the spec to handle CRLF platforms (because the binary spec is canonical LF and would be broken by platforms that translate LF to CRLF).

Also, let me take my wins where I can 😅

from rage.

str4d commented on June 4, 2024 1

I ran a brief test on my laptop, of the form:

time head -c 2147483648 /dev/urandom | cargo run --release -- -r age1somerecipient >/dev/null

Switching to Aead::encrypt_in_place (instead of letting it allocate a new ~64 kiB Vec for every chunk) does not speed up encryption at all (before and after are both around 15.5 seconds to encrypt 2 GiB on my laptop). I'll hunt for other possible hotspots in my code, but I expect that the necessary performance work is on the upstream crates.

from rage.

str4d commented on June 4, 2024 1

Current master of each (measured on my laptop - Thinkpad P1 with Xeon E-2176M):

$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.98user 0.88system 0:01.85elapsed 100%CPU (0avgtext+0avgdata 8276maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.28user 1.01system 0:04.29elapsed 99%CPU (0avgtext+0avgdata 4180maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps

$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.34user 0.89system 0:04.23elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k
0inputs+0outputs (0major+188minor)pagefaults 0swaps

rage is 2.32x slower than age, and rage compiled with AVX2 is 2.29x slower than age. These are basically the same now due to c2-chacha, but there's some small overhead that is improved with explicit AVX2 compilation. Not enough for me to worry about though.

from rage.

str4d commented on June 4, 2024 1

(Note that the flame graphs are probabilistic; running the test repeatedly, I see poly1305 taking anywhere from 41.7% up to 49% of execution time.)

from rage.

str4d commented on June 4, 2024 1

Re-ran the numbers on my desktop now that we've finally pulled in the poly1305 performance improvements:

FiloSottile/age@31500bf compiled with Go 1.13 (aka what my CI system generates for interoperability testing):

$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.79user 0.67system 0:01.45elapsed 100%CPU (0avgtext+0avgdata 8288maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps

rage 70cbf9a compiled with Rust 1.45.0 (the MSRV):

$ cargo clean
$ cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.34user 0.64system 0:02.99elapsed 99%CPU (0avgtext+0avgdata 4992maxresident)k
0inputs+0outputs (0major+272minor)pagefaults 0swap
$ cargo clean
$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.43user 0.67system 0:02.11elapsed 99%CPU (0avgtext+0avgdata 5348maxresident)k
0inputs+0outputs (0major+278minor)pagefaults 0swaps

By those numbers, rage is 2.06x slower than age, and rage compiled with AVX2 is 1.46x slower than age.

from rage.

tarcieri commented on June 4, 2024 1

FWIW, I have some ideas about improving ILP/SIMD parallelism I'll be trying to prototype soon: RustCrypto/traits#444

from rage.

str4d commented on June 4, 2024 1

I dug in further, and I think RustCrypto/stream-ciphers#262 would close almost all of the remaining gap between chacha20 and c2-chacha. After that, I suspect the remaining performance lead Go has is likely due to us not having one-pass encryption/decryption (RustCrypto/AEADs#74).

from rage.

tarcieri commented on June 4, 2024 1

The prospective v0.5 PR for the universal-hash crate adds a multi-block input API:

RustCrypto/traits#965

That should make it possible to interleave encryption+authentication / authentication+decryption passes at the granularity of blocks that the backend SIMD implementations operate over

from rage.

str4d commented on June 4, 2024 1

Re-ran the benchmarks on my Ryzen 9 5950X against #303, and that PR with RustCrypto/AEADs#415 applied.

FiloSottile/age@f01e37b compiled with Go 1.17.0

Pre-release

ac72edc compiled with Rust 1.56.0

Command	Time (s)	Relative
`age`	0.75	1
`rage`	1.91	2.54
`rage-avx2`	1.90	2.53
`age -a`	2.93	1
`rage -a`	3.43	1.17
`rage-avx2 -a`	3.45	1.18

Pre-release plus `universal-hash 0.5`

Command	Time (s)	Relative
`age`	0.80	1
`rage`	1.55	1.94
`rage-avx2`	1.47	1.84
`age -a`	2.83	1
`rage -a`	3.03	1.07
`rage-avx2 -a`	3.09	1.09

The new traits in universal-hash 0.5 are enabling a significant speed-up, I suspect due to us only checking the backend at runtime on the level of an entire message rather than on every block. But we still trail behind without one-pass encryption, and trying to implement that makes things significantly slower (and will likely do so until we figure out a way to lift the runtime checks to the AEAD level).

from rage.

paulmillr commented on June 4, 2024

Yeah, STREAM seems to be the slowest part here.

Multicore optimizations should also speed-up things massively. See the gist for a tiny & very performant example of Rust threads.

from rage.

str4d commented on June 4, 2024

Oh heh, looks like my benchmarks were being limited by the speed of /dev/urandom - I tested age and measured the same 15.5 seconds. Switching to /dev/zero I get:

$ time head -c 2147483648 /dev/zero | go run ./cmd/age -r age1somerecipient >/dev/null

real	0m2.536s
user	0m2.596s
sys	0m1.242s
$ time head -c 2147483648 /dev/zero | cargo run --release -- -r age1somerecipient >/dev/null 
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/rage -r age1somerecipient`

real	0m9.313s
user	0m9.404s
sys	0m1.456s
$ # Apply patch
$ time head -c 2147483648 /dev/zero | cargo run --release -- -r age1somerecipient >/dev/null 
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/rage -r age1somerecipient`

real	0m9.200s
user	0m9.230s
sys	0m1.511s

Still no difference using Aead::encrypt_in_place (the minor delta between unpatched and patched was within the system noise), but I see rage being around 4x slower than age.

from rage.

str4d commented on June 4, 2024

I adapted the chacha20 benchmark to rage, and get between 9.8 and 10.3 cycles per byte on current master.

Before vs after cargo update:

stream/encrypt/131072   time:   [1328643.4735 cycles 1362445.9702 cycles 1406940.9323 cycles]    
                        thrpt:  [10.7341 cpb 10.3946 cpb 10.1367 cpb]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
---
stream/encrypt/131072   time:   [1113173.3014 cycles 1131564.0177 cycles 1152718.4918 cycles]    
                        thrpt:  [8.7945 cpb 8.6331 cpb 8.4928 cpb]
                 change:
                        time:   [-19.391% -17.546% -15.723%] (p = 0.00 < 0.05)
                        thrpt:  [+18.656% +21.279% +24.055%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild
  5 (5.00%) high severe

Before cargo update vs after with RUSTFLAGS="-Ctarget-feature=+avx2":

stream/encrypt/131072   time:   [1281326.5182 cycles 1287772.9972 cycles 1296258.5135 cycles]    
                        thrpt:  [9.8897 cpb 9.8249 cpb 9.7757 cpb]
---
stream/encrypt/131072   time:   [739731.5694 cycles 743151.4588 cycles 747047.6420 cycles]       
                        thrpt:  [5.6995 cpb 5.6698 cpb 5.6437 cpb]
                 change:
                        time:   [-42.466% -42.006% -41.531%] (p = 0.00 < 0.05)
                        thrpt:  [+71.030% +72.433% +73.811%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

from rage.

str4d commented on June 4, 2024

I've opened #58 with the benchmark and the dependency update.

from rage.

str4d commented on June 4, 2024

More measurements of the improvement on my desktop (i7-8700K CPU @ 3.70GHz).

Before cargo update (e78c6a2) vs after cargo update (eee96f4):

Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 6.6372 s (20k iter
stream/encrypt/131072   time:   [1212026.6085 cycles 1214296.5898 cycles 1217916.5950 cycles]
                        thrpt:  [9.2920 cpb 9.2643 cpb 9.2470 cpb]
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.4919 s (20k iter
stream/encrypt/131072   time:   [1002757.7861 cycles 1002970.1976 cycles 1003166.3307 cycles]
                        thrpt:  [7.6536 cpb 7.6521 cpb 7.6504 cpb]
                 change:
                        time:   [-17.660% -17.306% -16.972%] (p = 0.00 < 0.05)
                        thrpt:  [+20.441% +20.928% +21.448%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild

Before cargo update vs after cargo update with RUSTFLAGS="-Ctarget-feature=+avx2":

Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 6.6394 s (20k iter
stream/encrypt/131072   time:   [1212129.4345 cycles 1212365.0293 cycles 1212570.7408 cycles]
                        thrpt:  [9.2512 cpb 9.2496 cpb 9.2478 cpb]
                 change:
                        time:   [-0.7644% -0.3612% +0.0306%] (p = 0.05 > 0.05)
                        thrpt:  [-0.0306% +0.3625% +0.7702%]
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) low severe
  5 (5.00%) low mild
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.7746 s (30k iter
stream/encrypt/131072   time:   [702772.3597 cycles 702891.9797 cycles 703037.8149 cycles]
                        thrpt:  [5.3638 cpb 5.3626 cpb 5.3617 cpb]
                 change:
                        time:   [-42.097% -41.931% -41.730%] (p = 0.00 < 0.05)
                        thrpt:  [+71.615% +72.209% +72.703%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild

And current master without vs with AVX2:

Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.4703 s (20k iter
stream/encrypt/131072   time:   [998447.7639 cycles 998734.2589 cycles 999051.0488 cycles]
                        thrpt:  [7.6222 cpb 7.6197 cpb 7.6176 cpb]
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) low severe
  3 (3.00%) low mild
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.7943 s (30k iter
stream/encrypt/131072   time:   [705379.2776 cycles 705501.0598 cycles 705635.0365 cycles]
                        thrpt:  [5.3836 cpb 5.3825 cpb 5.3816 cpb]
                 change:
                        time:   [-29.475% -29.247% -28.993%] (p = 0.00 < 0.05)
                        thrpt:  [+40.830% +41.337% +41.795%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild

from rage.

str4d commented on June 4, 2024

And age vs rage (current master of each) on my desktop:

$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.96user 0.48system 0:01.45elapsed 99%CPU (0avgtext+0avgdata 2840maxresident)k
0inputs+0outputs (0major+763minor)pagefaults 0swaps

$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.09user 0.39system 0:01.47elapsed 100%CPU (0avgtext+0avgdata 2840maxresident)k
0inputs+0outputs (0major+763minor)pagefaults 0swaps

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
4.57user 0.62system 0:05.19elapsed 100%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+503minor)pagefaults 0swaps

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
4.50user 0.68system 0:05.22elapsed 99%CPU (0avgtext+0avgdata 1856maxresident)k
0inputs+0outputs (0major+505minor)pagefaults 0swaps

$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.12user 0.75system 0:03.88elapsed 99%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+504minor)pagefaults 0swaps

$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.98user 0.89system 0:03.88elapsed 99%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+504minor)pagefaults 0swaps

So the current status is that rage is 3.57x slower than age, and rage compiled with AVX2 is 2.66x slower than age.

from rage.

paulmillr commented on June 4, 2024

Could be great to understand what exactly slows us down at this point. Not sure what is the best way to profile traces in Rust.

from rage.

paulmillr commented on June 4, 2024

rage compiled with AVX2 is 2.66x slower than age.

Is there any reason to not compile with AVX2? I think, almost every x86 cpu nowadays supports it?

from rage.

tarcieri commented on June 4, 2024

The three main things are:

Poly1305 implementation isn't SIMD. See all of the discussion here about that
chacha20poly1305 crate is 2-pass instead of 1-pass. ~~I can open an issue for that~~ if anyone wants to try to convert it to 1-pass as it should be fairly easy (edit: opened RustCrypto/AEADs#74)
chacha20 crate is still slower than it could be even with AVX2. See benchmarking versus c2-chacha crate here (c2-chacha is ~45% faster)

Re: the third item the c2-chacha crate impls the stream-cipher API. With a small API change to the chacha20poly1305 crate I could make the underlying ChaCha implementation generic so you could swap in its implementation.

from rage.

str4d commented on June 4, 2024

Is there any reason to not compile with AVX2? I think, almost every x86 cpu nowadays supports it?

Nope! Looking at the December 2019 Steam hardware survey, 77.05% of the surveyed Windows machines (which made up 96.86% of the survey, so I'm not looking at the macOS or Linux figures) support AVX2. Given that gamers tend towards newer hardware, this is most likely an upper bound on support (by how much, IDK). See also this Rust discussion thread.

from rage.

paulmillr commented on June 4, 2024

@tarcieri i've thought a nonce-misuse-resistant construction cannot be 1-pass? Specifically, SIV. Am I wrong?

from rage.

tarcieri commented on June 4, 2024

@paulmillr that’s true (for encryption, decryption in a SIV mode can still be 1-pass), but we’re talking about ChaCha20Poly1305 here...

from rage.

tarcieri commented on June 4, 2024

If anyone would like to try wiring it up, chacha20poly1305 v0.4 now has a generic ChaChaPoly1305 type which should theoretically be usable with the ChaCha20 implementation in the c2-chacha crate.

Benchmarks showed its AVX2 backend was about 40% faster than the chacha20 crate. I've been meaning to investigate why and see if there's something suboptimal in the chacha20 crate (whose implementation is significantly simpler than what's in c2-chacha + ppv-lite86)

from rage.

str4d commented on June 4, 2024

Ooh, thanks! I'll try that today 😃

from rage.

tarcieri commented on June 4, 2024

Also note that the chacha20 dependency in chacha20poly1305 is now optional if c2-chacha ends up working out.

from rage.

paulmillr commented on June 4, 2024

What about parallelism / multicore usage? Anything we could do here?

from rage.

tarcieri commented on June 4, 2024

STREAM is "embarrassingly parallel" so pick any parallelization strategy you want

from rage.

paulmillr commented on June 4, 2024

rust go brrrrrr

from rage.

tarcieri commented on June 4, 2024

@str4d a few options for additional improvements:

asm implementations of ChaCha20 and/or Poly1305
Pipelining ChaCha20 and Poly1305 via XMM registers

from rage.

paulmillr commented on June 4, 2024

I assume STREAM is still not parallel? I'd focus on this instead of using low-level dangerous asm code.

from rage.

tarcieri commented on June 4, 2024

Rogaway's STREAM is trivially parallelized and seekable.

By comparison, the CHAIN construction in the same paper is the one which is sequential-by-design.

from rage.

paulmillr commented on June 4, 2024

@tarcieri I understand it's parallelizable. But our Rust implementation of it — isn't? For now?

from rage.

tarcieri commented on June 4, 2024

If so, that's a deficiency in the STREAM implementation. The main benefit of STREAM over CHAIN is its parallelizability.

from rage.

paulmillr commented on June 4, 2024

Quick note. I've tried using age to encrypt big file -- over 100gb, and it seems like it's much slower than those benchmark numbers. It takes something like 1 minute to encrypt 5gb.

Haven't tried rage though.

from rage.

paulmillr commented on June 4, 2024

I don't understand though why non-parallel rage with AVX2 is 1.46x slower than age while parallel rage is only 1.28/1.17? Only 20% improvement? Where's 600% multi-core boost?

from rage.

str4d commented on June 4, 2024

Merged some performance improvements to armoring (among other things), so re-running the benchmarks.

For future updates I'll be using this script for simplicity:

#!/usr/bin/env bash

# Place age binary to compare against in here.
BINARIES=./tmp
BUILD=1

function run {
    binary=$@
    echo "$binary"
    echo "==="
    head -c 2147483648 /dev/zero | time $BINARIES/$binary -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
    echo
}

# Prepare binaries
if [[ $BUILD -ne 0 ]]; then
    cargo clean
    cargo build --release
    cp target/release/rage $BINARIES/rage
    cargo clean
    RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
    cp target/release/rage $BINARIES/rage-avx2
fi

# Run tests
run age
run rage
run rage-avx2
run age -a
run rage -a
run rage-avx2 -a

Configuration:

Intel Core i7-8700K
FiloSottile/age@902a3d4 compiled with Go 1.15
7fb88a1 compiled with Rust 1.45.0 (the MSRV)
- I checked with latest stable (1.49.0), and saw no appreciable difference.

age
===
0.92user 0.62system 0:01.56elapsed 99%CPU (0avgtext+0avgdata 12308maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

rage
===
2.66user 0.44system 0:03.12elapsed 99%CPU (0avgtext+0avgdata 4844maxresident)k
0inputs+0outputs (0major+305minor)pagefaults 0swaps

rage-avx2
===
1.60user 0.53system 0:02.15elapsed 99%CPU (0avgtext+0avgdata 4892maxresident)k
0inputs+0outputs (0major+304minor)pagefaults 0swaps

age -a
===
4.23user 0.98system 0:05.19elapsed 100%CPU (0avgtext+0avgdata 10276maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps

rage -a
===
5.04user 0.49system 0:05.54elapsed 99%CPU (0avgtext+0avgdata 5256maxresident)k
0inputs+0outputs (0major+450minor)pagefaults 0swaps

rage-avx2 -a
===
3.41user 0.58system 0:03.99elapsed 99%CPU (0avgtext+0avgdata 5388maxresident)k
0inputs+0outputs (0major+451minor)pagefaults 0swaps

rage is 100% (2x) slower than age.
rage compiled with AVX2 is 38% slower than age.
rage -a is 7% slower than age -a.
rage -a compiled with AVX2 is 23% faster than age -a.

Finally we're getting somewhere! 🚤

from rage.

str4d commented on June 4, 2024

And now that I've managed to update all the dependencies (#187, #186), and we have poly1305 0.6.2 with runtime AVX2 detection (RustCrypto/universal-hashes#97, thanks @tarcieri!), let's run the benchmarks again!

Configuration:

2 GiB of /dev/zero -> encrypt -> /dev/null
Intel Core i7-8700K
FiloSottile/age@902a3d4 compiled with Go 1.15
9c56470 compiled with Rust 1.45.0

age
===
1.08user 0.50system 0:01.59elapsed 100%CPU (0avgtext+0avgdata 12312maxresident)k
0inputs+0outputs (0major+199minor)pagefaults 0swaps

rage
===
1.69user 0.57system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 4892maxresident)k
0inputs+0outputs (0major+305minor)pagefaults 0swaps

rage-avx2
===
1.52user 0.67system 0:02.20elapsed 99%CPU (0avgtext+0avgdata 5068maxresident)k
0inputs+0outputs (0major+307minor)pagefaults 0swaps

age -a
===
4.10user 1.14system 0:05.21elapsed 100%CPU (0avgtext+0avgdata 10356maxresident)k
0inputs+0outputs (0major+201minor)pagefaults 0swaps

rage -a
===
3.54user 0.60system 0:04.14elapsed 99%CPU (0avgtext+0avgdata 5508maxresident)k
0inputs+0outputs (0major+469minor)pagefaults 0swaps

rage-avx2 -a
===
3.34user 0.65system 0:03.99elapsed 99%CPU (0avgtext+0avgdata 5548maxresident)k
0inputs+0outputs (0major+471minor)pagefaults 0swaps

rage is 43% slower than age.
rage compiled with AVX2 is 38% slower than age.
rage -a is 21% faster than age -a.
rage -a compiled with AVX2 is 23% faster than age -a.

🚄💨

from rage.

paulmillr commented on June 4, 2024

what are the use cases of armor tho? not that many I guess?

from rage.

paulmillr commented on June 4, 2024

It's awesome in any case.

Is it possible to compile one binary that would be using avx2 when available and falling back to non-vectorized impl?

from rage.

str4d commented on June 4, 2024

I just tried switching from chacha20poly1305 0.7 to 0.8, using chacha20 instead of c2-chacha (since the latter has not yet been updated with the new trait versions). Results:

Command	`chacha20poly1305 0.7.1` `c2-chacha 0.3.1`	`chacha20poly1305 0.8.0` `chacha20 0.7.1` (#245)
age	1.58	1.57
rage	2.24	4.20
rage-avx2	2.16	2.53
age -a	5.18	5.19
rage -a	4.67 (4.36)	6.62
rage-avx2 -a	4.28	5.00

The age commands are both latest master with Go 1.15, just from the two separate bench.sh runs.
The -a runs with a second parenthesised number show apparent toggling behaviour (but I can't consistently hit the second number). The remaining results are fairly stable (to within a few tens of milliseconds).

So I'm not sure how chacha20 0.7 is supposed to detect AVX2 support at runtime, but it is clearly not working. Compile-time detection does seem to work, but is still noticably slower than c2-chacha.

from rage.

tarcieri commented on June 4, 2024

Just verified it's working by running the benchmarks in the chacha20 directory of https://github.com/rustcrypto/stream-ciphers

$ cargo +nightly bench --features force-soft
     Running unittests (/Users/bascule/src/RustCrypto/stream-ciphers/target/release/deps/chacha20-131591e6cddd7159)

running 5 tests
test bench1_10     ... bench:          23 ns/iter (+/- 4) = 434 MB/s
test bench2_100    ... bench:         224 ns/iter (+/- 9) = 446 MB/s
test bench3_1000   ... bench:       2,396 ns/iter (+/- 818) = 417 MB/s
test bench4_10000  ... bench:      23,897 ns/iter (+/- 3,623) = 418 MB/s
test bench5_100000 ... bench:     240,243 ns/iter (+/- 45,351) = 416 MB/s

$ cargo +nightly bench
     Running unittests (/Users/bascule/src/RustCrypto/stream-ciphers/target/release/deps/chacha20-01c227d3ba15500b)

running 5 tests
test bench1_10     ... bench:          12 ns/iter (+/- 1) = 833 MB/s
test bench2_100    ... bench:          81 ns/iter (+/- 43) = 1234 MB/s
test bench3_1000   ... bench:       1,030 ns/iter (+/- 98) = 970 MB/s
test bench4_10000  ... bench:      10,620 ns/iter (+/- 921) = 941 MB/s
test bench5_100000 ... bench:     107,213 ns/iter (+/- 8,021) = 932 MB/s

from rage.

str4d commented on June 4, 2024

Here's the two flamegraphs for the two cases (zoomed in on the chunk-writing phase):

c2-chacha 0.3.1 + poly1305 0.6.2:

chacha20 0.7.1 + poly1305 0.7.1:

It's clear that chacha20 is indeed enabling AVX2 at runtime, which makes me surprised as to the amount of slowdown that has relative to compile-time.

I don't see anything in poly1305 0.7 that should have affected performance, so assuming the same wall clock time is spent on that part, the main cause I guess is that chacha20's AVX2 implementation is inherently slower than the one in c2-chacha. Most of that is in the rounds function, but there is also some _mm256_castsi256_si128 and _mm256_extractf128_si256 that looks suspiciously heavy.

from rage.

str4d commented on June 4, 2024

RustCrypto/stream-ciphers#261 helps to close the gap significantly:

Command	`c2-chacha 0.3.1`	`chacha20 0.7.1` (#245)	#245 + RustCrypto/stream-ciphers#261
`age`	1.58	1.57	1.57
`rage`	2.24	4.20	3.38
`rage-avx2`	2.16	2.53	2.48
`age -a`	5.18	5.19	5.18
`rage -a`	4.67 (4.36)	6.62	5.78
`rage-avx2 -a`	4.28	5.00	4.61

from rage.

paulmillr commented on June 4, 2024

Awesome stuff!!

from rage.

str4d commented on June 4, 2024

RustCrypto/stream-ciphers#267 does indeed close the remaining gap (for +avx2 mode). Re-running the benchmarks:

Command	`c2-chacha 0.3.1`	`chacha20 0.7.3`	`chacha20 0.7.3` + RustCrypto/stream-ciphers#267
`age`	1.62	1.62	1.62
`rage`	2.28	3.41	2.64
`rage-avx2`	2.21	2.54	2.14
`age -a`	5.17	5.17	5.19
`rage -a`	4.83	5.91	5.12
`rage-avx2 -a`	4.67	4.97	4.62

from rage.

paulmillr commented on June 4, 2024

Is it possible to combine rage and rage-avx2? Aka runtime avx detection without performance loss we've seen before. Maybe there's also some small issue that adds the perf hit

from rage.

str4d commented on June 4, 2024

Runtime AVX2 detection without performance loss is impossible, because rage-avx2 compiles out all the runtime checks, at which point the compiler is able to optimise out a bunch of calls and simplify the assembly. The best we can do is minimise the performance hit of those runtime checks, by caching the check at as high a level as is reasonable (giving the compiler more scope for optimisations over larger chunks of code).

Currently we check for AVX2 support inside both chacha20 and poly1305 during construction, and then check a cached token on each operation. I'm not sure what it would look like to extract these checks to the chacha20poly1305 level, and I suspect we'd get more bang for our buck working on one-pass encryption/decryption (RustCrypto/AEADs#74).

That being said, the runtime detection gap was seemingly smaller when using c2-chacha, so there might be some tweaks we can make to chacha20 to arrange the compilation units more effectively. That's gonna take some assembly spelunking that I don't have time to do at present.

from rage.

str4d commented on June 4, 2024

Huh, I just re-ran the benchmarks on my machine (on current main with chacha20poly1305 0.9), and got significantly less runtime detection gap:

Command	Time
`age`	1.59
`rage`	2.35
`rage-avx2`	2.09
`age -a`	5.16
`rage -a`	4.82
`rage-avx2 -a`	4.26

(The -a numbers historically seem to jump around a bit, I presume depending on precisely what armoring characters get used, so I don't see the rage-avx2 -a drop as significant here.)

So I think we're actually probably fine on my last point above (at least, switching away from c2-chacha is not a regression in the autodetect case).

from rage.

paulmillr commented on June 4, 2024

Can we get a new version of rage out?

from rage.

paulmillr commented on June 4, 2024

@str4d ping — it would be useful to have new release

from rage.

str4d commented on June 4, 2024

@paulmillr 0.7.0 is now out with the above changes.

from rage.

paulmillr commented on June 4, 2024

Weird: this person says rage is 5x slower FiloSottile/age#109 (comment)

from rage.

Tronic commented on June 4, 2024

Speaking of latest gen desktop CPUs, core count does not matter: still slow on Windows/Ryzen system, but on Linux/Intel it can at least do about 700 MB/s encryption and decryption (either on tmpfs or on SSD, not much difference). You absolutely need to do several blocks in parallel threads to make it faster (single thread caps to about 1 GB/s with current cryptolibs), and perhaps check your I/O path to remove any extra buffer copying. Ideally you read encrypted data by mmap, or if not possible, by readinto, to avoid some copying, feed that directly to chacha (no copies) and use regular write on the part of the decrypted buffer where there is file data (it is hard to avoid the copy here entirely).

Several gigabytes per second should be very possible, but then you cannot afford any extra copies.

from rage.

str4d commented on June 4, 2024

@Tronic I'm well aware that we will eventually need to add threading support to boost performance further. However, the last time I tried that (#57 (comment)) I saw only a 20%-ish throughput improvement while using 4x more CPU. So there are clearly other bottlenecks that need addressing first before we add multithreading support.

In any case, this particular issue is about catching up to the performance of the Go age implementation, which is also single-threaded. Let's move multithreading discussions to #271.

from rage.

str4d commented on June 4, 2024

Re-ran the benchmarks on my old and new machines:

FiloSottile/age@f01e37b compiled with Go 1.17.0
3e8e703 compiled with Rust 1.56.0

Intel Core i7-8700K

Command	Time (s)	Relative
`age`	1.78	1
`rage`	2.50	1.40
`rage-avx2`	2.39	1.34
`age -a`	5.69	1
`rage -a`	4.78	0.84
`rage-avx2 -a`	4.60	0.81

Baselines are higher (probably because I have Firefox open), but otherwise it's the same approximate ratios we've seen before.

AMD Ryzen 9 5950X

Command	Time (s)	Relative
`age`	0.74	1
`rage`	2.01	2.72
`rage-avx2`	1.87	2.53
`age -a`	2.80	1
`rage -a`	3.58	1.28
`rage-avx2 -a`	3.23	1.15

Compared to the i7-8700K:

age is 41% faster at native and 49% faster at armored.
rage is 20% faster at native and 25% faster at armored.
rage-avx2 is 22% faster at native and 30% faster at armored.

Yay, I have a new target to optimise for!

from rage.

paulmillr commented on June 4, 2024

What's leaving us behind at this point?

from rage.

str4d commented on June 4, 2024

Per my earlier comment (#57 (comment)), I'm almost certain it's our lack of one-pass encryption: the Go AEAD impl uses separate custom assembly for ChaCha20Poly1305, whereas the Rust Crypto AEAD impl is compositional so the ChaCha20 assembly is separate from the Poly1305 assembly.

from rage.

Rage is 38% slower at encrypting than Go implementation about rage HOT 64 OPEN

Comments (64)

Pre-release

Pre-release plus `universal-hash 0.5`

Intel Core i7-8700K

AMD Ryzen 9 5950X

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (64)

Pre-release

Pre-release plus universal-hash 0.5

Intel Core i7-8700K

AMD Ryzen 9 5950X

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Pre-release plus `universal-hash 0.5`