Git Product home page Git Product logo

simdutf8's Introduction

CI crates.io docs.rs

simdutf8 – High-speed UTF-8 validation

Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs, but now heavily improved.

Status

This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs.

Features

  • basic API for the fastest validation, optimized for valid UTF-8
  • compat API as a fully compatible replacement for std::str::from_utf8()
  • Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64
  • 🆕 ARM64 (aarch64) SIMD is supported with Rust 1.59 and 1.60 (use feature aarch64_neon) and Nightly (no extra feature needed)
  • 🆕 WASM (wasm32) SIMD is supported
  • x86-64: Up to 23 times faster than the std library on valid non-ASCII, up to four times faster on ASCII
  • aarch64: Up to eleven times faster than the std library on valid non-ASCII, up to four times faster on ASCII (Apple Silicon)
  • Faster than the original simdjson implementation
  • Selects the fastest implementation at runtime based on CPU support (on x86)
  • Falls back to the excellent std implementation if SIMD extensions are not supported
  • Written in pure Rust
  • No dependencies
  • No-std support

Quick start

Add the dependency to your Cargo.toml file:

[dependencies]
simdutf8 = "0.1.4"

For ARM64 SIMD support on Rust 1.59 and 1.60:

[dependencies]
simdutf8 = { version = "0.1.4", features = ["aarch64_neon"] }

Use simdutf8::basic::from_utf8() as a drop-in replacement for std::str::from_utf8().

use simdutf8::basic::from_utf8;

println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());

If you need detailed information on validation failures, use simdutf8::compat::from_utf8() instead.

use simdutf8::compat::from_utf8;

let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));

APIs

Basic flavor

Use the basic API flavor for maximum speed. It is fastest on valid UTF-8, but only checks for errors after processing the whole byte sequence and does not provide detailed information if the data is not valid UTF-8. simdutf8::basic::Utf8Error is a zero-sized error struct.

Compat flavor

The compat flavor is fully API-compatible with std::str::from_utf8(). In particular, simdutf8::compat::from_utf8() returns a simdutf8::compat::Utf8Error, which has valid_up_to() and error_len() methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.

It also fails early: errors are checked on the fly as the string is processed and once an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data. This comes at a slight performance penalty compared to the basic API even if the input is valid UTF-8.

Implementation selection

X86

The fastest implementation is selected at runtime using the std::is_x86_feature_detected! macro, unless the CPU targeted by the compiler supports the fastest available implementation. So if you compile with RUSTFLAGS="-C target-cpu=native" on a recent x86-64 machine, the AVX 2 implementation is selected at compile-time and runtime selection is disabled.

For no-std support (compiled with --no-default-features) the implementation is always selected at compile time based on the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2" for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2" for the SSE 4.2 implementation.

ARM64

The SIMD implementation is only available on Rust Nightly and Rust 1.59 or later. On Rust Nightly it is now turned on automatically. To get the SIMD implementation with Rust 1.59 and 1.60 the crate feature aarch64_neon needs to be enabled. For Rust Nightly this will no longer be required (but does not hurt either). It is expected that the SIMD implementation will be enabled automatically starting with Rust 1.61.

WASM32

For wasm32 support, the implementation is selected at compile time based on the presence of the simd128 target feature. Use RUSTFLAGS="-C target-feature=+simd128" to enable the WASM SIMD implementation. WASM, at the time of this writing, doesn't have a way to detect SIMD through WASM itself. Although this capability is available in various WASM host environments (e.g., wasm-feature-detect in the web browser), there is no portable way from within the library to detect this.

Building/Targeting WASM

See this document for more details.

Access to low-level functionality

If you want to be able to call a SIMD implementation directly, use the public_imp feature flag. The validation implementations are then accessible in the simdutf8::{basic, compat}::imp hierarchy. Traits facilitating streaming validation are available there as well.

Optimisation flags

Do not use opt-level = "z", which prevents inlining and makes the code quite slow.

Minimum Supported Rust Version (MSRV)

This crate's minimum supported Rust version is 1.38.0.

Benchmarks

The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.

The naming schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very first character is invalid UTF-8. Library versions are simdutf8 v0.1.2 and simdjson v0.9.2. When comparing with simdjson simdutf8 is compiled with #inline(never).

Configurations:

  • X86-64: PC with an AMD Ryzen 7 PRO 3700 CPU (Zen2) on Linux with Rust 1.52.0
  • Aarch64: Macbook Air with an Apple M1 CPU (Apple Silicon) on macOS with Rust rustc 1.54.0-nightly (881c1ac40 2021-05-08).

simdutf8 basic vs std library on x86-64 (AMD Zen2)

image Simdutf8 is up to 23 times faster than the std library on valid non-ASCII, up to four times on pure ASCII.

simdutf8 basic vs std library on aarch64 (Apple Silicon)

image Simdutf8 is up to to eleven times faster than the std library on valid non-ASCII, up to four times faster on pure ASCII.

simdutf8 basic vs simdjson on x86-64

image Simdutf8 is faster than simdjson on almost all inputs.

simdutf8 basic vs simdutf8 compat UTF-8 on x86-64

image There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.

Technical details

For inputs shorter than 64 bytes validation is delegated to core::str::from_utf8() except for the direct-access functions in simdutf8::{basic, compat}::imp.

The SIMD implementation is mostly similar to the one in simdjson except that it is has additional optimizations for the pure ASCII case. Also it uses prefetch with AVX 2 on x86 which leads to slightly better performance with some Intel CPUs on synthetic benchmarks.

For the compat API, we need to check the error status vector on each 64-byte block instead of just aggregating it. If an error is found, the last bytes of the previous block are checked for a cross-block continuation and then std::str::from_utf8() is run to find the exact location of the error.

Care is taken that all functions are properly inlined up to the public interface.

Thanks

  • to the authors of simdjson for coming up with the high-performance SIMD implementation and in particular to Daniel Lemire for his feedback. It was very helpful.
  • to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.

License

This code is dual-licensed under the Apache License 2.0 and the MIT License.

It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license as well.

simdjson itself is distributed under the Apache License 2.0.

References

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

simdutf8's People

Contributors

almann avatar cryze avatar hkratz avatar lemire avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

simdutf8's Issues

[Bug] Test failure on arm64

Hi,
I packaged your crate for debian. The tests all pass except on arm64. The --no-features and the --arch64_neon_prefetch
features fail. On all other arches the tests pass (x86-64, x86, armel, armhf). Log:

Compiling simdutf8 v0.1.4 (/usr/share/cargo/registry/simdutf8-0.1.4)
     Running `CARGO=/usr/bin/cargo CARGO_CRATE_NAME=simdutf8 CARGO_MANIFEST_DIR=/usr/share/cargo/registry/simdutf8-0.1.4 CARGO_PKG_AUTHORS='Hans Kratz <[email protected]>' CARGO_PKG_DESCRIPTION='SIMD-accelerated UTF-8 validation.' CARGO_PKG_HOMEPAGE='https://github.com/rusticstuff/simdutf8' CARGO_PKG_LICENSE='MIT OR Apache-2.0' CARGO_PKG_LICENSE_FILE='' CARGO_PKG_NAME=simdutf8 CARGO_PKG_REPOSITORY='https://github.com/rusticstuff/simdutf8' CARGO_PKG_VERSION=0.1.4 CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_VERSION_MINOR=1 CARGO_PKG_VERSION_PATCH=4 CARGO_PKG_VERSION_PRE='' CARGO_PRIMARY_PACKAGE=1 LD_LIBRARY_PATH='/tmp/tmp.Twjfx6GdWG/target/debug/deps:/usr/lib' rustc --crate-name simdutf8 --edition=2018 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="aarch64_neon"' --cfg 'feature="aarch64_neon_prefetch"' --cfg 'feature="default"' --cfg 'feature="hints"' --cfg 'feature="public_imp"' --cfg 'feature="std"' -C metadata=1e3b6e54b7c540bd -C extra-filename=-1e3b6e54b7c540bd --out-dir /tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/deps --target aarch64-unknown-linux-gnu -C incremental=/tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/incremental -L dependency=/tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/deps -L dependency=/tmp/tmp.Twjfx6GdWG/target/debug/deps -C debuginfo=2 --cap-lints warn -C linker=aarch64-linux-gnu-gcc -C link-arg=-Wl,-z,relro --remap-path-prefix /usr/share/cargo/registry/simdutf8-0.1.4=/usr/share/cargo/registry/simdutf8-0.1.4 --remap-path-prefix /tmp/tmp.Twjfx6GdWG/registry=/usr/share/cargo/registry`
     Running `CARGO=/usr/bin/cargo CARGO_CRATE_NAME=simdutf8 CARGO_MANIFEST_DIR=/usr/share/cargo/registry/simdutf8-0.1.4 CARGO_PKG_AUTHORS='Hans Kratz <[email protected]>' CARGO_PKG_DESCRIPTION='SIMD-accelerated UTF-8 validation.' CARGO_PKG_HOMEPAGE='https://github.com/rusticstuff/simdutf8' CARGO_PKG_LICENSE='MIT OR Apache-2.0' CARGO_PKG_LICENSE_FILE='' CARGO_PKG_NAME=simdutf8 CARGO_PKG_REPOSITORY='https://github.com/rusticstuff/simdutf8' CARGO_PKG_VERSION=0.1.4 CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_VERSION_MINOR=1 CARGO_PKG_VERSION_PATCH=4 CARGO_PKG_VERSION_PRE='' CARGO_PRIMARY_PACKAGE=1 LD_LIBRARY_PATH='/tmp/tmp.Twjfx6GdWG/target/debug/deps:/usr/lib' rustc --crate-name simdutf8 --edition=2018 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 --test --cfg 'feature="aarch64_neon"' --cfg 'feature="aarch64_neon_prefetch"' --cfg 'feature="default"' --cfg 'feature="hints"' --cfg 'feature="public_imp"' --cfg 'feature="std"' -C metadata=450361c6ec8c3c1b -C extra-filename=-450361c6ec8c3c1b --out-dir /tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/deps --target aarch64-unknown-linux-gnu -C incremental=/tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/incremental -L dependency=/tmp/tmp.Twjfx6GdWG/target/aarch64-unknown-linux-gnu/debug/deps -L dependency=/tmp/tmp.Twjfx6GdWG/target/debug/deps -C debuginfo=2 --cap-lints warn -C linker=aarch64-linux-gnu-gcc -C link-arg=-Wl,-z,relro --remap-path-prefix /usr/share/cargo/registry/simdutf8-0.1.4=/usr/share/cargo/registry/simdutf8-0.1.4 --remap-path-prefix /tmp/tmp.Twjfx6GdWG/registry=/usr/share/cargo/registry`
error[E0658]: use of unstable library feature 'stdsimd'
   --> src/lib.rs:1:1
    |
1   | / #![deny(warnings)]
2   | | #![warn(unused_extern_crates)]
3   | | #![deny(
4   | |     clippy::all,
...   |
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/implementation/aarch64/neon.rs:234:33
    |
234 |     _prefetch(ptr.cast::<i8>(), _PREFETCH_READ, _PREFETCH_LOCALITY3);
    |                                 ^^^^^^^^^^^^^^
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/implementation/aarch64/neon.rs:234:49
    |
234 |     _prefetch(ptr.cast::<i8>(), _PREFETCH_READ, _PREFETCH_LOCALITY3);
    |                                                 ^^^^^^^^^^^^^^^^^^^
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/implementation/aarch64/neon.rs:233:31
    |
233 |     use core::arch::aarch64::{_prefetch, _PREFETCH_LOCALITY3, _PREFETCH_READ};
    |                               ^^^^^^^^^
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/implementation/aarch64/neon.rs:233:42
    |
233 |     use core::arch::aarch64::{_prefetch, _PREFETCH_LOCALITY3, _PREFETCH_READ};
    |                                          ^^^^^^^^^^^^^^^^^^^
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/implementation/aarch64/neon.rs:233:63
    |
233 |     use core::arch::aarch64::{_prefetch, _PREFETCH_LOCALITY3, _PREFETCH_READ};
    |                                                               ^^^^^^^^^^^^^^
    |
    = note: see issue #48556 <https://github.com/rust-lang/rust/issues/48556> for more information

For more information about this error, try `rustc --explain E0658`.
error: could not compile `simdutf8` due to 6 previous errors

Add streaming API which works with the basic and compat APIs

Currently only full slices can be validated using the basic API. Using a streaming API with init(), update(), finish_validation() functions validation could be done on the fly.

With the compat API this can currently be awkwardly emulated by remembering how far the given slice is valid using the Utf8Error::valid_up_to() method.

Run Fuzzer on wasm32 Targeted Code

As part of #56, there is a remaining TODO to integrate with the fuzzer. based on the README for rust-fuzz x86-64 is required so we cannot run the fuzzer natively on something like wasm32-wasi.

https://github.com/rust-fuzz/cargo-fuzz/blob/63730da7f95cfb21f6f5a9b0a74532f98d3983a4/README.md?plain=1#L13-L16

In order to integrate with the fuzzer, we may want to take an approach similar to the benchmarking (shim to the WASM and use a WASM runtime to embed the functionality).

Miri reports UB with simd_bitmask (FW)

A forward from simd-lite/simd-json#264,

I managed to replicate the UB error from miri with this test:

cargo +nightly  miri test  --features public_imp utf8_validator

I could trigger it with avx2, sse4.2 and avx

#[cfg(all(feature = "public_imp", target_feature = "avx2"))]
#[test]
fn utf8_validator() {
    use simdutf8::basic::imp::ChunkedUtf8Validator;
    unsafe {
        let mut utf8_validator = simdutf8::basic::imp::x86::avx2::ChunkedUtf8ValidatorImp::new();
        let tmpbuf = [
            49, 46, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
        ];
        utf8_validator.update_from_chunks(&tmpbuf);
        assert!(utf8_validator.finalize(None).is_ok());
    }
}

Integrating into simd-json

Would make a lot of sense to use this in simd-json. It should be straight forward to do but might have to make some of the internals public. Let me know what you think and I can help kick this off.

Add armv7 neon support

  • Still some required intrinsics missing.
  • Runtime feature detection is required.

WIP draft pull request: #43

Validated ring buffer iterator

It would be nice to have a validated ring buffer iterator. Not sure if the consumer would want (pointer, byteLength) or (pointer, ignoreNPrefixBytes, byteLength) to keep them L1 cache aligned.

Upstream into libcore/libstd?

First thanks for the excellent crate!
Since this crate provides so much speedup compares to the std one, would it make sense to upstream this crate into libcore/libstd?

Heads-up: const_err lint is going away

This crate carries a allow(const_err). That lint is going away since it becomes a hard error, which causes a warning due to the removed lint being used, which then triggers deny(warnings).

The crate does not actually seem to trigger const_err (according to crater), so the hard error itself should not be a problem. The allow(const_err) can likely just be removed.

Benchmarking error

Am I doing something stupid? Happy to add a PR to the docs on running the benchmark suite.

x86 WSL2 Ubuntu 20.04 environment using rust nightly toolchain

crb002@LAPTOP-PNLGM1UH:~/github/meta_coreutils/thirdparty/simdutf8/bench$ cargo bench
   Compiling simdutf8-bench v0.0.1 (/home/crb002/github/meta_coreutils/thirdparty/simdutf8/bench)
warning: unused import: `simdutf8_bench::define_cpb_benchmark`
 --> benches/cpb_simdjson.rs:1:5
  |
1 | use simdutf8_bench::define_cpb_benchmark;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: unused import: `simdutf8_bench::define_throughput_benchmark`
 --> benches/throughput_simdjson.rs:1:5
  |
1 | use simdutf8_bench::define_throughput_benchmark;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

error[E0601]: `main` function not found in crate `throughput_simdjson`
 --> benches/throughput_simdjson.rs:1:1
  |
1 | / use simdutf8_bench::define_throughput_benchmark;
2 | |
3 | | #[cfg(feature = "simdjson")]
4 | | define_throughput_benchmark!(BenchFn::Simdjson);
  | |________________________________________________^ consider adding a `main` function to `benches/throughput_simdjson.rs`


error[E0601]: `main` function not found in crate `cpb_simdjson`
 --> benches/cpb_simdjson.rs:1:1
  |
1 | / use simdutf8_bench::define_cpb_benchmark;
2 | |
3 | | #[cfg(feature = "simdjson")]
4 | | define_cpb_benchmark!(BenchFn::Simdjson);
  | |_________________________________________^ consider adding a `main` function to `benches/cpb_simdjson.rs`

error: aborting due to previous error; 1 warning emitted

error: aborting due to previous error; 1 warning emitted

For more information about this error, try `rustc --explain E0601`.
For more information about this error, try `rustc --explain E0601`.
error: could not compile `simdutf8-bench`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

Question. Speed on large inputs.

Why do I get only about 12 GB/s with this manual benchmark?

use std::time::Instant;

use simdutf8::basic::from_utf8;

fn main() {
    let mut vec: Vec<u8> = Vec::new();

    for i in 0..1024 * 1024 * 10 {
        vec.push((i % 10) as u8 + b'0');
    }

    // println!("{:?}", vec);

    println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());

    let start = Instant::now();

    let decoded = from_utf8(vec.as_slice()).unwrap();
    // let decoded = std::str::from_utf8(vec.as_slice()).unwrap();
    println!("length: {}", decoded.len());

    let mut elapsed = Instant::now().duration_since(start);
    println!("Elapsed time: {:?}", elapsed);
    let giga = 1024 * 1024 * 1024;
    println!("Speed: {:?} GB/s", 1000000.0 / (elapsed.as_micros() as f64) * (vec.len() as f64) / (giga as f64));
}

When I run the benchmark (slightly patched), I get about 80 GB/s.

1-latin/1048576         time:   [12.134 µs 12.144 µs 12.155 µs]
                        thrpt:  [80.339 GiB/s 80.416 GiB/s 80.481 GiB/s]

Add aarch64 neon support

This has now been merged but there are still some open points:

  • The generated assembly is sub-par.
  • Documentation needs to be updated.

Mislink on Windows with lld and thinlto

It appears that validate_utf8_basic and similar functions trigger a mislink on Windows with lld and thinlto. This is not a terribly uncommon combination, so it may be worth exploring alternatives that do not cause this mislink.

This is an issue @Kixiron ran into while using bytecheck, which uses simdutf8 for fast string validation. The issue was traced back to simdutf8 using a release build with debug symbols and WinDbg, then the memory backing the AtomicPtr was rewound to the beginning of the application and verified to be invalid. This indicates that the function pointer placed in it was not relocated to the correct address.

Performance on short strings

If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard library is likely faster. If there is no native implementation for your platform (yet), use the standard library instead.

To my knowledge, there is no hard engineering reason why you'd ever be slower irrespective of the string length. In the worst case, you can always do...

if(short string) {
  do that
} else {
  do this
}

This adds one predictable branch.

Chunked iterator API like `Utf8Chunks`

I've wanted chunked UTF-8 decoding twice recently for different escaping routines, and have used simdutf8::compat::from_utf8 in a loop to achieve that. I would really like to be able to use an API like Utf8Chunks from #[feature(utf8_lossy)] or bstr::Utf8Chunks, but with the faster validation of this crate. Utf8Chunks avoids the disconnect between the length of the valid prefix and the prefix as a string. Additionally, I suspect an API for this could avoid some overhead from decoding in a loop.

I ended up writing something close to this:

pub fn from_utf8_lossy(mut v: &[u8]) -> Cow<'_, str> {
    match simdutf8::compat::from_utf8(v) {
        Ok(s) => s.into(),
        Err(mut err) => {
            let mut cleaned = String::with_capacity(v.len());
            loop {
                cleaned.push_str(unsafe { str::from_utf8_unchecked(&v[..err.valid_up_to()]) });
                cleaned.push_str("\u{FFFD}");
                if let Some(error_len) = err.error_len() {
                    v = &v[err.valid_up_to() + error_len..];
                    match simdutf8::compat::from_utf8(v) {
                        Ok(v) => cleaned.push_str(v),
                        Err(err1) => {
                            err = err1;
                            continue;
                        }
                    }
                }
                break cleaned.into();
            }
        }
    }
}

Compare to the stdlib implementation of String::from_utf8_lossy, which avoids any direct offset fiddling and unchecked conversions:

pub fn from_utf8_lossy(v: &[u8]) -> Cow<'_, str> {
    let mut iter = Utf8Chunks::new(v);

    let first_valid = if let Some(chunk) = iter.next() {
        let valid = chunk.valid();
        if chunk.invalid().is_empty() {
            debug_assert_eq!(valid.len(), v.len());
            return Cow::Borrowed(valid);
        }
        valid
    } else {
        return Cow::Borrowed("");
    };

    const REPLACEMENT: &str = "\u{FFFD}";

    let mut res = String::with_capacity(v.len());
    res.push_str(first_valid);
    res.push_str(REPLACEMENT);

    for chunk in iter {
        res.push_str(chunk.valid());
        if !chunk.invalid().is_empty() {
            res.push_str(REPLACEMENT);
        }
    }

    Cow::Owned(res)
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.