Git Product home page Git Product logo

lasso's Introduction

CI Security Audit Coverage Docs.rs Crates.io

A multithreaded and single threaded string interner that allows strings to be cached with a minimal memory footprint, associating them with a unique key that can be used to retrieve them at any time. A Rodeo allows O(1) internment and resolution and can be turned into a RodeoReader to allow for contention-free resolutions with both key to str and str to key operations. It can also be turned into a RodeoResolver with only key to str operations for the lowest possible memory usage.

Which interner do I use?

For single-threaded workloads Rodeo is encouraged, while multi-threaded applications should use ThreadedRodeo. Both of these are the only way to intern strings, but most applications will hit a stage where they are done interning strings, and at that point is where the choice between RodeoReader and RodeoResolver. If the user needs to get keys for strings still, then they must use the RodeoReader (although they can still transfer into a RodeoResolver) at this point. For users who just need key to string resolution, the RodeoResolver gives contention-free access at the minimum possible memory usage. Note that to gain access to ThreadedRodeo the multi-threaded feature is required.

Interner Thread-safe Intern String str to key key to str Contention Free Memory Usage
Rodeo N/A Medium
ThreadedRodeo Most
RodeoReader Medium
RodeoResolver Least

Cargo Features

By default lasso has one dependency, hashbrown, and only Rodeo is exposed. Hashbrown is used since the raw_entry api is currently unstable in the standard library's hashmap. The raw hashmap API is used for custom hashing within the hashmaps, which works to dramatically reduce memory usage To make use of ThreadedRodeo, you must enable the multi-threaded feature.

  • multi-threaded - Enables ThreadedRodeo, the interner for multi-threaded tasks
  • ahasher - Use ahash's RandomState as the default hasher
  • no-std - Enables no_std + alloc support for Rodeo and ThreadedRodeo
    • Automatically enables the following required features:
      • ahasher - no_std hashing function
  • serialize - Implements Serialize and Deserialize for all Spur types and all interners
  • inline-more - Annotate external apis with #[inline]

Example: Using Rodeo

use lasso::Rodeo;

let mut rodeo = Rodeo::default();
let key = rodeo.get_or_intern("Hello, world!");

// Easily retrieve the value of a key and find the key for values
assert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!(Some(key), rodeo.get("Hello, world!"));

// Interning the same string again will yield the same key
let key2 = rodeo.get_or_intern("Hello, world!");

assert_eq!(key, key2);

Example: Using ThreadedRodeo

use lasso::ThreadedRodeo;
use std::{thread, sync::Arc};

let rodeo = Arc::new(ThreadedRodeo::default());
let key = rodeo.get_or_intern("Hello, world!");

// Easily retrieve the value of a key and find the key for values
assert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!(Some(key), rodeo.get("Hello, world!"));

// Interning the same string again will yield the same key
let key2 = rodeo.get_or_intern("Hello, world!");

assert_eq!(key, key2);

// ThreadedRodeo can be shared across threads
let moved = Arc::clone(&rodeo);
let hello = thread::spawn(move || {
    assert_eq!("Hello, world!", moved.resolve(&key));
    moved.get_or_intern("Hello from the thread!")
})
.join()
.unwrap();

assert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!("Hello from the thread!", rodeo.resolve(&hello));

Example: Creating a RodeoReader

use lasso::Rodeo;

// Rodeo and ThreadedRodeo are interchangeable here
let mut rodeo = Rodeo::default();

let key = rodeo.get_or_intern("Hello, world!");
assert_eq!("Hello, world!", rodeo.resolve(&key));

let reader = rodeo.into_reader();

// Reader keeps all the strings from the parent
assert_eq!("Hello, world!", reader.resolve(&key));
assert_eq!(Some(key), reader.get("Hello, world!"));

// The Reader can now be shared across threads, no matter what kind of Rodeo created it

Example: Creating a RodeoResolver

use lasso::Rodeo;

// Rodeo and ThreadedRodeo are interchangeable here
let mut rodeo = Rodeo::default();

let key = rodeo.get_or_intern("Hello, world!");
assert_eq!("Hello, world!", rodeo.resolve(&key));

let resolver = rodeo.into_resolver();

// Resolver keeps all the strings from the parent
assert_eq!("Hello, world!", resolver.resolve(&key));

// The Resolver can now be shared across threads, no matter what kind of Rodeo created it

Example: Making a custom-ranged key

Sometimes you want your keys to only inhabit (or not inhabit) a certain range of values so that you can have custom niches. This allows you to pack more data into what would otherwise be unused space, which can be critical for memory-sensitive applications.

use lasso::{Key, Rodeo};

// First make our key type, this will be what we use as handles into our interner
#[derive(Copy, Clone, PartialEq, Eq)]
struct NicheKey(u32);

// This will reserve the upper 255 values for us to use as niches
const NICHE: usize = 0xFF000000;

// Implementing `Key` is unsafe and requires that anything given to `try_from_usize` must produce the
// same `usize` when `into_usize` is later called
unsafe impl Key for NicheKey {
    fn into_usize(self) -> usize {
        self.0 as usize
    }

    fn try_from_usize(int: usize) -> Option<Self> {
        if int < NICHE {
            // The value isn't in our niche range, so we're good to go
            Some(Self(int as u32))
        } else {
            // The value interferes with our niche, so we return `None`
            None
        }
    }
}

// To make sure we're upholding `Key`'s safety contract, let's make two small tests
#[test]
fn value_in_range() {
    let key = NicheKey::try_from_usize(0).unwrap();
    assert_eq!(key.into_usize(), 0);

    let key = NicheKey::try_from_usize(NICHE - 1).unwrap();
    assert_eq!(key.into_usize(), NICHE - 1);
}

#[test]
fn value_out_of_range() {
    let key = NicheKey::try_from_usize(NICHE);
    assert!(key.is_none());

    let key = NicheKey::try_from_usize(u32::max_value() as usize);
    assert!(key.is_none());
}

// And now we're done and can make `Rodeo`s or `ThreadedRodeo`s that use our custom key!
let mut rodeo: Rodeo<NicheKey> = Rodeo::new();
let key = rodeo.get_or_intern("It works!");
assert_eq!(rodeo.resolve(&key), "It works!");

Example: Creation using FromIterator

use lasso::Rodeo;
use core::iter::FromIterator;

// Works for both `Rodeo` and `ThreadedRodeo`
let rodeo = Rodeo::from_iter(vec![
    "one string",
    "two string",
    "red string",
    "blue string",
]);

assert!(rodeo.contains("one string"));
assert!(rodeo.contains("two string"));
assert!(rodeo.contains("red string"));
assert!(rodeo.contains("blue string"));
use lasso::Rodeo;
use core::iter::FromIterator;

// Works for both `Rodeo` and `ThreadedRodeo`
let rodeo: Rodeo = vec!["one string", "two string", "red string", "blue string"]
    .into_iter()
    .collect();

assert!(rodeo.contains("one string"));
assert!(rodeo.contains("two string"));
assert!(rodeo.contains("red string"));
assert!(rodeo.contains("blue string"));

Benchmarks

Benchmarks were gathered with Criterion.rs
OS: Windows 10
CPU: Ryzen 9 3900X at 3800Mhz
RAM: 3200Mhz
Rustc: Stable 1.44.1

Rodeo

STD RandomState

Method Time Throughput
resolve 1.9251 μs 13.285 GiB/s
try_resolve 1.9214 μs 13.311 GiB/s
resolve_unchecked 1.4356 μs 17.816 GiB/s
get_or_intern (empty) 60.350 μs 433.96 MiB/s
get_or_intern (filled) 57.415 μs 456.15 MiB/s
try_get_or_intern (empty) 58.978 μs 444.06 MiB/s
try_get_or_intern (filled) 57.421 μs 456.10 MiB/s
get (empty) 37.288 μs 702.37 MiB/s
get (filled) 55.095 μs 475.36 MiB/s

AHash

Method Time Throughput
try_resolve 1.9282 μs 13.264 GiB/s
resolve 1.9404 μs 13.181 GiB/s
resolve_unchecked 1.4328 μs 17.851 GiB/s
get_or_intern (empty) 38.029 μs 688.68 MiB/s
get_or_intern (filled) 33.650 μs 778.30 MiB/s
try_get_or_intern (empty) 39.392 μs 664.84 MiB/s
try_get_or_intern (filled) 33.435 μs 783.31 MiB/s
get (empty) 12.565 μs 2.0356 GiB/s
get (filled) 26.545 μs 986.61 MiB/s

FXHash

Method Time Throughput
resolve 1.9014 μs 13.451 GiB/s
try_resolve 1.9278 μs 13.267 GiB/s
resolve_unchecked 1.4449 μs 17.701 GiB/s
get_or_intern (empty) 32.523 μs 805.27 MiB/s
get_or_intern (filled) 30.281 μs 864.88 MiB/s
try_get_or_intern (empty) 31.630 μs 828.00 MiB/s
try_get_or_intern (filled) 31.002 μs 844.78 MiB/s
get (empty) 12.699 μs 2.0141 GiB/s
get (filled) 29.220 μs 896.28 MiB/s

ThreadedRodeo

STD RandomState

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 54.336 μs 482.00 MiB/s 364.27 μs 71.897 MiB/s
try_resolve 54.582 μs 479.82 MiB/s 352.67 μs 74.261 MiB/s
get_or_intern (empty) 266.03 μs 98.447 MiB/s N\A N\A
get_or_intern (filled) 103.04 μs 254.17 MiB/s 441.42 μs 59.331 MiB/s
try_get_or_intern (empty) 261.80 μs 100.04 MiB/s N\A N\A
try_get_or_intern (filled) 102.61 μs 255.25 MiB/s 447.42 μs 58.535 MiB/s
get (empty) 80.346 μs 325.96 MiB/s N\A N\A
get (filled) 92.669 μs 282.62 MiB/s 439.24 μs 59.626 MiB/s

AHash

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 22.261 μs 1.1489 GiB/s 265.46 μs 98.658 MiB/s
try_resolve 22.378 μs 1.1429 GiB/s 268.58 μs 97.513 MiB/s
get_or_intern (empty) 157.86 μs 165.91 MiB/s N\A N\A
get_or_intern (filled) 56.320 μs 465.02 MiB/s 357.13 μs 73.335 MiB/s
try_get_or_intern (empty) 161.46 μs 162.21 MiB/s N\A N\A
try_get_or_intern (filled) 55.874 μs 468.73 MiB/s 360.25 μs 72.698 MiB/s
get (empty) 43.520 μs 601.79 MiB/s N\A N\A
get (filled) 53.720 μs 487.52 MiB/s 360.66 μs 72.616 MiB/s

FXHash

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
try_resolve 17.289 μs 1.4794 GiB/s 238.29 μs 109.91 MiB/s
resolve 19.833 μs 1.2896 GiB/s 237.05 μs 110.48 MiB/s
get_or_intern (empty) 130.97 μs 199.97 MiB/s N\A N\A
get_or_intern (filled) 42.630 μs 614.35 MiB/s 301.60 μs 86.837 MiB/s
try_get_or_intern (empty) 129.30 μs 202.55 MiB/s N\A N\A
try_get_or_intern (filled) 42.508 μs 616.12 MiB/s 337.29 μs 77.648 MiB/s
get (empty) 28.001 μs 935.30 MiB/s N\A N\A
get (filled) 37.700 μs 694.68 MiB/s 292.15 μs 89.645 MiB/s

RodeoReader

STD RandomState

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 1.9398 μs 13.185 GiB/s 4.3153 μs 5.9269 GiB/s
try_resolve 1.9315 μs 13.242 GiB/s 4.1956 μs 6.0959 GiB/s
resolve_unchecked 1.4416 μs 17.741 GiB/s 3.1204 μs 8.1964 GiB/s
get (empty) 38.886 μs 673.50 MiB/s N\A N\A
get (filled) 56.271 μs 465.42 MiB/s 105.12 μs 249.14 MiB/s

AHash

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 1.9404 μs 13.181 GiB/s 4.1881 μs 6.1069 GiB/s
try_resolve 1.8932 μs 13.509 GiB/s 4.2410 μs 6.0306 GiB/s
resolve_unchecked 1.4128 μs 18.103 GiB/s 3.1691 μs 8.0703 GiB/s
get (empty) 11.952 μs 2.1399 GiB/s N\A N\A
get (filled) 27.093 μs 966.65 MiB/s 56.269 μs 465.44 MiB/s

FXHash

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 1.8987 μs 13.471 GiB/s 4.2117 μs 6.0727 GiB/s
try_resolve 1.9103 μs 13.389 GiB/s 4.2254 μs 6.0529 GiB/s
resolve_unchecked 1.4469 μs 17.677 GiB/s 3.0923 μs 8.2709 GiB/s
get (empty) 12.994 μs 1.9682 GiB/s N\A N\A
get (filled) 29.745 μs 880.49 MiB/s 52.387 μs 499.93 MiB/s

RodeoResolver

Method Time (1 Thread) Throughput (1 Thread) Time (24 Threads) Throughput (24 Threads)
resolve 1.9416 μs 13.172 GiB/s 3.9114 μs 6.5387 GiB/s
try_resolve 1.9264 μs 13.277 GiB/s 3.9289 μs 6.5097 GiB/s
resolve_unchecked 1.6638 μs 15.372 GiB/s 3.1741 μs 8.0578 GiB/s

lasso's People

Contributors

1011x avatar andrew-rj avatar coastalwhite avatar domenicquirl avatar jameysharp avatar jonas-schievink avatar jyn514 avatar kixiron avatar nerosnm avatar nhwn avatar paolobarbolini avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lasso's Issues

Regression in 0.3.0: The same string is interned differently

Hello,

I've recently come across a very interesting case that seems to cause the interner to give different indices for the same string interned twice.

This is a regression from 0.2.0 to 0.3.0.

I have reduced the case to

use lasso::Rodeo;

fn main() {
    let mut rodeo = Rodeo::default();

    rodeo.get_or_intern("a");
    rodeo.get_or_intern("b");
    rodeo.get_or_intern("c");
    rodeo.get_or_intern("d");
    rodeo.get_or_intern("e");
    rodeo.get_or_intern("f");
    rodeo.get_or_intern("g");
    rodeo.get_or_intern("h");
    rodeo.get_or_intern("i");
    rodeo.get_or_intern("j");
    rodeo.get_or_intern("k");
    rodeo.get_or_intern("l");
    rodeo.get_or_intern("m");
    rodeo.get_or_intern("n");
    rodeo.get_or_intern("o");
    rodeo.get_or_intern("p");
    rodeo.get_or_intern("q");
    rodeo.get_or_intern("r");
    rodeo.get_or_intern("s");
    rodeo.get_or_intern("t");
    rodeo.get_or_intern("u");
    rodeo.get_or_intern("v");
    rodeo.get_or_intern("w");
    rodeo.get_or_intern("x");
    rodeo.get_or_intern("y");
    rodeo.get_or_intern("z");
    rodeo.get_or_intern("aa");
    rodeo.get_or_intern("bb");
    rodeo.get_or_intern("cc");
    rodeo.get_or_intern("dd");
    rodeo.get_or_intern("ee");
    rodeo.get_or_intern("ff");
    rodeo.get_or_intern("gg");
    rodeo.get_or_intern("hh");
    rodeo.get_or_intern("ii");
    rodeo.get_or_intern("jj");
    rodeo.get_or_intern("kk");
    rodeo.get_or_intern("ll");
    rodeo.get_or_intern("mm");
    rodeo.get_or_intern("nn");
    rodeo.get_or_intern("oo");
    rodeo.get_or_intern("pp");
    rodeo.get_or_intern("qq");
    rodeo.get_or_intern("rr");
    rodeo.get_or_intern("ss");
    rodeo.get_or_intern("tt");
    rodeo.get_or_intern("uu");
    rodeo.get_or_intern("vv");
    rodeo.get_or_intern("ww");
    rodeo.get_or_intern("xx");
    rodeo.get_or_intern("yy");
    rodeo.get_or_intern("zz");
    rodeo.get_or_intern("aaa");
    rodeo.get_or_intern("bbb");
    rodeo.get_or_intern("ccc");

    let var = rodeo.get_or_intern("ddd");

    rodeo.get_or_intern("eee");

    dbg!(var, rodeo.get_or_intern("ddd"));
}
[dependencies]
lasso = "0.3.0"

The output that I see from this is

[src/main.rs:66] var = Spur {
    key: 56,
}
[src/main.rs:66] rodeo.get_or_intern("ddd") = Spur {
    key: 58,
}

Possible race condition in ThreadedRodeo

Hi,

Really appreciate this great library! I've encountered a small issue, which I believe could be a race condition in ThreadedRodeo.

It is located in https://github.com/Kixiron/lasso/blob/master/src/multi_threaded.rs#L341-L342

self.map.insert(string, key);
self.strings.insert(key, string);

Suppose a new string is being interned by two threads running at the same time, then the if at line 331 (shard.read().get(...)) would return None for both threads. Both threads continue to line 341, one will call self.map.insert() which will write the string into the map, the second thread would do the same, overwriting the value from the first thread.

This would then create a situation where two Spurs could exist that resolve to identical strings, but not be equal to each other because their keys are different.

One option to address this would be to lock the shard for the duration as we insert into the arena and increment the key. If we would like to avoid locking for that long, perhaps checking the return value from self.map.insert(...) would help, as we can verify that we already added that string in, and just return its key, but then we would need to undo the other operations, and it doesn't seem possible to remove strings from the arena. This way we may end up using slightly more memory and keys than actually needed (which is already the case in this scenario), but at the very least, the Spur equality property would be maintained.

serialize+multi-threaded feature combination doesn't compile

When I try to use both features I get the following compile error:

   Compiling lasso v0.6.0
error[E0599]: no function or associated item named `deserialize` found for struct `Vec<_, _>` in the current scope
   --> /home/xxxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.6.0/src/reader.rs:423:40
    |
423 |         let vector: Vec<String> = Vec::deserialize(deserializer)?;
    |                                        ^^^^^^^^^^^ function or associated item not found in `Vec<_, _>`

error[E0599]: no function or associated item named `deserialize` found for struct `Vec<_, _>` in the current scope
   --> /home/xxxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.6.0/src/resolver.rs:305:40
    |
305 |         let vector: Vec<String> = Vec::deserialize(deserializer)?;
    |                                        ^^^^^^^^^^^ function or associated item not found in `Vec<_, _>`

error[E0599]: no function or associated item named `deserialize` found for struct `Vec<_, _>` in the current scope
   --> /home/xxxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.6.0/src/single_threaded.rs:928:40
    |
928 |         let vector: Vec<String> = Vec::deserialize(deserializer)?;
    |                                        ^^^^^^^^^^^ function or associated item not found in `Vec<_, _>`

error[E0277]: the trait bound `String: Deserialize<'_>` is not satisfied
   --> /home/xxxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.6.0/src/multi_threaded.rs:963:45
    |
963 |         let deser_map: HashMap<String, K> = HashMap::deserialize(deserializer)?;
    |                                             ^^^^^^^^^^^^^^^^^^^^ the trait `Deserialize<'_>` is not implemented for `String`
    |
    = help: the trait `Deserialize<'de>` is implemented for `&'a str`
    = note: required for `hashbrown::HashMap<String, K>` to implement `Deserialize<'_>`

Some errors have detailed explanations: E0277, E0599.
For more information about an error, try `rustc --explain E0277`.
error: could not compile `lasso` due to 4 previous errors

Compilation exited abnormally with code 101 at Tue Nov 15 14:06:59

Issue producing a release build

Thaks for this nice library :) Here is the issue I am facing:

When producing a release build of a library using lasso 0.4 with the following dependencies

serde = { version = "1.0.114", features = ["derive"] }
serde_json = "1.0.56"
typetag = "0.1"
lasso = { version = "0.4", features = ["multi-threaded", "serialize"] }

During

cargo build --release

Following compilation errors appear

error[E0277]: the size for values of type `str` cannot be known at compilation time
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
    |
457 |         let elem: &_ = $slice.get_unchecked($idx);
    |                               ^^^^^^^^^^^^^ doesn't have a size known at compile-time
    |
   ::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:455:49
    |
455 |                 let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
    |                                                 ------------------------------------------- in this macro invocation
    |
    = help: the trait `std::marker::Sized` is not implemented for `str`
    = note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0277]: the size for values of type `str` cannot be known at compilation time
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:433:27
    |
433 |         let mut strings = Vec::with_capacity(capacity.strings);
    |                           ^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
    |
    = help: the trait `std::marker::Sized` is not implemented for `str`
    = note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
    = note: required by `std::vec::Vec::<T>::with_capacity`

error[E0599]: no method named `push` found for struct `std::vec::Vec<str>` in the current scope
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:471:29
    |
471 |                     strings.push(allocated);
    |                             ^^^^ method not found in `std::vec::Vec<str>`
    |
    = note: the method `push` exists but the following trait bounds were not satisfied:
            `str: std::marker::Sized`

error[E0599]: no method named `get_unchecked` found for struct `std::vec::Vec<str>` in the current scope
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
    |
457 |         let elem: &_ = $slice.get_unchecked($idx);
    |                               ^^^^^^^^^^^^^ method not found in `std::vec::Vec<str>`
    |
   ::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:476:38
    |
476 | ...                   unsafe { index_unchecked!(strings, key.into_usize()) };
    |                                ------------------------------------------- in this macro invocation
    |
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0308]: mismatched types
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:490:13
    |
490 |             strings,
    |             ^^^^^^^ expected `&str`, found `str`
    |
    = note: expected struct `std::vec::Vec<&'static str>`
               found struct `std::vec::Vec<str>`

error[E0277]: the size for values of type `str` cannot be known at compilation time
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
    |
457 |         let elem: &_ = $slice.get_unchecked($idx);
    |                               ^^^^^^^^^^^^^ doesn't have a size known at compile-time
    |
   ::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:960:49
    |
960 |                 let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
    |                                                 ------------------------------------------- in this macro invocation
    |
    = help: the trait `std::marker::Sized` is not implemented for `str`
    = note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0277]: the size for values of type `str` cannot be known at compilation time
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:938:27
    |
938 |         let mut strings = Vec::with_capacity(capacity.strings);
    |                           ^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
    |
    = help: the trait `std::marker::Sized` is not implemented for `str`
    = note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
    = note: required by `std::vec::Vec::<T>::with_capacity`

error[E0599]: no method named `push` found for struct `std::vec::Vec<str>` in the current scope
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:976:29
    |
976 |                     strings.push(allocated);
    |                             ^^^^ method not found in `std::vec::Vec<str>`
    |
    = note: the method `push` exists but the following trait bounds were not satisfied:
            `str: std::marker::Sized`

error[E0599]: no method named `get_unchecked` found for struct `std::vec::Vec<str>` in the current scope
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
    |
457 |         let elem: &_ = $slice.get_unchecked($idx);
    |                               ^^^^^^^^^^^^^ method not found in `std::vec::Vec<str>`
    |
   ::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:981:38
    |
981 | ...                   unsafe { index_unchecked!(strings, key.into_usize()) };
    |                                ------------------------------------------- in this macro invocation
    |
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0308]: mismatched types
   --> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:995:13
    |
995 |             strings,
    |             ^^^^^^^ expected `&str`, found `str`
    |
    = note: expected struct `std::vec::Vec<&'static str>`
               found struct `std::vec::Vec<str>`

error: aborting due to 10 previous errors

Some errors have detailed explanations: E0277, E0308, E0599.
For more information about an error, try `rustc --explain E0277`.
error: could not compile `lasso`.

The issue doesn't appear when producing a non release build. Ended moving to lasso 0.3.1 but the Key serializations isn't as nicer as in 0.4.1 :)

Thanks!

`Default` trait is not implemented for rodeo with spur types other than `Spur`

I tried this code

#[derive(Default)]
struct Foo {
    strings: Rodeo<lasso::MiniSpur>,
}

I expected this to happen

The code compiles.

Instead this happened

This does not compile.
Compiler says error[E0277]: the trait bound `lasso::Rodeo<lasso::MiniSpur>: std::default::Default` is not satisfied.

Meta

Lasso version: 0.5.0

Rustc version: 1.52.0

> rustc -Vv
rustc 1.52.0 (88f19c6da 2021-05-03)
binary: rustc
commit-hash: 88f19c6dab716c6281af7602e30f413e809c5974
commit-date: 2021-05-03
host: x86_64-unknown-linux-gnu
release: 1.52.0
LLVM version: 12.0.0

Additional context

impl Default for Rodeo<Spur, RandomState> redirects to Self::new(), and Self::new() is implemented for any Rodeo<K, RandomState> where K: Key.
Is there any reason not to implement impl<K: Key> Default for Rodeo<K, RandomState>?

`raw_entry_mut` to speed up `try_get_or_intern`?

To my knowledge dashmap can't do dat, but on nightly std's hash map and hashbrown on stable can. What this allows you to do is performing a single hash map look-up, no matter whether the input is already interned or needs to be stored.

The big downside is more complicated code, especially with some feature cfgs for nightly, and, well, this only being an optimisation for the ST interner. So all this may be considered »not worth it«.

`Clone` implementation?

Hello, I noticed that Rodeo doesn't have a Clone impl. Is there a reason for this? (Just out of curiosity; it won't be too hard to work around it.)

Compilation failure on nightly rust

I tried this code

Nightly: nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.78.0-nightly (4a0cc881d 2024-03-11)

cargo +nightly check -F serialize

Results in compilation error:

error[E0275]: overflow evaluating the requirement `&_ well-formed`
   --> src/util.rs:445:13
    |
445 |             $slice[$idx]
    |             ^^^^^^^^^^^^
    |
   ::: src/reader.rs:453:49
    |
453 |                 let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
    |                                                 ------------------------------------------- in this macro invocation
    |
    = note: this error originates in the macro `index_unchecked` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0275]: overflow evaluating the requirement `&_ well-formed`
    --> src/util.rs:445:13
     |
445  |             $slice[$idx]
     |             ^^^^^^^^^^^^
     |
    ::: src/rodeo.rs:1132:49
     |
1132 |                 let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
     |                                                 ------------------------------------------- in this macro invocation
     |
     = note: this error originates in the macro `index_unchecked` (in Nightly builds, run with -Z macro-backtrace for more info)

For more information about this error, try `rustc --explain E0275`.
warning: `lasso` (lib) generated 7 warnings
error: could not compile `lasso` (lib) due to 2 previous errors; 7 warnings emitted

Possibly related to #45, but a fix will be required for debug.

A few bits I noticed in 72a56a7 (clean code + perf)

https://github.com/Kixiron/lasso/tree/72a56a76564493d7e191bb46b0f53034452ecd99

  • Arena allocator for the strings to fragment heap memory less?
    • Capacity would then relate to spare bytes in the current arena allocator chunk.
    • You could use the typed-arena crate. Has good code.
  • NonZeroUsize::new(int + 1).unwrap_or_else(|| { can just be a new_unchecked call.
  • You can blanket-implement from_usize for all Keys by just expecting try_from_usize.

Other than that code's lookin' good. °^°b

Feature request: Update to new hashbrown version to avoid duplicated dependencies

There is a new semver version of hashbrown. It would be nice to get a new release of lasso that uses it so we can avoid duplicate dependencies. From the output of cargo deny check:

45 │ ╭ hashbrown 0.13.2 registry+https://github.com/rust-lang/crates.io-index
46 │ │ hashbrown 0.14.3 registry+https://github.com/rust-lang/crates.io-index
   │ ╰──────────────────────────────────────────────────────────────────────^ lock entries
   │  
   = hashbrown v0.13.2
     └── lasso v0.7.2
         └── my_crate 0.1.0
   = hashbrown v0.14.3
     ├── dashmap v5.5.3
     │   └── lasso v0.7.2
     │       └── my_crate 0.1.0
     └── ordered-multimap v0.7.1
         └── rust-ini v0.20.0
             └── my_crate 0.1.0

As can be seen, lasso on it's own (with the multi-threaded feature) is enough to pull in two versions thanks to dashmap depending on the newer version. Other libraries that I use (like rust-ini) also pulls in the newer version.

Deadlock / panics for multi-threading in 0.7.0

Hi, just used lasso for the first time and I was getting deadlocks and panics inside of DashMap - had no idea what I was doing wrong. After like an hour of debugging I noticed that 0.7.0 just came out a week ago so on a hunch I tried 0.6.0 - deadlocks and panics went away!

If it helps here's the panic trace:

   0: std::panicking::begin_panic_handler
             at /rustc/3a8a131e9509c478ece1c58fe0ea2d49463d2300/library\std\src\panicking.rs:577
   1: core::panicking::panic_fmt
             at /rustc/3a8a131e9509c478ece1c58fe0ea2d49463d2300/library\core\src\panicking.rs:67
   2: core::panicking::panic
             at /rustc/3a8a131e9509c478ece1c58fe0ea2d49463d2300/library\core\src\panicking.rs:117
   3: dashmap::mapref::entry::VacantEntry<ref$<str$>,lasso::keys::Spur,ahash::random_state::RandomState>::insert<ref$<str$>,lasso::keys::Spur,ahash::random_state::RandomState>
             at C:\Users\Andy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\dashmap-5.4.0\src\mapref\entry.rs:106
   4: enum2$<dashmap::mapref::entry::Entry<ref$<str$>,lasso::keys::Spur,ahash::random_state::RandomState> >::or_try_insert_with
             at C:\Users\Andy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\dashmap-5.4.0\src\mapref\entry.rs:82
   5: lasso::threaded_rodeo::ThreadedRodeo<lasso::keys::Spur,ahash::random_state::RandomState>::try_get_or_intern
             at C:\Users\Andy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\lasso-0.7.0\src\threaded_rodeo.rs:329
   6: lasso::threaded_rodeo::ThreadedRodeo<lasso::keys::Spur,ahash::random_state::RandomState>::get_or_intern<lasso::keys::Spur,ahash::random_state::RandomState,ref$<str$> >
             at C:\Users\Andy\.cargo\registry\src\index.crates.io-6f17d22bba15001f\lasso-0.7.0\src\threaded_rodeo.rs:289

LockfreeArena::store_str contains a data race

This simple stress test encounters a data race:

use std::sync::Arc;
use rand::{SeedableRng, Rng};
use rand::rngs::SmallRng;

fn main() {
    let rodeo: Arc<lasso::ThreadedRodeo<lasso::Spur>> = Arc::new(lasso::ThreadedRodeo::new());
    let mut threads = Vec::new();
    for _ in 0..64 {
        let rodeo = Arc::clone(&rodeo);
        let handle = std::thread::spawn(move || {
            let mut rng = SmallRng::from_entropy();
            for _ in 0..1_000 {
                rodeo.get_or_intern(rng.gen::<u64>().to_string());
            }
        });
        threads.push(handle);
    }
    for t in threads {
        t.join().unwrap();
    }
}

Miri says:


error: Undefined Behavior: Data race detected between (1) Read on thread `<unnamed>` and (2) Write on thread `<unnamed>` at alloc120819+0x8. (2) just happened here
   --> /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/arenas/atomic_bucket.rs:178:13
    |
178 |             addr_of_mut!((*self.as_ptr()).len).write(new_length);
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Data race detected between (1) Read on thread `<unnamed>` and (2) Write on thread `<unnamed>` at alloc120819+0x8. (2) just happened here
    |
help: and (1) occurred earlier here
   --> src/main.rs:13:17
    |
13  |                 rodeo.get_or_intern(rng.gen::<u64>().to_string());
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
    = note: BACKTRACE (of the first span):
    = note: inside `lasso::arenas::atomic_bucket::UniqueBucketRef::set_len` at /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/arenas/atomic_bucket.rs:178:13: 178:65
    = note: inside `lasso::arenas::atomic_bucket::UniqueBucketRef::push_slice` at /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/arenas/atomic_bucket.rs:212:18: 212:49
    = note: inside `lasso::arenas::lockfree::LockfreeArena::store_str` at /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/arenas/lockfree.rs:121:46: 121:70
    = note: inside `lasso::ThreadedRodeo::try_get_or_intern::<std::string::String>` at /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/threaded_rodeo.rs:322:49: 322:83
    = note: inside `lasso::ThreadedRodeo::get_or_intern::<std::string::String>` at /home/ben/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lasso-0.7.0/src/threaded_rodeo.rs:289:9: 289:36
note: inside closure
   --> src/main.rs:13:17
    |
13  |                 rodeo.get_or_intern(rng.gen::<u64>().to_string());
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to previous error; 1 warning emitted

Outside of Miri, this readily manifests as an infinite loop or a panic.

Files with spaces break bazel

Hello,

There are 2 files that spaces in the name: Before Release.md, and To Do.md.

Those 2 break rust bazel projects that use lasso.

Will you consider a PR to fix this?

impl Index<Key> for Rodeo

Indexing an interner by key seems very convenient. Unfortunately none of the Rodeo family of types currently allow for that. I propose that for each of the Rodeo types:

  1. rodeo[key] replace the behavior of rodeo.resolve(key),
  2. rodeo.resolve(key) replace the behavior of rodeo.try_resolve(key),
  3. rodeo.try_resolve() be deprecated.

Obviously (2) will result in an API breaking change. This shouldn't be an issue since lasso is still pre-v1.0.0, but if avoiding this is desired, there are other options. In addition to (1), we could:

  • leave .try_resolve() and deprecate .resolve() (unnecessarily wordy; inconsistent naming),
  • leave both .try_resolve() and .resolve() (there's now more than one way of doing something).

I'm not completely happy with either of these alternative solutions, but then again, I don't even use lasso (yet) so perhaps consideration should be placed on its users rather than my API preferences.

Implement FromIterator

string_interner uses FromIterator to allow this example:

let interner = vec!["Elephant", "Tiger", "Horse", "Tiger"]
    .into_iter()
    .collect::<StringInterner>();

Would be nice to do the same with no boilerplate using lasso!

Rodeo memory use is at least twice the theoretical requirement

(I wrote a comparison of memory usage of different interners. This issue is suggesting some improvements from my own interner implementation that are the cause of the nearly 3x memory usage difference.) You've quoted memory footprint as a key feature of lasso, so I thought you'd be interested to see my analysis. I've not actually measured performance characteristics of any of these tweaks, just the memory usage patterns.

The current definition of the (single threaded) Rodeo is roughly (degenericized)

struct Rodeo {
    arena: Arena<u8>,
    string_to_spur: HashMap<&'self.arena [u8], Spur>,
    spur_to_string: Vec<&'self.arena [u8]>,
}

When using the raw_entry API, the size of the str -> Spur map can be cut at least to a third by instead storing (effectively) HashSet<Spur> (though for raw_entry it's a HashMap<Spur, ()>) and using raw_entry to compare the Spur entries as if they were the string they map to. The cost for this memory saving is an extra indirection involved in hashing/comparing entries in the map (for any str -> Spur conversions).

There is also a further optimization potential for the Spur -> str map. There are two potential options here:

  1. Reduce memory usage of the map by about 25%. Instead of storing the &[u8] directly in the vector, store indexes into the arena. IIUC, these would minimally be { bucket_idx: u32, start_idx: u32, end_idx: u32 } for the current arena. (In my own crate, it's just (u32, u32) as the arena is a single contiguous allocation; however, I believe the less-copies arena approach you have here to probably be significantly better for tail-latency.) The cost would be extra indirection both on Spur -> str lookups and on str -> Spur lookups if the above change to that map is used.

  2. Allow interning of real &'static str in the interner without copying them into internal storage. With a direct Spur -> str map that doesn't go indirect through the arena, this is trivial to support; just put the &'static str into the map. And interning of statically known strings is a surprisingly common and useful thing to be able to do; most contexts that benefit from an interner have some set of known names they know they'll always use and already have baked into the binary's static data section in order to treat them specially. (Here's rustc's list!)

If lasso can incorporate even just the first size optimization for the Spur -> str map, even if just with the hashbrown feature, I'd feel a lot more comfortable recommending lasso instead of maintaining my own interner.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.