sile / patricia_tree Goto Github PK

View Code? Open in Web Editor NEW

102.0 7.0 16.0 161 KB

A memory-efficient patricia tree implementation written in Rust

License: MIT License

Rust 100.00%

rust patricia-tree

patricia_tree's Introduction

patricia_tree

Memory-efficient data structures based on patricia tree (a.k.a, radix tree).

Documentation

A common prefixes of the keys in a patricia tree are represented by a shared path. So if the prefixes of the key set is highly redundant, the memory usage of the resulting patricia tree will be drastically less than more generic data structures (e.g., BTreeMap).

See Radix tree for more details.

Examples

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("bar", 2);
map.insert("baz", 3);
assert_eq!(map.len(), 3);

assert_eq!(map.get("foo"), Some(&1));
assert_eq!(map.get("bar"), Some(&2));
assert_eq!(map.get("baz"), Some(&3));

Benchmarks

$ cargo run --example insert_lines --release -- --version 2> /dev/null
insert_lines 0.1.0

///
/// INPUT: Wikipedia
///
$ curl -s https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz | gzip -d > enwiki-latest-all-titles-in-ns0
$ du -hs enwiki-latest-all-titles-in-ns0
271M    enwiki-latest-all-titles-in-ns0

// HashSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind hash < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 0:10.23
# MEMORY: 1001548  // 978 MB

// BTreeSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind btree < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 0:10.90
# MEMORY: 1112068  // 1,086 MB

// PatriciaSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind patricia < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 1:12.55
# MEMORY: 434340   // 424 MB

///
/// INPUT: Google 5-gram
///
$ curl -s http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-0.gz | gzip -d > googlebooks-eng-all-5gram-20120701-0
$ du -hs googlebooks-eng-all-5gram-20120701-0
331M    googlebooks-eng-all-5gram-20120701-0

// HashSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind hash < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:08.36
# MEMORY: 1115544  // 1,089 MB

// BTreeSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind btree < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:06.85
# MEMORY: 942236   // 920 MB

// PatriciaSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind patricia < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:25.62
# MEMORY: 223616   // 218 MB

patricia_tree's People

Contributors

Stargazers

Watchers

Forkers

naokiri koba-e964 leshow jdv9so ives9638 untitaker jaxrtech ferrosync spedoske monoid alexkirsz machworklab jakobeha little-dude liushu-project

patricia_tree's Issues

Get longest common prefix

I want to segment a long string

Expect:

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("foobar", 2);
assert_eq!(map.get_longest_common_prefix("fo"), "fo");
assert_eq!(map.get_longest_common_prefix("fobaz"), "fo");

Current:

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("foobar", 2);
assert_eq!(map.get_longest_common_prefix("fo"), None);
assert_eq!(map.get_longest_common_prefix("fobaz"), None);
assert_eq!(map.get_longest_common_prefix("foo"), Some(("foo".as_bytes(), &1)));
assert_eq!(map.get_longest_common_prefix("fooba"), Some(("foo".as_bytes(), &1)));
assert_eq!(map.get_longest_common_prefix("foobar"), Some(("foobar".as_bytes(), &2)));
assert_eq!(map.get_longest_common_prefix("foobarbaz"), Some(("foobar".as_bytes(), &2)));

possible to get all values in a path?

would it be possible to traverse the tree using a key, picking up all of the matching values in the path?

Something like this:

use patricia_tree::PatriciaMap;

fn main() {
    let mut t = PatriciaMap::new();
    t.insert("z", vec!["z"]);
    t.insert("aba", vec!["a"]);
    t.insert("abb", vec!["b"]);
    t.insert("abc", vec!["c"]);

    assert!(t.get_all_in(&"abd").eq(vec!["a", "b", "c"].iter()));
}

we walk along the tree using the key, gathering up the matches as we go, when it no longer matches, return the gathered results

How to serialize a PatriciaMap ?

I followed the serde tutorial on how to do serialization, and tried to serialize my PatriciaMap with this code :

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("bar", 2);
map.insert("baz", 3);

let serialized = serde_json::to_string(&map).unwrap();

let deserialized: PatriciaMap<usize> = serde_json::from_str(&serialized).unwrap();

However it fails at deserialization with the following error :

invalid type: sequence, expected a byte string at line 1 column 2

So how can I properly serialize / deserialize a PatriciaMap ?

Is it safe to share a read only PatriciaMap across multiple thread?

In another word, is it safe to add unsafe impl<V> Sync for Node<V> {} ?

It is a dangerous trap to mix and match `insert` and `insert_str`

The insert_str methods in #15 seem to lead to a trap/footgun: if you mix and match insert and insert_str, then depending on what exactly you did, sometimes you won't be able to get things again afterwards.

Example:

use patricia_tree::PatriciaMap;

pub fn main() {
    let mut pmap = PatriciaMap::new();
    pmap.insert("ab/»", true);
    pmap.insert("ab/¡", true);
    pmap.insert_str("ab/«", true);

    assert!(*pmap.get("ab/«").unwrap()); // <--- this unwrap fails
    assert!(*pmap.get("ab/»").unwrap());
    assert!(*pmap.get("ab/¡").unwrap());
}

The motivation for insert_str makes sense and I think I can imagine, based on the tree structure, why this issue occurs. Maybe it would have been better for a stringy-PatriciaMap to be a different type, that ensures you can't mix them up?

I don't know whether it is interesting to you or not to break backwards compat for the sake of distinguishing these at the type system level, but it felt like it was worth mentioning. I only realised this issue after I introduced so many inserts that I couldn't actually track down where I was inserting wrong, so in the end I replaced all my insert_str with insert since they're easier to search for in a code base that also uses other kinds of maps :).

Longest common prefix lookup

I'd like to use this library for longest common prefix lookup.

For example, indexing IP networks and testing for most specific inclusion:

let mut map = PatriciaMap::new();
map.insert("110000001010100001100100", "in network 192.168.100.0/24");
map.insert("1100000010101000", "in network 192.168.0.0/16");

// find the most specific network that contains 192.168.100.50/32
map.get_longest_prefix("11000000101010000110010000110010") == "in network 192.168.100.0/24"

I think impl would consist of adding a version of .get() that chains an .or(self.value()) onto the search:

  pub(crate) fn get_longest_prefix(&self, key: &[u8]) -> Option<&V> {
      let common_prefix_len = self.skip_common_prefix(key);
      let next = &key[common_prefix_len..];
      if common_prefix_len == self.label().len() {
          if next.is_empty() {
              self.value()
          } else {
              self.child()
                  .and_then(|child| child.get_longest_prefix(next))
                  .or(self.value()) // Added
          }
      } else if common_prefix_len == 0 && self.label().get(0) <= key.get(0) {
          self.sibling()
              .and_then(|sibling| sibling.get_longest_prefix(next)) // Changed
      } else {
          None
      }
  }

Expose the return type of PatriciaMap::common_prefixes as a concrete type

The return type of PatriciaMap::common_prefixes is impl 'b + Iterator<Item = (&'b [u8], &'a V)>. This causes issues when you need to compose it in your own iterator--for instance, if you build your own data structure on top of a PatriciaMap. Instead, you will either have to Box the iterator, or use the type_alias_impl_trait unstable feature.

It would be better from the perspective of a user of your API is PatriciaMap::common_prefixes returned a CommonPrefixesIter<'a, 'b, V>.

is it possible to make keys unicode safe?

Hey there. Thanks for this library. I'm trying to create a JSON from the tree but it fails with unicode characters as they might get splitted on the tree. It would be nice to have unicode support and not breaking down multibytes chars. Thoughts?

is there any plan for serde derive

it is nice to have Serialize and Deserialize trait for patricia tree. it can save the cost of disk for storing and updating(merging patricia tree) .

PatriciaMap::iter has wrong length

Running the following code:

# Cargo.toml
[package]
name = "patricia_tree_issue"
version = "0.1.0"
edition = "2021"

[dependencies]
patricia_tree = "0.5.0"

# cargo --version
cargo 1.68.0-nightly (2381cbdb4 2022-12-23)

// src/main.rs
use patricia_tree::PatriciaMap;

fn main() {
    let mut map = PatriciaMap::new();
    map.insert("1", 0);
    map.insert("2", 0);
    map.remove("2");
    map.insert("2", 0);

    assert_eq!(map.len(), map.iter().count());
}

... produces the following output.

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `1`', src/main.rs:10:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/panicking.rs:64:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/panicking.rs:201:5
   4: patricia_tree_issue::main
             at ./src/main.rs:10:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/ops/function.rs:507:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

You would expect the iterator length to match with the map's length. The iter_mut method has the same problem.

Unable to get StringPatriciaMap to work with utf-8 keys

Hello!

I'm trying to build a japanese dictionary app with the StringPatriciaMap as a searchable index for searching the dictionary quickly.

Given the following mini example, this does not work:

    let mut map = StringPatriciaMap::<u8>::new();
    map.insert("インターポール", 1);
    map.insert("インターポル", 2);
    map.insert("インターリーブ", 3);
    map.insert("インターン", 4);

    assert_eq!(map.get("インターポール"), Some(&1)); //works
    assert_eq!(map.get("インターポル"), Some(&2)); //doesn't work

map.get will return Some if I search for the first key, but None if I search for the second or any other key.

This might just be a misunderstanding on how this data structure works, so please enlighten me if that's the case.

Random question

I've looked at this code a bunch and always wondered, what's up with these two conditions. Do you happen to remember? Apologies for the sort of weird ask:

    pub(crate) fn insert(&mut self, key: &[u8], value: V) -> Option<V> {
        if self.label().get(0) > key.get(0) {

This condition also shows up a bunch:

} else if common_prefix_len == 0 && self.label().get(0) <= key.get(0) {

I don't understand why it would matter that the first byte is <=, or conversely on insertion why it would matter if the label being added starts with a larger byte. Any insight?

The condition on insert makes it such that if you have "abc" and insert "dabc" you get "dabc" - sibling -> "abc" instead of the other way around, but I'm not sure why that matters.