Git Product home page Git Product logo

patricia_tree's Introduction

patricia_tree

patricia_tree Documentation Actions Status Coverage Status License: MIT

Memory-efficient data structures based on patricia tree (a.k.a, radix tree).

Documentation

A common prefixes of the keys in a patricia tree are represented by a shared path. So if the prefixes of the key set is highly redundant, the memory usage of the resulting patricia tree will be drastically less than more generic data structures (e.g., BTreeMap).

See Radix tree for more details.

Examples

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("bar", 2);
map.insert("baz", 3);
assert_eq!(map.len(), 3);

assert_eq!(map.get("foo"), Some(&1));
assert_eq!(map.get("bar"), Some(&2));
assert_eq!(map.get("baz"), Some(&3));

Benchmarks

$ cargo run --example insert_lines --release -- --version 2> /dev/null
insert_lines 0.1.0

///
/// INPUT: Wikipedia
///
$ curl -s https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz | gzip -d > enwiki-latest-all-titles-in-ns0
$ du -hs enwiki-latest-all-titles-in-ns0
271M    enwiki-latest-all-titles-in-ns0

// HashSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind hash < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 0:10.23
# MEMORY: 1001548  // 978 MB

// BTreeSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind btree < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 0:10.90
# MEMORY: 1112068  // 1,086 MB

// PatriciaSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind patricia < enwiki-latest-all-titles-in-ns0
# LINES: 13450823
# ELAPSED: 1:12.55
# MEMORY: 434340   // 424 MB

///
/// INPUT: Google 5-gram
///
$ curl -s http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-0.gz | gzip -d > googlebooks-eng-all-5gram-20120701-0
$ du -hs googlebooks-eng-all-5gram-20120701-0
331M    googlebooks-eng-all-5gram-20120701-0

// HashSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind hash < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:08.36
# MEMORY: 1115544  // 1,089 MB

// BTreeSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind btree < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:06.85
# MEMORY: 942236   // 920 MB

// PatriciaSet
$ /usr/bin/time -f "# ELAPSED: %E\n# MEMORY: %M" cargo run --example insert_lines --release -- --kind patricia < googlebooks-eng-all-5gram-20120701-0
# LINES: 9814743
# ELAPSED: 0:25.62
# MEMORY: 223616   // 218 MB

patricia_tree's People

Contributors

alexkirsz avatar astariul avatar erichdongubler avatar jakobeha avatar jaxrtech avatar koba-e964 avatar leshow avatar little-dude avatar seespring avatar sile avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

patricia_tree's Issues

Get longest common prefix

I want to segment a long string

Expect:

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("foobar", 2);
assert_eq!(map.get_longest_common_prefix("fo"), "fo");
assert_eq!(map.get_longest_common_prefix("fobaz"), "fo");

Current:

use patricia_tree::PatriciaMap;

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("foobar", 2);
assert_eq!(map.get_longest_common_prefix("fo"), None);
assert_eq!(map.get_longest_common_prefix("fobaz"), None);
assert_eq!(map.get_longest_common_prefix("foo"), Some(("foo".as_bytes(), &1)));
assert_eq!(map.get_longest_common_prefix("fooba"), Some(("foo".as_bytes(), &1)));
assert_eq!(map.get_longest_common_prefix("foobar"), Some(("foobar".as_bytes(), &2)));
assert_eq!(map.get_longest_common_prefix("foobarbaz"), Some(("foobar".as_bytes(), &2)));

possible to get all values in a path?

would it be possible to traverse the tree using a key, picking up all of the matching values in the path?

Something like this:

use patricia_tree::PatriciaMap;

fn main() {
    let mut t = PatriciaMap::new();
    t.insert("z", vec!["z"]);
    t.insert("aba", vec!["a"]);
    t.insert("abb", vec!["b"]);
    t.insert("abc", vec!["c"]);

    assert!(t.get_all_in(&"abd").eq(vec!["a", "b", "c"].iter()));
}

we walk along the tree using the key, gathering up the matches as we go, when it no longer matches, return the gathered results

How to serialize a PatriciaMap ?

I followed the serde tutorial on how to do serialization, and tried to serialize my PatriciaMap with this code :

let mut map = PatriciaMap::new();
map.insert("foo", 1);
map.insert("bar", 2);
map.insert("baz", 3);

let serialized = serde_json::to_string(&map).unwrap();

let deserialized: PatriciaMap<usize> = serde_json::from_str(&serialized).unwrap();

However it fails at deserialization with the following error :

invalid type: sequence, expected a byte string at line 1 column 2


So how can I properly serialize / deserialize a PatriciaMap ?

It is a dangerous trap to mix and match `insert` and `insert_str`

The insert_str methods in #15 seem to lead to a trap/footgun: if you mix and match insert and insert_str, then depending on what exactly you did, sometimes you won't be able to get things again afterwards.

Example:

use patricia_tree::PatriciaMap;

pub fn main() {
    let mut pmap = PatriciaMap::new();
    pmap.insert("ab/»", true);
    pmap.insert("ab/¡", true);
    pmap.insert_str("ab/«", true);

    assert!(*pmap.get("ab/«").unwrap()); // <--- this unwrap fails
    assert!(*pmap.get("ab/»").unwrap());
    assert!(*pmap.get("ab/¡").unwrap());
}

The motivation for insert_str makes sense and I think I can imagine, based on the tree structure, why this issue occurs. Maybe it would have been better for a stringy-PatriciaMap to be a different type, that ensures you can't mix them up?

I don't know whether it is interesting to you or not to break backwards compat for the sake of distinguishing these at the type system level, but it felt like it was worth mentioning. I only realised this issue after I introduced so many inserts that I couldn't actually track down where I was inserting wrong, so in the end I replaced all my insert_str with insert since they're easier to search for in a code base that also uses other kinds of maps :).

Longest common prefix lookup

I'd like to use this library for longest common prefix lookup.

For example, indexing IP networks and testing for most specific inclusion:

let mut map = PatriciaMap::new();
map.insert("110000001010100001100100", "in network 192.168.100.0/24");
map.insert("1100000010101000", "in network 192.168.0.0/16");

// find the most specific network that contains 192.168.100.50/32
map.get_longest_prefix("11000000101010000110010000110010") == "in network 192.168.100.0/24"

I think impl would consist of adding a version of .get() that chains an .or(self.value()) onto the search:

  pub(crate) fn get_longest_prefix(&self, key: &[u8]) -> Option<&V> {
      let common_prefix_len = self.skip_common_prefix(key);
      let next = &key[common_prefix_len..];
      if common_prefix_len == self.label().len() {
          if next.is_empty() {
              self.value()
          } else {
              self.child()
                  .and_then(|child| child.get_longest_prefix(next))
                  .or(self.value()) // Added
          }
      } else if common_prefix_len == 0 && self.label().get(0) <= key.get(0) {
          self.sibling()
              .and_then(|sibling| sibling.get_longest_prefix(next)) // Changed
      } else {
          None
      }
  }

Expose the return type of PatriciaMap::common_prefixes as a concrete type

The return type of PatriciaMap::common_prefixes is impl 'b + Iterator<Item = (&'b [u8], &'a V)>. This causes issues when you need to compose it in your own iterator--for instance, if you build your own data structure on top of a PatriciaMap. Instead, you will either have to Box the iterator, or use the type_alias_impl_trait unstable feature.

It would be better from the perspective of a user of your API is PatriciaMap::common_prefixes returned a CommonPrefixesIter<'a, 'b, V>.

is it possible to make keys unicode safe?

Hey there. Thanks for this library. I'm trying to create a JSON from the tree but it fails with unicode characters as they might get splitted on the tree. It would be nice to have unicode support and not breaking down multibytes chars. Thoughts?

is there any plan for serde derive

it is nice to have Serialize and Deserialize trait for patricia tree. it can save the cost of disk for storing and updating(merging patricia tree) .

PatriciaMap::iter has wrong length

Running the following code:

# Cargo.toml
[package]
name = "patricia_tree_issue"
version = "0.1.0"
edition = "2021"

[dependencies]
patricia_tree = "0.5.0"
# cargo --version
cargo 1.68.0-nightly (2381cbdb4 2022-12-23)
// src/main.rs
use patricia_tree::PatriciaMap;

fn main() {
    let mut map = PatriciaMap::new();
    map.insert("1", 0);
    map.insert("2", 0);
    map.remove("2");
    map.insert("2", 0);

    assert_eq!(map.len(), map.iter().count());
}

... produces the following output.

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `1`', src/main.rs:10:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/panicking.rs:64:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/panicking.rs:201:5
   4: patricia_tree_issue::main
             at ./src/main.rs:10:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/270c94e484e19764a2832ef918c95224eb3f17c7/library/core/src/ops/function.rs:507:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

You would expect the iterator length to match with the map's length. The iter_mut method has the same problem.

Unable to get StringPatriciaMap to work with utf-8 keys

Hello!

I'm trying to build a japanese dictionary app with the StringPatriciaMap as a searchable index for searching the dictionary quickly.

Given the following mini example, this does not work:

    let mut map = StringPatriciaMap::<u8>::new();
    map.insert("インターポール", 1);
    map.insert("インターポル", 2);
    map.insert("インターリーブ", 3);
    map.insert("インターン", 4);

    assert_eq!(map.get("インターポール"), Some(&1)); //works
    assert_eq!(map.get("インターポル"), Some(&2)); //doesn't work

map.get will return Some if I search for the first key, but None if I search for the second or any other key.

This might just be a misunderstanding on how this data structure works, so please enlighten me if that's the case.

Random question

I've looked at this code a bunch and always wondered, what's up with these two conditions. Do you happen to remember? Apologies for the sort of weird ask:

    pub(crate) fn insert(&mut self, key: &[u8], value: V) -> Option<V> {
        if self.label().get(0) > key.get(0) {

This condition also shows up a bunch:

} else if common_prefix_len == 0 && self.label().get(0) <= key.get(0) {

I don't understand why it would matter that the first byte is <=, or conversely on insertion why it would matter if the label being added starts with a larger byte. Any insight?

The condition on insert makes it such that if you have "abc" and insert "dabc" you get "dabc" - sibling -> "abc" instead of the other way around, but I'm not sure why that matters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.