Git Product home page Git Product logo

fastbloom's Introduction

fastbloom

OSCS Status docs.rs Test Rust Test Python Benchmark Crates Latest Release PyPI Latest Release Sonatype Nexus (Snapshots)

A fast bloom filter | counting bloom filter implemented by Rust for Rust and Python!

Language: 简体中文

setup

Python

requirements

Python >= 3.7

install

Install the latest fastbloom version with:

pip install fastbloom-rs

Rust

fastbloom-rs = "{latest}"

Java

maven

<dependency>
    <groupId>io.github.yankun1992</groupId>
    <artifactId>fastbloom</artifactId>
    <version>{latest-version}</version>
</dependency>

Examples

BloomFilter

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

Reference: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. Full text article

Python

basic usage

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)

bloom.add_str('hello')
bloom.add_bytes(b'world')
bloom.add_int(9527)

assert bloom.contains('hello')
assert bloom.contains(b'world')
assert bloom.contains(9527)

assert not bloom.contains('hello world')

build bloom filter from bytes or list

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)
bloom.add_str('hello')
assert bloom.contains('hello')

bloom2 = BloomFilter.from_bytes(bloom.get_bytes(), bloom.hashes())
assert bloom2.contains('hello')

bloom3 = BloomFilter.from_int_array(bloom.get_int_array(), bloom.hashes())
assert bloom3.contains('hello')

there are some bulk api for python to reduce ffi cost between python and rust

bloom = BloomFilter(100_000_000, 0.01)
inserts = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100]
checks = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100, 190, 290, 390]
results = [True, True, True, True, True, True, True, True, True, True, True, True, False, False, False]

bloom.add_int_batch(inserts)
contains = bloom.contains_int_batch(checks)
assert contains == results

bloom.add_str_batch(list(map(lambda x: str(x), inserts)))
assert bloom.contains_str_batch(list(map(lambda x: str(x), checks))) == results

bloom.add_bytes_batch(list(map(lambda x: bytes(x), inserts)))
assert bloom.contains_bytes_batch(list(map(lambda x: bytes(x), checks))) == results

more examples at py_tests.

Rust

use fastbloom_rs::{BloomFilter, FilterBuilder};

let mut bloom = FilterBuilder::new(100_000_000, 0.01).build_bloom_filter();

bloom.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);
assert_eq!(bloom.contains(b"helloworld!"), false);

more examples at docs.rs

CountingBloomFilter

A Counting Bloom filter works in a similar manner as a regular Bloom filter; however, it is able to keep track of insertions and deletions. In a counting Bloom filter, each entry in the Bloom filter is a small counter associated with a basic Bloom filter bit.

Reference: F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006

Python

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01)
cbf.add('hello')
cbf.add('hello')
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' in cbf  # because 'hello' added twice. 
# If add same element larger than 15 times, then remove 15 times the filter will not contain the element.
cbf.remove('hello')
assert 'hello' not in cbf

A CountingBloomFilter has a four bits counter to save hash index, so when insert an element repeatedly, the counter will spill over quickly. So, you can set enable_repeat_insert to False to check whether the element has added. if it has added, it will not add again. enable_repeat_insert default set to True.

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01, False)
cbf.add('hello')
cbf.add('hello')  # because enable_repeat_insert=False, this addition will not take effect. 
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' not in cbf 

more examples at py_tests.

Rust

use fastbloom_rs::{CountingBloomFilter, FilterBuilder};

let mut builder = FilterBuilder::new(100_000, 0.01);
let mut cbf = builder.build_counting_bloom_filter();
cbf.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);

benchmark

computer info

CPU Memory OS
AMD Ryzen 7 5800U with Radeon Graphics 16G Windows 10

add one str to bloom filter

Benchmark insert one str to bloom filter:

bloom_add_test          time:   [41.168 ns 41.199 ns 41.233 ns]
                        change: [-0.4891% -0.0259% +0.3417%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

add one million to bloom filter

Benchmark loop insert (1..1_000_000).map(|n| { n.to_string() }) to bloom filter:

bloom_add_all_test      time:   [236.24 ms 236.86 ms 237.55 ms]
                        change: [-3.4346% -2.9050% -2.3524%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

check one contains in bloom filter

bloom_contains_test     time:   [42.065 ns 42.102 ns 42.156 ns]
                        change: [-0.7830% -0.5901% -0.4029%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  9 (9.00%) high severe

check one not contains in bloom filter

bloom_not_contains_test time:   [22.695 ns 22.727 ns 22.773 ns]
                        change: [-3.1948% -2.9695% -2.7268%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

add one str to counting bloom filter

counting_bloom_add_test time:   [60.822 ns 60.861 ns 60.912 ns]
                        change: [+0.2427% +0.3772% +0.5579%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

add one million to counting bloom filter

Benchmark loop insert (1..1_000_000).map(|n| { n.to_string() }) to counting bloom filter:

counting_bloom_add_million_test
                        time:   [272.48 ms 272.58 ms 272.68 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

fastbloom's People

Contributors

bitcoin-eagle avatar yankun1992 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fastbloom's Issues

Would it be possible to have add_if_not_contains?

Your library looks great. My use case is in deduplication, so I would be wanting to know if a given string is already in the Bloom filter, and add it if not. I can certainly make a call to contains, and then a call to add_str, but it seems like that will be repeating computing all the hash values. I assume it would run roughly twice as slow as a combined function could. Would that be something that would be possible to add? It would return whether it was added or not. Thanks!

cross language support (java)

Hi!

I'd like to share my bloom filter output to Java services. Are there any Java libs that are compatible with the data structure provided by fastbloom?

Thanks

Request: Method for lookups based on pre-computed hashes

I have a use-case where I need to look up a key within multiple bloom filters of identical fixed size and properties. Unioning the filters together doesn't help in this scenario because I need to identify which bloom filters contain the key. Ideally, I'd create the hash of the key once, then look up the set of hashes within the multiple bloom filters to avoid the re-hashing of the content.

From python, maybe it'd look like:

hashes = bloom.get_hashes('hello')
if bloom.contains_hashes(hashes):
   ... do some stuff ...

Features: enhance bulk operations API.

Now there has some bulk insert API. Eg:

# in BloomFilter
def add_int_batch(self, array: Sequence[int]):
    ...
def add_str_batch(self, array: Sequence[str]):
    ...
...

Add more batch insert API and support batch check API.

Request: Counting bloom filter function to get current value

Another request if possible. For the counting bloom filter, is it possible add a function to get the current count?

Eg:

cbf.add("asdf")
cbf.add("asdf")
print(cbf.count("asdf")) --> 2
print(cbf.count("abcd")) --> 0

Not sure if this is possible since I'm not a bloom filter expert. I think looking at the code it uses 4 bits for the counting, so will be encoding a count between 0 and 15 to track the count, right?

Load from file (wsl, Python)

Hi Yankun,

I testing load from file test.bin created with 10bil items, test.bin with 11.7Gb it load to Ram but need 23.4Gb Ram to load. My pc 16Gb ram and it load swap after loaded (224 second), 6Gb in real Ram and 6Gb in swap and bloomfilter it progress slow.
So it need double ram with .bin file to load and progress quickly ?

Thanks,

Bug: CountingBloomFilter should not use hash1 membership if enable_repeat_insert=true

Found a tricky bug in my code which I think is caused by this.

CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?

Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.

In my weird case I believe the above is impacting me because:

  1. Test membership of key 'abc', it thinks correctly that it is present because it has non-zero count, but is actually an error collision
  2. I loop decrement until 'abc' is not present anymore to get the count
  3. Now I loop increment 'abc' until it's back to it's original count

Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.

Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.

Serialize bloom filter using serde

Thank you so much for this crate. Many serious applications of fastbloom may need serialization and deserialization support. Is this planned?

Add set cardinality etimation

Hello,

I want to implement a set cardinality estimation function
I'm working on a PR (#11) but I cannot manage to launch my unit tests.
Do you have any advice on how to build the project and launch the rust and python test suite ?

I installed maturin, launched maturin develop --release
then tried to launch the unit tests with python -m unittest py_tests.test_bloom.test_bloom_add
but the test fails.

Thanks for your advices

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.