bits-and-blooms / bloom Goto Github PK

View Code? Open in Web Editor NEW

2.3K 41.0 230.0 123 KB

Go package implementing Bloom filters, used by Milvus and Beego.

License: BSD 2-Clause "Simplified" License

Go 84.92% Makefile 15.08%

bloom bloom-filters go

bloom's Introduction

Bloom filters

This library is used by popular systems such as Milvus and beego.

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    filter := bloom.NewWithEstimates(1000000, 0.01)

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Godoc documentation: https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3

Installation

go get -u github.com/bits-and-blooms/bloom/v3

Verifying the False Positive Rate

Sometimes, the actual false positive rate may differ (slightly) from the theoretical false positive rate. We have a function to estimate the false positive rate of a Bloom filter with m bits and k hashing functions for a set of size n:

    if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 ...

You can use it to validate the computed m, k parameters:

    m, k := bloom.EstimateParameters(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(m, k, n)

    f := bloom.NewWithEstimates(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(f.m, f.k, n)

You would expect ActualfpRate to be close to the desired false-positive rate fp in these cases.

The EstimateFalsePositiveRate function creates a temporary Bloom filter. It is also relatively expensive and only meant for validation.

Serialization

You can read and write the Bloom filters as follows:

	f := New(1000, 4)
	var buf bytes.Buffer
	bytesWritten, err := f.WriteTo(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	var g BloomFilter
	bytesRead, err := g.ReadFrom(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	if bytesRead != bytesWritten {
		t.Errorf("read unexpected number of bytes %d != %d", bytesRead, bytesWritten)
	}

Performance tip: When reading and writing to a file or a network connection, you may get better performance by wrapping your streams with bufio instances.

E.g.,

	f, err := os.Create("myfile")
	w := bufio.NewWriter(f)

	f, err := os.Open("myfile")
	r := bufio.NewReader(f)

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

bloom's People

Contributors

Stargazers

Watchers

Forkers

andradeandrey drhodes mjarco xushiwei liluxdev channelmeter fd newblue cenkalti rsrsps unixfile amazonthrowaway wheelcomplex thepaul archs eliothedeman kyledj ondi bigtong lucifadva zumur godeep autoroute pombreda erikdubbelboer j0m advancedlogic digdeeply pombredanne turgon zoutaiqi wchh gitter-badger arussellsaw frairon zouhuigang gtfierro sunbenxin db7 lovoo tompao carolma leonklingele wilhg tecnickcom ajaybodhe giddily swxctx hilalisa jzhongming m3db 17media aravind-kumar teddy-hoo lucianthorr tmthrgd zhanglei lyw007 aries-wang russellwmy xingzhexiaozhu maciej chhsiao1981 carlsama vinitkantrod traetox waterem rusenask danieldc metabd gongxun0928 jameschen79 imbuba xianqingv5 emperorearth huangnauh sheldonzhong sakishen 00arthur00 danceiny my-stars pnordahl go-packages annchain jared2501 backwardn kungehero arlendotcn fantboy trendingtechnology kirill-shkodkin segmentio anissac jiamingming axamon flyzhang007 shilindev ldkingvivi jiwubu joe-elliott

bloom's Issues

why not use uint8 on bloom bitset?

the bitset is
type BitSet struct { length uint set []uint64 }
now.
cause the bloomFilter used more memory

Missing /v3 suffix in tagged revision

Hello,
when trying to add dependency:

github.com/bits-and-blooms/bloom/v3 v3.0.0

I get

go: github.com/bits-and-blooms/bloom/[email protected]: go.mod has non-.../v3 module path "github.com/bits-and-blooms/bloom" (and .../v3/go.mod does not exist) at revision v3.0.0

I found that the /v3 suffix has been added to master, but is still missing on v3.0.0 tag.

New release tag

Could you possibly set another release tag? v1 is missing some of the useful String functions and JSON marshaling which I'd like to use with the security of gopkg.

Thanks!

how much memory is used per item / false positive percentage?

lower false positive rates = higher memory used. is there a vague guide on the number of bytes?

    filter := bloom.NewWithEstimates(1000000, 0.01)

for a cdn deployment with 100 million items, what's a good number for the bloom item count and false positive rate?
also how much memory is needed for this bloom filter?
i was thinking if it's using a lot of memory, would it be advisable to use roaring bitmaps as "replacement" to bloom filter as suggested by chatgpt here:
RoaringBitmap/roaring#388

Backward compatibillity break test failures on s390x

With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

--- FAIL: TestHashBasic (0.00s)
    murmur_test.go:29: Backward compatibillity break.
    murmur_test.go:29: Backward compatibillity break.
[...]
--- FAIL: TestHashRandom (0.02s)
    murmur_test.go:60: Backward compatibillity break.
    murmur_test.go:60: Backward compatibillity break.
[...]
FAIL
exit status 1
FAIL	github.com/willf/bloom	0.217s

You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

Strange false positive with single item capacity filter

Greetings,

We've encountered a strange false positive case with a filter provisioned with single capacity and 1:billion error rate. If we increase the filter capacity to two or higher the false positive disappears, or if we keep the capacity at one and use a different error rate (e.g. 1:million) the false positive disappears.

A test to illustrate against the latest release v3.6.0:

func TestBitsAndBloom(t *testing.T) {
  assert := assert.New(t)

  // Capacity = 1 with 1:million error rate - passes
  filter := bitsandbloom.NewWithEstimates(1, 0.00000001)
  filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
  assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist")      // Passes
  assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Passes

  // Capacity = 2 with 1:billion error rate - passes
  filter = bitsandbloom.NewWithEstimates(2, 0.000000001)
  filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
  assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist")      // Passes
  assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Passes

  // Capacity = 1 with 1:billion error rate - fails
  filter = bitsandbloom.NewWithEstimates(1, 0.000000001)
  filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
  assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist")      // Passes
  assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Fails
}

We appreciate a single item capacity filter isn't the intended use case of a bloom filter (an edge case on our side), but the behaviour is curious. If not considered a bug, could readme guidance be included on any minimum capacity to achieve the required error rate?

Thanks,
-Dotjim

Applying the bloom filter for HASH160 addresses

Example Address List.

0af80f98e511577e5c87d56069896c70fe4a6263
0af81021340c23bb20429e47c3bdd62137c322d3
0af811994346cdca9d6dbe8e7da4f3393130fc71
0af81250856d377cf7f6914ebe196a9570ada313
0af814b46c4ef5c57dbc53ae03c812eea37d4aac
0af816bb57c0adac8bca0401d295724cd80b0956
64dfb83dcca1f516d2bac93c8c9e4eb6513fd364
64dfb8c23802f3aa67c9078c0e7970fa17584892
64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239
64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec
64dfba34a39823c273a33fadf49112c063ec3724

It gives quite a lot of positive and false results and for this reason I cannot use it.
I want to use bloom filter to filter about 25 million hash160 addresses.
How can I use the settings more efficiently?

I want to use it for this project.
https://github.com/oguzdelioglu/odelbrain

divide by zero error if number of bits is 0

I'm creating bloom filter from a list of strings with NewWithEstimates. I pass double size of the list as argument n. If the list is empty, n is 0.

n := 0
f := bloom.NewWithEstimates(n * 2, 0.01)
f.Get([]byte("test"))
panic: runtime error: integer divide by zero

I can add a check for n == 0 and put another value instead of 0, however it would be nice if the module itself did this check:

diff --git a/bloom.go b/bloom.go
index e19ba10..b7f74c7 100644
--- a/bloom.go
+++ b/bloom.go
@@ -72,6 +72,9 @@ type BloomFilter struct {
 
 // New creates a new Bloom filter with _m_ bits and _k_ hashing functions
 func New(m uint, k uint) *BloomFilter {
+	if m == 0 {
+		m = 1
+	}
 	return &BloomFilter{m, k, bitset.New(m)}
 }

TestAndAdd always returns false

// Equivalent to calling Test(data) then Add(data). Returns the result of Test.
func (f *BloomFilter) TestAndAdd(data []byte) bool {
f.locations(data)
for i := uint(0); i < f.k; i++ {
if !f.b.Test(f.locBuff[i]) {
f.present = false
}
f.b.Set(f.locBuff[i])
}
return f.present
}

needs f.present = true initialization

OOB read using ReadFrom

When upgrading to latest version we observe this error:

panic: runtime error: index out of range [14977] with length 14977

goroutine 761 [running]:
github.com/bits-and-blooms/bitset.(*BitSet).ReadFrom(0xc00036ff00, {0x50e95a0?, 0xc0003b8dc0?})
	github.com/bits-and-blooms/[email protected]/bitset.go:955 +0x392
github.com/bits-and-blooms/bloom/v3.(*BloomFilter).ReadFrom(0xc001096a50, {0x50e95a0, 0xc0003b8dc0})
	github.com/bits-and-blooms/bloom/[email protected]/bloom.go:367 +0xd6
github.com/minio/minio/cmd.(*dataUpdateTracker).deserialize(0xc000eb4850, {0x50e95a0, 0xc0003b8dc0}, {0x1b?, 0xc001344f00?, 0x0?})
	github.com/minio/minio/cmd/data-update-tracker.go:434 +0x487
github.com/minio/minio/cmd.(*dataUpdateTracker).load(0xc000eb4850, {0x50fa338, 0xc000ff7280}, {0xc001254700?, 0x64, 0xc000128c80?})
[...]

We are de-serializing into a new instance of &bloom.BloomFilter{}.

Possible regression from bits-and-blooms/bitset#103

Problem with uniformity of FNV implementation

Hi Will,

Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

I tested five different methods of finding bit locations:

your current implementation
an implementation using Murmur3
your implementation preceding changeset ed0c3ad
an implementation using Go's native FNV-1
an implementation using Go's native FNV-1a

The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

I think the problem stems from your addition of the index on line 77.
That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher,
Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

I hope this was clear enough, and you can find it useful. Thank you!

Is one hash enough?

Hi Will, is one hash really enough for a bloom filter?

Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

Cheers,
Bruce

is there a way to save results to disk when reboot?

Appears to be data races in Concurrent methods

WARNING: DATA RACE
Read at 0x0000002f40e8 by goroutine 8:
  github.com/willf/bloom.fnv64Hash()
      github.com/willf/bloom/_test/_obj_test/bloom.go:86 +0x134
  github.com/willf/bloom.baseHashes()
      github.com/willf/bloom/_test/_obj_test/bloom.go:95 +0x92
  github.com/willf/bloom.(*BloomFilter).Test()
      github.com/willf/bloom/_test/_obj_test/bloom.go:192 +0x92
  github.com/willf/bloom.TestConcurrent.func2()
      /Users/will/code/go/src/github.com/willf/bloom/bloom_test.go:43 +0x83

WriteTo() variant that exports only the bitset

Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

I'm eager to submit a PR to change that if we could agree on the design.

Of the top of my head we could do one of those things:

expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
add a *BloomFilter.WriteBitSetTo() function?
and my favourite: add a *BloomFilter.BitSetWriter() function that would return a

type BitSetWriter interface {
  WriterTo
}

go get fails due to bitset include

When I type in

go get github.com/willf/bloom

I get the result

package bitset: unrecognized import path "bitset"

I see that you wrote the package bit set, however, the way you import it doesn't work with go get. I'm not sure if a bug, but something to consider.

Edit: I found your other package. Unfortunately, there is a change before it works. I believe the Sum function requires the []byte to be passed in.

HI WILL

this is spam

hack day December 2, 2016

I'll be working on this repo and the underlying bitset project on Dec 2 as part of GitHub's open source Friday efforts.

If you have suggestions, please put them in comments below.

test

Incorrect return value for EstimateFalsePositiveRate

At line 77 EstimateFalsePositiveRate returns the number of false positives found divided by 100. This makes no sense as the number of tests performed depends on the input number. Choosing a higher n will now increase the number of false positives by a too high amount.

The correct return value would be the number of false positives found depending on the number of tests performed:

fp_rate = float64(fp) / float64(rounds) * float64(100)

Max number of elements

What is max number of elements in bloom filter? If it will be 200-300 millions - its ok?

Issue when importing package from master branch

when I importing bloom from master branch go get -u github.com/bits-and-blooms/bloom@master error appear that bitset package has invalid go.mod module github.com/willf/bitset

Is this library support persistence feature ?

hi,

Is this library support persistence feature ?
such as :
filter := bloom.New(20*n, 5)

filter.save(file_name);
or
filter.load(file_name);

thanks.

The VERSION file is correct?

as title, the release is 3.1.0, but ..
https://github.com/bits-and-blooms/bloom/blob/master/VERSION

bits-and-blooms can ensure thread-safe/go-routine safe？

if not，i should use lock/unlock to use Add method？thanks for your answer

Looking for new maintainer

I am not writing Go code anymore. Would someone like to take over as maintainer? @lemire ?

panic: runtime error: integer divide by zero when n is zero

bloom.NewWithEstimates(0, 1.0e-6).TestString("boom")

False positive rate increases dramatically on big sets

I wrote a simple code to test the FP%, and found that FP% was good (~2%) on small sets, but suddenly began increasing at some scale, and reached 100% very quickly.

func data(i uint) []byte {
	return []byte(fmt.Sprintf("%x", md5.Sum([]byte(string(i)))))
}

func falsePositive(n, b, k uint) float64 {
	f := bloom.New(uint(n * b), k)
	for i := uint(0); i < n; i++ {
		f.Add(data(i))
	}
	fp, rounds := 0, 10000
	for i := 0; i < rounds; i++ {
		if f.Test(data(uint(i) + n)) {
			fp++
		}
	}
	return float64(fp)/float64(rounds)
}

func main() {
	fmt.Println(falsePositive(1000, 8, 4))
	fmt.Println(falsePositive(10000, 8, 4))
	fmt.Println(falsePositive(100000, 8, 4))
	fmt.Println(falsePositive(1000000, 8, 4))
	fmt.Println(falsePositive(1104000, 8, 4))
	fmt.Println(falsePositive(1108000, 8, 4))
	fmt.Println(falsePositive(1112000, 8, 4))
	fmt.Println(falsePositive(1116000, 8, 4))
}

On macbook pro 64-bit, the output was:

My project needs a filter over tens of million items, but the filter fails at scale of ~1 million due to 100% false positives.

Create SECURITY.md

Hey there!

I belong to an open source security research community, and a member (@akincibor) has found an issue, but doesn’t know the best way to disclose it.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)

What license is this code released under?

Hi,

I couldn't figure out any other way to contact you about this. What license are you releasing this code under?

Thanks!

murmur3 over xxHash?

Is this matter of licensing issues or are there technical reasons to use murmur3 over xxHash?

Typo in README: binary/encoding

s@binary/encoding@encoding/binary@

Wrong sum256 hash value on big endian host architecture

On big-endian architecture bytes of data unpacked (1, 2) to uint64 as big-endian resulting in wrong hash value. Data bytes must be unpacked as little-endian for correct hash calculation.

Possible way to fix that issue:

diff --git a/murmur.go b/murmur.go
index c8947e1..c93b1ba 100644
--- a/murmur.go
+++ b/murmur.go
@@ -39,6 +39,7 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 package bloom
 
 import (
+	"encoding/binary"
 	"math/bits"
 	"unsafe"
 )
@@ -59,8 +60,8 @@ type digest128 struct {
 func (d *digest128) bmix(p []byte) {
 	nblocks := len(p) / block_size
 	for i := 0; i < nblocks; i++ {
-		t := (*[2]uint64)(unsafe.Pointer(&p[i*block_size]))
-		k1, k2 := t[0], t[1]
+		b := (*[16]byte)(unsafe.Pointer(&p[i*block_size]))
+		k1, k2 := binary.LittleEndian.Uint64(b[:8]), binary.LittleEndian.Uint64(b[8:])
 		d.bmix_words(k1, k2)
 	}
 }
@@ -269,8 +270,8 @@ func (d *digest128) sum256(data []byte) (hash1, hash2, hash3, hash4 uint64) {
 	// we do not want to append to an actual array!!!
 	if tail_length+1 == block_size {
 		// We are left with no tail!!!
-		word1 := *(*uint64)(unsafe.Pointer(&tail[0]))
-		word2 := uint64(*(*uint32)(unsafe.Pointer(&tail[8])))
+		word1 := binary.LittleEndian.Uint64(tail[:8])
+		word2 := uint64(binary.LittleEndian.Uint32(tail[8:8+4]))
 		word2 = word2 | (uint64(tail[12]) << 32) | (uint64(tail[13]) << 40) | (uint64(tail[14]) << 48)
 		// We append 1.
 		word2 = word2 | (uint64(1) << 56)

Error in readme.md of v3

The link to reference in pkg.go.dev should be https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3 other than https://pkg.go.dev/github.com/bits-and-blooms/bloom . The latter one is the doc for v2, the former one is for v3

Incorrect tests

Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.

module declares its path as github.com/bits-and-blooms/bitset but require github.com/willf/bitset

go get -u github.com/willf/bloom

NOTICE:

go: downloading github.com/willf/bloom v1.0.0
go: downloading github.com/willf/bloom v2.0.3+incompatible
go: downloading github.com/willf/bitset v1.2.2
go: downloading github.com/spaolacci/murmur3 v1.1.0
go get: github.com/willf/bitset@none updating to
        github.com/willf/[email protected]: parsing go.mod:
        module declares its path as: github.com/bits-and-blooms/bitset
                but was required as: github.com/willf/bitset

Question for hashing strings and order inside it

Hey,

I have a question if it is possible to hash strings not depending on the order of strings? I mean

hash("black cat") = hash("cat black")

this generates false negatives because changing order of sentence results in totally different hash. Is there any option for it? Maybe splitting sentences and combining hashed strings?

Thank you for any response.

go.sum missing

EstimateFalsePositiveRate rounds should be configurable?

After some confusion I realized EstimateFalsePositiveRate only does 10000 rounds, which is not enough to accurately estimate probability below 0.5% or so. I did the following, but maybe the number of rounds (here, x) should be a parameter of the function so the user can configure it.

const x = 10000000
for i := uint32(0); i < uint32(x); i++ {
    binary.BigEndian.PutUint32(n1, i+uint32(n)+1)
    if f.Test(n1) {
        fp++
    }
}
fp_rate = float64(fp) / float64(x)

Thanks for the library, by the way, it's awesome!

Add TestOrAdd, TestOrAddString

@lemire please review #69

Dead link in README file

The link in your README to the murmerhash algorithm goes to a dead link.

Meaning of this line?

n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love")

What does it mean by load of 20,5 keys and why are we multiplying by 1000 ? would love an explanation. thanks

2x 128 bit hash instead of 2x 64 bit hash. Why?

I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64s. instead of 128 long hash partitioned into 2 uint64s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?

feature: `Clear`

I use bloom with cache to check whether entry is exist or not before access to cache.

above situation, to reduce false positive over the long time, i need to refresh bloom filter periodically.

Is possible adding to Clear method which set bit to zero? because i want to clear partial bits which are sampled not all.