clarkduvall / hyperloglog Goto Github PK

View Code? Open in Web Editor NEW

441.0 441.0 48.0 111 KB

HyperLogLog and HyperLogLog++ implementation in Go/Golang.

Home Page: http://godoc.org/github.com/clarkduvall/hyperloglog

License: MIT License

Go 100.00%

hyperloglog's People

Contributors

Stargazers

Watchers

hyperloglog's Issues

Consider panic instead of returning error on invalid precision

Weird results post-merge

I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:

func main() {
	add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
		h := fnv.New64a()
		h.Write([]byte(strconv.Itoa(i)))
		hll.Add(h)
	}

	a, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(a, i%100)
	}
	fmt.Printf("a: %d\n", a.Count())

	b, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(b, (i%100)+5)
	}
	fmt.Printf("b: %d\n", b.Count())

	a.Merge(b)
	fmt.Printf("a+b: %d\n", a.Count())
}

I get the output:

a: 100
b: 100
a+b: 6

So the Counts for a and b individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!

Count error rate very high after merge (hll++)

every time I merge 2 hll++ with precision = 18. I find that the count result is totally wrong

func HashFnv1(str string) hash.Hash64 {
    hashed := fnv.New64a()
    hashed.Write([]byte(str))
    return hashed
}

func BenchmarkHLLPP(b *testing.B) {
    hll1,_ := hyperloglog.NewPlus(18)
    hll2,_ := hyperloglog.NewPlus(18)

    for i := 0; i < 1000; i++ {
        hll1.Add(HashFnv1("entry_1_"+strconv.Itoa(i)))
        hll2.Add(HashFnv1("entry_2_"+strconv.Itoa(i)))
    }

    b.Log("count hll1 = ", hll1.Count())
    b.Log("count hll2 = ", hll2.Count())

    for i := 0; i < b.N; i++ {
        hll1.Merge(hll2)
    }
    b.Log("count hll1 | hll2 =", hll1.Count())
}

gives me this

BenchmarkHLLPP-4       50000         36267 ns/op
--- BENCH: BenchmarkHLLPP-4
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000

Another case of really large Count value

Similar to #18 I have another case where the Count method returns an insanely large value for a very specific set of inputs.

Here is the input data: hll.txt

And here is the script to test it:

package main

import (
	"fmt"
	"hash/fnv"
	"io"
	"log"

	"github.com/clarkduvall/hyperloglog"
)

func main() {
	a, _ := hyperloglog.NewPlus(12)

	for {
		var l string
		if _, err := fmt.Scanln(&l); err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}
		h := fnv.New64a()
		h.Write([]byte(l))
		a.Add(h)
		//log.Printf("added val %q to hll, count is now: %d", l, a.Count())
	}

	log.Printf("a.Count: %d", a.Count())
}

Like in #18 I narrowed down the dataset so that omitting the last line causes the large number to not be output:

 /tmp ▻ cat hll.txt | ./hlltest                                                   
2017/05/22 14:49:54 a.Count: 18446744073709551604                                 
 /tmp ▻ cat hll.txt | head -n-1| ./hlltest                                        
2017/05/22 14:50:04 a.Count: 20478

Let me know if there's anything I can do to help debug more, thanks!

Return boolean on Add if register was modified similar to Redis

http://redis.io/commands/pfadd

Extremely large Count() value after a Merge

So this is a super specific test case, and unfortunately I haven't had much luck in boiling it down farther. But for a very specific input we're seeing extremely high Count() values after Merge'ing a populated HLL+ into an empty one.

Attached is the input data: hll.txt

This is the test code:

package main

import (
	"fmt"
	"hash/fnv"
	"io"
	"log"

	"github.com/clarkduvall/hyperloglog"
)

func main() {
	a, _ := hyperloglog.NewPlus(12)

	for {
		var l string
		if _, err := fmt.Scanln(&l); err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}
		h := fnv.New64a()
		h.Write([]byte(l))
		a.Add(h)
		log.Printf("added val %q to hll, count is now: %d", l, a.Count())
	}

	b, _ := hyperloglog.NewPlus(12)
	b.Merge(a)
	log.Printf("b.Count: %d", b.Count())
}

To run:

cat hll.txt | ./hlltest

The output should end with something like:

2016/12/12 14:20:06 b.Count: 18446744073709551615

If you omit the last line like this:

cat hll.txt | head -n-1 | ./hlltest

You get a much more reasonable output:

2016/12/12 14:26:46 b.Count: 3099

Let me know if there's any other information you need to help debug this, thanks!

Large errors hll++ at high cardinality

Once I set my cardinality to 100k using hll++, the error rate becomes massive:

approx:  10506
actual:  100000
delta:   89494

My sample code is pretty much cribbed from your test/bench code. Set the truth variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

Complete repro code:

package main

import (
    "github.com/clarkduvall/hyperloglog"
    "fmt"
    "hash/fnv"
    "hash"
)

func hash64(s string) hash.Hash64 {
    h := fnv.New64()
    h.Write([]byte(s))
    return h
}

func main() {
    truth := 100000

    h, err := hyperloglog.NewPlus(16);
    if err != nil {
        fmt.Println(err)
        return
    }

    for i := 0; i < truth; i++ {
            istr := fmt.Sprintf("%s", i)
            h.Add(hash64(istr))
    }
    epp := h.Count()

    fmt.Println("approx: ", epp)
    fmt.Println("actual: ", truth)
    fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
}

clarkduvall / hyperloglog Goto Github PK

hyperloglog's People

Contributors

Stargazers

Watchers

Forkers

hyperloglog's Issues

Consider panic instead of returning error on invalid precision

Weird results post-merge

Count error rate very high after merge (hll++)

Another case of really large Count value

Return boolean on Add if register was modified similar to Redis

Extremely large Count() value after a Merge

Large errors hll++ at high cardinality

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent