clarkduvall / hyperloglog Goto Github PK
View Code? Open in Web Editor NEWHyperLogLog and HyperLogLog++ implementation in Go/Golang.
Home Page: http://godoc.org/github.com/clarkduvall/hyperloglog
License: MIT License
HyperLogLog and HyperLogLog++ implementation in Go/Golang.
Home Page: http://godoc.org/github.com/clarkduvall/hyperloglog
License: MIT License
I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:
func main() {
add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
h := fnv.New64a()
h.Write([]byte(strconv.Itoa(i)))
hll.Add(h)
}
a, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
add(a, i%100)
}
fmt.Printf("a: %d\n", a.Count())
b, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
add(b, (i%100)+5)
}
fmt.Printf("b: %d\n", b.Count())
a.Merge(b)
fmt.Printf("a+b: %d\n", a.Count())
}
I get the output:
a: 100
b: 100
a+b: 6
So the Counts for a
and b
individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!
every time I merge 2 hll++ with precision = 18. I find that the count result is totally wrong
func HashFnv1(str string) hash.Hash64 {
hashed := fnv.New64a()
hashed.Write([]byte(str))
return hashed
}
func BenchmarkHLLPP(b *testing.B) {
hll1,_ := hyperloglog.NewPlus(18)
hll2,_ := hyperloglog.NewPlus(18)
for i := 0; i < 1000; i++ {
hll1.Add(HashFnv1("entry_1_"+strconv.Itoa(i)))
hll2.Add(HashFnv1("entry_2_"+strconv.Itoa(i)))
}
b.Log("count hll1 = ", hll1.Count())
b.Log("count hll2 = ", hll2.Count())
for i := 0; i < b.N; i++ {
hll1.Merge(hll2)
}
b.Log("count hll1 | hll2 =", hll1.Count())
}
gives me this
BenchmarkHLLPP-4 50000 36267 ns/op
--- BENCH: BenchmarkHLLPP-4
operation_test.go:160: count hll1 = 1000
operation_test.go:161: count hll2 = 1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 = 1000
operation_test.go:161: count hll2 = 1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 = 1000
operation_test.go:161: count hll2 = 1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 = 1000
Similar to #18 I have another case where the Count
method returns an insanely large value for a very specific set of inputs.
Here is the input data: hll.txt
And here is the script to test it:
package main
import (
"fmt"
"hash/fnv"
"io"
"log"
"github.com/clarkduvall/hyperloglog"
)
func main() {
a, _ := hyperloglog.NewPlus(12)
for {
var l string
if _, err := fmt.Scanln(&l); err == io.EOF {
break
} else if err != nil {
panic(err)
}
h := fnv.New64a()
h.Write([]byte(l))
a.Add(h)
//log.Printf("added val %q to hll, count is now: %d", l, a.Count())
}
log.Printf("a.Count: %d", a.Count())
}
Like in #18 I narrowed down the dataset so that omitting the last line causes the large number to not be output:
/tmp ▻ cat hll.txt | ./hlltest
2017/05/22 14:49:54 a.Count: 18446744073709551604
/tmp ▻ cat hll.txt | head -n-1| ./hlltest
2017/05/22 14:50:04 a.Count: 20478
Let me know if there's anything I can do to help debug more, thanks!
So this is a super specific test case, and unfortunately I haven't had much luck in boiling it down farther. But for a very specific input we're seeing extremely high Count() values after Merge'ing a populated HLL+ into an empty one.
Attached is the input data: hll.txt
This is the test code:
package main
import (
"fmt"
"hash/fnv"
"io"
"log"
"github.com/clarkduvall/hyperloglog"
)
func main() {
a, _ := hyperloglog.NewPlus(12)
for {
var l string
if _, err := fmt.Scanln(&l); err == io.EOF {
break
} else if err != nil {
panic(err)
}
h := fnv.New64a()
h.Write([]byte(l))
a.Add(h)
log.Printf("added val %q to hll, count is now: %d", l, a.Count())
}
b, _ := hyperloglog.NewPlus(12)
b.Merge(a)
log.Printf("b.Count: %d", b.Count())
}
To run:
cat hll.txt | ./hlltest
The output should end with something like:
2016/12/12 14:20:06 b.Count: 18446744073709551615
If you omit the last line like this:
cat hll.txt | head -n-1 | ./hlltest
You get a much more reasonable output:
2016/12/12 14:26:46 b.Count: 3099
Let me know if there's any other information you need to help debug this, thanks!
Once I set my cardinality to 100k using hll++, the error rate becomes massive:
approx: 10506
actual: 100000
delta: 89494
My sample code is pretty much cribbed from your test/bench code. Set the truth
variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.
Complete repro code:
package main
import (
"github.com/clarkduvall/hyperloglog"
"fmt"
"hash/fnv"
"hash"
)
func hash64(s string) hash.Hash64 {
h := fnv.New64()
h.Write([]byte(s))
return h
}
func main() {
truth := 100000
h, err := hyperloglog.NewPlus(16);
if err != nil {
fmt.Println(err)
return
}
for i := 0; i < truth; i++ {
istr := fmt.Sprintf("%s", i)
h.Add(hash64(istr))
}
epp := h.Count()
fmt.Println("approx: ", epp)
fmt.Println("actual: ", truth)
fmt.Println("delta: ", int64(truth)-int64(h.Count()))
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.