Git Product home page Git Product logo

hyperloglog's People

Contributors

clarkduvall avatar dim avatar ianwilkes avatar mysza avatar nieksand avatar sdsykes avatar zserge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperloglog's Issues

Weird results post-merge

I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:

func main() {
	add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
		h := fnv.New64a()
		h.Write([]byte(strconv.Itoa(i)))
		hll.Add(h)
	}

	a, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(a, i%100)
	}
	fmt.Printf("a: %d\n", a.Count())

	b, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(b, (i%100)+5)
	}
	fmt.Printf("b: %d\n", b.Count())

	a.Merge(b)
	fmt.Printf("a+b: %d\n", a.Count())
}

I get the output:

a: 100
b: 100
a+b: 6

So the Counts for a and b individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!

Count error rate very high after merge (hll++)

every time I merge 2 hll++ with precision = 18. I find that the count result is totally wrong

func HashFnv1(str string) hash.Hash64 {
    hashed := fnv.New64a()
    hashed.Write([]byte(str))
    return hashed
}

func BenchmarkHLLPP(b *testing.B) {
    hll1,_ := hyperloglog.NewPlus(18)
    hll2,_ := hyperloglog.NewPlus(18)

    for i := 0; i < 1000; i++ {
        hll1.Add(HashFnv1("entry_1_"+strconv.Itoa(i)))
        hll2.Add(HashFnv1("entry_2_"+strconv.Itoa(i)))
    }

    b.Log("count hll1 = ", hll1.Count())
    b.Log("count hll2 = ", hll2.Count())

    for i := 0; i < b.N; i++ {
        hll1.Merge(hll2)
    }
    b.Log("count hll1 | hll2 =", hll1.Count())
}

gives me this

BenchmarkHLLPP-4       50000         36267 ns/op
--- BENCH: BenchmarkHLLPP-4
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000
    operation_test.go:161: count hll2 =  1000
    operation_test.go:166: count hll1 | hll2 = 244
    operation_test.go:160: count hll1 =  1000

Another case of really large Count value

Similar to #18 I have another case where the Count method returns an insanely large value for a very specific set of inputs.

Here is the input data: hll.txt

And here is the script to test it:

package main

import (
	"fmt"
	"hash/fnv"
	"io"
	"log"

	"github.com/clarkduvall/hyperloglog"
)

func main() {
	a, _ := hyperloglog.NewPlus(12)

	for {
		var l string
		if _, err := fmt.Scanln(&l); err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}
		h := fnv.New64a()
		h.Write([]byte(l))
		a.Add(h)
		//log.Printf("added val %q to hll, count is now: %d", l, a.Count())
	}

	log.Printf("a.Count: %d", a.Count())
}

Like in #18 I narrowed down the dataset so that omitting the last line causes the large number to not be output:

 /tmp ▻ cat hll.txt | ./hlltest                                                   
2017/05/22 14:49:54 a.Count: 18446744073709551604                                 
 /tmp ▻ cat hll.txt | head -n-1| ./hlltest                                        
2017/05/22 14:50:04 a.Count: 20478

Let me know if there's anything I can do to help debug more, thanks!

Extremely large Count() value after a Merge

So this is a super specific test case, and unfortunately I haven't had much luck in boiling it down farther. But for a very specific input we're seeing extremely high Count() values after Merge'ing a populated HLL+ into an empty one.

Attached is the input data: hll.txt

This is the test code:

package main

import (
	"fmt"
	"hash/fnv"
	"io"
	"log"

	"github.com/clarkduvall/hyperloglog"
)

func main() {
	a, _ := hyperloglog.NewPlus(12)

	for {
		var l string
		if _, err := fmt.Scanln(&l); err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}
		h := fnv.New64a()
		h.Write([]byte(l))
		a.Add(h)
		log.Printf("added val %q to hll, count is now: %d", l, a.Count())
	}

	b, _ := hyperloglog.NewPlus(12)
	b.Merge(a)
	log.Printf("b.Count: %d", b.Count())
}

To run:

cat hll.txt | ./hlltest

The output should end with something like:

2016/12/12 14:20:06 b.Count: 18446744073709551615

If you omit the last line like this:

cat hll.txt | head -n-1 | ./hlltest

You get a much more reasonable output:

2016/12/12 14:26:46 b.Count: 3099

Let me know if there's any other information you need to help debug this, thanks!

Large errors hll++ at high cardinality

Once I set my cardinality to 100k using hll++, the error rate becomes massive:

approx:  10506
actual:  100000
delta:   89494

My sample code is pretty much cribbed from your test/bench code. Set the truth variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

Complete repro code:

package main

import (
    "github.com/clarkduvall/hyperloglog"
    "fmt"
    "hash/fnv"
    "hash"
)

func hash64(s string) hash.Hash64 {
    h := fnv.New64()
    h.Write([]byte(s))
    return h
}

func main() {
    truth := 100000

    h, err := hyperloglog.NewPlus(16);
    if err != nil {
        fmt.Println(err)
        return
    }

    for i := 0; i < truth; i++ {
            istr := fmt.Sprintf("%s", i)
            h.Add(hash64(istr))
    }
    epp := h.Count()

    fmt.Println("approx: ", epp)
    fmt.Println("actual: ", truth)
    fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.