roaringbitmap / roaring Goto Github PK
View Code? Open in Web Editor NEWRoaring bitmaps in Go (golang), used by InfluxDB, Bleve, DataDog
Home Page: http://roaringbitmap.org/
License: Apache License 2.0
Roaring bitmaps in Go (golang), used by InfluxDB, Bleve, DataDog
Home Page: http://roaringbitmap.org/
License: Apache License 2.0
There are several methods with "fast" variants. Other than a minor interface change, what do they differently? Can you document this?
The performance-critical functions should be rewritten in assembly, to use the popcnt instruction, as in Will's bitset library:
https://github.com/willf/bitset/blob/master/popcnt_amd64.s
Of course, benchmarks should be provided.
Implement a fast "IsSubset" comparison function between bitmaps... It is already implemented in the C version of Roaring...
when roaringarray.go's func rangeOfOnes(start, last int) container is changed to be implemented purely with bitmaps, arrays, or rle16 containers, I see test suite failure. Here are the 3 options:
For context: this
https://github.com/RoaringBitmap/roaring/blob/master/roaringarray.go#L42
is being replaced by:
func rangeOfOnes(start, last int) container {
return newBitmapContainerwithRange(start, last) // option 1
//return newArrayContainerRange(start, last) // option 2
//return newRunContainer16Range(uint64(start), uint64(last)) // option3
}
as in glycerine#1
and here are the suite fails, which are different in each case:
bitmaps for runs, test failures:
func TestDoubleAndNotBug01(t *testing.T) fails, line 1681
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 1681:
Expected: true
Actual: false
----------
arrays for runs: 2 distinct fails:
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 1616:
Expected: 'true'
Actual: 'false'
(Should be equal)
--- FAIL: TestDoubleAdd2 (0.01s)
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 1629:
Expected: 'true'
Actual: 'false'
(Should be equal)
--- FAIL: TestDoubleAdd3 (0.01s)
----------
with rle16:
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 787:
Expected: '96617'
Actual: '9'
(Should be equal)
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 865:
Expected: '35392'
Actual: '1928'
(Should be equal)
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 886:
Expected: '131999'
Actual: '131351'
(Should be equal)
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 933:
Expected: 'true'
Actual: 'false'
(Should be equal)
Failures:
* /Users/jaten/go/src/github.com/glycerine/roaring/roaring_test.go
Line 1589:
Expected: 'true'
Actual: 'false'
(Should be equal)
--- FAIL: TestFlipBigA (0.12s)
A Roaring data structure really stores 32-bit unsigned integers. It is problematic to use as inputs and outputs signed ints.
turning checkbox into issue, from #70:
update the unit tests to aggressively call runOptimize alot, and run the smat tests.
Currently, the binary And, Or, Xor and AndNot in-place functions do not attempt to do the computation in-place.
A genuine in-place implementation, as is available in the Java counterpart https://github.com/lemire/RoaringBitmap would use less memory and be faster in some instances (because it would avoid memory allocation and unnecessary copies).
Hello,
I think i've found a bug in AddMany & BitmapOf methods.
Here is a small test:
package libtest
import (
"github.com/RoaringBitmap/roaring"
"testing"
)
var array = []uint32{5580, 33722, 44031, 57276, 83097}
func TestRoaringBitmapBitmapOf(t *testing.T) {
bmp := roaring.BitmapOf(array...)
if len(array) != int(bmp.GetCardinality()) {
t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
t.FailNow()
}
}
func TestRoaringBitmapAdd(t *testing.T) {
bmp := roaring.NewBitmap()
for _, v := range array {
bmp.Add(v)
}
if len(array) != int(bmp.GetCardinality()) {
t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
t.FailNow()
}
}
func TestRoaringBitmapAddMany(t *testing.T) {
bmp := roaring.NewBitmap()
bmp.AddMany(array)
if len(array) != int(bmp.GetCardinality()) {
t.Errorf("length diff %d!=%d", len(array), bmp.GetCardinality())
t.FailNow()
}
}
Result:
--- FAIL: TestRoaringBitmapBitmapOf (0.00s)
roaring_test.go:17: length diff 5!=4
--- FAIL: TestRoaringBitmapAddMany (0.00s)
roaring_test.go:39: length diff 5!=4
FAIL
the cython implementation by @andreasvc has a nice idea from lucene "that it uses arrays not only when a block contains less than 2 ** 12 elements, but also when it contains more than 2 ** 32 - 2 ** 12 elements; i.e., blocks that are mostly full are stored just as compactly as blocks that are mostly empty"
it would be nice to support that in this version.
We need to compare the performance of this implementation with sensible competitors.
doesn't this overflow int?
https://github.com/RoaringBitmap/roaring/blob/master/util.go#L283
Currently, the OrCardinality function materializes temporary containers that are immediately (?) garbage collected. This is wasteful. Instead, we should have fast container-to-container union functions that just return the cardinality without doing memory allocation, as is the case with the intersections.
Line 416 in db18267
Hi, thanks for the cool work on roaring bitmaps.
Here's something strange that I originally encountered while trying to write some automated fuzz testing for roaring.
It's possible I'm using the Flip() API incorrectly, or the bug is really in the willf.BitSet that I'm comparing roaring against, but please see here for a simple test...
https://github.com/steveyen/roaring/blob/master/roaring_test.go#L1685
bm := roaring.NewBitmap()
bs := bitset.New(0)
bm.AddInt(0)
bs = bs.Set(0)
bm.Flip(1, 2)
bs = bs.Flip(1)
So(equalsBitSet(bs, bm), ShouldEqual, true) // Fails here.
}
often times in the run container operations it appeared faster to convert to bitmaps and the do the operations there. Meaure and see if any of the remaining operations could benefit from this kind of tuning.
Golint warning makes sense to me... but this would break existing code...
type name will be used as roaring.RoaringBitmap by other packages, and that stutters; consider calling this Bitmap
In Java, we are stuck with short, int, long. The Java roaring library effectively emulates uint16 using short (a major pain). However, go has exactly the right native type (uint16).
Unless there is a very good reason for it, I recommend that the type "short" be replaced by uint16 for clarity.
(I could make this change, naturally... but I try to avoid doing such far ranging changes without discussing it first.)
The Java version contains many more optimizations...
You have named the package goroaring. I approve even though the simpler "roaring" could be even better.
There is a real problem however with the repo name (go-roaring) on github. This violates one of the Go's convention that the name of the repository should match the package name:
It would help a lot if this repo will start using http://semver.org spec. Simple git tag v0.1.0
.
Both the C and the java binary search implementations are subject to a classic integer overflow bug in binary search; e.g. at
https://github.com/RoaringBitmap/CRoaring/blob/master/include/roaring/containers/run.h#L91
instead of
int32_t middleIndex = (low + high) >> 1;
prefer:
int32_t middleIndex = low + ((high-low) >> 1);
For better compression when there are long runs of consecutive values, the Java version has run containers (as of version 0.5). They should be implemented :
These can be created by the "runOptimize" function...
These containers can improve the compression ratio in some cases.
It should be possible to iterate over the unset bits.
Though the code is well tested and works, and there has been several contributions to make it run faster, there could still be large inefficiencies. We should profile this code.
When two array containers are intersected, we select the right algorithm according to a heuristic which depends (arbitrarily) on a parameter (64). This parameter was chosen more or less randomly a long time ago. It could be that either a larger or, more likely, a smaller value could be preferable in practice. We should test this out:
https://github.com/RoaringBitmap/roaring/blob/master/setutil.go#L202-L208
Automated continuous integration through Travis would be nice:
http://docs.travis-ci.com/user/languages/go/
I think only the project owner can set this up.
https://github.com/RoaringBitmap/roaring/pull/70/files#diff-fd38fd3c1b1dffc8f36af21de50228f5R1204
line 1203 of rle16.go, at the top of isubtract() I make a copy of the backing slice rc.iv:
func (rc *runContainer16) isubtract(del interval16) {
origiv := make([]interval16, len(rc.iv))
copy(origiv, rc.iv)
...
The copy of rc.iv into origiv is needed for correctness, because subtracting an interval can actually split an existing interval in two and make the overall slice grow.
Hence under the current implementation using a backing slice of intervals, a pathological subtraction of a set of intervals could cause O(N^2) time. For large sets a red-black tree, e.g. https://github.com/glycerine/rbtree, or converting to a bitmapContainer first and doing the subtraction there, is probably faster; but needs to be measured.
This sequence of calls causes a panic on linux/arm and darwin/amd64
bm := NewRoaringBitmap()
bm.AddRange(21, 26)
bm.AddRange(9, 14)
bm.AddRange(11, 16)
The element struct means that our 16-bit keys probably consume a lot more than 16 bits. The code should be rewritten without an element struct for reduced memory usage.
Java's iterators can skip over values :
It should be possible create a RoaringBitmap in Java, save it and load it back in Go. And vice versa.
Currently andCardinaly and orCardinality are implemented in such a way that intermediate containers are constructed and immediately discarded. This could be optimized significantly.
More testing is needed:
currently these functions are just stubbed out. Once implemented, the whole lazy combination system could be groomed and tuned.
package main
import (
"fmt"
"github.com/roaring"
)
func main() {
bm := roaring.NewRoaringBitmap()
roaring.Flip(bm, 0, 10)
fmt.Printf("%v cardinality is: %d\n", bm, bm.GetCardinality())
}
Perhaps the best place to document copyright (by adding something like Copyright 2016 owner <[email protected]>
) would be README.md
. Thanks.
Sometimes the function "AndNot" shows a bug: on a big set and a subset "roaring.AndNot(subset, set)" is not empty, but it should be.
The code and the sets are in the 7zip file at:
https://drive.google.com/file/d/0B7aIxjl31WjSOGVVTE5YNzRuTzA/view?usp=sharing
Keep up the good work,
Enrico
CRoaring
converts "full" bitsets to full runs, for faster processing.
Thoughts on serialization... things that could save a bunch of time...
When I need quick and portable serialization, I typically use msgpack2 from http://msgpack.org. It's a binary version of JSON that has a readable spec and is portable to all languages.
A reason I like msgpack2: there is an easy to use library for Go that is very fast.
Fast to deploy and fast when reading/writing.
https://github.com/tinylib/msgp in particular will parse your .go files and do codegen for your Go structs, making setting up serialization a 30 second task. All that is required is that your struct fields be Exported (capitalized).
Usually compression isn't needed afterwards, depending on the entropy of your data. You have to measure, but sometimes it is a win. If desired, then snappy can be easily applied without sacrificing decoding speed. Other general purpose compressors like gzip and bzip will sometimes compress slightly more, but are super slow to decode, hogging cpu and wasting valuable time at runtime. I usually use my implementation of the streaming snappy format here, https://github.com/glycerine/go-unsnap-stream, as it comes with nice command line tools (snap and unsnap). Since Google open sourced snapped (http://google.github.io/snappy/) quite a while ago, there are libraries for most languages for this as well.
Finally, one other nice thing about msgpack2 is that you don't need to predefine a schema; like JSON the data is self describing. It gives you some data-evolution capability (based of field names rather than field numbers) while being dynamic language friendly. There is no need for a pre-compiler (as in protobufs/thrift/capnproto) and java and python can read msgpack2 with the available libraries. msgpack2 is quite a bit more strongly typed that JSON however, and you can quickly move binary data without a whole lot of base64 encoding fuss.
That said, if one really wants a schema for future proofing the data evolution, then protobufs are also highly tuned and have libraries for the big 4 (Go, C++, Java, python). Also I used to maintain a capnproto implementation for Go, so I can discuss that if it is of interest. However the java bindings were never finished, so I didn't really consider capnproto a choice here where java support is expected.
In Java, the best performing function to compute the union between
many bitmaps is horizontal_or. It uses much less memory than FastOr
and avoids cardinality maintenance overhead.
It should be ported to Go:
The current copy-on-write implementation is not effective. It should be re-engineered.
In theory, adding benchmarks in Go is trivial. It is as easy as adding a test.
Unfortunately, Convey seems to get in the way: whenever I try to run a single benchmark, Convey insists on running all the tests, which takes 2 minutes or so...
I do not know Convey well enough to understand how well it supports benchmarks.
To understand what I have in mind, grab https://github.com/willf/bitset and try...
go test -bench=Iterate
It will run a single benchmark and nothing else. This very handy when optimizing the code. I cannot make this work with Convey.
Right now, we can only add content to a bitmap... but we should be able to remove them as well.
Roaring is ideally suited for parallel execution, as all containers can be operated upon in distinct threads.
../gopath/src/github.com/RoaringBitmap/roaring/roaring.go:1038: too many arguments to return
This code is puzzling to some of us :
https://github.com/tgruben/roaring/blob/master/roaring_test.go#L1321-L1323
c.c. @tgruben
I ran across a paper about HICAMP Bitmaps from folks at Stanford, and wondered if there were any good ideas to be learned from it.
http://web.stanford.edu/~hlitz/papers/a7-wang.pdf
A common filter, used in 90% of my queries, is to restrict a document set to a [begin,end) interval or time-window.
If I was to construct document IDs in chronological order, then this filter can be efficiently represented by a [first, last) interval of two integers.
Q: what is the efficient way to represent/construct such a bitset in Roaring, and use it in intersections?
Wrinkle: I typically have 100e6 documents, so it seems super inefficient (but please correct me if I'm wrong) to do the following:
func CreateContinguousSpanBitmap(begTime, endTime uint64, chronoOrderedDocs []Doc) *roaring.Bitmap {
b := roaring.NewBitmap()
n := len(chronoOrderedDocs)
for i:=uint32(0); i < n; i++ {
b.Add(i)
}
b.RemoveRange(begTime, endTime)
b.Flip(0, n)
return b
}
There's got to be a faster way, no?
Hi,
just wanted to ask for some advice before trying the code out. Say I have a Roaring Bitmap which can contain any unsigned integer below 4,294,967,296. It will be very sparsely populated and the highest integer may be much lower than the maximum. I want to count the set bits below an index. Is the recommended solution to make a new RoaringBitmap of length equal to the index-1 and AND it with the main bitmap and then do a count of the resulting bitmap?
Any advice much appreciated on this, or any other approach. Performance will be the key factor for my use case.
Thanks!
In practice this may not be dangerous, but concurrently performing operations like Or(rb1, rb2)
triggers warnings in Go's race detector due to concurrently writing the roaringarray.needCopyOnWrite
flags on the argument bitmaps.
Example test:
func TestConcurrentBitmapAccess(t *testing.T) {
rb1 := roaring.BitmapOf(0, 1, 10)
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
rb2 := roaring.NewBitmap()
rb2.Or(rb1)
wg.Done()
}()
}
wg.Wait()
}
Run via go test -race
and you should see race warnings like:
==================
WARNING: DATA RACE
Write at 0x00c4201436e8 by goroutine 9:
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*roaringArray).appendCopyMany()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaringarray.go:91 +0x3f6
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.Or()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:686 +0xa73
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*Bitmap).Or()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:578 +0x46
github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess.func1()
/Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:67 +0x2e5
Previous write at 0x00c4201436e8 by goroutine 8:
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*roaringArray).appendCopyMany()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaringarray.go:91 +0x3f6
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.Or()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:686 +0xa73
github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring.(*Bitmap).Or()
/Users/verm/go/src/github.com/figly/datasearch/vendor/github.com/RoaringBitmap/roaring/roaring.go:578 +0x46
github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess.func1()
/Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:67 +0x2e5
Goroutine 9 (running) created at:
github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess()
/Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:69 +0x115
testing.tRunner()
/usr/local/Cellar/go/1.7.1/libexec/src/testing/testing.go:610 +0xc9
Goroutine 8 (finished) created at:
github.com/figly/datasearch/figsearch.TestConcurrentBitmapAccess()
/Users/verm/go/src/github.com/figly/datasearch/figsearch/codec_test.go:69 +0x115
testing.tRunner()
/usr/local/Cellar/go/1.7.1/libexec/src/testing/testing.go:610 +0xc9
==================
A workaround is to call rb.Clone()
before using in any other operations in a separate goroutine.
Possible solutions:
Leave as is but clarify this behavior in documentation. The current documentation gives me the impression that reading bitmaps in concurrent goroutines should be safe if I haven't set them to be copy-on-write. The race warning suggests otherwise. This may be wrong (the racing goroutines are setting the memory to the same value) but at minimum it's worrying and some doc mention would be helpful.
Use atomic loads/stores when accessing that flag. At minimum using an atomic store to implement roaringarray.setNeedsCopyOnWrite
fixes this particular race for me; I'm not sure if other accesses of the flag need similar treatment.
Have operations like Or
use appendWithoutCopy
and appendWithoutCopyMany
if the argument bitmaps aren't copy-on-write. (I haven't thought about what the performance indications are here.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.