konsumlamm / rrb-vector Goto Github PK

An implementation of a Relaxed Radix Balanced Vector in Haskell.

License: BSD 3-Clause "New" or "Revised" License

Haskell 100.00%

rrb-vector's Introduction

rbb-vector

An implementation of a Relaxed Radix Balanced Vector (RRB-Vector), an efficient sequence data structure. It supports fast indexing, iteration, concatenation and splitting.

For more information, see rrb-vector on Hackage.

rrb-vector's People

Contributors

Stargazers

Watchers

Forkers

boarders sgraf812 tracsis meooow25

rrb-vector's Issues

Reorder arguments to index, adjust, and adjust'

For other arrays, containers, strings, etc., including Sequence that users will often be converting from, index takes the container before the Int, and adjust and adjust' take the (a -> a) function as the first argument. In addition to being consistent with these cases, I think one can argue that the standard order is more often useful for currying.

I realize this is a breaking change, but I think it'd be much better to make it now, before there are too many users, rather than wait. Users that are surprised by the change should at least get a compiler error, plus you'll obviously bump the library's version number.

Thanks very much for this library!! Sorry to nit-pick.

Transient API

RRB trees are appealing because like vectors, they offer an even more efficient transient (as opposed to persistent) API based on unsafe mutable primitives.

Can we have such a module in this library? Plus maybe another module that wraps the unsafe API with linear types.

Pattern synonyms?

Hi, @konsumlamm, thanks for publishing rrb-vector. I'd like to try it out as a replacement for Seqs in a few projects I'm working on. When working with Seqs, I find using the bundled pattern synonyms quite convenient. Would you be interested in adding similar pattern synonyms for Vector (I didn't see any in the haddocks)? If you think it might be a good choice, I could try to implement them and open a PR. Thanks!

Question: Why sliceable arrays?

Why are the arrays stored in the RRB-tree nodes O(1)-sliceable?
It adds a memory cost of 2 words per node, but the benefit is not clear to me from looking through the code.
The original paper describing the tree and its Scala implementation use regular fixed-length arrays, so reading that didn't shed light on it either.

`unzip = unzipWith id`

It would be nice to define unzip simply as unzipWith id, but we need to make sure that this doesn't decrease its performance.

Sorting utilities

It would be useful to have sorting utilities that function directly on rrb vectors so conversions between types to get sorting aren't necessary; sortBy at the very least would be useful.

Add value-strict interface

I would find a value-strict interface to this data structure helpful.

The proposal is to add a new module called Data.RRBVector.Strict which exports a newtype wrapper around the existing type, and functions and typeclass instances that are all value-strict.

If this sounds OK I'd be happy to prepare a PR.

Property tests for invariants

I would like to add some tests that check that structure invariants of the tree are not violated, if that sounds ok.
Functions <|, ><, insertAt, deleteAt seem to be violating invariants.

lens type-classes instances

Supporting the appropriate type-classes from lens like Seq does could make transitioning from Seq to Vector as simple as just changing the used type constructor for code using lens.

Unboxed RRB-Vector

Originally proposed by u/sansboarders on reddit.

Using unboxed arrays in the tree structure would probably improve performance (note that I haven't done any benchmarks yet, so I'm not sure). There are a few different "kinds of unboxed values" to consider:

I don't know what the differences (in performance and ergonomics) between these three are or if it's possible to use these for the inner nodes (instead of only the leafs), so input on this is welcome.

One thing I'd like to avoid is duplicating too much code for this. One solution would be to implement a common typeclass, although this would probably make things slower (unless everything is specialized?) and type inference worse, so I'm not sure it's worth the trouble.

Improve `Arbitrary1` instance

Currently, the Arbitrary1 instance in the test suite only generates balanced and a specific kind of unbalanced trees. It would be nice to have more kinds of unbalanced trees or trees resulting from applying several operations to an initially balanced tree.

Optimized `replicate`

Currently, replicate n x can be implemented as fromList (replicate n x) (using replicate for lists).

However, this can be optimized by sharing as many nodes as possible, avoiding unnecessary allocations. This would not only be faster than using fromList, but also make the resulting vector consume less memory.

`cons` and `snoc` don't respect the offset of an existing Array

When you cons or snoc a vector that previously had its head removed, the resultant Vector has the head in it.

This is due to the copying in the cons and snoc functions ignoring the offset in the original Arrays. See here: https://hackage.haskell.org/package/rrb-vector-0.2.0.1/docs/src/Data.RRBVector.Internal.Array.html#cons

Reproduction below.

 > import Data.RRBVector
 > x = fromList "ab"
 > viewl x
 Just ('a',fromList "b")
 > let Just (_, y) = viewl x
 > y
 fromList "b"
 > 'a' <| y
 fromList "aa"

(Semi-)Sparse Vector?

Your folks' great work has inspired me to think about a sparse vector/array type, that is usually (much) more space efficient than IntMap. I'm thinking of a Word64 of bits for present/missing, together with a SmallArray of the non-missing elements (for dimension <= 64) or sparse subvectors. Hopefully with bit-counting instructions for countLeadingZeros or countTrailingZeros, indexing would be pretty fast.

My application is sparse linear algebra, where vectors vary gradually in sparseness. For example, one very large case from integer factoring I believe starts with an integer (or integer mod p) matrix with around 100,000 rows, each of which averages 20-30 nonzero entries. As one does (clever) Gaussian elimination, the rows steadily get more dense. Thus I'm thinking that for most of the algorithm, this data structure would be much better than either extreme of IntMap (good for very sparse vectors), or fully dense arrays.

For linear algebra at least, the basic operations are singleton, zipWith/merge (e.g. adding sparse vectors), map, and fold. For high-performance (exact) work, it'd be nice eventually to also have these types over unboxed integers mod p where p < 2^8, 2^16, 2^32, or 2^64 respectively.

What do you think? Thanks in advance for any advice or interest, and sorry if this is somewhat off-topic for rrb-vector.

MutableArray in Buffer is not UNPACKed

rrb-vector/src/Data/RRBVector/Internal/Buffer.hs

Lines 19 to 20 in 204a643

 -- | A mutable array buffer with a fixed capacity. 

 data Buffer s a = Buffer !(A.MutableArray s a) !(IntRef s)

Unpacking the MutableArray would likely give a small performance benefit. Or better, Buffer could use a SmallMutableArray directly since it does not need to support slicing.

	-- \| A mutable array buffer with a fixed capacity.
	data Buffer s a = Buffer !(A.MutableArray s a) !(IntRef s)