Comments (21)
Yes, slicing and views in general is still lacking. I hope it would not be too hard to implement. Question, why is the view of a subblock a subrray of a blocked matrix. It feels this should be a subarray of the same type that each block is stored as.
from blockarrays.jl.
There is no type for storing blocks, as the data is stored as a Vector{Float64}
, (ordered at the moment by block, but soon to be re-written to be ordered by column). Or rather, the type for each block is a SubArray{Float64,2,BlockBandedMatrix{…},Tuple{Block,Block},false}
, which has all the features of a StridedMatrix
: it is equivalent to a BLAS matrix, as it has a pointer, stride, etc.
For fast algorithms I think this is much better than the BlockArray
approach of storing the blocks. For example, if the memory is stored by column you can use LAPACK's QR code to calculate a QR decomposition in a block-wise fashion, as one can get not only just a block, but a contiguous sequence of blocks in a StridedMatrix
-like format. Having the blocks stored separately does not allow this.
from blockarrays.jl.
Having the blocks stored separately allows for other things though. Say you have a block matrix like:
[A B]
[B^T 0]
and want to act with the Schur complement S = B^T M^(-1) B
on a vector v
(as part of preconditioning in an iterative linear solver). If B
and M
are stored separately this can be done much faster than actually having to extract the blocks in each iteration.
from blockarrays.jl.
from blockarrays.jl.
In my application these matrices are sparse so I don't think it is as easy then.
from blockarrays.jl.
For sparse arrays, one could do it with a filter
of the indices, which is I think what happens when you call A[1:n,1:n]
on a sparse matrix. But I agree there's a cost attached.
In any case, I think this is off topic as both BlockArray
and PseudoBlockArray
have their usage cases. And each one should have a SubArray{T,…,Tuple{Block}}
implemented.
from blockarrays.jl.
I agree with @KristofferC, many operations occur block-by-block, and so having the blocks be contiguous in memory feels more natural, and seems like it would be more efficient. It also allows for the possibility of having blocks with different array types.
from blockarrays.jl.
I'm confused: I'm advocating for contiguous on memory, while the current BlockArray
is a wrapper around an array of arrays, so no guarantee of memory locality.
Ordering by columns gives the best of both worlds: you can do things block by block and still use LAPACK, but at the same time get LAPACK compatible views of contiguous blocks together.
In any case, this is not really a central issue as there can exist multiple subtypes of AbstractBlockArray
.
from blockarrays.jl.
PS Using arrays of arrays will be very inefficient when there are many small blocks. This is why I decided to use a single Vector
with LAPACK compatible blocks.
Note that my current implementation ofBlockBandedMatrix
actually does have the blocks themselves contiguous. I'm actually proposing to reimplement it to order by columns, so that I can use LAPACKs QR to calculate the QR decomposition.
from blockarrays.jl.
Right, that makes a lot of sense, I misunderstood how ApproxFun
worked, it's actually quite different from PseudoBlockArray
(I got misled by that statement). It's really neat that you were able to get views to work, and it does seem like there's a strong argument for storing the data data way.
Could you explain what you mean by “ordered by column?” Are blocks still contiguous in memory in that case?
from blockarrays.jl.
Sorry, I guess I misunderstood how PseudoBlockArray
worked.
Ordering by column just means list the entries by columns, just like in Matrix
. In the case of BlockBandedMatrix
, we omit the zero block entries. In the case of BlockMatrix
, ut becomes essentially just a wrapper around a Matrix
.
While the blocks are not contiguous in memory, they still have the structure of a strided matrix: it is enough to know the first entry, size and stride to use LAPACK and BLAS Level 3 routines.
I think at this point the best thing to do would be to implement the ordering by column, so we can see exactly what the speeds of the two implementations are for matrix multiplication.
from blockarrays.jl.
That's interesting. It seems to me that the answer might depend on the size of the blocks. The application that got me interested in this package will likely have a small number of large blocks (say, 1000x1000). So I wasn't concerned about the block-level logic being fast. I don't mind slowly iterating over the blocks as long as within-block operations are as fast as possible. I'm getting the impression that your use-case is different, involving lots of little blocks, so you're more motivated to make the block-level logic efficient. Is that accurate?
from blockarrays.jl.
I think I misunderstood as well. So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying? Is that correct?
from blockarrays.jl.
There's almost no benefit in having blocks contiguous in memory:
n = 1000; A = rand(n,n); B = rand(n,n); M = rand(2n,2n); M2 = rand(10n,2n);
@btime A*B # 31.949 ms (2 allocations: 7.63 MiB)
@btime view(M,1:n,1:n)*view(M,n+1:2n,n+1:2n) # 37.348 ms (74 allocations: 7.63 MiB)
@btime view(M2,1:n,1:n)*view(M2,n+1:2n,n+1:2n) # 39.615 ms (74 allocations: 7.63 MiB)
from blockarrays.jl.
So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying?
Yes that's right. In the non-banded case this would mean the data is stored exactly like a Matrix
, but with the block structure attached on top.
from blockarrays.jl.
Oh but that is exactly "PseudoBlockArray" then? (crappy name I know)
from blockarrays.jl.
That's what I thought. Let me clarify there's three things under discussion:
- The current
ApproxFun.BlockBandedMatrix
, which is not likePseudoBlockArray
since blocks are contiguous in memory - The proposed
ApproxFun.BlockBandedMatrix
, which is likePseudoBlockArray
but without the zero blocks stored in memory. The reason to rewrite is that there is almost no penalty for not having contiguous blocks, and it allows LAPack compatible views likeview(A,Block.(1:2),Block(1))
, that allow for a block-wise QR - The current
ApproxFun.BandedBlockBandedMatrix
, which hasn't been discussed but stores the blocks in a format where they can be retrieved in a way compatible with BandedMatrices.jl
But this discussion is off topic, my only point is that I am fairly confident that PseudoBlockArray
approach is faster computationally overall, especially when you have many blocks.
(Provided the blocks are dense.)
from blockarrays.jl.
OK, I did the following quick experiment should that using view
s really is fast:
using BlockArrays
# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
fill!(Y.blocks,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
for P = 1:N
BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
1.0,cur_block)
end
end
Y
end
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 1.572 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 95.898 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 348.103 μs (2100 allocations: 131.25 KiB)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 13.259 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 324.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 707.913 μs (20 allocations: 1.25 KiB)
# Convert A to Block Array
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 6.474 ms (120016 allocations: 7.40 MiB)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 47.627 ms (960016 allocations: 58.90 MiB)
from blockarrays.jl.
Those are some really interesting results. It's pretty clear that multiplication hasn't been properly implemented yet for BlockArrays
and PseudoBlockArrays
. To make the comparison complete, I've taken your A_mul_B_block
and written a version of it for BlockArrays
, and rerun the benchmarks (I've also rearranged things slightly to make it easier for me to follow). It does end up being ever so slightly faster using a matrix of blocks than views with a single matrix, and there's no memory allocation at all anymore. So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?
Also, what do you think accounts for the big difference between A.blocks*A.blocks and A_mul_B_block!(Y, A, A)? I don't have a very good intuition for that.
using BenchmarkTools
using BlockArrays
# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
fill!(Y.blocks,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
for P = 1:N
BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
1.0,cur_block)
end
end
Y
end
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 1.446 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 37.028 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 386.679 μs (2100 allocations: 131.25 KiB)
# Convert A to Block Array
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 5.557 ms (120016 allocations: 7.40 MiB)
# replace views with array of blocks
function A_mul_B_blockarray!(Y, A, B)
fill!(Y,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = Y.blocks[K,J]
for P = 1:N
BLAS.gemm!('N','N',1.0,A.blocks[K,P]
, B.blocks[P,J]
, 1.0,cur_block)
end
end
Y
end
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 337.071 μs (0 allocations: 0 bytes)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 11.767 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 143.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 226.180 μs (20 allocations: 1.25 KiB)
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 38.895 ms (960016 allocations: 58.90 MiB)
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 224.157 μs (0 allocations: 0 bytes)
# Even bigger blocks, removing the slow methods.
b = 1000; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Z = similar(A.blocks)
@btime BLAS.gemm!('N', 'N', 1.0, A.blocks, A.blocks, 0.0, Z) # 109.154 ms (0 allocations: 0 bytes)
Y = similar(A)
@btime A_mul_B_block!(Y, A, A) # 143.229 ms (20 allocations: 1.25 KiB)
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 131.430 ms (0 allocations: 0 bytes)
from blockarrays.jl.
So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?
The benefit is that you can combine neighbouring blocks and still get a strided matrix. For example:
function A_mul_B_col_block!(Y, A, B)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
A_mul_B!(cur_block,
view(A.blocks,(K-1)b+1:K*b,:),
view(B.blocks,:,(J-1)b+1:J*b))
end
Y
end
@btime A_mul_B_col_block!(Y, A, A) # 275.924 μs (300 allocations: 18.75 KiB)
In particular, I want to calculate a QR factorization and this is very complicated with BlockArrays
, and will be slow.
Note that the allocations should disappear eventually: https://github.com/JuliaLang/julia/issues/14955`
from blockarrays.jl.
This is now merged.
from blockarrays.jl.
Related Issues (20)
- Err multiplying block array with `Diagonal`
- Broadcasting `BlockedUnitRange` sometimes loses blocking information HOT 2
- Generalize `BlockedUnitRange` to element types besides `Int` HOT 6
- Potential issue with `findblock[index]` for `BlockUnitRange` with `first != 1` HOT 4
- There and back again: a block indexing tale HOT 2
- Functionality for slicing with unit ranges that preserves block information HOT 8
- Slicing a `BlockIndexRange` with unit ranges doesn't return a `BlockedIndexRange`
- Indexing with `Vector{<:BlockIndexRange{1}}` HOT 2
- More convenient syntax for merging blocks
- Generic interface for constructing `AbstractBlockArray`s: `blocked(a::AbstractArray, blocklengths...)`
- `getindex(::UnitRange, ::BlockedUnitRange)` isn't blocked HOT 1
- More convenient syntax for getting the size of a block HOT 5
- Definition of BlockUnitRange HOT 3
- Overload dot HOT 3
- Bug: Block indexing changed between 0.16.40 and 0.16.43 HOT 1
- Failing to compile on 1.12 nightly HOT 7
- Rename PseudoBlockArray to BlockedArray (v1.0) HOT 1
- Make `blocks(randn(2, 2))[1, 1]` return the original array
- `mortar([Block(1)[1:2], Block(2)[1:2]])[Block(1)]` has type `Vector{BlockIndex{1}}` instead of `BlockIndexRange{1}`
- `+(::BlockVector, ::Vector)` promotes axis from `BlockedOneTo` to `BlockedUnitRange`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blockarrays.jl.