Git Product home page Git Product logo

Comments (21)

KristofferC avatar KristofferC commented on June 21, 2024

Yes, slicing and views in general is still lacking. I hope it would not be too hard to implement. Question, why is the view of a subblock a subrray of a blocked matrix. It feels this should be a subarray of the same type that each block is stored as.

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

There is no type for storing blocks, as the data is stored as a Vector{Float64}, (ordered at the moment by block, but soon to be re-written to be ordered by column). Or rather, the type for each block is a SubArray{Float64,2,BlockBandedMatrix{…},Tuple{Block,Block},false}, which has all the features of a StridedMatrix: it is equivalent to a BLAS matrix, as it has a pointer, stride, etc.

For fast algorithms I think this is much better than the BlockArray approach of storing the blocks. For example, if the memory is stored by column you can use LAPACK's QR code to calculate a QR decomposition in a block-wise fashion, as one can get not only just a block, but a contiguous sequence of blocks in a StridedMatrix-like format. Having the blocks stored separately does not allow this.

from blockarrays.jl.

KristofferC avatar KristofferC commented on June 21, 2024

Having the blocks stored separately allows for other things though. Say you have a block matrix like:

[A     B]
[B^T   0]

and want to act with the Schur complement S = B^T M^(-1) B on a vector v (as part of preconditioning in an iterative linear solver). If B and M are stored separately this can be done much faster than actually having to extract the blocks in each iteration.

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

from blockarrays.jl.

KristofferC avatar KristofferC commented on June 21, 2024

In my application these matrices are sparse so I don't think it is as easy then.

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

For sparse arrays, one could do it with a filter of the indices, which is I think what happens when you call A[1:n,1:n] on a sparse matrix. But I agree there's a cost attached.

In any case, I think this is off topic as both BlockArray and PseudoBlockArray have their usage cases. And each one should have a SubArray{T,…,Tuple{Block}} implemented.

from blockarrays.jl.

maximerischard avatar maximerischard commented on June 21, 2024

I agree with @KristofferC, many operations occur block-by-block, and so having the blocks be contiguous in memory feels more natural, and seems like it would be more efficient. It also allows for the possibility of having blocks with different array types.

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

I'm confused: I'm advocating for contiguous on memory, while the current BlockArray is a wrapper around an array of arrays, so no guarantee of memory locality.

Ordering by columns gives the best of both worlds: you can do things block by block and still use LAPACK, but at the same time get LAPACK compatible views of contiguous blocks together.

In any case, this is not really a central issue as there can exist multiple subtypes of AbstractBlockArray.

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

PS Using arrays of arrays will be very inefficient when there are many small blocks. This is why I decided to use a single Vector with LAPACK compatible blocks.

Note that my current implementation ofBlockBandedMatrix actually does have the blocks themselves contiguous. I'm actually proposing to reimplement it to order by columns, so that I can use LAPACKs QR to calculate the QR decomposition.

from blockarrays.jl.

maximerischard avatar maximerischard commented on June 21, 2024

Right, that makes a lot of sense, I misunderstood how ApproxFun worked, it's actually quite different from PseudoBlockArray (I got misled by that statement). It's really neat that you were able to get views to work, and it does seem like there's a strong argument for storing the data data way.

Could you explain what you mean by “ordered by column?” Are blocks still contiguous in memory in that case?

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

Sorry, I guess I misunderstood how PseudoBlockArray worked.

Ordering by column just means list the entries by columns, just like in Matrix. In the case of BlockBandedMatrix, we omit the zero block entries. In the case of BlockMatrix, ut becomes essentially just a wrapper around a Matrix.

While the blocks are not contiguous in memory, they still have the structure of a strided matrix: it is enough to know the first entry, size and stride to use LAPACK and BLAS Level 3 routines.

I think at this point the best thing to do would be to implement the ordering by column, so we can see exactly what the speeds of the two implementations are for matrix multiplication.

from blockarrays.jl.

maximerischard avatar maximerischard commented on June 21, 2024

That's interesting. It seems to me that the answer might depend on the size of the blocks. The application that got me interested in this package will likely have a small number of large blocks (say, 1000x1000). So I wasn't concerned about the block-level logic being fast. I don't mind slowly iterating over the blocks as long as within-block operations are as fast as possible. I'm getting the impression that your use-case is different, involving lots of little blocks, so you're more motivated to make the block-level logic efficient. Is that accurate?

from blockarrays.jl.

KristofferC avatar KristofferC commented on June 21, 2024

I think I misunderstood as well. So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying? Is that correct?

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

There's almost no benefit in having blocks contiguous in memory:

n = 1000; A = rand(n,n); B = rand(n,n); M = rand(2n,2n); M2 = rand(10n,2n);

@btime A*B # 31.949 ms (2 allocations: 7.63 MiB)
@btime view(M,1:n,1:n)*view(M,n+1:2n,n+1:2n)  # 37.348 ms (74 allocations: 7.63 MiB)
@btime view(M2,1:n,1:n)*view(M2,n+1:2n,n+1:2n)  # 39.615 ms (74 allocations: 7.63 MiB)

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying?

Yes that's right. In the non-banded case this would mean the data is stored exactly like a Matrix, but with the block structure attached on top.

from blockarrays.jl.

KristofferC avatar KristofferC commented on June 21, 2024

Oh but that is exactly "PseudoBlockArray" then? (crappy name I know)

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

That's what I thought. Let me clarify there's three things under discussion:

  1. The current ApproxFun.BlockBandedMatrix, which is not like PseudoBlockArray since blocks are contiguous in memory
  2. The proposed ApproxFun.BlockBandedMatrix, which is like PseudoBlockArray but without the zero blocks stored in memory. The reason to rewrite is that there is almost no penalty for not having contiguous blocks, and it allows LAPack compatible views like view(A,Block.(1:2),Block(1)), that allow for a block-wise QR
  3. The current ApproxFun.BandedBlockBandedMatrix, which hasn't been discussed but stores the blocks in a format where they can be retrieved in a way compatible with BandedMatrices.jl

But this discussion is off topic, my only point is that I am fairly confident that PseudoBlockArray approach is faster computationally overall, especially when you have many blocks.

(Provided the blocks are dense.)

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

OK, I did the following quick experiment should that using views really is fast:

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end


b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.572 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks #  95.898 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) #   348.103 μs (2100 allocations: 131.25 KiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 13.259 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks #  324.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) #     707.913 μs (20 allocations: 1.25 KiB)




# Convert A to Block Array
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 6.474 ms (120016 allocations: 7.40 MiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 47.627 ms (960016 allocations: 58.90 MiB)

from blockarrays.jl.

maximerischard avatar maximerischard commented on June 21, 2024

Those are some really interesting results. It's pretty clear that multiplication hasn't been properly implemented yet for BlockArrays and PseudoBlockArrays. To make the comparison complete, I've taken your A_mul_B_block and written a version of it for BlockArrays, and rerun the benchmarks (I've also rearranged things slightly to make it easier for me to follow). It does end up being ever so slightly faster using a matrix of blocks than views with a single matrix, and there's no memory allocation at all anymore. So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

Also, what do you think accounts for the big difference between A.blocks*A.blocks and A_mul_B_block!(Y, A, A)? I don't have a very good intuition for that.

using BenchmarkTools

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end

b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.446 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 37.028 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 386.679 μs (2100 allocations: 131.25 KiB)

# Convert A to Block Array
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 5.557 ms (120016 allocations: 7.40 MiB)

# replace views with array of blocks
function A_mul_B_blockarray!(Y, A, B)
    fill!(Y,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = Y.blocks[K,J]
        for P = 1:N
            BLAS.gemm!('N','N',1.0,A.blocks[K,P]
                       , B.blocks[P,J]
                       , 1.0,cur_block)
        end
    end
    Y
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 337.071 μs (0 allocations: 0 bytes)

b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 11.767 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 143.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 226.180 μs (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

@btime B*B # 38.895 ms (960016 allocations: 58.90 MiB)

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 224.157 μs (0 allocations: 0 bytes)

# Even bigger blocks, removing the slow methods.
b = 1000; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Z = similar(A.blocks)
@btime BLAS.gemm!('N', 'N', 1.0, A.blocks, A.blocks, 0.0, Z) # 109.154 ms (0 allocations: 0 bytes)
Y = similar(A)
@btime A_mul_B_block!(Y, A, A) # 143.229 ms (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 131.430 ms (0 allocations: 0 bytes)

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

The benefit is that you can combine neighbouring blocks and still get a strided matrix. For example:

function A_mul_B_col_block!(Y, A, B)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        A_mul_B!(cur_block,
                 view(A.blocks,(K-1)b+1:K*b,:),
                 view(B.blocks,:,(J-1)b+1:J*b))
    end
    Y
end

@btime A_mul_B_col_block!(Y, A, A) #   275.924 μs (300 allocations: 18.75 KiB)

In particular, I want to calculate a QR factorization and this is very complicated with BlockArrays, and will be slow.

Note that the allocations should disappear eventually: https://github.com/JuliaLang/julia/issues/14955`

from blockarrays.jl.

dlfivefifty avatar dlfivefifty commented on June 21, 2024

This is now merged.

from blockarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.