chenxm1986 / cktso Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 3.0 26.94 MB

Pursuing the best performance of linear solver in circuit simulation

C 99.43% Makefile 0.57%

cktso's People

Contributors

Stargazers

Watchers

Forkers

changgang arronlam-502 ericqg

cktso's Issues

Support Apple Silicon M1 and M2 Series

Just wonder if you could provide files for M1 series CPU. They are really fast.

Support custom or natural ordering

Relevant to #2, since we are dealing with b which the number of columns may greatly exceeds the number of rows, factorization time does not matter anymore, but rather the quality of factorization, or more specifically the number of fills really matters and could result in an even 10% difference on the total computation time. If possible, we would like to use our own optimized ordering and skip the built-in AMD or any MD algorithm.

Paper access

It seems the referenced paper is still not indexed by IEEE. Is it possible to provide here if it is already accepted and published? You could have a footnote to indicate the copyright belong to IEEE to legally share here.

Parallel performance

Dear Mr.Chen,
we have tested your CKTSO matrix solver in our circuit simulation software and generally liked the performance. We tried it on AMD/Intel Windows platforms, with CSR formatted matrices (row-major).
However we did not achieve multi-thread performance improvement over single-thread mode, only slowdown.
We did transient simulations - many refactorization and solve calls.
We think we are using the library as recommended in the user guide.

To check that we are doing everything properly, could you help in these:

if we send you a number of exported MTX files with RHS vectors, can you evaluate if there's a multi-thread speed gain in your test environment? Do you need anything else?
are there some statistical dll calls or output values to see if CKTSO has really decided to use multiple threads or used sequential solving?

Thank you and best regards,
Gergely

Need some help on weird outputs of the benchmark demo.

Hi! Dr Chen, recently we made some benchmark tests on cktso. The test matrices are drawn from real circuit design ranging from 1.0E+04 up to 1.3E+07. We use the benchmark demos from both NICSLU and CKTSO and meet some weird situation. When
run ./benchmark add20.mtx #nthreads the output is just fine something like this follows:
Analysis time = 4900 us.
Factorization average time = 100 us, min time = 82 us.
Refactorization average time = 49 us, min time = 47 us.
Solve average time = 10 us, min time = 9 us.
Residual = 2.47485e-10.
Transposed solve average time = 12 us, min time = 11 us.
Residual = 2.44494e-10.
NNZ(L) = 9867, NNZ(U) = 7472.
Factorization flops = 133187, solve flops = 32283.
Determinent = 5.86668*10^(-3351).
Memory usage = 646989 bytes, max memory usage = 646989 bytes.

However, let's take CKTSO as an example, if we run ./benchmark ourcircuitmatrix #nthreads, something like this follows:
Analysis time = 0 us.
Factorization average time = 0 us, min time = 0 us.
Refactorization average time = 0 us, min time = 0 us.
Solve average time = 0 us, min time = 0 us.
Residual = 7062.9.
Transposed solve average time = 0 us, min time = 0 us.
Residual = 7062.9.
NNZ(L) = 0, NNZ(U) = 0.
Factorization flops = 106382044954745, solve flops = 10.
Determinent = 4.67441e-310*10^(6.91969e-310).
Memory usage = 1480 bytes, max memory usage = 61772 bytes.

I really don't know how to tune the numerous parameters in CKTSO or NICSLU. Canyou please give me some advices on tuning the solver? Since other popular direct solvers like KLU or PARDISO solved all of our test cases without tuning too much, I'm really confused about the weird result.

always return -8 from factorization

I'm trying to directly use the dll from Julia, however the factorization function always return -8, while others seem work fine (return 0). Here is my script:

using CEnum
using SparseArrays

const _libcktso = joinpath(dirname(@__FILE__),"..","cktso","win10_x64","cktso_l.dll")

mutable struct __cktso_l_dummy end

const ICktSo_L = Ptr{__cktso_l_dummy}

function CKTSO_L_CreateSolver(inst, iparm, oparm)
    ccall((:CKTSO_L_CreateSolver, _libcktso), Cint, (Ptr{ICktSo_L}, Ptr{Ptr{Cint}}, Ptr{Ptr{Clonglong}}), inst, iparm, oparm)
end

function CKTSO_L_DestroySolver(inst)
    ccall((:CKTSO_L_DestroySolver, _libcktso), Cint, (ICktSo_L,), inst)
end

function CKTSO_L_Analyze(inst, is_complex, n, ap, ai, ax, threads)
    ccall((:CKTSO_L_Analyze, _libcktso), Cint, (ICktSo_L, Bool, Clonglong, Ptr{Clonglong}, Ptr{Clonglong}, Ptr{Cdouble}, Cint), inst, is_complex, n, ap, ai, ax, threads)
end

function CKTSO_L_Factorize(inst, ax, fast)
    ccall((:CKTSO_L_Factorize, _libcktso), Cint, (ICktSo_L, Ptr{Cdouble}, Bool), inst, ax, fast)
end

function CKTSO_L_CleanUpGarbage(inst)
    ccall((:CKTSO_L_CleanUpGarbage, _libcktso), Cint, (ICktSo_L,), inst)
end

function CKTSO_L_Determinant(inst, mantissa, exponent)
    ccall((:CKTSO_L_Determinant, _libcktso), Cint, (ICktSo_L, Ptr{Cdouble}, Ptr{Cdouble}), inst, mantissa, exponent)
end

a = Ref{ICktSo_L}(0)
b = Cint[]
c = Clonglong[]
solver = CKTSO_L_CreateSolver(a, b, c)

A = sprand(100, 100, 0.01)
# make sure the diagonal is 1
for i in 1:100
    A[i, i] = 1
end

ap = Clonglong.(A.colptr) .- 1;  # -1 because the indices in Julia are 1-based
ai = Clonglong.(A.rowval) .- 1;  #
ax = Cdouble.(A.nzval);

CKTSO_L_Analyze(a[], Bool(false), Clonglong(0),  pointer_from_objref(ap), pointer_from_objref(ai), pointer_from_objref(ax), Cint(0)) # return 0

res = CKTSO_L_Factorize(a[], pointer_from_objref(ax), Bool(false))  # return -8

display(res)

Native support when b is a 2D matrix

When b is actually a 2D matrix, the time cost for factorization can be neglectable (cause you only have to do once apparently). In this case, does cktso natively support b as a sparse or dense 2D matrix? If yes, is cktso solving the slices of b in parallel as well?

Support in-place transpose `x` in `CKTSO_L_MV`

I am not sure whether it is feasible, but I am wondering if we could add an additional argument to CKTSO_L_MV like transpose so that if x is a square matrix, then we could get a transposed x directly. The benefit of doing is that sometimes the user may need to run transpose(x) immediately after solving Ax = B (2D), having such an in-place transpose will greatly reduce the memory allocation and time cost if x is very large.

Support inplace for CKTSO_L_SolveMV

Can we have a inplace version of CKTSO_L_SolveMV? The goal is to reduce the memory allocation and usage (thus better performance). Right now we need to allocate x first in order to execute CKTSO_L_SolveMV(id, nb, b, x, transpose), can we optimize the logic if the memory of b could be reused?

FPGA utilization?

I know this may sound like something too much to be done in the near future, but have you considered utilizing cloud FPGA services to achieve more parallel speedups? Do you have any experience in this field?
I've recently read a paper from Tarek Nechma who claims to had success with it - though on local FPGA hardware.
Thank you for any answer or hint.

Support single precision

KLU does not support single precision, and it seems cktso does not as well. How hard to add support for single precision? And just wonder what's the reason behind that most sparse solvers do not support single precision.

Helper function for the case when B is an identity matrix

When the system is relatively large(e.g. B is 20000 * 20000), creating (and allocating) such an identity matrix takes almost the same time as factoring and solving the system(benchmarked in Julia). Is it possible to provide a helper function, so that we could simply pass the sparse matrix A and an allocated space B? I feel like this scenario is quite common in the numerical computation.