chenxm1986 / cktso Goto Github PK
View Code? Open in Web Editor NEWPursuing the best performance of linear solver in circuit simulation
Pursuing the best performance of linear solver in circuit simulation
Just wonder if you could provide files for M1 series CPU. They are really fast.
Relevant to #2, since we are dealing with b which the number of columns may greatly exceeds the number of rows, factorization time does not matter anymore, but rather the quality of factorization, or more specifically the number of fills really matters and could result in an even 10% difference on the total computation time. If possible, we would like to use our own optimized ordering and skip the built-in AMD or any MD algorithm.
It seems the referenced paper is still not indexed by IEEE. Is it possible to provide here if it is already accepted and published? You could have a footnote to indicate the copyright belong to IEEE to legally share here.
Dear Mr.Chen,
we have tested your CKTSO matrix solver in our circuit simulation software and generally liked the performance. We tried it on AMD/Intel Windows platforms, with CSR formatted matrices (row-major).
However we did not achieve multi-thread performance improvement over single-thread mode, only slowdown.
We did transient simulations - many refactorization and solve calls.
We think we are using the library as recommended in the user guide.
To check that we are doing everything properly, could you help in these:
Thank you and best regards,
Gergely
Hi! Dr Chen, recently we made some benchmark tests on cktso. The test matrices are drawn from real circuit design ranging from 1.0E+04 up to 1.3E+07. We use the benchmark demos from both NICSLU and CKTSO and meet some weird situation. When
run ./benchmark add20.mtx #nthreads the output is just fine something like this follows:
Analysis time = 4900 us.
Factorization average time = 100 us, min time = 82 us.
Refactorization average time = 49 us, min time = 47 us.
Solve average time = 10 us, min time = 9 us.
Residual = 2.47485e-10.
Transposed solve average time = 12 us, min time = 11 us.
Residual = 2.44494e-10.
NNZ(L) = 9867, NNZ(U) = 7472.
Factorization flops = 133187, solve flops = 32283.
Determinent = 5.86668*10^(-3351).
Memory usage = 646989 bytes, max memory usage = 646989 bytes.
However, let's take CKTSO as an example, if we run ./benchmark ourcircuitmatrix #nthreads, something like this follows:
Analysis time = 0 us.
Factorization average time = 0 us, min time = 0 us.
Refactorization average time = 0 us, min time = 0 us.
Solve average time = 0 us, min time = 0 us.
Residual = 7062.9.
Transposed solve average time = 0 us, min time = 0 us.
Residual = 7062.9.
NNZ(L) = 0, NNZ(U) = 0.
Factorization flops = 106382044954745, solve flops = 10.
Determinent = 4.67441e-310*10^(6.91969e-310).
Memory usage = 1480 bytes, max memory usage = 61772 bytes.
I really don't know how to tune the numerous parameters in CKTSO or NICSLU. Canyou please give me some advices on tuning the solver? Since other popular direct solvers like KLU or PARDISO solved all of our test cases without tuning too much, I'm really confused about the weird result.
I'm trying to directly use the dll from Julia, however the factorization function always return -8, while others seem work fine (return 0). Here is my script:
using CEnum
using SparseArrays
const _libcktso = joinpath(dirname(@__FILE__),"..","cktso","win10_x64","cktso_l.dll")
mutable struct __cktso_l_dummy end
const ICktSo_L = Ptr{__cktso_l_dummy}
function CKTSO_L_CreateSolver(inst, iparm, oparm)
ccall((:CKTSO_L_CreateSolver, _libcktso), Cint, (Ptr{ICktSo_L}, Ptr{Ptr{Cint}}, Ptr{Ptr{Clonglong}}), inst, iparm, oparm)
end
function CKTSO_L_DestroySolver(inst)
ccall((:CKTSO_L_DestroySolver, _libcktso), Cint, (ICktSo_L,), inst)
end
function CKTSO_L_Analyze(inst, is_complex, n, ap, ai, ax, threads)
ccall((:CKTSO_L_Analyze, _libcktso), Cint, (ICktSo_L, Bool, Clonglong, Ptr{Clonglong}, Ptr{Clonglong}, Ptr{Cdouble}, Cint), inst, is_complex, n, ap, ai, ax, threads)
end
function CKTSO_L_Factorize(inst, ax, fast)
ccall((:CKTSO_L_Factorize, _libcktso), Cint, (ICktSo_L, Ptr{Cdouble}, Bool), inst, ax, fast)
end
function CKTSO_L_CleanUpGarbage(inst)
ccall((:CKTSO_L_CleanUpGarbage, _libcktso), Cint, (ICktSo_L,), inst)
end
function CKTSO_L_Determinant(inst, mantissa, exponent)
ccall((:CKTSO_L_Determinant, _libcktso), Cint, (ICktSo_L, Ptr{Cdouble}, Ptr{Cdouble}), inst, mantissa, exponent)
end
a = Ref{ICktSo_L}(0)
b = Cint[]
c = Clonglong[]
solver = CKTSO_L_CreateSolver(a, b, c)
A = sprand(100, 100, 0.01)
# make sure the diagonal is 1
for i in 1:100
A[i, i] = 1
end
ap = Clonglong.(A.colptr) .- 1; # -1 because the indices in Julia are 1-based
ai = Clonglong.(A.rowval) .- 1; #
ax = Cdouble.(A.nzval);
CKTSO_L_Analyze(a[], Bool(false), Clonglong(0), pointer_from_objref(ap), pointer_from_objref(ai), pointer_from_objref(ax), Cint(0)) # return 0
res = CKTSO_L_Factorize(a[], pointer_from_objref(ax), Bool(false)) # return -8
display(res)
When b is actually a 2D matrix, the time cost for factorization can be neglectable (cause you only have to do once apparently). In this case, does cktso natively support b as a sparse or dense 2D matrix? If yes, is cktso solving the slices of b in parallel as well?
I am not sure whether it is feasible, but I am wondering if we could add an additional argument to CKTSO_L_MV
like transpose
so that if x is a square matrix, then we could get a transposed x directly. The benefit of doing is that sometimes the user may need to run transpose(x)
immediately after solving Ax = B (2D)
, having such an in-place transpose will greatly reduce the memory allocation and time cost if x is very large.
Can we have a inplace version of CKTSO_L_SolveMV? The goal is to reduce the memory allocation and usage (thus better performance). Right now we need to allocate x first in order to execute CKTSO_L_SolveMV(id, nb, b, x, transpose)
, can we optimize the logic if the memory of b could be reused?
I know this may sound like something too much to be done in the near future, but have you considered utilizing cloud FPGA services to achieve more parallel speedups? Do you have any experience in this field?
I've recently read a paper from Tarek Nechma who claims to had success with it - though on local FPGA hardware.
Thank you for any answer or hint.
KLU does not support single precision, and it seems cktso does not as well. How hard to add support for single precision? And just wonder what's the reason behind that most sparse solvers do not support single precision.
When the system is relatively large(e.g. B is 20000 * 20000), creating (and allocating) such an identity matrix takes almost the same time as factoring and solving the system(benchmarked in Julia). Is it possible to provide a helper function, so that we could simply pass the sparse matrix A and an allocated space B? I feel like this scenario is quite common in the numerical computation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.