Comments (8)
Hi Francis
Could you run the tests with just slab decomposition? It is normally the best solution. Choose slab=True when creating the TensorProductSpace.
Installing without conda should not be that hard on supercomputers. They usually have all the dependencies you need. FFTW, HDF5 in parallel, MPI etc.
from shenfun.
Ah. Just realized this is 2D, so it's already slab.
from shenfun.
BTW, I do not find the results strange, but I cannot really understand them without a better knowledge of your PC. Is the memory shared, distributed, heterogeneous? How many cores in each CPU? By the looks of it I'm guessing 2 cores per CPU because the performance is very good on 2 cores, but then there is very little speed-up going to 4. That could be explained with your computer having fast interconnect between 2 cores on the same CPU, but slower from CPU to CPU. But you could also have 4 cores on one CPU and fast interconnect between cores 1 and 2, but slower between (1, 2) and to the two remaining (3, 4). If you look at our MPI paper we discuss this to some length in section 4. The thing with spectral codes is that they require communication between all cores because the basis functions are global. So spectral codes are very heavy on communication and require fast interconnect between all cores. Most computers are not built to handle that. On dedicated supercomputers with very fast interconnect the codes can scale well up to thousands of cores and I've used this code up to 70,000 cores with near perfect scaling. But that does not mean that the code will scale at all on a different computer with a different hardware. These things are complicated:-)
from shenfun.
Hello Mikael and thanks for your respones. Greatly appreciated.
Yes, this is a 2D problem so it is always a slab.
I did a 3D version of my code and did some tests with 512^3 and found the scalings of
np = 1, 2, 4, 8, 16
1., 0.93, 0.75, 0.54, 0.28
I suspect that if I used more degrees fo freedom I would get even better performance. This suggests that my architure can have good scalability.
As for the hardware, this is a brand new inspiron, 1 processor with 18 cores, which I have been led to believe should have great scalability.
I might try doing some tests with parts of the code to see how computing the flux, RHS, many times scales. If that scales well then the problem might be with the time stepping. I plan to try this tomorrow and will let you know how things go.
from shenfun.
I have some strange results to share but might help to point out where the problem occurs.
Below, I will copy a simple code that times how long it takes to compute the gradient 20 times. It doesn't do any updating so it's not physically meaningful but it should be a good measure of how the effort scales with mulitple cores. The results are as follows:
N=2048
np scaling
1 1
2 0.5
4 0.45
8 0.35
N=4096
np scaling
1 1
2 0.43
4 0.36
8 0.26
In both cases the scaling from 1 to 2 is awful. It actually takes longer with two cores than with 1. But, going from 2 to 4, actually scales very well. The efficiency from 2 to 4 is 0.96 for the first and 0.84 for the second. Again, I would think with more points it should scale better.
I guess I have some questions.
First, do you get similar results on one of your machines that scales well?
Second, could it be there is a problem in what's going on from 1 to 2?
Third, I don't remember if I've turned on the optimization or not. Should that make a difference in the scalability?
As an aside, I have tried to install shenfun on one of our servers and have problems both with and without conda, but I will save that for another issue. ; )
from shenfun.
`from mpi4py import MPI
import numpy as np
from shenfun import *
from mpi4py_fft import generate_xdmf, fftw
import matplotlib.pyplot as plt
import time
import sys
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Geometry
M = 12
N = (2**M, 2**M)
L = [20.0, 20.]
# Parameters
F2 = 1e-8
# Funcion Spaces
V1 = Basis(N[0], 'F', dtype='D', domain=(0, L[0]))
V2 = Basis(N[1], 'F', dtype='d', domain=(0, L[1]))
T = TensorProductSpace(comm, (V1, V2), **{'planner_effort': 'FFTW_MEASURE'})
TV = VectorTensorProductSpace(T)
# Padded Spaces
Tp = T.get_dealiased((1.5, 1.5))
TVp = TV.get_dealiased((1.5, 1.5))
q_hat, q, qb = Function(T), Array(T), Array(T)
gradq_hat = Function(TV)
qp, gradqp, Up = Array(Tp), Array(TVp), Array(TVp) # Padded arrays
#FJP
Ub, gradqb = Array(TVp), Array(TVp)
# Wavenumbers
K = np.array(T.local_wavenumbers(True,True))
K2 = np.sum(K*K + F2, 0, dtype=float)
K_over_K2 = K.astype(float) / np.where(K2 == 0, 1, K2).astype(float)
# Initialization
np.random.seed(rank)
q[:] = 1e-4*2.*(np.random.random(q.shape).astype(q.dtype)-0.5)
q_hat = T.forward(q, q_hat)
if __name__ == '__main__':
# Solve
time_initial = time.time()
rhs_hat = Function(T)
for ii in np.arange(20):
# Gradient of q
gradq_hat = 1j*K*q_hat
gradqp = TVp.backward(gradq_hat, gradqp)
# Velocity
Up[0] = Tp.backward( 1j*K_over_K2[1]*q_hat, Up[0])
Up[1] = Tp.backward(-1j*K_over_K2[0]*q_hat, Up[1])
# Flux
rhs = -(Up[0]*gradqp[0] + Up[1]*gradqp[1])
rhs_hat = Tp.forward(rhs, rhs_hat)
time_final = time.time()
print('Total time = ', (time_final - time_initial))
`
from shenfun.
Hi Francis,
-
I think you are are going about this the wrong way. You need to first benchmark your computer using some common benchmark tools. When you have an idea of what to expect it will be much easier to analyse the shenfun results. There are MPI benchmarks available for most processors, for example here for intel. Note that you should only look into the performance of
alltoall
, because that is all that is used by shenfun. In fact, we are only usingalltoallw
. -
You may see more speed-up on your computer using threads. Choose number of threads when creating the TensorProductSpace.
-
It is not really strange that the performance does not increase going from 1 to 2 processors. That is very often the case. One processor requires NO communication, whereas 2 does a whole lot. If communications are slow, then 2 may be slower than 1 even though the serial operations are twice as fast. For 1 processor you can also do
collapse_fourier=True
, which will speed up 1 processor even further, but it cannot be used for padding or 2D slab. -
MPI has a whole bunch of environment variables that may be tweaked to enhance performance. Each function usually has several different implementations and you only get the default unless you ask for something else. For example, there's a
MPICH_SHARED_MEM_COLL_OPT
that tries to take advantage of shared memory, that I suspect you computer has? -
You can get either MPICH or OpenMPI from conda. You should probably try both on your computer to figure out which one performs best.
-
Shenfun is not optimized for shared memory, only distributed memory. So even though all cores have access to all memory on a shared memory computer, shenfun will still assume memory is distributed and send arrays back and forth. Optimizing for shared memory would probably lead to major speed-ups, but I have not looked into it.
-
Speed-up will most likely come from tweaking the MPI parameters and not shenfun's implementation. Shenfun just calls
alltoallw
for any array, 2D, 3D, 4D etc, and that's the beauty of it. Other parallel FFT vendors have codes that use 10,000 lines just to get the 3D pencil implemented. We have a completely generic implementation on just a few hundred lines of code. But we have to callalltoallw
, there is simply no room for change or optimization on shenfun's side. Improve MPICH'salltoallw
and shenfun's performance will improve.
from shenfun.
Hello Mikael,
All excellent points. I'll start from the first step and go from there.
I will keep you posted as I make progess.
from shenfun.
Related Issues (20)
- Dx not working with 1D mixed function space HOT 4
- Issue on Stokes equations with doubly periodic boundary conditions. HOT 2
- Error when evaluating derivatives for nonstandard domain HOT 7
- Installation problem: error in "cimport fastgl_wrap" HOT 2
- Old shenfun script for 2D Poisson equation dosn't work with new version HOT 2
- 1d poisson HOT 1
- Bug in FunctionSpace with bc as dictionary HOT 4
- Solve DGL with non-constant (discrete) variable? HOT 5
- AttributeError: 'tuple' object has no attribute 'copy' HOT 1
- Install problem HOT 1
- problems in solving 2D Allen Cahn equation HOT 2
- Two questions in the demo MixedPoisson.py HOT 7
- How to compute the numerical solution on the general points HOT 4
- 1. Solver for multidimensional problem 2. Product with variable coefficients HOT 5
- Two questions about the Legendre-Galerkin method for the Cahn-Hilliard equation HOT 2
- How to compute the integral about the nonlinear term in the Navier-Stokes equation? HOT 2
- How to assemble the right-hand side vector? HOT 3
- Understanding the chebyshev solver in poisson2ND.py HOT 3
- Time dependent boundary conditions in segments for Rayleigh-Benard 2D HOT 2
- Implementing the one-dimensional Kuramoto-Sivashinsky system HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shenfun.