xcompact3d / x3d2 Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 4.0 1.55 MB

Home Page: https://xcompact3d.github.io/x3d2

License: BSD 3-Clause "New" or "Revised" License

Fortran 97.62% Python 0.91% CMake 1.31% Shell 0.15%

x3d2's People

Contributors

Stargazers

Watchers

Forkers

semi-h pbartholomew08 nanoseb cfd-xing

x3d2's Issues

Single precision support for MPI messages

There is the compile time seleciton mechanism for the precision in the code base, however the MPI modules are currently hardcoded for MPI_DOUBLE_PRECISION. We need to figure out a way to fix this so that we can run single precision simulations.

A very incomplete attemt is here (albeit the wrong place, it should be in base backend)

x3d2/src/cuda/backend.f90

Line 26 in 85246f3

integer :: MPI_FP_PREC = dp

Then based on the dp parameter MPI_FP_PREC can be set and used everywhere afterwards. This probably will require converting mpi_sendrecv modules into classes, so that they can be instantiated at the backends with the MPI_FP_PREC

Thanks @JamieJQuinn for pointing out this in #67.

Implement gradient of scalar field

Implement distributed tridiagonal solver

Poisson solver (iterative solver)

Investigate using pfUnit as formal testing framework

Currently unit tests are defined as regular subroutines and run through ctest. We may get better utilities using pfunit, like better float comparison, test tagging, running tests in parallel.

@slaizet said Thibault tried this out last year and described it as "painful" so I'd like to know what benefits we might get from pfunit before ripping out our current working test framework.

Field size in x/y/z orientations

We minimize the memory requirement via our memory pool strategy. We have implemented the allocator module to provide this functionality. Allocator module is responsible from allocating memory on the device or host, and then passing a pointer to a memory region when a method requires a temporary field storage.

Currently, allocator module allocates a 3D array with specified sizes.
allocate (field(SZ, nx, ny*nz/SZ)
However, the number of vertices can differ in x, y, and z directions. Therefore, when switching from one orientation to another, or when requesting a pointer from the allocator module, we may need to specify a size so that we have a temporary storage with correct shape. A feature introduced in Fortran 2003 called bounds remapping allows assigning a pointer with a different size or dimension to an existing array and I have tested this feature both with the nvidia and gcc compilers so don't expect any issue here.

To sum up, we need to improve the allocator module a little so that we can specify the shape of the array we need. I think it will be a simple fix to the get_block method in the allocator module. This may require us to convert get_block from a function to a subroutine. Until this is fixed we can only run simulations where (nx_local == ny_local == nz_local).

I think it also makes sense to allocate a 1D array with right size (nxnynz) when create_block is executed because then Fortran compilers should be more comfortable assuming that our pointers point to a continuous piece of memory without any strides.

Shall we have a `host_allocator`?

We are capable of running the entire algorithm on CPUs or GPUs, but there are certain operations we need to carry out on host memory. One example is initialisation (as of now only initializes the velocity fields for a TGV simulation, but it won't be too different for any initial field).

x3d2/src/solver.f90

Lines 116 to 128 in eaeae17

 do k = 1, nz 

 do j = 1, ny 

 do i = 1, nx 

 x = (i - 1)*globs%dx 

 y = (j - 1)*globs%dy 

 z = (k - 1)*globs%dz 

 u_init(i, j, k) = sin(x)*cos(y)*cos(z) 

 v_init(i, j, k) = -cos(x)*sin(y)*cos(z) 

 w_init(i, j, k) = 0 

 end do 

 end do 

 end do

Because this is a one time operation, it would be a waste to have dedicated functions for each backend. Therefore, we allocate an array on host memory, then fill this up with the initial condition, and simpy set the u, v, and w field_ts with these temporary arrays which will get deallocated soon after.

x3d2/src/solver.f90

Lines 130 to 132 in eaeae17

 call solver%backend%set_field_data(solver%u, u_init) 

 call solver%backend%set_field_data(solver%v, v_init) 

 call solver%backend%set_field_data(solver%w, w_init)

This doesn't really cause any ongoing memory issues because initial field arrays are allocated and deallocated before the simulation starts, however the current primitive support for output is a different story. The output Fortran arrays are allocated in the beginning and they're always there until the end, increasing the memory footprint of a simulation running with OpenMP backend. (With CUDA backend, there are 3 fields living in host memory but we don't really care, as the limitation is normally on the GPU memory, and a few host arrays don't cause any harm). Currently we just copy the solution fields back to these output arrays at the end of a simulation.

x3d2/src/solver.f90

Lines 668 to 670 in eaeae17

 call self%backend%get_field_data(u_out, self%u) 

 call self%backend%get_field_data(v_out, self%v) 

 call self%backend%get_field_data(w_out, self%w)

This is very primitive and likely change with proper IO but the idea will be the same, if we allocate Fortran arrays to safely copy data back to a host memory, they'll increase our memory footprint on OpenMP backend.

A potential solution

A solution I have in mind involves having a host_allocator, which just points to the default allocator we have in a simulation with OpenMP backend, or on CUDA backend, points to an instance of the allocator_t, which is different than cuda_allocator_t that our default allocator on CUDA backend points to. We can set this up in the main code somewhere around here

x3d2/src/xcompact.f90

Lines 117 to 134 in eaeae17

 #ifdef CUDA 

 cuda_allocator = cuda_allocator_t(globs%nx_loc, globs%ny_loc, & 

 globs%nz_loc, SZ) 

 allocator => cuda_allocator 

 print*, 'CUDA allocator instantiated' 

 cuda_backend = cuda_backend_t(globs, allocator) 

 backend => cuda_backend 

 print*, 'CUDA backend instantiated' 

 #else 

 omp_allocator = allocator_t(globs%nx_loc, globs%ny_loc, globs%nz_loc, SZ) 

 allocator => omp_allocator 

 print*, 'OpenMP allocator instantiated' 

 omp_backend = omp_backend_t(globs, allocator) 

 backend => omp_backend 

 print*, 'OpenMP backend instantiated' 

 #endif

And then pass this host_allocator to the solver, where we can do

u_io => solver%host_allocator%get_block(DIR_C)

do i, j, k
   u_io%data(i, j, k) = sin(?)cos(?) ! we're certain %data exists because u_io is obtained from host_allocator
end do
call solver%backend%set_field_data(solver%u, u_io%data, u_io%dir)

call solver%host_allocator%release_block(u_io)

In case we're running OpenMP backend, host_allocator and the allocator are the same instance, so they share the same memory pool and unless we request more from them near the peak memory use which occurs in transeq, it'll only grab an available memory block and operate on it, and make it available for later use.

This will definitely be useful for setting ICs of the domain, and maybe with some test programs (#68 (comment)_), but probably not worth unless we can make use of this with the proper IO we're planning to have soon. So please share your thoughts, @pbartholomew08 @JamieJQuinn, in particular considering ADIOS and its requirements.

Have SZ, nx_loc, n_groups etc. accessible "globally"

At the moment all the variables that define block sizes are defined in various places and sometimes duplicated over multiple structures. For example, backend_t has nx_loc, ny_loc, ny_loc, but so has backend_t%xdirps%n etc.
SZ is accessed in m_omp_common or m_cuda_common depending on the backend, meaning we can't access it outside the backend functions (unless being passed as argument from there etc.).

It would be good to have a single consistent way of accessing all of these variable. It will also reduce the number of arguments needed to be passed around.

Implement FFT Poisson solver (CUDA)

Single GPU (cuFFT)
Multi GPU (cuFFTMp)

Implement divergence of vector field (v2p)

This is implemented at the solver level

Add a new interface in backends for summing two fields

With #37 we have three distinct subroutines carrying out similar work, sum_yintox, sum_zintox, and vecadd. Just like we did in reorder subroutine we can combine all these into one interface with a switch statement.

backend%add(u, u_, SUM_Y2X)

we can make the switch statement optional so that if nothing is passed two arrays are summed without doing any reordering, resulting in the current vecadd.

The only thing is, we need this add to be able to scale u_ at least so that it can be used in the time integrator and also at the very end in the pressure correction step. I don't think we need this scaling option when we're adding a y or z oriented array into x. Given this is the case, is it better to have 2 interfaces, one combines sum_yintox and sum_zintox, and the other just deals with vecadd including scaling the input/output fields? Any thoughts?

Reuse result field in transeq_component as a temporary field

In transeq we call _x _y and _z components separately, and each time we pass a field to store the result as well as 3 temporary fields that distributed algorithm requires. Because the result is assembled only at the very last step, we can actually pass only 2 temporary fields and use the result field in place of a temporary storage. This eliminates 1 field size worth of memory usage, reducing the total down to 14 (See #37). It won't make any diference in terms of performance in the CUDA backend, however, OpenMP backend performance will be improved slightly. On the OpenMP backend reusing result as a temporary array removes an unnecessary single field sized read, reducing the total from 14 to 13 per transeq_component call and saving %7 of memory bandwidth. And I expect an equivalent speedup in this case.

x3d2/src/cuda/backend.f90

Lines 259 to 263 in 26ccadd

 call exec_dist_transeq_3fused( & 

 ru_dev, & 

 u_dev, self%u_recv_s_dev, self%u_recv_e_dev, & 

 u_dev, self%u_recv_s_dev, self%u_recv_e_dev, & 

 du_dev, duu_dev, d2u_dev, &

result in line 260
temporary fields in 263

x3d2/src/cuda/exec_dist.f90

Lines 112 to 113 in 26ccadd

 call transeq_3fused_subs<<<blocks, threads>>>( & 

 r_u, v, du, dud, d2u, &

And here in line 113 lets say we'll have r_u, v, dud, d2u only, because du will be in r_u.

Use GPU-GPU transfers directly into ghost buffer space in fields

Currently ghost cells are stored in separate buffer from the main field. This leads to complexity in subroutines like der_univ_dist in kernels_dist.f90. We should test:

storing the ghost cell buffer directly inside the field array, eliminating the separate boundary calculations
using CUDA-aware MPI to transfer this buffer region directly between field arrays (currently the buffer is copied into a separate array still on the GPU, then transferred)

At first glance, definitely need to:

change allocator to output arrays that include the ghost cells
update transpose functions to understand new memory layout
update send/recv ghost cell functions to send/recv directly into arrays
update kernels which use ghost cells

Implement distributed tridiagonal solver specialised for transport eqn

Add tridiagonal solver to omp backend

Use test_omp_tridiag as basis for tridiagonal solver

Implement divergence of vector field (v2v)

Investigate removing reordered variables to reduce memory usage

@semi-h Can you summarise here?

Use more standard build directories/layout

CMake supports building following a standard /path/lib, /path/bin, etc. layout, e.g. x3d2/build/lib, x3d2/build/bin

This will make it easier to find things (e.g. the executable), ensure that generated (sub)modules are placed in a standard place for better editor integration.

fprettify alternatives

Since #22 is listing problems facing with fprettify. I am creating an other issue see what else is around.

https://fortranwiki.org/fortran/show/Tools is likely a good starting point. In particular

findent: indentation only, but seems to work well, still active development
fortran-linter: fixes a handful of simple syntax issues, mainly space related (not tested yet), still active development
flint: only to check a syntax

Split performance measurements from unit tests

Currently performance measurements are performed at the end of some unit tests. These slow down the testing cycle and can be pulled out into their own subroutines. This will additionally help identify some common setup code that can be refectored out.

We also need:

A way to tag tests so a user can run only the unit/performance test
A way of regularly running performance tests to check for perf regression

fprettify...

fprettify doesn't allow fine grained control over some of the whitespace preferences. Lets use this thread to list the "wrong" stuff fprettify does.

I start with the most important one. fprettify doesn't understand the chevron syntax
call gpu_kernel<<<blocks, threads>>>(arguments,...)
and adds lots of spaces between < and >. Changing the settings for the whitespace around <,> messes up the if statements.

Documentation & cleanup needed for FFT subroutine `waves_set`

waves_set in src/poisson_fft.f90 has been copied from Incompact3d and needs input from @slaizet to improve code readability and add documentation either in comments or in the proper docs.

Investigate OMP reordering performance

The OMP backend is using a single function for any reorder. Specialised functions may offer better performance (better inlining, remove of the select statements etc.).
See #68 (comment)

Implement differentiation on the GPU

Child issue of (#2).

To differentiate a line of data along the direction $j$, we need

The bulk differentiation stencil.
The two left-side differentiation stencils.
The two right-side differentiation stencils.
Coefficients for the forward pass of the thomas algorithm.
Coefficients for the backward pass of the thomas algorithm.
Coefficients for the upper diagonal of the tridiagonal coefficient matrix.

A cuda_diffengine_t object can hold a device copy of the differentiation stencils, in a way similar to the diffengine_t type:

type(stencil), device :: bulk_stencil
type(stencil), device :: left_stencils(2)
type(stencil), device :: right_stencils(2)

The CPU diffengine_t implementation is supported by a tridiagsolv_t type that is responsible for pre-computing the coefficient arrays and providing an entry point procedure to solve a system. In a CUDA implementation however, we'd like to fuse the stencil application with the forward pass of the tridiagonal algorithm. The fwd and bwd arrays could be made components of the cuda_diffengine_t type and computed during the execution of its constructor.

The cuda_diffengine_t's diff procedure could the contain the code for both the (stencil application + forward pass) and the backward pass of the tridiagonal algorithm. Something along the lines of:

i = threadIdx%x
b = blockIdx%x

do j= 1, N
    du(j) = afi*(u(i, j, b) - u(i, j, b)) &
          & + bfi*(u(i, j, b) - u(i, j, b))
	      & - du(j)*fwd(j) # Forward pass
end do 

du(n) = du(n)*bwd(n)
do j = n - 1, 1, -1
    du(j) = (du(j) - up_coeff(j)*du(j + 1))*bwd(j)
end do

Run full Taylor Green vortex (OMP)

Run tests in parallel in github actions

At the moment ctest only runs in serial.

~~See https://discourse.cmake.org/t/ctest-and-mpi-parallel-googletests/5557 to have them run with MPI.~~

Add tests for `diffengine` with dirichlet conditions and second order derivatives

Creation of a diffengine instance and differentiation using its diff type bound procedure is currently only tested with boundary boundary conditions.

A test for differentiation using currently implemented dirichlet boundary schemes (see src/stencil_definitions.f90) would presumably look very similar to the current test, with the exception of the object creation:

  diffeng = diffengine_t("compact6", length=n, order=1, dx=dx, &
                         & left_key="dirichlet", right_key="dirichlet")

Testing second order derivation would also look very similar except for the order dummy argument to the diffengine constructor.

  ! Creates a diffengine for periodic second order differentiation
  ! with 6th order compact scheme.
  diffeng = diffengine_t("compact6", length=n, order=2, dx=dx)

Potential refactoring of the backends into a solver class and a backend class

We can do a refactoring so that the backend class only provides subroutines that executes kernels. To do this we need to extract a solver class from the current backend class. For example, transeq operation is very specific to Incompact3D algorithm and we can define this under a solver class. We still need child classes for different architectures as the specifics of the operations we carry out differs based on the architecture. However, most operations solver need rely on tridiagonal solvers, so tridiagonal solver kernels can be provided via a backend class that is independent from the solver.

This could be useful when we want to extend the capabilities of the new framework such as adding low mach number support. The low mach number solver can specify its own preferences and execute transeq in a slightly different way.

Implement FFT Poisson solver (OMP)

Possible cleanup needed in cuda complex reorder kernels

Currently, the subroutines in src/cuda/kernels/complex.f90 are considered temporary and tied to the current implementation of the FFT Poisson solver using cuFFT. When this is reimplemented in cuFFTMP, these functions are likely to be removed.

Only if we decide to keep the existing reordering subroutines specific to cuFFT should we tidy those subroutines up and document properly. This issue is just a reminder and can be closed if the relevant subroutines are removed as planned.

Support for stretched meshes

Incompact3D supports stretched meshes in y-direction. My understanding is it is pretty straightforward. After obtaining the solution from the regular tridiagonal solver, every single line in the field along y direction gets pointwise-multiplied by an array to factor in the stretching in the final result.

I plan to add support for stretchinig by passing this array of stretching factors in the second phase of the distributed algorithm by fusing this operation in the final substitution phase.
https://github.com/xcompact3d/x3d2/blob/main/src/cuda/kernels_dist.f90#L145-L149
So when this support is added even when there is no stretching we'll be multipying the final result by an array of ones, but I think that in terms of performance there won't be a meaningful difference when the stretching support is added.

Periodic thomas solver constructor assumes diagonal is `1`

Instances of the base tridiagsolv type can be constructed for an arbitrary tridiagonal matrix:

$$\begin{bmatrix} b_1 & c_1 & & & 0 \\\ a_2 & b_2 & c_2 & & \\\ & a_3 & b_3 & \ddots & \\\ & & \ddots & \ddots & c_{n-1} \\\ 0 & & & a_n & b_n \end{bmatrix}$$

Although the tridiagonal solver is expected to be used mostly in the case where $b_i$ are all 1, the base (nonperiodic) tridiagonal solver makes no assumption on these values.

! low == ai, up == ci, diag == bi
tsolver = tridiagsolv(low, up, diag)

The constructor periodic tridiagonal solver does, however:

construct_periodic only takes two arguments low and up

inherited components from the base type are based on a diag array assuming the diagonal of the original array is 1:

! Runs non-periodic (base type) constructor which sets up forward
! and backward coefficient arrays.
diag = [2., &
     & (1., i = 2, n - 1), &
     & 1. + up(1) * up(1) &
     & ]
solver%tridiagsolv = tridiagsolv(low, up, diag=diag)

Instead, construct_periodic should take three arguments low, up, and diag just like the nonperiodic constructor, and initialise the base components as

diag = [diag(1) + solver%gamma, &
       & (diag(i), i = 2, n - 1), &
       & diag(n) - (low(1) * up(n)) / solver%gamma &
       & ]
  solver%tridiagsolv = tridiagsolv(low, up, diag=diag)

See https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm#Variants

The tridiagonal solver are currently only tested in cases where the diagonal is 1. To test the above the thomas solver types should be tested on one or more systems in which the diagonal coefficients are not all 1.

Unit testing for the Poisson solver

We plan to provide two different Poisson solvers, iterative and FFT based direct. We need a high level unit test for the poisson solver so that both methods can be tested in a single program.

Implement slab transport contribution on the GPU

Slab transport is evaluating the contribution to the transport term coming from derivatives along a single direction. Considering a slab of data in the $j$-orientation, we want to compute

$$u_i * \frac{\partial u_i}{\partial x_j} + \frac{\partial u_i u_j}{\partial x_j} + \nu \frac{\partial^2 u_i}{\partial x_j}$$$

Implementation of slab transport can then split into three steps:

Add a new cuda_diffengine_t derived type providing a diff bound procedure executing a cuda kernel performing the differentiation (#3).
Compute and combine derivatives into a transport_dir bound procedure on the cuda slab.
Implement the top level transport bound procedure on the cuda slab type.

Although the GPU implementation might eventually differ from the CPU implementation for performance reasons (e.g. computation of all derivatives and their sum within a single CUDA kernel), I suggest that the first working version follows the CPU implementation closely.

Just like the diffengine_t type is responsible for instanciating a tridiagonal solver from the bulk and boundary differentiation schemes, the cuda_diffengine_t could hold device arrays and stencil instances for the thomas solver coefficeints and stencil, respectively. These device objects could be initialised from host code before calling the differentatiation kernel.

Run full Taylor Green vortex (multi-GPU)

Implement distributed tridiagonal solver specialised for transport eqn (OMP)

Documentation regarding the loops over groups in the data structure.

@JamieJQuinn pointed out in #67 that its not clear for the user how to specify loops over groups.

I think we need a section in https://xcompact3d.github.io/x3d2/internals/ describing how we loop over groups in the domain both in CUDA and OpenMP backends. This is closely related to #68 and the subsequent changes that'll come in CUDA backend regarding reorderings as well.

Implement transeq in omp backend

Implement distributed tridiagonal solver

Paire (sym) issue in the transport equation

This issue is about the infamous 'paire' in Incompact3D (renamed as sym in the new framework). And then it links to our discussion on Tuesday with @pbartholomew08 and @Nanoseb.

In transport equation we call the directional derivatives paying attention to this 'paire' and pass the right preprocessed coefficient arrays (ff, fs, fw and sf, ss, sw).

We have found a solution to this a while ago with @slaizet. Here is the plan:
Assume we have a function that obtains all the derivatives for the transport_equation for all three momentum equations in one direction and writes them back to 3 solution fields;

call transeq_dir(du, dv, dw, u, v, w, ...)

in x dir it has to be (see https://github.com/xcompact3d/Incompact3d/blob/master/src/transeq.f90#L117-L122)

du = der_paired(u*u) + u*der_nopair(u)
dv = der_nopair(u*v) + u*der_paired(v)
dw = der_nopair(u*w) + u*der_paired(w)

And in y dir paired/non paired stuff looks a bit different https://github.com/xcompact3d/Incompact3d/blob/master/src/transeq.f90#L190-L195

du = der_nopair(v*u) + v*der_paired(u)
dv = der_paired(v*v) + v*der_nopair(v)
dw = der_nopair(v*w) + v*der_paired(w)

So we know how the final calls should look like. To simplify this process, lets call the function that executes x directional derivatives by rearranging the argument order

call transeq_dir(dv, du, dw, v, u, w, ...)

it'll do the following for us (switch all the u's by v's and all the v's by u's)

dv = der_paired(v*v) + v*der_nopair(v)
du = der_nopair(v*u) + v*der_paired(u)
dw = der_nopair(v*w) + v*der_paired(w)

(lets reorder equations to make them look more natural);

du = der_nopair(v*u) + v*der_paired(u)
dv = der_paired(v*v) + v*der_nopair(v)
dw = der_nopair(v*w) + v*der_paired(w)

So we're just changing the order of fields we pass and then the paired/non paired are always in one particular order inside the function. This way we hide paire specific stuff in the transport equation.
Does anyone foresee any isues resulting from this approach?

Back to the discussion on Tuesday, the main thing was where we switch the order or u, v, w. The current state in Backends PR is that we call transeq_dir in all directions in u, v, w order as shown below, and then in CUDA or OpenMP backends we carry out the shift and call the transeq_dir as v, u, win y and w, u, v in z.

   subroutine transeq(self, du, dv, dw, u, v, w)
      implicit none

      class(base_backend_t) :: self
      class(field_t), intent(out) :: du, dv, dw
      class(field_t), intent(in) :: u, v, w

      ! x derivatives in the transport equation
      call self%transeq_x(du, dv, dw, u, v, w, self%xdirps)

      ! get u_y, v_y, w_y by rearranging u, v, w
      ! similar to the x direction, obtain derivatives in y.
      call self%transeq_y(du_y, dv_y, dw_y, u_y, v_y, w_y, self%ydirps)

      ! get u_z, v_z, w_z by rearranging u, v, w
      ! get the derivatives in z
      call self%transeq_z(du_z, dv_z, dw_z, u_z, v_z, w_z, self%zdirps)

This is actually what we agreed on Tuesday but thinking on this a bit more I'm not sure if its the right way. If we carry out this argument order rearrangement in the CUDA/OpenMP backends, we'll be doing something common at a lower level. I think this is conceptually wrong and we should do the reordering at base backend level and call the transeq_dir function as below, what do you think?

      call self%transeq_y(dv_y, du_y, dw_y, v_u, u_y, w_y, self%ydirps)
      call self%transeq_z(dw_z, du_z, dv_z, w_z, u_z, v_z, self%zdirps)

Don't hardcode the number of halo to 4

In multiple locations in the code, n_halo is set to 4. Even if it is unlikely to change, this shouldn't be hardcoded.

Reordering test fails when run in parallel

Current test is performed in serial, which passes. However if ran in parallel it fails with SIGFPE, with @Nanoseb tracked this down to it should be passed n_padded rather than n_loc.

Proposal:

Duplicate test for serial and parallel execution
Rewrite ordering operations to us padded values

Reordering operations

Need reordering steps from:

x -> y
- cuda
- omp
x -> z
- cuda
- omp
y -> x
- cuda
- omp
y -> z
- cuda
- omp
z -> x
- cuda
- omp
z -> y
- cuda
- omp

~~Note: z -> x is not required as of 19/02/2024.~~

Curl operator requires z -> x, 27/02/2024.

Design a global logging system for better error reporting

Currently errors are reported throughout the codebase with regular print statements:

print *, "Error: something happened"

While this is generally fine, a more sophisticated system that looks something like...

logger.warn("This is just a warning")
> WARNING: This is just a warning

logger.error("This is a catastrophic error")
> ERROR: This is a catastrophic error

logger.info("Just reporting regular information")
> INFO: Just reporting regular information

would allow some nice benefits:

flexible, global redirection of different types of errors to files & other output streams
uniform hierarchy of messages (e.g. error, warn, info, debug, etc)

Mainly just adding this issue to discuss whether we want this & what it could look like.

Tasks:

Integrate fypp into build system
Write logger
Migrate all existing print statements to logger
Catch future uses of print with some static analysis

Reorder kernels in CUDA backend fail with SZ=64

This is due to us going above the maxiumum thread count per block when we set SZ=64. Some of the reorder kernels require a 2D thread with SZ*SZ threads. We need to fix these kernels so that when SZ=64 or above these kernels work on tiles of 32 by 32.

Check impact of -ffast-math on performance & correctness

Just noticed we're setting -ffast-math or -fast in the release build. For a summary of why this is dangerous yet helpful, see this blog post. I haven't personally found it affects the correctness of fluid simulations, but I also haven't found it to improve performance so I tend to stick to just -O3 just in case.

I think it's worth testing

whether -ffast-math actually improves performance, and
whether we're potentially affecting correctness of results

Optimisation for summing x-oriented array and z-oriented array

CUDA backend implements a function to directly add a z-oriented array to a x-oriented into an x-oriented array. This replaces a reorder operation followed by a sum.

Implement local to global and global to local index mapping

With the backend-dependent layout we should have a method(s) to obtain the local index in (i,j,k) and from that the global index (or vice-versa). This will enable users who need the Cartesian coordinates to get these without duplicate code.

Refactoring tridiagonal solver related parameters, coefficent arrays, and preprocessor functions into a class

We've implemented a universal tridiagonal solver to use with compact schemes, both in the cuda backend and openmp backend. These kernels are implemented in a way that they're isolated as much as possible, and this is highlighted in the tests.

Now we need to think about a class that we can use to store all the necessary data these kernels need when solving the tridiagonal systems. We can create multiple instances of this class and store all the data associated with different operations in their own instance. I think we can pass input arguments when instantiating this class and the class constructor can immediately do all the preprocessing. This kind of resembles the old diffengine class we developed for the Thomas algorithm, but the actual solver and the data are completely decoupled from each other in this new approach. A benefit of such decoupling is that the kernels can be implemented in C and then can be called from Fortran anywhere else.

For the moment I believe it is better to handle halo data and manage the required storage for halo data at backend level, not within this tridiagonal solver preprocessor class.

Performance improvement in transeq in the OpenMP backend

I'm copying my comment in #27 so that we don't forget about it.

x3d2/src/omp/exec_dist.f90

Lines 164 to 185 in 2d906a5

 do k = 1, n_block 

 call der_univ_subs(du(:, :, k), & 

 du_recv_s(:, :, k), du_recv_e(:, :, k), & 

 tdsops_du%n, tdsops_du%dist_sa, tdsops_du%dist_sc) 

 call der_univ_subs(dud(:, :, k), & 

 dud_recv_s(:, :, k), dud_recv_e(:, :, k), & 

 tdsops_dud%n, tdsops_dud%dist_sa, tdsops_dud%dist_sc) 

 call der_univ_subs(d2u(:, :, k), & 

 d2u_recv_s(:, :, k), d2u_recv_e(:, :, k), & 

 tdsops_d2u%n, tdsops_d2u%dist_sa, tdsops_d2u%dist_sc) 

 do j = 1, n 

 !$omp simd 

 do i = 1, SZ 

 rhs(i, j, k) = -0.5_dp*(v(i, j, k)*du(i, j, k) + dud(i, j, k)) + nu*d2u(i, j, k) 

 end do 

 !$omp end simd 

 end do 

 end do

I realised that here we're writing 3 field sized arrays into main memory unnecessarily. It is potentially increasing the runtime %20.

In the second phase of the algorithm here we pass a part of the du, dud, and d2u into der_univ_subs, and they're all rewritten in place. Then later we combine them in rhs for the final result. Ideally, we want du, dud, and d2u to be read once and rhs to be written only once. However because of the way der_univ_subs work, the updated data in du arrays after der_univ_subs call gets written in the main memory, even though we don't need this at all.

There are three ways we can fix this

In the parallel do loop in the second phase we can copy the relevant parts of du, dud, and d2u arrays into (SZ, n) sized temporary arrays. Then we pass temporary arrays into der_univ_subs, and at the end we use these temporaries to obtain final rhs. This is the easiest solution but it may not be the best in terms of performance.
We can write an alternative der_univ_subs and separate input and output arrays. This way we can pass a part of the du arrays as we do now, and pass a small temporary array as the output one. Because du arrays will be input arrays no data will be written in main memory. Then we can combine the temporaries to get rhs.
If we're writing an alternative der_univ_subs to be used in transeq, we can go one step further and have a fused version of it. This would probably the most performant solution. der_univ_subs is relatively lightweight so it isn't really hard to do so. The new subrotuine can input all du, dud, and d2u, and write the final result rhs.

	do k = 1, nz
	do j = 1, ny
	do i = 1, nx
	x = (i - 1)*globs%dx
	y = (j - 1)*globs%dy
	z = (k - 1)*globs%dz

	u_init(i, j, k) = sin(x)cos(y)cos(z)
	v_init(i, j, k) = -cos(x)sin(y)cos(z)
	w_init(i, j, k) = 0
	end do
	end do
	end do

	call solver%backend%set_field_data(solver%u, u_init)
	call solver%backend%set_field_data(solver%v, v_init)
	call solver%backend%set_field_data(solver%w, w_init)

	call self%backend%get_field_data(u_out, self%u)
	call self%backend%get_field_data(v_out, self%v)
	call self%backend%get_field_data(w_out, self%w)

	#ifdef CUDA
	cuda_allocator = cuda_allocator_t(globs%nx_loc, globs%ny_loc, &
	globs%nz_loc, SZ)
	allocator => cuda_allocator
	print*, 'CUDA allocator instantiated'

	cuda_backend = cuda_backend_t(globs, allocator)
	backend => cuda_backend
	print*, 'CUDA backend instantiated'
	#else
	omp_allocator = allocator_t(globs%nx_loc, globs%ny_loc, globs%nz_loc, SZ)
	allocator => omp_allocator
	print*, 'OpenMP allocator instantiated'

	omp_backend = omp_backend_t(globs, allocator)
	backend => omp_backend
	print*, 'OpenMP backend instantiated'
	#endif

	call exec_dist_transeq_3fused( &
	ru_dev, &
	u_dev, self%u_recv_s_dev, self%u_recv_e_dev, &
	u_dev, self%u_recv_s_dev, self%u_recv_e_dev, &
	du_dev, duu_dev, d2u_dev, &

	call transeq_3fused_subs<<<blocks, threads>>>( &
	r_u, v, du, dud, d2u, &

	do k = 1, n_block
	call der_univ_subs(du(:, :, k), &
	du_recv_s(:, :, k), du_recv_e(:, :, k), &
	tdsops_du%n, tdsops_du%dist_sa, tdsops_du%dist_sc)

	call der_univ_subs(dud(:, :, k), &
	dud_recv_s(:, :, k), dud_recv_e(:, :, k), &
	tdsops_dud%n, tdsops_dud%dist_sa, tdsops_dud%dist_sc)

	call der_univ_subs(d2u(:, :, k), &
	d2u_recv_s(:, :, k), d2u_recv_e(:, :, k), &
	tdsops_d2u%n, tdsops_d2u%dist_sa, tdsops_d2u%dist_sc)

	do j = 1, n
	!$omp simd
	do i = 1, SZ
	rhs(i, j, k) = -0.5_dp(v(i, j, k)du(i, j, k) + dud(i, j, k)) + nu*d2u(i, j, k)
	end do
	!$omp end simd
	end do

	end do

xcompact3d / x3d2 Goto Github PK

x3d2's People

Contributors

Stargazers

Watchers

Forkers

x3d2's Issues

A potential solution

Recommend Projects

Recommend Topics

Recommend Org