amrex-astro / mini-castro Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 5.0 27.79 MB

a mini-app version of castro

License: MIT License

Makefile 0.91% Fortran 59.70% Python 2.84% C++ 24.96% C 2.33% Shell 9.26%

mini-castro's People

Stargazers

Watchers

Forkers

guadabsb15 adam-m-jcbs bcfriesen forest80 heinerbilch

mini-castro's Issues

AMR does not work on GPUs

When using inputs.128 with amr.max_level = 1 on summitdev, bad data is generated and the timesteps get really small. This does not occur with inputs.64, and does not occur in debug mode.

PGI 17.9 + OpenACC fails to compile eos_type.F90

When attempting to compile eos_type.F90 with PGI 17.9 using OpenACC, StarLord encounters this error:

PGF90-W-0435-Array declared with zero size (..//Source/eos_type.F90: 132)
  0 inform,   1 warnings,   0 severes, 0 fatal for eos_type_module
PGF90-S-0155-Module variables used in acc routine need to be in !$acc declare create() - zion$p (..//Source/eos_type.F90: 190)
PGF90-S-0155-Module variables used in acc routine need to be in !$acc declare create() - zion$sd (..//Source/eos_type.F90: 190)
composition:
    173, Generating acc routine seq
         Generating Tesla code
  0 inform,   0 warnings,   2 severes, 0 fatal for composition
PGF90-S-0155-Module variables used in acc routine need to be in !$acc declare create() - small_x$p (..//Source/eos_type.F90: 214)
PGF90-S-0155-Accelerator region ignored; see -Minfo messages  (..//Source/eos_type.F90)
normalize_abundances:
      0, Accelerator region ignored
    214, Accelerator restriction: invalid loop
  0 inform,   0 warnings,   2 severes, 0 fatal for normalize_abundances
clean_state:
    224, Generating acc routine seq
         Generating Tesla code
eos_get_small_temp:
    255, Generating acc routine seq
         Generating Tesla code
eos_get_small_dens:
    268, Generating acc routine seq
         Generating Tesla code
eos_get_max_temp:
    282, Generating acc routine seq
         Generating Tesla code
eos_get_max_dens:
    296, Generating acc routine seq
         Generating Tesla code
make: *** [tmp_build_dir/o/3d.pgi.MPI.EXE/eos_type.o] Error 2

This occurs on commit eaa153c ("Add a .gitignore for Util/").

Write code generation tool for creation of CUDA wrappers to device functions

This issue describes a recommended approach for launching Fortran functions as CUDA kernels. It uses as an example the Castro function

ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi)

In order to launch this function on the device, a CUDA kernel needs to be launched first, and then this function needs to be called inside the kernel as a device function. This should be done by wrapping this function in DEVICE_LAUNCHABLE(), as:

DEVICE_LAUNCHABLE(ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi));

(When we're not compiling for the device, this will be a simple C++ preprocessor function macro that does nothing.)

This should be expanded to:

__device__ void ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

That is, it should prepend __device__ to the target Fortran function (which must have attributes(device) manually prepended to it). It should also create another function declaration prepended with cuda_, that has the same arguments.

The new cuda_ function should look like:

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const amrex::Real* state, const int* state_lo, const int* state_hi)
{
int blo[3];
int bhi[3];
get_loop_bounds(blo, bhi, lo, hi);
ca_compute_temp(blo, bhi, state, state_lo, state_hi);
}

and should be declared in a separate compilation unit, not the header file (a reasonable choice would be a single .cpp file that contains all of the newly created CUDA functions).

Note that get_loop_bounds is a function that is found in AMReX_Device.H.

The corresponding call to this function should be:

DEVICE_LAUNCH(ca_compute_temp(lo, hi, state, state.loVect(), state.hiVect()));

This should be replaced by:

 dim3 numThreads, numBlocks;
 amrex::Device::c_threads_and_blocks(lo, hi, numBlocks, numThreads);
 cuda_ca_compute_temp<<<numBlocks, numThreads, 0, amrex::Device::cudaStream()>>>(lo, hi, state, state.loVect(), state.hiVect());

Note that this makes it a requirement that lo and hi are the first two arguments to the function. This way they can be replaced by the zone index corresponding to each CUDA thread.

AmrLevel::Derive() generates bad data

When using a certain combination of boxes, e.g. inputs.128 with amr.max_grid_size=32, the diagnostics report NaN's. This occurs because the calls to derive that are being used to fill in the fields for the diagnostics are generating bad data with NaN's in them, which implies some problem in FillPatch.

Experiment with single precision

We should compile StarLord in single precision and compare results to double precision (looking at both performance and relative accuracy). Main difficulties are:

We sometimes use large numbers in the code like 1.d200, which could be replaced with HUGE()
In astro we usually represent physics with numbers that are larger than the single precision floating point range, for example white dwarf systems where the total energy is O(1e50). A solution here is to use scaled numbers in the actual simulation (say, scaled by the maximum value on the domain in the initial conditions) and undo the scaling only for I/O when we need the absolute numbers.

RFC: should the default CUDA version be upgraded from 8 to 9?

Currently the default CUDA version that StarLord anticipates is version 8. However, MPI + CUDA codes on summitdev crash at run time if using nvprof with CUDA 8. This crash is avoided if one compiles with version 8 but runs with version 9 (the latest available on summitdev). Is there a reason not to change the default compilation version to 9 to fix this problem without requiring the above workaround?

Check for filcc validity

The "thread-safe" version of filcc that was written to accommodate CUDA support in StarLord may not fill corner zones with valid values. Need to check this and fix it if so.

Can't compile eos_type.f90 with gcc/6.3.1 + OpenACC

Hi,

on commit eaa153c ("Add a .gitignore for Util/") I encounter the following error trying to compile eos_type.f90 with GCC 6.3.1 using OpenACC:

..//Source/eos_type.F90:177:38:

..//Source/eos_type.F90:175:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:177:38:

     use bl_constants_module, only: ONE
                                      1
Error: USE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:178:43:

..//Source/eos_type.F90:175:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:178:43:

     use network, only: aion, aion_inv, zion
                                           1
Error: USE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:180:17:

..//Source/eos_type.F90:175:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:180:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:207:38:

..//Source/eos_type.F90:205:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:207:38:

     use bl_constants_module, only: ONE
                                      1
Error: USE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:208:43:

..//Source/eos_type.F90:205:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:208:43:

     use extern_probin_module, only: small_x
                                           1
Error: USE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:210:17:

..//Source/eos_type.F90:205:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:210:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:228:17:

..//Source/eos_type.F90:226:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:228:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:259:17:

..//Source/eos_type.F90:257:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:259:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:272:17:

..//Source/eos_type.F90:270:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:272:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:286:17:

..//Source/eos_type.F90:284:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:286:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
..//Source/eos_type.F90:300:17:

..//Source/eos_type.F90:298:21:

     !$acc routine seq
                     2
..//Source/eos_type.F90:300:17:

     implicit none
                 1
Error: IMPLICIT NONE statement at (1) cannot follow !$ACC ROUTINE statement at (2)
make: *** [tmp_build_dir/o/3d.gnu.MPI.EXE/eos_type.o] Error 1```

Reduce memory usage by using only one edge state array

At present sxm, sxp, etc. have size NQ, since we compute all of the edge-state reconstructions first, and then in a separate step "trace" under them (which for MOL is a no-op). In Castro this was simplified (4f84d83b283a35551a1ea523f582294ce1e4f02a) to use only a single copy of the edge state array, and then loop from 1 to NQ, in each case doing the PPM reconstruction and then immediately doing the "tracing." If we can do this in StarLord, we can cut down on the memory footprint by quite a bit, since there are six of these arrays and NQ is ~20 when we use aprox13.

This cannot be done trivially because we are currently launching separate kernels for the reconstruction and tracing steps. But perhaps there is some related approach we can do here.

ptxas application ptx input, line 9; fatal   : Unsupported .version 6.0; current version is '5.0'
ptxas fatal   : Ptx assembly aborted due to errors
nvlink fatal   : elfLink fatbinary error
pgacclnk: child process exit status 2: /autofs/nccs-svm1_sw/summitdev/.swci/0-core/opt/spack/20171006/linux-rhel7-ppc64le/gcc-4.8.5/pgi-17.10-y76zzohzqn6lgwv6ktlihmfwj5ebdp2x/linuxpower/17.10/bin/pgnvd
make: *** [Castro3d.pgi.MPI.CUDA.ex] Error 2

$ module list

Currently Loaded Modules:
  1) hsi/5.0.2.p5   2) xalt/0.7.5   3) lsf-tools/1.0   4) DefApps   5) pgi/17.10   6) spectrum-mpi/10.2.0.0-20180110   7) git/2.13.0   8) cuda/9.0.69   9) emacs/25.1  10) python/3.5.2

Seems to work with CUDA 8.

StarLord does not work with the IBM compiler on GPUs

It compiles, but completely fails at runtime. Need to investigate why.

amrex-astro / mini-castro Goto Github PK

mini-castro's People

Stargazers

Watchers

Forkers

mini-castro's Issues

Recommend Projects

Recommend Topics

Recommend Org