luchete80 / weldformgpu Goto Github PK

CUDA implementation of Weakly-Compressible Smoothed Particle Hydrodynamics for Elasto Plastic and thermal coupled Mechanics

License: GNU General Public License v3.0

CMake 0.65% C++ 68.13% Cuda 22.14% C 2.72% Batchfile 0.01% Jupyter Notebook 6.35%

cuda deformable-bodies-dynamics deformable-solids gpu nvidia plasticity smoothed-particle-hydrodynamics solid-mechanics sph

weldformgpu's Introduction

WeldFormGPU is the GPU version of WeldForm This is being migrated to be the CPU / GPU solver.

Features

Has been exclusively adapted to solid mechaincs, and it includes:

Mechanic Solver
Thermal Solver
Coupled ThermoMechanical Solver (in progress)
Contact formulation (in progress)
Adaptive search only in case of plastic strain threshold (in progress)

Is hevaily based on Kirk Fraser Thesis and works on SPH model of FSW

Requirements

GPU Version

Solver executables. Check which version corresponds to your GPU [here].
Visual Studio Runtime Libraries [here] (https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)
CUDA runtime libraries.

Building Instructions

Install Visual Studio Community 2019 (Not tested on 2022 yet)
Install cmake
Download and install CUDA compilers (tested only on 11.4 version)
Set CUDA compiler path on CMD: set CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1"
Create a direcory for building libraries binaries
Run make.bat (located here on root directory), to set MSVC env vars
Clone this repo (for example to c:\WeldFormGPU\src)
Inside binaries library run: cmake c:\WeldFormGPU\src -G "NMake Makefiles"
Inside binaries library run: nmake.exe

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1

weldformgpu's People

Contributors

Stargazers

Watchers

Forkers

xupeiwust simzc

weldformgpu's Issues

include\xutility(1309): error: expected a "(" detected during instantiation of "void std::

https://stackoverflow.com/questions/67732065/why-does-vs2019-pro-have-compile-errors-with-xutility-xmemory-and-atomic-when

stream of’ has initializer but incomplete type 81 | std::ofstream of(fn.CStr(), std::ios::out);

Up until now is commentd

Calculate thermal time step size

SOA vs AOS

https://stackoverflow.com/questions/17924705/structure-of-arrays-vs-array-of-structures
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)

This make sense in an all gpu code?


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <thrust\device_vector.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

#define BLOCKSIZE   1024

/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {

    unsigned int    x1;
    unsigned int    x2;
    unsigned int    code;
    bool            done;

};

/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {

    unsigned int    *x1;
    unsigned int    *x2;
    unsigned int    *code;
    bool            *done;

};


/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {
        cellAoS tempCell = d_cells[tid];

        tempCell.x1 = tempCell.x1 + 10;
        tempCell.x2 = tempCell.x2 + 10;

        d_cells[tid] = tempCell;
    }

}

/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        d_x1[tid] = d_x1[tid] + 10;
        d_x2[tid] = d_x2[tid] + 10;

    }

}

/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        cell.x1[tid] = cell.x1[tid] + 10;
        cell.x2[tid] = cell.x2[tid] + 10;

    }

}

/********/
/* MAIN */
/********/
int main() {

    const int N = 2048 * 2048 * 4;

    TimingGPU timerGPU;

    thrust::host_vector<cellAoS>    h_cells(N);
    thrust::device_vector<cellAoS>  d_cells(N);

    thrust::host_vector<unsigned int>   h_x1(N);
    thrust::host_vector<unsigned int>   h_x2(N);

    thrust::device_vector<unsigned int> d_x1(N);
    thrust::device_vector<unsigned int> d_x2(N);

    for (int k = 0; k < N; k++) {

        h_cells[k].x1 = k + 1;
        h_cells[k].x2 = k + 2;
        h_cells[k].code = k + 3;
        h_cells[k].done = true;

        h_x1[k] = k + 1;
        h_x2[k] = k + 2;

    }

    d_cells = h_cells;

    d_x1 = h_x1;
    d_x2 = h_x2;

    cellSoA cell;
    cell.x1 = thrust::raw_pointer_cast(d_x1.data());
    cell.x2 = thrust::raw_pointer_cast(d_x2.data());
    cell.code = NULL;
    cell.done = NULL;

    timerGPU.StartCounter();
    AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());

    //timerGPU.StartCounter();
    //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
    //gpuErrchk(cudaPeekAtLastError());
    //gpuErrchk(cudaDeviceSynchronize());
    //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());

    timerGPU.StartCounter();
    AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());

    h_cells = d_cells;

    h_x1 = d_x1;
    h_x2 = d_x2;

    // --- Check results
    for (int k = 0; k < N; k++) {
        if (h_x1[k] != k + 11) {
            printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
            break;
        }
        if (h_x2[k] != k + 12) {
            printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
            break;
        }
        if (h_cells[k].x1 != k + 11) {
            printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
            break;
        }
        if (h_cells[k].x2 != k + 12) {
            printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
            break;
        }
    }

}

Inside cuda, convet all blitz Arrays to local class or std

// Array<Array<std::pair<size_t,size_t> > >	SMPairs;
// Array<Array<std::pair<size_t,size_t> > >	NSMPairs;
// Array<Array<std::pair<size_t,size_t> > >	FSMPairs;

WholeVelocity in both CPU and GPU??

Processing sum as PAIRS and compare with particles (and how to block)

https://stackoverflow.com/questions/59655911/can-multiple-threads-write-the-same-value-to-the-same-variable-at-the-same-time

How are matrix mult? Shared

As detailed in Variable Memory Space Specifiers shared memory is allocated using the shared memory space specifier.

Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad memory (or software managed cache) to minimize global memory accesses from a CUDA block as illustrated by the following matrix multiplication example.

The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one column of B and computes the corresponding element of C as illustrated in Figure 7. A is therefore read B.width times from global memory and B is read A.height times.

Neib test in 1D or 2D format

https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

general, dynamically allocated 2D case:

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

Also, here is a thrust method for building a general dynamically allocated 2D array.

flattening:

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

general, dynamically allocated 3D case:

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

special case: array width known at compile time:

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

doubly-subscripted host code, singly-subscripted device code:

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

Check LOCK in GPU (see omp_lock from cpu or GPUSPH methods)

Create dTdt for Thermal solver

Accesing arrays inside kernel creates error

void __global__ ThermalSolveKernel (double *dTdt,
																		double3 *x, double *h,
																		double *m, double *rho,
																		double *T, double *k_T, double *cp, 
																		int *neib_part, int *neib_offs/*orcount*/) {

	printf("searching nb..\n");	
	int i = threadIdx.x+blockDim.x*blockIdx.x;

printf("dT: %f\n",dTdt[i]);

Add StartAcceleration kernel

No CMAKE_CUDA_COMPILER could be found.

In neccesary tensile instability factor to be a particle value (could be global value)

CUDA operator interaction erros

[ 11%] Building CUDA object src/cuda/CMakeFiles/cuda.dir/Interaction.cu.obj
Interaction.cu
D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(105): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(105): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(142): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(146): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(155): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(158): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(162): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(166): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(185): error: more than one operator "" matches these operands:
function "operator(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

Some dTdt have wrong values

dTdt: 0.000000
dTdt: 0.000000
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: 0.000000
dTdt: 0.000000

create StartAccel kernel

ambiguous operator

src\cuda\interaction.cu(142):
error: more than one operator "" matches these operands:
function "operator(double, const Vector &)"
D:\Luciano\Numerico\persiangpu\src\Vector.h(99): here
function "operator*(const float &, const float3 &)"
D:\Luciano\Numerico\persiangpu\src\vector_math.h(511): here
operand types are: double * float3

Pass NOT Subdomain but only certain parameter list to kernels

Does not make sense to pass all this
StartAcceleration
PrimaryComputeAcceleration
LastComputeAccel

Solver CUDA global function is not being called

Memory allocation host2dev copy crash main test program

This is crashing main test thermal program

	// double3 *x =  (double3 *)malloc(dom.Particles.size());
	// for (int i=0;i<dom.Particles.size();i++){
		// x[i] = make_double3(dom.Particles[i]->x);
	// }
	// int size = dom.Particles.size() * sizeof(double3);

Change some symtensor to float*

put all in inline cuda (without include at the end)

Calling tensor operator from device not allowed

calling a host function("tensor3::operator ()(int, int)") from a device function("SPH::SubDomain::StartAcceleration") is not allowed

Analyze float3 vs float4

Change Stress to float3*, symtensor or float2**

Move Domain partcilesarray std::vector to **Particle

Make All aceleration functions

force inline all device functions

Particle.cu, Subdomain.cu, ..

Including <cuNSearch.h> gives compiling error

Undefined referencies of domain

/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function main': main.cpp:(.text+0x1ec): undefined reference to SPH::Domain::Kernel_Set(Kernels_Type const&)'
/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::WholeVelocity()': main.cpp:(.text._ZN3SPH6Domain13WholeVelocityEv[_ZN3SPH6Domain13WholeVelocityEv]+0x2a8): undefined reference to SPH::EOS(unsigned long const&, double const&, double const&, double const&, double const&)'
/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::Solve(double, double, double, char const*, unsigned long)': main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x20e): undefined reference to SPH::Domain::StartAcceleration(Vector const&)'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x2f0): undefined reference to SPH::Domain::ClearNbData()' /usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x605): undefined reference to SPH::Domain::PrimaryComputeAcceleration()'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x64b): undefined reference to SPH::Domain::LastComputeAcceleration()' /usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0xa2e): undefined reference to SPH::Domain::AdaptiveTimeStep()'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0xab0): undefined reference to SPH::Domain::ClearNbData()' /usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::YZPlaneCellsNeighbourSearch(int)':
main.cpp:(.text._ZN3SPH6Domain27YZPlaneCellsNeighbourSearchEi[_ZN3SPH6Domain27YZPlaneCellsNeighbourSearchEi]+0x24): undefined reference to `omp_get_thread_num'
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/gpuweldform.dir/build.make:100: src/gpuweldform] Error 1
make[1]: *** [CMakeFiles/Makefile2:135: src/CMakeFiles/gpuweldform.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Write CSV output file

LLVMSymbolizer: error reading file: 'kernel32.pdb': no such file or directory

Read here

Templatize to data type (float/double)

https://forums.developer.nvidia.com/t/how-to-run-templatized-global-function-cuda-templates/508

Add main solver in domain with kernels

Unresolved extern function '_ZN7tensor3clEii'

Write simple global mech forces (interactions) without Nb

Use std::vector, thrust or External/Array for CPU arrays

Check SOA in Vector components.

Unresolved extern function '_ZN7tensor3clEii'

ptxas fatal : Unresolved extern function '_ZN7tensor3clEii'
make[2]: *** [src/cuda/CMakeFiles/cuda.dir/build.make:102: src/cuda/CMakeFiles/cuda.dir/SubDomain.cu.o] Error 255
make[1]: *** [CMakeFiles/Makefile2:162: src/cuda/CMakeFiles/cuda.dir/all] Error 2

Domain Copies Host2Dev

Vector Unresolved external

Appears when are made vector x = v1 - v2

Vector unresolved externals

main.cu.obj : error LNK2019: símbolo externo "public: __cdecl Vector::Vector(double,double,double)" (??0Vector@@qeaa@NNN@Z) sin resolver al que se hace referencia en la función "public: __cdecl SPH::Boundary::Boundary(void)" (??0Boundary@SPH@@qeaa@XZ)
main.cu.obj : error LNK2019: símbolo externo "public: class Vector & __cdecl Vector::operator=(float const &)" (??4Vector@@QEAAAEAV0@AEBM@Z) sin resolver al que se hace referencia en la función "public: __cdecl SPH::Boundary::Boundary(void)