Git Product home page Git Product logo

luchete80 / weldformgpu Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 2.0 1.47 MB

CUDA implementation of Weakly-Compressible Smoothed Particle Hydrodynamics for Elasto Plastic and thermal coupled Mechanics

License: GNU General Public License v3.0

CMake 0.65% C++ 68.13% Cuda 22.14% C 2.72% Batchfile 0.01% Jupyter Notebook 6.35%
cuda deformable-bodies-dynamics deformable-solids gpu nvidia plasticity smoothed-particle-hydrodynamics solid-mechanics sph

weldformgpu's Introduction

WeldFormGPU is the GPU version of WeldForm This is being migrated to be the CPU / GPU solver.

Features

Has been exclusively adapted to solid mechaincs, and it includes:

  • Mechanic Solver
  • Thermal Solver
  • Coupled ThermoMechanical Solver (in progress)
  • Contact formulation (in progress)
  • Adaptive search only in case of plastic strain threshold (in progress)

alt text

Is hevaily based on Kirk Fraser Thesis and works on SPH model of FSW

Requirements

GPU Version

  1. Solver executables. Check which version corresponds to your GPU [here].
  2. Visual Studio Runtime Libraries [here] (https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)
  3. CUDA runtime libraries.

Building Instructions

  1. Install Visual Studio Community 2019 (Not tested on 2022 yet)
  2. Install cmake
  3. Download and install CUDA compilers (tested only on 11.4 version)
  4. Set CUDA compiler path on CMD: set CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1"
  5. Create a direcory for building libraries binaries
  6. Run make.bat (located here on root directory), to set MSVC env vars
  7. Clone this repo (for example to c:\WeldFormGPU\src)
  8. Inside binaries library run: cmake c:\WeldFormGPU\src -G "NMake Makefiles"
  9. Inside binaries library run: nmake.exe

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1

weldformgpu's People

Contributors

luchete80 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

xupeiwust simzc

weldformgpu's Issues

SOA vs AOS

https://stackoverflow.com/questions/17924705/structure-of-arrays-vs-array-of-structures
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)

This make sense in an all gpu code?


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <thrust\device_vector.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

#define BLOCKSIZE   1024

/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {

    unsigned int    x1;
    unsigned int    x2;
    unsigned int    code;
    bool            done;

};

/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {

    unsigned int    *x1;
    unsigned int    *x2;
    unsigned int    *code;
    bool            *done;

};


/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {
        cellAoS tempCell = d_cells[tid];

        tempCell.x1 = tempCell.x1 + 10;
        tempCell.x2 = tempCell.x2 + 10;

        d_cells[tid] = tempCell;
    }

}

/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        d_x1[tid] = d_x1[tid] + 10;
        d_x2[tid] = d_x2[tid] + 10;

    }

}

/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        cell.x1[tid] = cell.x1[tid] + 10;
        cell.x2[tid] = cell.x2[tid] + 10;

    }

}

/********/
/* MAIN */
/********/
int main() {

    const int N = 2048 * 2048 * 4;

    TimingGPU timerGPU;

    thrust::host_vector<cellAoS>    h_cells(N);
    thrust::device_vector<cellAoS>  d_cells(N);

    thrust::host_vector<unsigned int>   h_x1(N);
    thrust::host_vector<unsigned int>   h_x2(N);

    thrust::device_vector<unsigned int> d_x1(N);
    thrust::device_vector<unsigned int> d_x2(N);

    for (int k = 0; k < N; k++) {

        h_cells[k].x1 = k + 1;
        h_cells[k].x2 = k + 2;
        h_cells[k].code = k + 3;
        h_cells[k].done = true;

        h_x1[k] = k + 1;
        h_x2[k] = k + 2;

    }

    d_cells = h_cells;

    d_x1 = h_x1;
    d_x2 = h_x2;

    cellSoA cell;
    cell.x1 = thrust::raw_pointer_cast(d_x1.data());
    cell.x2 = thrust::raw_pointer_cast(d_x2.data());
    cell.code = NULL;
    cell.done = NULL;

    timerGPU.StartCounter();
    AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());

    //timerGPU.StartCounter();
    //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
    //gpuErrchk(cudaPeekAtLastError());
    //gpuErrchk(cudaDeviceSynchronize());
    //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());

    timerGPU.StartCounter();
    AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());

    h_cells = d_cells;

    h_x1 = d_x1;
    h_x2 = d_x2;

    // --- Check results
    for (int k = 0; k < N; k++) {
        if (h_x1[k] != k + 11) {
            printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
            break;
        }
        if (h_x2[k] != k + 12) {
            printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
            break;
        }
        if (h_cells[k].x1 != k + 11) {
            printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
            break;
        }
        if (h_cells[k].x2 != k + 12) {
            printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
            break;
        }
    }

}

How are matrix mult? Shared

As detailed in Variable Memory Space Specifiers shared memory is allocated using the shared memory space specifier.

Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad memory (or software managed cache) to minimize global memory accesses from a CUDA block as illustrated by the following matrix multiplication example.

The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one column of B and computes the corresponding element of C as illustrated in Figure 7. A is therefore read B.width times from global memory and B is read A.height times.

Neib test in 1D or 2D format

https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

general, dynamically allocated 2D case:

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

Also, here is a thrust method for building a general dynamically allocated 2D array.

flattening:

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

general, dynamically allocated 3D case:

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

special case: array width known at compile time:

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

doubly-subscripted host code, singly-subscripted device code:

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

Accesing arrays inside kernel creates error

void __global__ ThermalSolveKernel (double *dTdt,
																		double3 *x, double *h,
																		double *m, double *rho,
																		double *T, double *k_T, double *cp, 
																		int *neib_part, int *neib_offs/*orcount*/) {

	printf("searching nb..\n");	
	int i = threadIdx.x+blockDim.x*blockIdx.x;

printf("dT: %f\n",dTdt[i]);

CUDA operator interaction erros

[ 11%] Building CUDA object src/cuda/CMakeFiles/cuda.dir/Interaction.cu.obj
Interaction.cu
D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(105): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(105): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(142): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(146): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(155): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(158): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(162): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(166): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

D:/Luciano/Numerico/persiangpu/src/cuda/Interaction.cu(185): error: more than one operator "" matches these operands:
function "operator
(const float &, const float3 &)"
function "operator*(double, const Vector &)"
operand types are: double * float3

Some dTdt have wrong values

dTdt: 0.000000
dTdt: 0.000000
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: -nan(ind)
dTdt: 0.000000
dTdt: 0.000000

ambiguous operator

src\cuda\interaction.cu(142):
error: more than one operator "" matches these operands:
function "operator
(double, const Vector &)"
D:\Luciano\Numerico\persiangpu\src\Vector.h(99): here
function "operator*(const float &, const float3 &)"
D:\Luciano\Numerico\persiangpu\src\vector_math.h(511): here
operand types are: double * float3

Memory allocation host2dev copy crash main test program

This is crashing main test thermal program

	// double3 *x =  (double3 *)malloc(dom.Particles.size());
	// for (int i=0;i<dom.Particles.size();i++){
		// x[i] = make_double3(dom.Particles[i]->x);
	// }
	// int size = dom.Particles.size() * sizeof(double3);

Change Stress to float3*, symtensor or float2**

../gpusph/src/integrators/RepackingIntegrator.cc: BUFFER_DKDE | BUFFER_TAU |
../gpusph/src/define_worker_commands.h: (BUFFER_FORCES | BUFFER_XSPH | BUFFER_TAU | BUFFER_DKDE | BUFFER_CFL |
../gpusph/src/cuda/visc.cu: float2 **tau = bufwrite.getRawPtr<BUFFER_TAU>();

Undefined referencies of domain

/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function main': main.cpp:(.text+0x1ec): undefined reference to SPH::Domain::Kernel_Set(Kernels_Type const&)'
/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::WholeVelocity()': main.cpp:(.text._ZN3SPH6Domain13WholeVelocityEv[_ZN3SPH6Domain13WholeVelocityEv]+0x2a8): undefined reference to SPH::EOS(unsigned long const&, double const&, double const&, double const&, double const&)'
/usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::Solve(double, double, double, char const*, unsigned long)': main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x20e): undefined reference to SPH::Domain::StartAcceleration(Vector const&)'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x2f0): undefined reference to SPH::Domain::ClearNbData()' /usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x605): undefined reference to SPH::Domain::PrimaryComputeAcceleration()'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0x64b): undefined reference to SPH::Domain::LastComputeAcceleration()' /usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0xa2e): undefined reference to SPH::Domain::AdaptiveTimeStep()'
/usr/bin/ld: main.cpp:(.text._ZN3SPH6Domain5SolveEdddPKcm[_ZN3SPH6Domain5SolveEdddPKcm]+0xab0): undefined reference to SPH::Domain::ClearNbData()' /usr/bin/ld: CMakeFiles/gpuweldform.dir/main.cpp.o: in function SPH::Domain::YZPlaneCellsNeighbourSearch(int)':
main.cpp:(.text._ZN3SPH6Domain27YZPlaneCellsNeighbourSearchEi[_ZN3SPH6Domain27YZPlaneCellsNeighbourSearchEi]+0x24): undefined reference to `omp_get_thread_num'
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/gpuweldform.dir/build.make:100: src/gpuweldform] Error 1
make[1]: *** [CMakeFiles/Makefile2:135: src/CMakeFiles/gpuweldform.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Unresolved extern function '_ZN7tensor3clEii'

ptxas fatal : Unresolved extern function '_ZN7tensor3clEii'
make[2]: *** [src/cuda/CMakeFiles/cuda.dir/build.make:102: src/cuda/CMakeFiles/cuda.dir/SubDomain.cu.o] Error 255
make[1]: *** [CMakeFiles/Makefile2:162: src/cuda/CMakeFiles/cuda.dir/all] Error 2

Vector unresolved externals

main.cu.obj : error LNK2019: símbolo externo "public: __cdecl Vector::Vector(double,double,double)" (??0Vector@@qeaa@NNN@Z) sin resolver al que se hace referencia en la función "public: __cdecl SPH::Boundary::Boundary(void)" (??0Boundary@SPH@@qeaa@XZ)
main.cu.obj : error LNK2019: símbolo externo "public: class Vector & __cdecl Vector::operator=(float const &)" (??4Vector@@QEAAAEAV0@AEBM@Z) sin resolver al que se hace referencia en la función "public: __cdecl SPH::Boundary::Boundary(void)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.