Git Product home page Git Product logo

sgemm_on_vega's Introduction

An Alternative SGEMM(DGEMM) on VEGA (MI25, MI50, MI60)

to Verify Power by LDS, SGPR , and Data Forwarding

 

1 Legacy DGEMM implementation

https://github.com/NervanaSystems/maxas/wiki/SGEMM has a detailed explain of SGEMM on Maxwell Architecture. Most SGEMM/DGEMM implementation on GPU are using similar algorithms. The top level idea of legacy SGEMM/DGEMM are implemented as Following:

  • Using work group Size (64,1,1)
  • Each work group computes the matrix C’s  region from (m,n),  to  [m+64, n+63],  we calls [64x64] macro-tile for workgroup.  In this example, only 64x64  macro tile size is discussed as example.
  • Each work group will load Matrix A, 64 * K , Matrix B K * 64 ,  do 64 * K * 64 times of  FMA computing
  • Matrix A and Matrix B will be loaded into LDS,
  • Every Thread does Matrix computing matrix A= 8xK, Matrix B=Kx8 for SGEMMEvery thread computes matrix C’s 8x8 micro tile size.
  • For Each workgroup: Matrix A will be read 8xK times from LDS
  • For Each workgroup: matrix B will be read 8xK times from LDS
  • For Each workgroup: Matrix A will be read 64xKx64 times from VGPR
  • For Each workgroup: Matrix B will be read 64xKx64 times from VGPR
  • For Each workgroup: Matrix C will be read and write  64xKx64 times from VGPR

 

Memory read/write occupies very high ratio of total power energy.  SGEMM/DGEMM computing includes following memory accesses in modern GPUs:

  • External Video memory Read from GDDR or HBM to L2 Cache
  • From L2 Cache to L1 Cache
  • From L1 Cache to LDS
  • From LDS to VGPR
  • FMA reads VGPRs only for matrix Sum

 

In general,  LDS/VGPR occupies almost 50% total energy for SGEMM/DGEMM. 

 

2 Very Low Power SGEMM/DGEMM Algorithms for SGEMM

2.1 Macro Tile per Workgroup and Micro-Tile per Thread

The VLP SGEMM uses work group size 64 for macro tile M=64, N=64.

The workgroup size of 128 uses macro tile size M=64, N = 128. 

The workgroup size of 256 uses macro tile size M=64,  N = 256. 

The micro tile size for each thread is M=64 and N =1.  Each thread computes Matrix A= 64xK, Matrix B = Kx64,  result in Matrix-C  64 x1 . 

For 64 threads,  the Matrix-C’s address is continuous for each M.

In this paper,  the algorithm is based on macro tile size M=64 and N=64 if there is no special notation .

To have best use of  Matrix A for SQC constants, 

  • hipBlockIdx_x =  N/64
  • hipBlockIdx_y = M/64

 

2.2  Matrix A  Base Offset Per Wave

Every  block has one base address for its Matrix A. 

matrix_A_base_offset  = hipBlockIdx_y *  64  * lda;

 

2.3 Matrix B Base Offset per Wave

Every  block has one base address for its Matrix B. 

matrix_B_base_offset  = hipBlockIdx_x *  64  * ldb;

2.4 Matrix A’s Offset for Each K

matrix_A_koffset = k * sizeof (float)

The algorithm reads Matrix A’s data by  Assembly Instruction  “s_load_dwordx8”

s_load_dwordx8 s[32:39], s[12:13], s18

AMD  GCN architects has 96 available SGPRs . This algorithm uses SGPR s32 to SGPR s95.  It has only 64 SGPRs to read Matrix A’s data. 

Each group of s_load_dwordx8 instructions reads 64 data from 8x M  and 8xK. The algorithm has 8x groups to read 64x different M. 

 

AMD GCN architect does not support in-order return of s_load_dword.  So there is no double buffer loading of Matrix A for this algorithm.  

We postpone the performance analysis of limited SGPR number and unhiding latency  by out of order SGPR return.

 

2.5 Double Buffer Prefetch of Matrix B

Each thread uses micro-tile size M=64, N=1.  Each thread needs 8x VGPRs to load 1x N’s 8xK data.    The algorithm uses global_load_dwordx4 to have best cache line hit.  The next memory read instruction reads next 4 DWORDs of the same cache line.

global_load_dwordx4 v[68:71], v[2:3], s[20:21]

s_add_u32 s20, s20, 16                       

s_addc_u32 s21, s21, 0                       

global_load_dwordx4 v[72:75], v[2:3], s[20:21]

s_add_u32 s20, s20, 16                       

s_addc_u32 s21, s21, 0                       

Double buffer has better latency hiding.  It needs 16x VGPRs to support this feature.

2.6 VGPRs Allocations

Every thread needs V[2:3] for Matrix B’s per thread offset. 

Double Buffer Loading of Matrix B needs 16x VGPRs.

 64x M needs 64x VGPRs. 

In addition to hipThreadIdx_x , totally 16 + 64 + 2 + 1 = 83 VGPRs.

83 VGPRs means 3 waves per SIMD or 3 workgroups per CU.  It is good to have good performance.

 

2.7 NO LDS Operation At All

2.8 No Barrier At All

2.9 FMA with SGPR source and Data Forwarding to Saving SGPR

Modern GPU usually has one constant loading cache which is independent from Texture/Buffer L1 Cache.  SIMD FMA  instructions allows to have one operand from Constant data.  AMD GCN architecture even promotes the constants into Scalar GPRs.  The constant Cache data can be stored into Scalar SGPRs.  The FMA instruction of GCN has following syntax to support SGPR:

v_fma_f32 v4, v68, s32, v4

v_fma_f32 v4, v69, s33, v4

v_fma_f32 v4, v70, s34, v4

v_fma_f32 v4, v71, s35, v4

v_fma_f32 v4, v72, s36, v4

v_fma_f32 v4, v73, s37, v4

v_fma_f32 v4, v74, s38, v4

v_fma_f32 v4, v75, s39, v4

 

v_fma_f64  with SGPRs means 25% less GPR read/write access.  In other words, it is possible to save 25% dynamic power of VGPR access.

2.10 Matrix C Address

Matrix C address is very similar to Matrix B since every thread has different N value.

2.11 Theoretical Comparison of VGPR/L1 Cache/LDS Access

Following table give the example of Macro Tile Size M=64, N =256.  It is very clear that this new SGEMM algorithm reduces 70% VGPR reading by SQC constant Loading and Data Forwarding of Accumulator.  

Costs for Matrix Multiply 64x1x256

Legeacy

SQC

Unit in FP64

LDS

Non-LDS

Matrix A L2-L1

64

64

Matrix A VGPR Write

576

64

Matrix A VGPR Read

16384

64

Matrix A LDS Write

64

0

Matrix A LDS Read

512

0

Matrix B L2-L1

256

256

Matrix B L1 Read

256

256

matrix B VGPR write

2304

256

matrix B VGPR read

16384

16384

Matrix B LDS write

256

0

Matrix B LDS read

2304

0

Matrix C VGPR read/write+

32768

4096

SUM-L2-L1

320

320

SUM-L1-Read

320

320

VGPR Read/Write

68416

20864

LDS    Read/Write

3136

0

Barrier

1

0

 

 

However, there are several performance limits to prevent this kernel to achieve more than 78% performance on AMD GCN architect.

  • AMD GCN supports only 96 SGPRs for program. This limitation prevents SGEMM kernel to do buffer loading.
  • AMD GCN returns constants out of order. The SGEMM kernel has to use “s_waitcnt lgkmcnt(0)” to avoid dirty data return .  It makes the latency hiding very hard.

3 Benchmark 

3.1 Performance Testing of SGEMM_64x256

 

The following result is measured on MI60 with different GPU engine frequencies with fixed memory frequency = 800mhz.

K=640

GFX1700Mhz

GFX1500Mhz

GFX1300Mhz

GFX1100Mhz

M=N=256

0.423

0.378

0.329

0.282

M=N=512

1.125

1.052

1.033

0.896

M=N=768

2.458

2.264

2.092

1.853

M=N=1024

4.368

3.903

3.622

3.331

M=N=1280

5.687

5.213

4.753

4.241

M=N=1536

7.058

6.435

5.739

4.995

M=N=1792

6.493

5.972

5.463

4.807

M=N=2048

8.13

7.448

6.797

6.047

M=N=2304

8.366

7.63

6.828

5.95

M=N=2560

8.561

7.856

7.11

6.226

M=N=2816

9.35

8.558

7.711

6.741

M=N=3072

9.825

8.918

8.048

7.071

M=N=3328

9.758

8.896

8.026

6.998

M=N=3584

9.66

8.875

7.966

6.968

M=N=3840

9.868

9.002

8.139

7.089

M=N=4096

9.954

9.145

8.226

7.185

M=N=4352

9.821

9.07

8.192

7.229

M=N=4608

9.8

9.074

8.203

7.245

M=N=4864

9.856

9.088

8.252

7.258

M=N=5120

9.781

9.088

8.228

7.281

M=N=5376

9.76

9.101

8.285

7.304

M=N=5632

9.8

9.122

8.285

7.346

M=N=5888

9.737

9.13

8.37

7.372

M=N=6144

9.678

9.092

8.302

7.347

M=N=6400

9.672

9.121

8.328

7.383

M=N=6656

9.674

9.173

8.343

7.414

M=N=6912

9.684

9.166

8.375

7.408

M=N=7168

9.638

9.18

8.359

7.413

M=N=7424

9.657

9.155

8.377

7.452

M=N=7680

9.655

9.16

8.4

7.444

M=N=7936

9.67

9.168

8.398

7.466

M=N=8192

9.61

9.133

8.414

7.42

M=N=8448

9.666

9.211

8.413

7.489

M=N=8704

9.662

9.236

8.417

7.465

M=N=8960

9.651

9.217

8.471

7.511

M=N=9216

9.608

9.199

8.459

7.477

M=N=9472

9.643

9.234

8.454

7.509

M=N=9728

9.689

9.227

8.449

7.527

M=N=9984

9.682

9.258

8.484

7.517

M=N=10240

9.605

9.258

8.453

7.498

M=N=10496

9.716

9.297

8.493

7.518

M=N=10752

9.664

9.299

8.523

7.539

M=N=11008

9.672

9.299

8.521

7.537

M=N=11264

9.62

9.253

8.517

7.527

M=N=11520

9.672

9.297

8.5

7.532

M=N=11776

9.652

9.275

8.497

7.548

M=N=12032

9.675

9.318

8.515

7.534

M=N=12288

9.634

9.277

8.493

7.521

M=N=12544

9.681

9.339

8.531

7.556

M=N=12800

9.675

9.326

8.524

7.553

M=N=13056

9.675

9.362

8.54

7.567

M=N=13312

9.666

9.344

8.57

7.581

M=N=13568

9.698

9.403

8.552

7.556

M=N=13824

9.714

9.392

8.565

7.581

M=N=14080

9.703

9.429

8.57

7.591

M=N=14336

9.604

9.353

8.559

7.58

M=N=14592

9.674

9.391

8.558

7.605

M=N=14848

9.657

9.312

8.545

7.587

M=N=15104

9.601

9.266

8.495

7.535

M=N=15360

9.61

9.322

8.499

7.516

M=N=15616

9.661

9.351

8.541

7.554

M=N=15872

9.663

9.363

8.562

7.591

M=N=16128

9.71

9.426

8.575

7.583

M=N=16384

9.532

9.228

8.508

7.532

 

3.2 Power Testing

              Non-workload   == 42 watts,  GFX1700Mhz

  • Data Forwarding:

    • M=N=4096, K=640, Max Power = 265 watts,  with 9.5T
  • NO-Forwarding,

    • M=N=4096, K=640, Max Power = 284 watts,  with 9.18T

              Non-workload   == 36 watts,  GFX1500Mhz

  • Data Forwarding:

    • M=N=4096, K=640, Max Power = 223-watts,  with 9.132T
  • NO-Forwarding,

    • M=N=4096, K=640, Max Power = 240 watts,  with 8.986T

 

4 Run the test

4.1 Run the test

Hardware:  MI60/MI50

Software: ROCm

Command Line to Build the test:

hipcc sgemm_sqc_test.cpp -o sgemm_sqc_test.exe

Command Lien to run the test:

./ sgemm_sqc_test.exe <M> <N> <K> 64 256 <iterations=10> <verify=0>

For example:

./ sgemm_sqc_test.exe 16384 16384 640 64 256 10 0

4.2 Source Code

The GCN LLVM assembly is written in sgemm_64x256_sqc.cpp by inline assembly. 

Compiling  Command line of  sgemm_64x256_sqc.cpp :

hipcc sgemm-64x256-sqc.cpp -o sgemm-64x256-sqc.out

Extract the kernel by following Command line which will generate sgemm-64x256-sqc.out-000-gfx906.isa:

extractkernel -i sgemm_64x256_sqc.out

Extract the correct kernel from sgemm-64x256-sqc.out-000-gfx906.isa and fill into sgemm_64x256_sqc.s. 

Compile sgemm_64x256_sqc.s into LLVM  code object :

/opt/rocm/hcc/bin/clang -x assembler -target amdgcn--amdhsa -mcpu=gfx906 -mno-code-object-v3 sgemm_64x256_sqc.s -o sgemm_sqc.co

 

 

 

 

 

 

 

 

 

 

 

 

 

sgemm_on_vega's People

Contributors

fsword73 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sgemm_on_vega's Issues

some questions on compiling

follow the instruction using your .s file I get

 System minor 0
 System major 3
 agent prop name Device 66a1
hip Device prop succeeded 
*** Error in `/public/home/caspra120/sgemm_strided_batch_test/lib/./sgemm_strided_batched_test': free(): invalid size: 0x00007ffe096d8320 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x2b70d6ccd489]
/public/home/caspra120/sgemm_strided_batch_test/lib/./sgemm_strided_batched_test[0x405254]
/public/home/caspra120/sgemm_strided_batch_test/lib/./sgemm_strided_batched_test[0x405ec1]
/public/home/caspra120/sgemm_strided_batch_test/lib/./sgemm_strided_batched_test[0x40669e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b70d6c6e3d5]
/public/home/caspra120/sgemm_strided_batch_test/lib/./sgemm_strided_batched_test[0x402684]
======= Memory map: ========
00400000-0040a000 r-xp 00000000 00:29 5856331270                         /public/home/caspra120/sgemm_strided_batch_test/lib/sgemm_strided_batched_test
00609000-0060a000 r-xp 00009000 00:29 5856331270                         /public/home/caspra120/sgemm_strided_batch_test/lib/sgemm_strided_batched_test
0060a000-0060c000 rwxp 0000a000 00:29 5856331270                         /public/home/caspra120/sgemm_strided_batch_test/lib/sgemm_strided_batched_test
0134c000-02b86000 rwxp 00000000 00:00 0                                  [heap]
2b70bfc3c000-2b70bfc5e000 r-xp 00000000 08:03 1320025                    /usr/lib64/ld-2.17.so
2b70bfc5e000-2b70bfc61000 rwxp 00000000 00:00 0 
2b70bfc61000-2b70bfc62000 rwxs 00000000 00:05 63663                      /dev/kfd
2b70bfc62000-2b70bfc63000 rwxp 00000000 00:00 0 
2b70bfc63000-2b70bfc64000 rwxs 00000000 00:05 63663                      /dev/kfd
2b70bfc65000-2b70bfc66000 rwxs 00000000 00:05 63663                      /dev/kfd
2b70bfc67000-2b70bfc68000 rwxs 00000000 00:05 63663                      /dev/kfd
2b70bfc69000-2b70bfc6a000 rwxp 00000000 00:00 0 
2b70bfc6c000-2b70bfc6e000 rwxp 00000000 00:00 0 
2b70bfc70000-2b70bfc78000 rwxs 100004000 00:05 53492                     /dev/dri/renderD128
2b70bfc79000-2b70bfc7a000 rwxp 00000000 00:00 0 
2b70bfc7c000-2b70bfc7e000 rwxp 00000000 00:00 0 
2b70bfc7f000-2b70bfc80000 rwxp 00000000 00:00 0 
2b70bfc82000-2b70bfc84000 rwxp 00000000 00:00 0 
2b70bfc85000-2b70bfc86000 rwxp 00000000 00:00 0 
2b70bfc88000-2b70bfc8a000 rwxp 00000000 00:00 0 
2b70bfc8b000-2b70bfc8c000 rwxp 00000000 00:00 0 
2b70bfc8d000-2b70bfc8e000 rwxp 00000000 00:00 0 
2b70bfc90000-2b70bfc9d000 rwxp 00000000 00:00 0 
2b70bfc9d000-2b70bfcd2000 r-xs 00000000 08:03 12977936                   /var/db/nscd/passwd
2b70bfcd4000-2b70bfcd6000 rwxp 00000000 00:00 0 
2b70bfcd7000-2b70bfcd8000 rwxp 00000000 00:00 0 
2b70bfcd9000-2b70bfcda000 ---p 14da5d000 00:05 53492                     /dev/dri/renderD128
2b70bfcdc000-2b70bfcde000 rwxs 00000000 00:05 63663                      /dev/kfd
2b70bfcdf000-2b70bfce0000 rwxp 00000000 00:00 0 
2b70bfce1000-2b70bfce2000 rwxp 00000000 00:00 0 
2b70bfce3000-2b70bfce4000 rwxp 00000000 00:00 0 
2b70bfce5000-2b70bfce6000 rwxp 00000000 00:00 0 
2b70bfd00000-2b70bfd80000 rwxp 00000000 00:00 0 
2b70bfdc0000-2b70bfe00000 rwxp 00000000 00:00 0 
2b70bfe5d000-2b70bfe5e000 r-xp 00021000 08:03 1320025                    /usr/lib64/ld-2.17.so
2b70bfe5e000-2b70bfe5f000 rwxp 00022000 08:03 1320025                    /usr/lib64/ld-2.17.so
2b70bfe5f000-2b70bfe60000 rwxp 00000000 00:00 0 
2b70bfe60000-2b70bff49000 r-xp 00000000 08:03 1320335                    /usr/lib64/libstdc++.so.6.0.19
2b70bff49000-2b70c0148000 ---p 000e9000 08:03 1320335                    /usr/lib64/libstdc++.so.6.0.19
2b70c0148000-2b70c0150000 r-xp 000e8000 08:03 1320335                    /usr/lib64/libstdc++.so.6.0.19
2b70c0150000-2b70c0152000 rwxp 000f0000 08:03 1320335                    /usr/lib64/libstdc++.so.6.0.19
2b70c0152000-2b70c0167000 rwxp 00000000 00:00 0 
2b70c0167000-2b70c0169000 r-xp 00000000 08:03 1320038                    /usr/lib64/libdl-2.17.so
2b70c0169000-2b70c0369000 ---p 00002000 08:03 1320038                    /usr/lib64/libdl-2.17.so
2b70c0369000-2b70c036a000 r-xp 00002000 08:03 1320038                    /usr/lib64/libdl-2.17.so
2b70c036a000-2b70c036b000 rwxp 00003000 08:03 1320038                    /usr/lib64/libdl-2.17.so
2b70c036b000-2b70c046c000 r-xp 00000000 08:03 1320040                    /usr/lib64/libm-2.17.so
2b70c046c000-2b70c066b000 ---p 00101000 08:03 1320040                    /usr/lib64/libm-2.17.so
2b70c066b000-2b70c066c000 r-xp 00100000 08:03 1320040                    /usr/lib64/libm-2.17.so
2b70c066c000-2b70c066d000 rwxp 00101000 08:03 1320040                    /usr/lib64/libm-2.17.so
2b70c066d000-2b70c0684000 r-xp 00000000 08:03 1320058                    /usr/lib64/libpthread-2.17.so
2b70c0684000-2b70c0883000 ---p 00017000 08:03 1320058                    /usr/lib64/libpthread-2.17.so
2b70c0883000-2b70c0884000 r-xp 00016000 08:03 1320058                    /usr/lib64/libpthread-2.17.so
2b70c0884000-2b70c0885000 rwxp 00017000 08:03 1320058                    /usr/lib64/libpthread-2.17.so
2b70c0885000-2b70c0889000 rwxp 00000000 00:00 0 
2b70c0889000-2b70c0894000 r-xp 00000000 08:03 6032106                    /opt/rocm/hcc/lib/libhc_am.so.2.9
2b70c0894000-2b70c0a93000 ---p 0000b000 08:03 6032106                    /opt/rocm/hcc/lib/libhc_am.so.2.9
2b70c0a93000-2b70c0a94000 r-xp 0000a000 08:03 6032106                    /opt/rocm/hcc/lib/libhc_am.so.2.9
2b70c0a94000-2b70c0afa000 rwxp 0000b000 08:03 6032106                    /opt/rocm/hcc/lib/libhc_am.so.2.9
2b70c0afa000-2b70c0b0f000 r-xp 00000000 08:03 6032109                    /opt/rocm/hcc/lib/libmcwamp.so.2.9
2b70c0b0f000-2b70c0d0e000 ---p 00015000 08:03 6032109                    /opt/rocm/hcc/lib/libmcwamp.so.2.9
2b70c0d0e000-2b70c0d0f000 r-xp 00014000 08:03 6032109                    /opt/rocm/hcc/lib/libmcwamp.so.2.9
2b70c0d0f000-2b70c0d54000 rwxp 00015000 08:03 6032109                    /opt/rocm/hcc/lib/libmcwamp.so.2.9
2b70c0d54000-2b70c0e9d000 r-xp 00000000 08:03 6033482                    /opt/rocm/hip/lib/libhip_hcc.so
2b70c0e9d000-2b70c109c000 ---p 00149000 08:03 6033482                    /opt/rocm/hip/lib/libhip_hcc.so
2b70c109c000-2b70c109e000 r-xp 00148000 08:03 6033482                    /opt/rocm/hip/lib/libhip_hcc.so
2b70c109e000-2b70c111d000 rwxp 0014a000 08:03 6033482                    /opt/rocm/hip/lib/libhip_hcc.so
2b70c111d000-2b70c155e000 rwxp 00000000 00:00 0 
2b70c155e000-2b70c1627000 r-xp 00000000 08:03 6031638                    /opt/rocm/hsa/lib/libhsa-runtime64.so.1.1.9
2b70c1627000-2b70c1826000 ---p 000c9000 08:03 6031638                    /opt/rocm/hsa/lib/libhsa-runtime64.so.1.1.9
2b70c1826000-2b70c182b000 r-xp 000c8000 08:03 6031638                    /opt/rocm/hsa/lib/libhsa-runtime64.so.1.1.9
2b70c182b000-2b70c182c000 rwxp 000cd000 08:03 6031638                    /opt/rocm/hsa/lib/libhsa-runtime64.so.1.1.9
2b70c182c000-2b70c182d000 rwxp 00000000 00:00 0 
2b70c182d000-2b70c187a000 r-xp 00000000 08:03 6033963                    /opt/rocm/profiler/CXLActivityLogger/bin/x86_64/libCXLActivityLogger.so
2b70c187a000-2b70c1a79000 ---p 0004d000 08:03 6033963                    /opt/rocm/profiler/CXLActivityLogger/bin/x86_64/libCXLActivityLogger.so
2b70c1a79000-2b70c1a7b000 r-xp 0004c000 08:03 6033963                    /opt/rocm/profiler/CXLActivityLogger/bin/x86_64/libCXLActivityLogger.so
2b70c1a7b000-2b70c1a7c000 rwxp 0004e000 08:03 6033963                    /opt/rocm/profiler/CXLActivityLogger/bin/x86_64/libCXLActivityLogger.so
2b70c1a7c000-2b70c1a7d000 r-xp 00000000 00:29 5586203398                 /public/home/caspra120/sgemm_strided_batch_test/lib/libcheckresult.so
2b70c1a7d000-2b70c1c7c000 ---p 00001000 00:29 5586203398                 /public/home/caspra120/sgemm_strided_batch_test/lib/libcheckresult.so
2b70c1c7c000-2b70c1c7d000 r-xp 00000000 00:29 5586203398                 /public/home/caspra120/sgemm_strided_batch_test/lib/libcheckresult.so
2b70c1c7d000-2b70c1c7e000 rwxp 00001000 00:29 5586203398                 /public/home/caspra120/sgemm_strided_batch_test/lib/libcheckresult.so
2b70c1c7e000-2b70c1c7f000 r-xp 00000000 00:29 5586203654                 /public/home/caspra120/sgemm_strided_batch_test/lib/libsgemm_strided_batched.so
2b70c1c7f000-2b70c1e7e000 ---p 00001000 00:29 5586203654                 /public/home/caspra120/sgemm_strided_batch_test/lib/libsgemm_strided_batched.so
2b70c1e7e000-2b70c1e7f000 r-xp 00000000 00:29 5586203654                 /public/home/caspra120/sgemm_strided_batch_test/lib/libsgemm_strided_batched.so
2b70c1e7f000-2b70c1e80000 rwxp 00001000 00:29 5586203654                 /public/home/caspra120/sgemm_strided_batch_test/lib/libsgemm_strided_batched.so
2b70c1e80000-2b70c20b4000 r-xp 00000000 08:03 1320697                    /usr/lib64/libcrypto.so.1.0.2k
2b70c20b4000-2b70c22b4000 ---p 00234000 08:03 1320697                    /usr/lib64/libcrypto.so.1.0.2k
2b70c22b4000-2b70c22d0000 r-xp 00234000 08:03 1320697                    /usr/lib64/libcrypto.so.1.0.2k
2b70c22d0000-2b70c22dd000 rwxp 00250000 08:03 1320697                    /usr/lib64/libcrypto.so.1.0.2k
2b70c22dd000-2b70c22e1000 rwxp 00000000 00:00 0 
2b70c22e1000-2b70d3141000 r-xp 00000000 08:03 6033990                    /opt/rocm/rocblas/lib/librocblas.so.0.1
2b70d3141000-2b70d3341000 ---p 10e60000 08:03 6033990                    /opt/rocm/rocblas/lib/librocblas.so.0.1
2b70d3341000-2b70d3368000 r-xp 10e60000 08:03 6033990                    /opt/rocm/rocblas/lib/librocblas.so.0.1
2b70d3368000-2b70d669c000 rwxp 10e87000 08:03 6033990                    /opt/rocm/rocblas/lib/librocblas.so.0.1
2b70d669c000-2b70d6a36000 rwxp 00000000 00:00 0 
2b70d6a36000-2b70d6a4b000 r-xp 00000000 08:03 1310739                    /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b70d6a4b000-2b70d6c4a000 ---p 00015000 08:03 1310739                    /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b70d6c4a000-2b70d6c4b000 r-xp 00014000 08:03 1310739                    /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b70d6c4b000-2b70d6c4c000 rwxp 00015000 08:03 1310739                    /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b70d6c4c000-2b70d6e0e000 r-xp 00000000 08:03 1320032                    /usr/lib64/libc-2.17.so
2b70d6e0e000-2b70d700e000 ---p 001c2000 08:03 1320032                    /usr/lib64/libc-2.17.so
2b70d700e000-2b70d7012000 r-xp 001c2000 08:03 1320032                    /usr/lib64/libc-2.17.so
2b70d7012000-2b70d7014000 rwxp 001c6000 08:03 1320032                    /usr/lib64/libc-2.17.so
2b70d7014000-2b70d7019000 rwxp 00000000 00:00 0 
2b70d7019000-2b70dc3ae000 r-xp 00000000 08:03 5905890                    /opt/rocm/lib/libamd_comgr.so.1.3
2b70dc3ae000-2b70dc5ad000 ---p 05395000 08:03 5905890                    /opt/rocm/lib/libamd_comgr.so.1.3
2b70dc5ad000-2b70dc989000 r-xp 05394000 08:03 5905890                    /opt/rocm/lib/libamd_comgr.so.1.3
2b70dc989000-2b70dc996000 rwxp 05770000 08:03 5905890                    /opt/rocm/lib/libamd_comgr.so.1.3
2b70dc996000-2b70dc9e7000 rwxp 00000000 00:00 0 
2b70dc9e7000-2b70dca06000 r-xp 00000000 08:03 5905898                    /opt/rocm/lib64/libhsakmt.so.1.0.6
2b70dca06000-2b70dcc06000 ---p 0001f000 08:03 5905898                    /opt/rocm/lib64/libhsakmt.so.1.0.6
2b70dcc06000-2b70dcc07000 r-xp 0001f000 08:03 5905898                    /opt/rocm/lib64/libhsakmt.so.1.0.6
2b70dcc07000-2b70dcc11000 rwxp 00020000 08:03 5905898                    /opt/rocm/lib64/libhsakmt.so.1.0.6
2b70dcc11000-2b70dcc26000 r-xp 00000000 08:03 1320344                    /usr/lib64/libz.so.1.2.7
2b70dcc26000-2b70dce25000 ---p 00015000 08:03 1320344                    /usr/lib64/libz.so.1.2.7
2b70dce25000-2b70dce26000 r-xp 00014000 08:03 1320344                    /usr/lib64/libz.so.1.2.7
2b70dce26000-2b70dce27000 rwxp 00015000 08:03 1320344                    /usr/lib64/libz.so.1.2.7
2b70dce27000-2b70dce2e000 r-xp 00000000 08:03 1320062                    /usr/lib64/librt-2.17.so
2b70dce2e000-2b70dd02d000 ---p 00007000 08:03 1320062                    /usr/lib64/librt-2.17.so
2b70dd02d000-2b70dd02e000 r-xp 00006000 08:03 1320062                    /usr/lib64/librt-2.17.so
2b70dd02e000-2b70dd02f000 rwxp 00007000 08:03 1320062                    /usr/lib64/librt-2.17.so
2b70dd02f000-2b70dd054000 r-xp 00000000 08:03 1320373                    /usr/lib64/libtinfo.so.5.9
2b70dd054000-2b70dd254000 ---p 00025000 08:03 1320373                    /usr/lib64/libtinfo.so.5.9
2b70dd254000-2b70dd258000 r-xp 00025000 08:03 1320373                    /usr/lib64/libtinfo.so.5.9
2b70dd258000-2b70dd259000 rwxp 00029000 08:03 1320373                    /usr/lib64/libtinfo.so.5.9
2b70dd259000-2b70dd263000 r-xp 00000000 08:03 1320901                    /usr/lib64/libnuma.so.1
2b70dd263000-2b70dd463000 ---p 0000a000 08:03 1320901                    /usr/lib64/libnuma.so.1
2b70dd463000-2b70dd464000 r-xp 0000a000 08:03 1320901                    /usr/lib64/libnuma.so.1
2b70dd464000-2b70dd465000 rwxp 0000b000 08:03 1320901                    /usr/lib64/libnuma.so.1
2b70dd465000-2b70dd471000 r-xp 00000000 08:03 1322493                    /usr/lib64/libpci.so.3.5.1
2b70dd471000-2b70dd670000 ---p 0000c000 08:03 1322493                    /usr/lib64/libpci.so.3.5.1
2b70dd670000-2b70dd671000 r-xp 0000b000 08:03 1322493                    /usr/lib64/libpci.so.3.5.1
2b70dd671000-2b70dd672000 rwxp 0000c000 08:03 1322493                    /usr/lib64/libpci.so.3.5.1
2b70dd672000-2b70dd688000 r-xp 00000000 08:03 1320060                    /usr/lib64/libresolv-2.17.so
2b70dd688000-2b70dd887000 ---p 00016000 08:03 1320060                    /usr/lib64/libresolv-2.17.so
2b70dd887000-2b70dd888000 r-xp 00015000 08:03 1320060                    /usr/lib64/libresolv-2.17.so
2b70dd888000-2b70dd889000 rwxp 00016000 08:03 1320060                    /usr/lib64/libresolv-2.17.so
2b70dd889000-2b70dd88b000 rwxp 00000000 00:00 0 
2b70dd88b000-2b70dd92e000 r-xp 00000000 08:03 6032115                    /opt/rocm/hcc/lib/libmcwamp_hsa.so.2.9
2b70dd92e000-2b70ddb2d000 ---p 000a3000 08:03 6032115                    /opt/rocm/hcc/lib/libmcwamp_hsa.so.2.9
2b70ddb2d000-2b70ddb30000 r-xp 000a2000 08:03 6032115                    /opt/rocm/hcc/lib/libmcwamp_hsa.so.2.9
2b70ddb30000-2b70ddf6f000 rwxp 000a5000 08:03 6032115                    /opt/rocm/hcc/lib/libmcwamp_hsa.so.2.9
2b70ddf6f000-2b70ddf70000 ---p 00000000 00:00 0 
2b70ddf70000-2b70de170000 rwxp 00000000 00:00 0 
2b70de170000-2b70de1c3000 r-xp 00000000 08:03 6031645                    /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
2b70de1c3000-2b70de3c3000 ---p 00053000 08:03 6031645                    /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
2b70de3c3000-2b70de3c4000 r-xp 00053000 08:03 6031645                    /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
2b70de3c4000-2b70de434000 rwxp 00054000 08:03 6031645                    /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
2b70de480000-2b70de500000 rwxp 00000000 00:00 0 
2b70de540000-2b70de5a0000 rwxp 00000000 00:00 0 
2b70de600000-2b70dea00000 rwxp 00000000 00:00 0 
2b70dea80000-2b70deb00000 rwxp 00000000 00:00 0 
2b70dec00000-2b70df000000 rwxp 00000000 00:00 0 
2b70df080000-2b70df100000 rwxp 00000000 00:00 0 
2b70df200000-2b70df600000 rwxp 00000000 00:00 0 
2b70df680000-2b70df700000 rwxp 00000000 00:00 0 
2b70df800000-2b70dfc00000 rwxp 00000000 00:00 0 
2b70dfc80000-2b70dfd00000 rwxp 00000000 00:00 0 
2b70dfd80000-2b70dfe00000 rwxp 00000000 00:00 0 
2b70dfe80000-2b70dff00000 rwxp 00000000 00:00 0 
2b70e0000000-2b70e0021000 rwxp 00000000 00:00 0 
2b70e0021000-2b70e4000000 ---p 00000000 00:00 0 
2b70e4200000-2b70e4600000 rwxp 00000000 00:00 0 
2b70e4800000-2b70e4c00000 rwxp 00000000 00:00 0 
2b70e4e00000-2b70e5200000 rwxp 00000000 00:00 0 
2b70e5400000-2b70e5800000 rwxp 00000000 00:00 0 
2b70e5a00000-2b70e5e00000 rwxp 00000000 00:00 0 
2b70e6000000-2b70e6400000 rwxp 00000000 00:00 0 
2b70e6600000-2b70e6a00000 rwxp 00000000 00:00 0 
2b70e6c00000-2b70e7000000 rwxp 00000000 00:00 0 
2b70e7200000-2b70e7600000 rwxp 00000000 00:00 0 
2b70e7800000-2b70e7c00000 rwxp 00000000 00:00 0 
2b70e7e00000-2b70e8200000 rwxp 00000000 00:00 0 
2b70e8400000-2b70e8800000 rwxp 00000000 00:00 0 
2b70e8a00000-2b70e8e00000 rwxp 00000000 00:00 0 
2b70e9000000-2b70e9400000 rwxp 00000000 00:00 0 
2b70e9600000-2b70e9a00000 rwxp 00000000 00:00 0 
2b70e9c00000-2b70ea000000 rwxp 00000000 00:00 0 
2b70ea200000-2b70ea600000 rwxp 00000000 00:00 0 
2b70ea800000-2b70eac00000 rwxp 00000000 00:00 0 
2b70eae00000-2b70eb200000 rwxp 00000000 00:00 0 
2b70eb400000-2b70eb800000 rwxp 00000000 00:00 0 
2b70eba00000-2b70ebe00000 rwxp 00000000 00:00 0 
2b70ec000000-2b70ec400000 rwxp 00000000 00:00 0 
2b70ec600000-2b70eca00000 rwxp 00000000 00:00 0 
2b70ecc00000-2b70ed000000 rwxp 00000000 00:00 0 
2b70ed200000-2b70ed600000 rwxp 00000000 00:00 0 
2b70ed800000-2b70edc00000 rwxp 00000000 00:00 0 
2b70ede00000-2b70ee200000 rwxp 00000000 00:00 0 
2b70ee400000-2b70ee800000 rwxp 00000000 00:00 0 
2b70ee800000-2b70f3a02000 rwxp 00000000 00:00 0 
2b70f3c00000-2b70f6600000 rwxs 10841b000 00:05 53492                     /dev/dri/renderD128
2b70f6800000-2b70f9200000 rwxs 10ae1b000 00:05 53492                     /dev/dri/renderD128
2b70f9400000-2b70fa945000 rwxp 00000000 00:00 0 
2b70fc000000-2b70fc021000 rwxp 00000000 00:00 0 
2b70fc021000-2b7100000000 ---p 00000000 00:00 0 
2b7105d6c000-2b710909f000 rwxp 00000000 00:00 0 
2b7109200000-2b7209200000 ---p 00000000 00:00 0 
2b7209400000-2b7309400000 ---p 00000000 00:00 0 
2b7309600000-2b7409600000 ---p 00000000 00:00 0 
2b7409800000-2b7509800000 ---p 00000000 00:00 0 
2b7509800000-2b7549901000 rwxp 00000000 00:00 0 
2b7549a00000-2b7589c00000 rwxs 10d81b000 00:05 53492                     /dev/dri/renderD128
7ffe096ba000-7ffe096dd000 rwxp 00000000 00:00 0                          [stack]
7ffe097b8000-7ffe097ba000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
host allocated 
device allocated 
Host to Device Copied
sgemm_sqc.o begin
sgemm_sqc.o loaded
srun: error: k03r1n14: task 0: Aborted
srun: Terminating job step 1504559.0

my rocminfo should be the same as yours.

  Name:                    gfx906                             
  Marketing Name:          Device 66a1                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    6                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26273(0x66a1)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   17152                              
  Internal Node ID:        6                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32

is that because of the memory over allocate on different device? or there's multiple memory type or VGPR registors on the same distinctive device?
Thanks for your tutorial by the way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.