cloudcores / cuassembler Goto Github PK
View Code? Open in Web Editor NEWAn unofficial cuda assembler, for all generations of SASS, hopefully :)
License: MIT License
An unofficial cuda assembler, for all generations of SASS, hopefully :)
License: MIT License
This is a really cool and ambitious project!
I just wanted to share that I found an instruction description for all SASS instructions recently, see DocumentSASS. Perhaps the process of creating an assembler can be automated from these files.
Best,
Sebastian
Hi, I tried to use CuAssembler on the sample CuTest code. It works well on the build-in cubin files. However, when I recompiled it using NVCC11.6 and the same arguments, the new cubin file can not be disassembled correctly and gots the following error:
2022-05-12 23:33:29,857 - - Running CubinFile.loadCubin...
2022-05-12 23:33:29,857 - ENTRY - Loading cubin file cudatest.sm_75.cubin...
2022-05-12 23:33:29,858 - ERROR - Abnormal elf layout detected! No program header found!
Traceback (most recent call last):
File "a.py", line 4, in <module>
cf = CubinFile(binname)
File "/home//sync_test/CuAssembler/CuAsm/CubinFile.py", line 28, in __init__
self.loadCubin(cubinname)
File "/home//sync_test/CuAssembler/CuAsm/CuAsmLogger.py", line 143, in wrapper
ret = func(*args, **kwargs)
File "/home//sync_test/CuAssembler/CuAsm/CubinFile.py", line 72, in loadCubin
raise Exception(msg)
Exception: Abnormal elf layout detected! No program header found!
Any plan to support new compilers?
2023-01-19 00:42:36,214 - - Running CuAsmParser.parse...
2023-01-19 00:42:36,215 - ENTRY - Parsing file BuildSteps.sm_75.cuasm
2023-01-19 00:42:36,224 - - Running CuAsmParser.__preScan...
2023-01-19 00:42:37,001 - WARNING - Line 19808: Weak symbol found! The implementation is not complete, please be cautious...
2023-01-19 00:42:37,005 - - Running CuAsmParser.__gatherTextSectionSizeLabel...
2023-01-19 00:42:37,006 - - Running CuAsmParser.__buildInternalTables...
2023-01-19 00:42:37,007 - - Running CuAsmParser.__evalFixups...
2023-01-19 00:42:37,009 - - Running CuAsmParser.__parseKernels...
2023-01-19 00:42:37,013 - PROC - Parsing kernel text of ".text._Z10DoEatStepsILi3EEvi"...
2023-01-19 00:42:37,057 - ERROR - Assertion failed in:
File BuildSteps.sm_75.cuasm:18452 :
[B------:R-:W-:-:S02] /*0d80*/ @!P0 IADD3.X R2, P1, R2, R15, RZ, P1, !PT ;
Error when assembling instruction "[B------:R-:W-:-:S02] @!P0 IADD3.X R2, P1, R2, R15, RZ, P1, !PT ;":
Assembling failed (NewVals): Insufficient basis, try CuAsming more instructions!
Known Records:
IADD3.X R4, P1, R17, R4, RZ, P1, !PT ;
IADD3.X R16, P0, R5, R4, RZ, P0, !PT ;
IADD3.X R2, P0, RZ, R0, RZ, P0, !PT ;
IADD3.X R4, P0, RZ, R4, RZ, P0, !PT ;
IADD3.X RZ, P0, R4, R4, RZ, P0, !PT ;
IADD3.X R2, P0, R3.reuse, R3, RZ, P0, !PT ;
IADD3.X R13, P1, RZ, R4, RZ, P0, !PT ;
IADD3.X R44, P5, R5, ~R28, RZ, P5, !PT ;
IADD3.X R24, P3, ~R0, R37, RZ, P0, !PT ;
How do I "CuAsm more instructions?"
BuildSteps.sm_75.cuasm is the unmodified output from cuasm.cmd BuildSteps.sm_75.cubin
.
FYI: The 'weak' symbol is .weak $_Z11DoInitFirstj$__cuda_sm20_rem_u64
.
Hi, big god.
I found the "L2Bank" in your microbenchmark.
I don't know the principle and theory of your code. would you like to teach me?
I found some reference document about "L2Bank" from internet, but, the "L2Bank" is about Ampere.
and, I noticed that the ARCH is sm_50 and sm_75 in your makefile
So, I'm very confused, what is this "L2Bank"?
I run your microbenchmark on my SM_86, I cannot figure out what it want to demostrate,
So, how to understand it ?
libcublasLt_static.1028.sm_75.zip
libcublasLt_static.1028.sm_75_cubin.zip
I successfully create the assembly code from libcublasLt_static.1028.sm_75.cubin with cuasm libcublasLt_static.1028.sm_75.cubin.
When I try to regenerate the cubin with cuasm libcublasLt_static.1028.sm_75.cuasm -o new_libcublasLt_static.1028.sm_75.cubin it shows me the following error:
2023-01-27 16:55:59,995 - - Running CuAsmParser.parse...
2023-01-27 16:55:59,996 - ENTRY - Parsing file libcublasLt_static.1028.sm_75.cuasm
2023-01-27 16:56:00,025 - - Running CuAsmParser.__preScan...
2023-01-27 16:56:03,810 - - Running CuAsmParser.__gatherTextSectionSizeLabel...
2023-01-27 16:56:03,814 - - Running CuAsmParser.__buildInternalTables...
Traceback (most recent call last):
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 153, in
doProcess(infile, outfile, direction)
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 94, in doProcess
cuasm2cubin(src, dst)
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 72, in cuasm2cubin
cap.parse(asmname)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmLogger.py", line 210, in wrapper
ret = func(*args, **kwargs)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 766, in parse
self.__buildInternalTables()
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmLogger.py", line 241, in wrapper
ret = func(*args, **kwargs)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 929, in __buildInternalTables
self.__mSectionDict['.symtab'].getData())
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 147, in buildSymbolDict
raise Exception('Duplicate symbol @%#x with name %s!', p, name)
Exception: ('Duplicate symbol @%#x with name %s!', 24, '')
I am trying to interleave the instructions of LD and FFMA for maximum throughput. For LDG instruction, I notice the stall count is always 4, such as:
[----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ;
[----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4] ;
[----:B------:R-:W3:-:S04] /0380/ LDG.E.SYS R52, [R2+0x4] ;
[----:B------:R-:W4:-:S04] /0390/ LDG.E.SYS R54, [R2+0x8] ;
[----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ;
[----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ;
However, the gloabl load instruction actually holds a variable latency. The instruction using the loaded oprand by LDG has to be synchronized with wrtie barrier instead of stall count. So why does the stall count of LDG is set as 4.
Another confusion is about the stall count for fix latency instruction, such as:
[-R--:B------:R-:W-:-:S02] /0540/ FFMA R22, R54, R66.reuse, R22 ;
[-R--:B------:R-:W-:-:S02] /0550/ FFMA R20, R56, R66.reuse, R20 ;
[-R--:B------:R-:W-:-:S02] /0560/ FFMA R18, R58, R66.reuse, R18 ;
[-R--:B------:R-:W-:-:S02] /0570/ FFMA R16, R60, R66.reuse, R16 ;
[-R--:B------:R-:W-:-:S02] /0580/ FFMA R14, R62, R66.reuse, R14 ;
As we can see the stall count is set as 2. The author of Maxas declares the pipline of a arithmetic instruction is 6 cycles in maxwell arch. However, the author implementing dual issue as following:
--:-:-:-:0 FFMA cx02y00, j0Ax02, j0By00, cx02y00;
--:-:-:-:1 LDS.U.128 j1Ax00, [readAs + 4x<1128 + 00>];
--:-:-:-:1 FFMA cx02y01, j0Ax02, j0By01, cx02y01;
--:-:-:-:0 FFMA cx00y01, j0Ax00, j0By01, cx00y01;
--:-:-:-:1 LDS.U.128 j1By00, [readBs + 4x<1128 + 00>];
--:-:-:-:1 FFMA cx00y00, j0Ax00, j0By00, cx00y00;
--:-:-:-:0 FFMA cx03y00, j0Ax03, j0By00, cx03y00;
--:-:-:-:1 LDS.U.128 j1Ax64, [readAs + 4x<1128 + 64>];
--:-:-:-:1 FFMA cx03y01, j0Ax03, j0By01, cx03y01;
--:-:-:-:0 FFMA cx01y01, j0Ax01, j0By01, cx01y01;
--:2:1:-:2 LDS.U.128 j1By64, [readBs + 4x<1128 + 64>]; // Set Dep 1,2 Stall 2
02:-:-:-:1 IADD readAs, readAs, 16; // Wait Dep 2
--:-:-:-:1 IADD readBs, readBs, 16;
01:-:-:-:1 FFMA cx02y00, j1Ax02, j1By00, cx02y00; // Wait Dep 1
cited from Maxas
I understand the first 0 stall count is for dual issue. After a dual issue, there is a FFMA instruction(3rd instruction) followed by LDS instruction, and its stall count is 1, following by this FFMA another dual issue is triggered. My understanding is this 1 stall count is for issue this instruction and all the instruction shown in this code are stored in instruction buffer without wating them to finish. The doubtion is, in my assembly code, I don't use such strategy, so why the stall count is set as 2? Is this implies the latency of FFMA is 2 clock cycles. If this is not the case, how does a program ensure the FFMA computation is finish and safe to use in the following inscturion?
In some code, like:
[----:B------:R-:W-:-:S02] /0470/ IADD3 R44, R44, 0x8, RZ ;
[----:B------:R-:W-:-:S02] /0480/ ISETP.GE.AND P0, PT, R86, 0x80, PT ;
[----:B------:R-:W-:-:S01] /0490/ IADD3 R46, R46, 0x8, RZ ;
[----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ;
You can see for a fix latency instruction, such as IADD3 in this code, the stall count of the instruction is also varied in different scenario. For stall count as 1, I would assume this different is to interleave the instruction issue. But agian, how should a program make sure the oprand is ready for following instruction.
Hello,
When I try cuassember followed your suggestion , I run 'make dump' but I met next issue , could you kindly help ?
HNVCC_OP=dump hnvcc test_war.cu -arch=sm_60 -o test_war
HNVCC_OP is %s HNVCC_OP
Error when calling run(args) with args=['cicc', '--c++14', '--gnu_version=70500', '--orig_src_file_name', 'test_war.cu', '--allow_managed', '-arch', 'compute_60', '-m64', '-ftz=0', '-prec_div=1', '-prec_sqrt=1', '-fmad=1', '--include_file_name', 'test_war.fatbin.c', '-tused', '-nvvmir-library', '/usr/local/cuda-11.0/bin/../nvvm/libdevice/libdevice.10.bc', '--gen_module_id_file', '--module_id_file_name', 'hnvcc_keep_dir/test_war.module_id', '--gen_c_file_name', 'hnvcc_keep_dir/test_war.cudafe1.c', '--stub_file_name', 'hnvcc_keep_dir/test_war.cudafe1.stub.c', '--gen_device_file_name', 'hnvcc_keep_dir/test_war.cudafe1.gpu', 'hnvcc_keep_dir/test_war.cpp1.ii', '-o', 'hnvcc_keep_dir/test_war.ptx']!
[Errno 2] No such file or directory: 'cicc'
C:\Python39>python "G:\CuAssembler\bin\cuasm.py" --bin2asm "G:\elf3.dat"
,023 - - Running CubinFile.loadCubin...
,023 - ENTRY - Loading cubin file G:\elf3.dat...
,044 - - Running CubinFile.disassembleCubin...
,044 - PROC - Disassembling G:\elf3.dat...
,779 - - Func CubinFile.disassembleCubin completed! Time= 12.7356 secs.
,044 - - Func CubinFile.loadCubin completed! Time= 13.0207 secs.
,044 - - Running CubinFile.saveAsCuAsm...
,044 - ENTRY - Saving to cuasm file G:\elf3.cuasm...
Traceback (most recent call last):
File "G:\CuAssembler\bin\cuasm.py", line 153, in
doProcess(infile, outfile, direction)
File "G:\CuAssembler\bin\cuasm.py", line 92, in doProcess
cubin2cuasm(src, dst)
File "G:\CuAssembler\bin\cuasm.py", line 68, in cubin2cuasm
cf.saveAsCuAsm(asmname)
File "G:\CuAssembler\CuAsm\CuAsmLogger.py", line 210, in wrapper
ret = func(*args, **kwargs)
File "G:\CuAssembler\CuAsm\CubinFile.py", line 526, in saveAsCuAsm
self.__writeImplicitSectionAsm(fout, secname)
File "G:\CuAssembler\CuAsm\CubinFile.py", line 445, in __writeImplicitSectionAsm
raise Exception('Unknown implicit section %s !'%secname)
Exception: Unknown implicit section .nv.info._ZNK45_GLOBAL__N__22_octane_kernels_cpp1_ii_kernel011Wavelengths33convertFilterSpectrumToLinearSrgbE6float4$1021 !
mhi, big god
my CuAssembler raise exception when I test the "TestData"
my nvcc is 11.3, and my arch is sm_86
it throw following exception when I executed "make hack":
`2023-09-16 23:27:38,221 - ERROR - Assertion failed in:
File hack.cudatest.sm_86.cuasm:2876 :
[B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;
Error when assembling instruction "[B------:R-:W2:-:S01] LDG.E R5, [R22.64] ;":
Assembling failed (NewModi): Unknown modifiers: ({'2_R.64'})
Known Records:
LDG.E R6, [R4] ;
LDG.E R0, [R4] ;
LDG.E R25, [R18] ;
@P1 LDG.E R58, [R58] ;
@P2 LDG.E.128 R36, [R48] ;
@P2 LDG.E.128 R68, [R76+0x80] ;
@P1 LDG.E.64 R96, [R96] ;
@P0 LDG.E.U16 R38, [R38] ;
@P0 LDG.E.LTC128B R42, [R88] ;
@!P0 LDG.E.STRONG.GPU R40, [R34] ;
LDG.U16.CONSTANT R17, [R17] ;
LDG.U16.CONSTANT R9, [R9+-0x40] ;
LDG.U8.CONSTANT R17, [R17] ;
LDG.U16 R13, [R6] ;
@P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;
@P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;`
and the codes around line2876 in "hack.cudatest.sm_86.cuasm" as following:
`2871 [B------:R-:W-:Y:S02] /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
2872 [B------:R-:W-:-:S01] /*0010*/ IMAD.MOV.U32 R22, RZ, RZ, c[0x0][0x160] ;
2873 [B------:R-:W-:-:S01] /*0020*/ ULDC.64 UR36, c[0x0][0x118] ;
2874 [B------:R-:W-:-:S01] /*0030*/ IMAD.MOV.U32 R23, RZ, RZ, c[0x0][0x164] ;
2875 [B------:R-:W-:Y:S04] /*0040*/ IADD3 R1, R1, -0x28, RZ ;
2876 [B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;
2877 [B------:R-:W-:-:S02] /*0060*/ MOV R2, 32@lo(flist) ;
2878 [B------:R-:W-:-:S01] /*0070*/ MOV R3, 32@hi(flist) ;
2879 [B------:R-:W0:-:S04] /*0080*/ S2R R17, SR_CTAID.X ;
2880 [B------:R-:W0:-:S01] /*0090*/ S2R R0, SR_TID.X ;
2881 [B------:R-:W-:-:S02] /*00a0*/ IMAD.MOV.U32 R18, RZ, RZ, 0x4 ;`
and line2876 in "hack.cudatest.sm_86.cuasm" as following:
2876 [B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;
it will be ok when I change arch to sm_60/sm_75。
it will throw above exceptions if I change arch to sm_80/86
but, my real hardware is sm_86, so, I cannot pass these exceptions.
I don't know how to fix this trouble, would you like to help me to fix this trouble?
or, would you like to tell me the reason?
Hi,
Many thanks for releasing this assembler! I was trying to disassemble pytorch's kernel, but and then I saw this error:
Error when assembling instruction "[B------:R3:W4:-:S06] @p3 LDG.E.EL.128 R12, desc[UR8][R16.64] ;":
Assembling failed (NewModi): Unknown modifiers: ({'0_EL'})
May I know is there any idea how to resolve this error?
Best,
Hi,
Just want to know status on tensor core support..
Thanks..
I found that there are many negative numbers in the weight matrixs. From what I see, this is because a certain modifier is omitted for some instruction . Such as the F2I_R_R instruction of SM_70, it has many data type modifers , for example "U64, S64, U32, S16, U8, S8", but the default "S32" type is omitted. If we add the "S32" modifier back,then after resolving the weight, we will get a matrix without negative number. Is there a reliable way to achieve this function?
我的python代码如下:
binname = 'D:\WorkSpace\cuda_workspace\instruction_test\simpletest\x64\Release\kernel.sm_61.cubin'
cf = CubinFile(binname)
asmname = binname.replace('.cubin', '.cuasm')
cf.saveAsCuAsm(asmname)
各种软件版本如下:
python is 3.7.4
cuda tool kit is 10.2
Sympy is 1.4
pyelftools is the latest
总是提示“Exception: The segment range (0x2a4, 0x4fc) doesnot align with sections!”
即便把sm_61.cubin换成相应的sm_50或者sm_75都是类似的错误提示,不知道大神能不能帮忙看看什么原因?
相应的cu文件和cubin文件在附件中
simpletest.zip
Hello, I am using CuAssembler's example.
__global__ void vectorAdd(const float* a, const float* b, float* c)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
c[idx] = a[idx] + b[idx];
}
The previous steps are normal. But when I tried to convert the cuasm file to cubin file, I get an error, as follows:
2022-01-28 05:55:42,506 - - Running CuAsmParser.parse...
2022-01-28 05:55:42,506 - ENTRY - Parsing file vecadd.sm_70.cuasm
2022-01-28 05:55:42,507 - - Running CuAsmParser.__preScan...
Traceback (most recent call last):
File "asm2bin.py", line 6, in <module>
cap.parse(asmname)
File "/CuAssembler/CuAsm/CuAsmLogger.py", line 143, in wrapper
ret = func(*args, **kwargs)
File "CuAssembler/CuAsm/CuAsmParser.py", line 734, in parse
self.__preScan()
File "CuAssembler/CuAsm/CuAsmLogger.py", line 174, in wrapper
ret = func(*args, **kwargs)
File "CuAssembler/CuAsm/CuAsmParser.py", line 842, in __preScan
self.__dirDict[cmd](args)
File "CuAssembler/CuAsm/CuAsmParser.py", line 652, in <lambda>
'.__elf_flags' : (lambda args: self.__dir_elfheader('flags' , args)),
File "CuAssembler/CuAsm/CuAsmParser.py", line 1406, in __dir_elfheader
self.__mCuInsAsmRepos.setToDefaultInsAsmDict()
File "/CuAssembler/CuAsm/CuInsAssemblerRepos.py", line 38, in setToDefaultInsAsmDict
fname = Config.getDefaultInsAsmReposFile(self.__mSMVersion.getVersionNumber())
File "/CuAssembler/CuAsm/config.py", line 52, in getDefaultInsAsmReposFile
raise IOError('File %s not found!' % repos_name)
OSError: File DefaultInsAsmRepos.sm_70.txt not found!
The software versions are as follows:
python 3.7
CUDA 11.2
GPU used is V100
How should this situation be resolved?
When I convert the kernel to sass assembly, I use this tool to convert cuasm to cubin, and call the cubin in the c++program. The direct execution result is correct, but an error occurs when using Nsight Compute to analyze:
==Error==LaunchFailed
==Error==LaunchFailed
==PROF==Trying to shutdown target application
==Error==The application returned an error code (9)
==Error==An error occurred while trying to profile
I encountered the same error on both the 2022 and 2024 versions of Nsight Compute, however, the cubin file generated directly using NVCC can be analyzed normally using Nsight Compute
Hi,
I ran into the following issue when trying to use CuAssembler:
Error when assembling instruction "[----:B------:R-:W-:Y:S04] SHF.L.W.U32.HI R11, R4, 0x13, R4 ;":("Assembling failed (NewModi): Unknown modifiers: ({'0_W'})",)
I am performing the following steps (without modifying the cuasm file):
CuAsm.CubinFile
CuAsm.CuAsmParser
I checked the repo file and it should contain 0_W
in the InsModiSet
for SHF
.
In addition, I also tried to merge the repos file generated from my cuda binary with the default repos file for the sm_75 architecture and then use the updated repository in the CuAsmParser
. But that doesn't seem to help.
Could you give me a hint on how to proceed with this error? Is this a limitation of the project in its current state?
Additional information on my setup:
- arch: sm_75
- OS: Ubuntu 20.04
- nvcc version: 10.1
Hi, big god
I meet a new trouble about "LDG",the assembler raise exception like this:
` File hack.main.sm_86.cuasm:795 :
[B------:R-:W2:-:S04] /*0080*/ LDG.E.LTC128B.CONSTANT R4, desc[UR4][R2.64] ;
Error when assembling instruction "[B------:R-:W2:-:S04] LDG.E.LTC128B.CONSTANT R4, desc[UR4][R2.64] ;":
Assembling failed (NewModi): Unknown modifiers: ({'0_LTC128B'})`
my LDG instruction in ptx like this:
` asm volatile (
"ld.global.nc.L2::128B.v4.b32 {%0, %1, %2, %3}, [%4];\n"
: "=r"(reg0.x), "=r"(reg0.y), "=r"(reg0.z), "=r"(reg0.w)
: "l"(src_ptr)
);`
this exception will be removed if I remove the "L2::128B"
so, would you like to teach me how to solve this trouble?
I used vectorAdd from cuda samples for experiment. Follow the steps in User Guide, I got my cuasm. Here is the critical part of my cuasm
[----:B------:R-:W-:-:S06] /0008/ MOV R1, c[0x0][0x20] ;
[----:B------:R-:W0:-:S01] /0010/ S2R R0, SR_CTAID.X ;
[----:B------:R-:W1:-:S15] /0018/ S2R R2, SR_TID.X ;
[R---:B0-----:R-:W-:-:S01] /0028/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ ;
[R---:B-1----:R-:W-:-:S06] /0030/ XMAD R2, R0.reuse, c[0x0] [0x8], R2 ;
[----:B------:R-:W-:-:S06] /0038/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 ;
[----:B------:R-:W-:Y:S13] /0048/ ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT ;
[----:B------:R-:W-:Y:S10] /0050/ NOP ;
[----:B------:R-:W-:-:S13] /0058/ @p0 EXIT ;
[R---:B------:R-:W-:-:S01] /0068/ SHL R6, R0.reuse, 0x2 ;
[----:B------:R-:W-:-:S05] /0070/ SHR R0, R0, 0x1e ;
[R---:B------:R-:W-:-:S06] /0078/ IADD R4.CC, R6.reuse, c[0x0][0x140] ;
[R---:B------:R-:W-:-:S02] /0088/ IADD.X R5, R0.reuse, c[0x0][0x144] ;
[----:B------:R-:W-:-:S00] /0090/ { IADD R2.CC, R6, c[0x0][0x148] ;
[----:B------:R-:W-:-:S06] /0098/ LDG.E R4, [R4] }
[----:B------:R-:W-:Y:S02] /00a8/ IADD.X R3, R0, c[0x0][0x14c] ;
[----:B------:R-:W5:-:S01] /00b0/ LDG.E R2, [R2] ;
[----:B------:R-:W-:-:S06] /00b8/ IADD R6.CC, R6, c[0x0][0x150] ;
[----:B------:R-:W-:-:S07] /00c8/ IADD.X R7, R0, c[0x0][0x154] ;
[----:B-----5:R-:W-:-:S06] /00d0/ FADD R0, R2, R4 ;
[----:B------:R-:W-:-:S02] /00d8/ FADD R0, R0, 1 ;
[----:B------:R-:W-:-:S01] /00e8/ STG.E [R6], R0 ;
[----:B------:R-:W-:Y:S04] /00f0/ NOP ;
[----:B------:R-:W-:-:S15] /00f8/ EXIT ;
I simply modified [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 1 ;
to [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 2 ;
, but the results didn't change. Is there any possible reason?
Here is my script for compile
fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm "--image3=kind=elf,sm=50,file=new_cudatest.sm_50.cubin" "--image3=kind=ptx,sm=50,file=cudatest.ptx" --embedded-fatbin="cudatest.fatbin.c"
gcc -E -x c++ -D__CUDACC__ -D__NVCC__ -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=89 -include "cuda_runtime.h" "cudatest.cu" -o "cudatest.cpp4.ii"
cudafe++ --c++14 --gnu_version=70500 --allow_managed --unsigned_chars --m64 --parse_templates --gen_c_file_name "cudatest.cudafe1.cpp" --stub_file_name "cudatest.cudafe1.stub.c" --module_id_file_name "cudatest.module_id" "cudatest.cpp4.ii"
gcc -D__CUDA_ARCH__=500 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" "cudatest.cudafe1.cpp" -o "cudatest.o"
nvlink --arch=sm_50 --register-link-binaries="cudatest_dlink.reg.c" -m64 "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -cpu-arch=AARCH64 "cudatest.o" -lcudadevrt -o "cudatest_dlink.sm_50.cubin"
fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm -link "--image3=kind=elf,sm=50,file=cudatest_dlink.sm_50.cubin" --embedded-fatbin="cudatest_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE=""cudatest_dlink.fatbin.c"" -DREGISTERLINKBINARYFILE=""cudatest_dlink.reg.c"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=89 "/usr/local/cuda-10.2/bin/crt/link.stub" -o "cudatest_dlink.o"
g++ -Wl,--start-group "cudatest_dlink.o" "cudatest.o" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "cudatest"
Hi,
Im researching the paper "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking". I follow the step of the first part of the article which is registers remapping. In the paper, the registers' indices used for holding the tiles of A and B matrix are interleved. As shown in following picture.
The regs mapping is generated by NVCC 9.0 as described by author.
As I use cuAssembler with NVCC 10.2, the generated regs mapping is as follow.
You can see all the regs used for tile A and B are with even indices, which means all the regs used for tile C storage should be with odd indices (for avoiding reg bank conflict as FFMA op). But, as detailed in the second pic, the regs for C storage are mixed with odd and even indices, indicating the odd regs are not enough to prevent bank conflict. Thus, I try two potential methods to solve such issue. I firstly try changing some odd regs for tile A and B storage, insted of tile C, however, this results in cuda instruction error (CUDA error at matrixMul.cu:144 code=715(cudaErrorIllegalInstruction)) as I run the app. Then, I try to modify some even reg indices to to odd, for instance, R42, R40, R38 to R95, R97, R99. Unfortunately, this modification give me same error msg.
So I have three questions:
Many thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.