🐛 Describe the bug The Triton Matrix Multiplication example kerne

Hi! <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Actually the bug is a in line: <a href="https://github.com/triton-lang/triton/blob/0e7

Triton Matrix Multiplication example invalid results (return zeros) on Volta about pytorch HOT 9 OPEN

RobertCsordas commented on September 26, 2024

Triton Matrix Multiplication example invalid results (return zeros) on Volta

from pytorch.

Comments (9)

malfet commented on September 26, 2024

Adding oncall: pt2, perhaps we can chat a bit more about Volta coverage for torch.compile...
@RobertCsordas have you filed an issue against trition? (As it does not have much to do with the PyTorch)

from pytorch.

zou3519 commented on September 26, 2024

triage review to discuss Volta coverage for torch.compile, this issue implies that it might not work

from pytorch.

RobertCsordas commented on September 26, 2024

Hi! @malfet: no, I haven't. It is unclear to me who is maintaining the PyTorch Triton version, as it seems to be different from the current official upstream. But I will open an issue there as well.

This example worked perfectly with 2.1. There was even a 2.2 dev version where it worked with my custom Triton kernel together with torch.compile().

from pytorch.

zou3519 commented on September 26, 2024

cc @bertmaher

from pytorch.

malfet commented on September 26, 2024

Ok, I've tried it on colab and can not reproduce the failure using 2.3 and therefore verbatim example from https://github.com/triton-lang/triton/blob/release/2.3.x/python/tutorials/03-matrix-multiplication.py

On T4 : https://colab.research.google.com/drive/1GnTzDW7aKWofIYjKE0X5XgkzBETO3KhT?usp=sharing

from pytorch.

RobertCsordas commented on September 26, 2024

Interesting. I'm able to reproduce this on 4 different machines (every single machine I was testing on). I'm running drivers 530.41.03 with CUDA 12.1.1, 535.104.12 with CUDA 12.2.140, 535.113.01 with CUDA 11.5, 535.129.03 and 11.5.117. One of the machines has a Titan V, the rest are V100s, both 32 and 16gb versions. I started a clean install of python with just triton installed, the situation is identical. Python versions are 3.10.10 and 3.10.12.

Do you have any suggestions what should I try or how can we continue debugging?

from pytorch.

RobertCsordas commented on September 26, 2024

I simplified the matmul code a bit (removed leaky relu) and left just one config to guarantee equivalence, and dumped the PTX with the working 2.1 and the broken 2.2 Triton. Maybe this can help with debugging. Code: https://pastebin.com/FAL22dH1

Tirton 2.1 ptx (working): https://pastebin.com/6E0wiVbb
Triton 2.2 ptx (broken): https://pastebin.com/XMNJgZYB

I don't speak PTX, but to me it look like the Triton 2.2 PTX is completely missing the code that should call the tensor cores (the 2.1 PTX has a bunch of mma.sync.aligned.m8n8k4 instructions, while the 2.2 one has 0). The invalid 2.2 code is also significantly shorter.

The Triton 2.3.1 PTX is identical to that of 2.2.

EDIT: updated code to dump the TTIR and TTGIR as well: https://pastebin.com/3TtEEPiG
2.1 TTIR: https://pastebin.com/Lz6r02Ft
2.1 TTGIR: https://pastebin.com/0ukha35H
2.1 LLIR: https://pastebin.com/xzZSDs06
2.2 TTIR: https://pastebin.com/06mP1j1j
2.2 TTGIR: https://pastebin.com/k5FfpT7K
2.2 LLIR: https://pastebin.com/FAL22dH1

The TTIRs seems identical, except the register numbers, and 2 instructions:
2.2 has
%9 = arith.cmpi slt, %8, %c8_i32 : i32
%10 = arith.select %9, %8, %c8_i32 : i32
while 2.1:
%9 = arith.minsi %8, %c8_i32 : i32

The ordering and reg numbers in the TTGIR are different, but the general gist seems to be similar. The thing that forgets to do the tt.dot seems to come after these IRs.

The 2.2 LLIRs don't have mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32's, while the 2.1 LLIR has them.

EDIT 2:

I ran the unit tests on (branch release/2.2.x from github), and they fail after some time with segmentation fault, but the results in the mean time are here: https://pastebin.com/PSHDbUs0
GEMM tends to fail.

The results of lit test: https://pastebin.com/MrNRwyqT

I can't run the Ninja test because it searches cmake in /tmp and it can't find it, and I have not yet figured out how to fix it.

EDIT 3: added missing LLIRs

from pytorch.

RobertCsordas commented on September 26, 2024

Btw the success on T4 is not a good test, because even though the compute capability is 7.5, it's actually Turing. Triton uses a different dot implementation for Turing and Volta:

https://github.com/triton-lang/triton/blob/0e7b97bd47fc4beb21ae960a516cd9a7ae9bc060/lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp#L63

It's even a different file: https://github.com/triton-lang/triton/blob/release/2.2.x/lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/MMAv1.cpp is used only for Volta, and all the rest (including Turing) uses https://github.com/triton-lang/triton/blob/release/2.2.x/lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/MMAv2.cpp

from pytorch.

RobertCsordas commented on September 26, 2024

Actually the bug is a in line: https://github.com/triton-lang/triton/blob/0e7b97bd47fc4beb21ae960a516cd9a7ae9bc060/lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/MMAv1.cpp#L74
ALayout should be BLayout. Then I found that this is already fixed on main (triton-lang/triton@cc3aed4)

I can't build the main because of the following error:

downloading and extracting https://tritonlang.blob.core.windows.net/llvm-builds/llvm-657ec732-ubuntu-x64.tar.gz ...
      error: HTTP Error 403: This request is not authorized to perform this operation.

but if I change that char in 2.2, it works just fine.

from pytorch.

Triton Matrix Multiplication example invalid results (return zeros) on Volta about pytorch HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent