🐛 Describe the bug I made a customized Triton operator with Trito

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here is a simple triton operator <div class="snippet-clipboard-content notranslate

Sometimes it also returns error <div class="snippet-clipboard-content notranslate

`torch.compile` fails with customized Triton Operator on Triton 2.2,about pytorch/pytorch

Comments (8)

Luke20000429 commented on July 22, 2024 1

I am trying to create a minimum example, there are some code I cannot share right now.

from pytorch.

Luke20000429 commented on July 22, 2024

FYI, the code also works with CUDAGraph, so I suppose that the triton version has some conflicts.

from pytorch.

oulgen commented on July 22, 2024

@Luke20000429 your torch version is 2.2.2. User defined triton kernels are officially released on 2.3. Could you upgrade your pytorch version and try again?

from pytorch.

Luke20000429 commented on July 22, 2024

I've upgraded my torch and flash-attn and triton. Without customized triton operator it works fine.
Now it looks like I can run torch.compile on model with my custom triton operator, but the results are incorrect. My code is like

model = AutoModelForCausalLM.from_pretrained("luodian/llama-7b-hf", torch_dtype=torch.float16)
model = model.to(device).eval()
# replace mlp by my triton operator
# compiled model returns correct output without this 
# ['<s> My favourite condiment is ketchup. I love it on everything. I love it on my eggs, on my burg']
for layer in model.model.layers:
    layer.mlp = Triton_myMLP(layer.mlp)

tokenizer = AutoTokenizer.from_pretrained("luodian/llama-7b-hf")
prompt = "My favourite condiment is"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.max_new_tokens = 20
out = model.generate(input_ids, cache_implementation="static") # warmup ?
print(tokenizer.batch_decode(out.long()))
out = model.generate(input_ids, cache_implementation="static")
print(tokenizer.batch_decode(out.long())) # the result is like ['<s> My favourite condiment is plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus']

This looks like the static cache didn't step forward.

I suppose compile doesn't do warmup and Triton JIT might ran multiple times for autotune. So, I warmup the operator manually by

for i in range(2):
    text = generate(prompt, model, tokenizer, max_length=20)

print(text) # this gives the correct results

and compile. It returns me another error

/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py:124: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Traceback (most recent call last):
  File "/home/user/workarea/simple_static.py", line 100, in <module>
    out = model.generate(input_ids, cache_implementation="static") # warmup ?
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/transformers/generation/utils.py", line 1736, in generate
    result = self._sample(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2375, in _sample
    outputs = self(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
    return _compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
    transformations(instructions, code_options)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform
    tracer.run()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
    super().run()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1001, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1178, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1251, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1232, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/__init__.py", line 1731, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1102, in compile_fx
    return compile_fx(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx
    return aot_autograd(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 119, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base
    return inner_compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn
    return self.compile_to_module().call
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1254, in compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2160, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/home/xueshen/tmp/torchinductor_xueshen/l2/cl2u6wpoycbpdp4aztrroprkgqtcvo5jswnhcemznnqhhe3td5yq.py", line 39, in <module>
    triton_red_fused__to_copy_add_embedding_mean_mul_pow_rsqrt_0 = async_compile.triton('triton_', '''
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2658, in triton
    future = self.process_pool().submit(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/concurrent/futures/process.py", line 707, in submit
    raise BrokenProcessPool(self._broken)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Therefore, I finally tried giving only one configuration for my triton operator, manually warmup, and compile and run the compiled model. It still gives me the wrong output as

['<s> My favourite condiment is plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus plus']

from pytorch.

oulgen commented on July 22, 2024

I will take a look at it tomorrow, please share the exact code you're running

from pytorch.

Luke20000429 commented on July 22, 2024

Here is a simple triton operator

import triton 
from triton import language as tl
import torch

# `triton.jit`'ed functions can be auto-tuned by using the `triton.autotune` decorator, which consumes:
#   - A list of `triton.Config` objects that define different configurations of
#       meta-parameters (e.g., `BLOCK_SIZE_M`) and compilation options (e.g., `num_warps`) to try
#   - An auto-tuning *key* whose change in values will trigger evaluation of all the
#       provided configs
@triton.autotune(
    # configs=get_cuda_autotune_config(),
    configs= [
        triton.Config({'BLOCK_SIZE_M': 16, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1}, num_stages=3,
                      num_warps=16),
        triton.Config({'BLOCK_SIZE_M': 16, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1}, num_stages=3,
                      num_warps=16),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
        # Pointers to matrices
        a_ptr, b_ptr, c_ptr,
        # Matrix dimensions
        M, N, K,
        # The stride variables represent how much to increase the ptr by when moving by 1
        # element in a particular dimension. E.g. `stride_am` is how much to increase `a_ptr`
        # by to get the element one row down (A has M rows).
        stride_am, stride_ak,  #
        stride_bk, stride_bn,  #
        stride_cm, stride_cn,
        # Meta-parameters
        BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,  #
        GROUP_SIZE_M: tl.constexpr,  #
):
    """Kernel for computing the matmul C = A x B.
    A has shape (M, K), B has shape (K, N) and C has shape (M, N)
    """
    # -----------------------------------------------------------
    # Map program ids `pid` to the block of C it should compute.
    # This is done in a grouped ordering to promote L2 data reuse.
    # See above `L2 Cache Optimizations` section for details.
    pid = tl.program_id(axis=0)
    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
    num_pid_in_group = GROUP_SIZE_M * num_pid_n
    group_id = pid // num_pid_in_group
    first_pid_m = group_id * GROUP_SIZE_M
    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
    pid_m = first_pid_m + (pid % group_size_m)
    pid_n = (pid % num_pid_in_group) // group_size_m

    # ----------------------------------------------------------
    # Create pointers for the first blocks of A and B.
    # We will advance this pointer as we move in the K direction
    # and accumulate
    # `a_ptrs` is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers
    # `b_ptrs` is a block of [BLOCK_SIZE_K, BLOCK_SIZE_N] pointers
    # See above `Pointer Arithmetic` section for details
    offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
    offs_k = tl.arange(0, BLOCK_SIZE_K)
    a_ptrs = a_ptr + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
    b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)

    # -----------------------------------------------------------
    # Iterate to compute a block of the C matrix.
    # We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block
    # of fp32 values for higher accuracy.
    # `accumulator` will be converted back to fp16 after the loop.
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
        # Load the next block of A and B, generate a mask by checking the K dimension.
        # If it is out of bounds, set it to 0.
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
        # We accumulate along the K dimension.
        accumulator = tl.dot(a, b, accumulator)
        # Advance the ptrs to the next K block.
        a_ptrs += BLOCK_SIZE_K * stride_ak
        b_ptrs += BLOCK_SIZE_K * stride_bk
    # You can fuse arbitrary activation functions here
    # while the accumulator is still in FP32!
    c = accumulator.to(tl.float16)

    # -----------------------------------------------------------
    # Write back the block of the output matrix C with masks.
    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
    tl.store(c_ptrs, c, mask=c_mask)

def matmul(a, b, activation=""):
    # Check constraints.
    assert a.shape[1] == b.shape[0], "Incompatible dimensions"
    assert a.is_contiguous(), "Matrix A must be contiguous"
    M, K = a.shape
    K, N = b.shape
    # Allocates output.
    c = torch.zeros((M, N), device=a.device, dtype=a.dtype)
    # 1D launch kernel where each block gets its own program.
    grid = lambda META: (triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']), )
    matmul_kernel[grid](
        a, b, c,  #
        M, N, K,  #
        a.stride(0), a.stride(1),  #
        b.stride(0), b.stride(1),  #
        c.stride(0), c.stride(1),  #
    )
    return c

And here is the inference code

from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
import torch
from typing import Optional
from kernels.basic_gemm import matmul # change to your path
from torch import nn
device = "cuda"

class Triton_myMLP(nn.Module):
    def __init__(self, llama_mlp_layer):
        super().__init__()

        W1 = llama_mlp_layer.gate_proj.weight.cuda()
        self.W1T = torch.empty_like(W1.t()).copy_(W1.t()).contiguous().cuda()
        W2 = llama_mlp_layer.up_proj.weight.cuda()
        self.W2T = torch.empty_like(W2.t()).copy_(W2.t()).contiguous().cuda()
        W3 = llama_mlp_layer.down_proj.weight.cuda()
        self.W3T = torch.empty_like(W3.t()).copy_(W3.t()).contiguous().cuda()

    def forward(self, x):
        
        x = x.view(-1, 4096)

        gate = torch.nn.functional.silu(x @ self.W1T)
        up = matmul(x, self.W2T)
        c = gate*up
        output = c @ self.W3T

        return output[None, :, :]
            
model = AutoModelForCausalLM.from_pretrained("luodian/llama-7b-hf", torch_dtype=torch.float16)
model = model.to(device).eval()
# replace mlp
for layer in model.model.layers:
    layer.mlp = Triton_myMLP(layer.mlp)


tokenizer = AutoTokenizer.from_pretrained("luodian/llama-7b-hf")
prompt = "My favourite condiment is"


input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

model.generation_config.max_new_tokens = 20
out = model.generate(input_ids) # warmup ?
print(tokenizer.batch_decode(out.long())) # output is correct
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

out = model.generate(input_ids, cache_implementation="static")
print(tokenizer.batch_decode(out.long()))

I try to reproduce my error by creating a simplified version but this simple triton operator gives me some new errors.
The error is also mentioned by #126864

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: cannot extract sympy expressions from {'c_ptr': FakeTensor(..., device='cuda:0', size=(7, 11008), dtype=torch.float16)} <class 'torch.fx.immutable_collections.immutable_dict'>

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

from pytorch.

Luke20000429 commented on July 22, 2024

Sometimes it also returns error

/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py:124: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
LLVM ERROR: pthread_join failed: Invalid argument
Traceback (most recent call last):
  File "/home/user/workarea/simple_static.py", line 70, in <module>
    out = model.generate(input_ids, cache_implementation="static")
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/transformers/generation/utils.py", line 1736, in generate
    result = self._sample(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2375, in _sample
    outputs = self(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
    return _compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
    transformations(instructions, code_options)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform
    tracer.run()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
    super().run()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1001, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1178, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1251, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1232, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/__init__.py", line 1731, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1102, in compile_fx
    return compile_fx(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx
    return aot_autograd(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 119, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base
    return inner_compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn
    return self.compile_to_module().call
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1254, in compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2160, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/home/xueshen/tmp/torchinductor_xueshen/ey/ceyqnfqif4gc2yy5lfv3qtmpyxru5dgpfigrveo5bqepci2b473c.py", line 1219, in <module>
    async_compile.wait(globals())
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2715, in wait
    scope[key] = result.result()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2522, in result
    self.future.result()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/user/anaconda3/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

from pytorch.

oulgen commented on July 22, 2024

Can you run with TORCHINDUCTOR_COMPILE_THREAD=1 it will give you better error messages, instead of process pool nonsense

from pytorch.

`torch.compile` fails with customized Triton Operator on Triton 2.2 about pytorch HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent