triton-lang / triton Goto Github PK

View Code? Open in Web Editor NEW

12.3K 184.0 1.5K 333.87 MB

Development repository for the Triton language and compiler

Home Page: https://triton-lang.org/

License: MIT License

CMake 0.89% C++ 40.37% Python 30.28% C 0.28% MLIR 28.14% LLVM 0.03%

triton's Introduction

The Triton Conference is happening again on September 17th, 2024 in Fremont (CA)!

If you are interested in attending, please fill up this form.

`Documentation`	`Nightly Wheels`

Triton

This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.

The foundations of this project are described in the following MAPL2019 publication: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Please consider citing this work if you use Triton!

The official documentation contains installation instructions and tutorials. See also these third-party Triton puzzles, which can all be run using the Triton interpreter -- no GPU required.

Quick Installation

You can install the latest stable release of Triton from pip:

pip install triton

Binary wheels are available for CPython 3.8-3.12 and PyPy 3.8-3.9.

And the latest nightly release:

pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

Install from source

git clone https://github.com/triton-lang/triton.git;
cd triton;

pip install ninja cmake wheel; # build-time dependencies
pip install -e python

Or with a virtualenv:

git clone https://github.com/triton-lang/triton.git;
cd triton;

python -m venv .venv --prompt triton;
source .venv/bin/activate;

pip install ninja cmake wheel; # build-time dependencies
pip install -e python

Building with a custom LLVM

Triton uses LLVM to generate code for GPUs and CPUs. Normally, the Triton build downloads a prebuilt LLVM, but you can also build LLVM from source and use that.

LLVM does not have a stable API, so the Triton build will not work at an arbitrary LLVM version.

Find the version of LLVM that Triton builds against. Check cmake/llvm-hash.txt to see the current version. For example, if it says: 49af6502c6dcb4a7f7520178bd14df396f78240c

This means that the version of Triton you have builds against LLVM 49af6502.
git checkout LLVM at this revision. Optionally, make additional modifications to LLVM.

Build LLVM. For example, you might run

$ cd $HOME/llvm-project  # your clone of LLVM.
$ mkdir build
$ cd build
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON ../llvm -DLLVM_ENABLE_PROJECTS="mlir;llvm" -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU"
$ ninja

Grab a snack, this will take a while.

Build Triton as above, but set the following environment variables.

# Modify as appropriate to point to your LLVM build.
$ export LLVM_BUILD_DIR=$HOME/llvm-project/build

$ cd <triton install>
$ LLVM_INCLUDE_DIRS=$LLVM_BUILD_DIR/include \
  LLVM_LIBRARY_DIR=$LLVM_BUILD_DIR/lib \
  LLVM_SYSPATH=$LLVM_BUILD_DIR \
  pip install -e python

Tips for building

Set TRITON_BUILD_WITH_CLANG_LLD=true as an environment variable to use clang and lld. lld in particular results in faster builds.
Set TRITON_BUILD_WITH_CCACHE=true to build with ccache.
Set TRITON_HOME=/some/path to change the location of the .triton directory where Triton's cache is located and downloads are stored during the build. By default, this is the user's home directory. It can be changed anytime.
Pass --no-build-isolation to pip install to make nop builds faster. Without this, every invocation of pip install uses a different symlink to cmake, and this forces ninja to rebuild most of the .a files.
vscode intellisense has some difficulty figuring out how to build Triton's C++ (probably because, in our build, users don't invoke cmake directly, but instead use setup.py). Teach vscode how to compile Triton as follows.
- Do a local build.
- Get the full path to the compile_commands.json file produced by the build: find python/build -name 'compile_commands.json | xargs readlink -f'
- In vscode, install the C/C++ extension, then open the command palette (Shift + Command + P on Mac, or Shift + Ctrl + P on Windows/Linux) and open C/C++: Edit Configurations (UI).
- Open "Advanced Settings" and paste the full path to compile_commands.json into the "Compile Commands" textbox.

Running tests

There currently isn't a turnkey way to run all the Triton tests, but you can follow the following recipe.

# One-time setup.  Note we have to reinstall local Triton because torch
# overwrites it with the public version.
$ pip install scipy numpy torch pytest lit pandas matplotlib && pip install -e python

# Run Python tests using your local GPU.
$ python3 -m pytest python/test/unit

# Move to builddir.  Fill in <...> with the full path, e.g.
# `cmake.linux-x86_64-cpython-3.11`.
$ cd python/build/cmake<...>

# Run C++ unit tests.
$ ctest -j32

# Run lit tests.
$ lit test

You may find it helpful to make a symlink to the builddir and tell your local git to ignore it.

$ ln -s python/build/cmake<...> build
$ echo build >> .git/info/exclude

Then you can e.g. rebuild and run lit with the following command.

$ ninja -C build && ( cd build ; lit test )

Tips for hacking

For detailed instructions on how to debug Triton's frontend, please refer to this tutorial. The following includes additional tips for hacking on Triton's backend.

Helpful environment variables

MLIR_ENABLE_DUMP=1 dumps the IR before every MLIR pass Triton runs, for all kernels. Use MLIR_ENABLE_DUMP=kernelName to dump for a specific kernel only.
LLVM_IR_ENABLE_DUMP=1 dumps the IR before every pass run over the LLVM IR.
TRITON_INTERPRET=1 uses the Triton interpreter instead of running on the GPU. You can insert Python breakpoints in your kernel code!
TRITON_ENABLE_LLVM_DEBUG=1 passes -debug to LLVM, printing a lot of debugging information to stdout. If this is too noisy, run with just TRITON_LLVM_DEBUG_ONLY instead to limit the output.

An alternative way to reduce output noisiness is running with LLVM_IR_ENABLE_DUMP=1, extract the IR before the LLVM pass of interest, and then run LLVM's opt standalone, perhaps passing -debug-only=foo on the command line.
TRITON_LLVM_DEBUG_ONLY=<comma-separated> is the equivalent of LLVM's -debug-only command-line option. This limits the LLVM debug output to specific pass or component names (which are specified using #define DEBUG_TYPE throughout LLVM and Triton) in order to allow the debug output to be less noisy. TRITON_LLVM_DEBUG_ONLY allows for one or more comma separated values to be specified (eg TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions or TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc").
USE_IR_LOC={ttir,ttgir} reparses the IR such that the location information will be the line number of the IR file with that particular extension, instead of line number of the python file. This can provide a direct mapping from the IR to llir/ptx. When used with performance tools, it can provide a breakdown on IR instructions.
TRITON_PRINT_AUTOTUNING=1 prints out the best autotuning config and total time spent for each kernel after autotuning is complete.
DISABLE_LLVM_OPT will disable llvm optimizations for make_llir and make_ptx if its value is true when parsing as Bool. Otherwise, it will be parsed as a list of flags to disable llvm optimizations. One usage case is DISABLE_LLVM_OPT="disable-lsr" Loop strength reduction is known to cause up to 10% performance changes for certain kernels with register pressure.
TRITON_ALWAYS_COMPILE=1 forces to compile kernels regardless of cache hit.
MLIR_ENABLE_TIMING dumps the timing information for each MLIR pass.
LLVM_ENABLE_TIMING dumps the timing information for each LLVM pass.
TRITON_DEFAULT_FP_FUSION overrides the default behavior of allowing fp fusion (mul+add->fma).
MLIR_ENABLE_REMARK enables the performance warnings that are emitted as remarks.

Changelog

Version 2.0 is out! New features include:

Many, many bug fixes
Performance improvements
Backend rewritten to use MLIR
Support for kernels that contain back-to-back matmuls (e.g., flash attention)

Contributing

Community contributions are more than welcome, whether it be to fix bugs or to add new features at github. For more detailed instructions, please visit our contributor's guide.

Compatibility

Supported Platforms:

Linux

Supported Hardware:

NVIDIA GPUs (Compute Capability 7.0+)
AMD GPUs (ROCm 5.2+)
Under development: CPUs

triton's People

Contributors

Stargazers

Watchers

Forkers

xianyi liyancas ngaloppo benjamesbabala laureatian intel leliaonvidia naibaf7 lukezhezhu hbcbh1999 coxlab davidcox yxftju mkolod afcarl strategist922 glee1228 binbinmeng huzelin hardware-alchemy isaactheboi bangtianliu benjaminxiang sakuraxiamf lovejoan99 soumith virajdeshwal santhosh-ks peiyanflying wade1990 timemao dhaval18k litianjian hinature abhinishanand ruslanmlnkv charlie-x dlutwh ghosthamlet arashashari zhangboyue jeffra zeta1999 daadaada nottombrown jareddk jansel mrapplexz aray-openai dvntaudio zhaojp-frank beyondliyang norabelrose young768 tigerff justinjaywang srimouli04 shafiahmed sinsixx mrcodechef ai-hub-deep-learning-fundamental ebelliveau stephennfernandes lagait ameerhamza111 shauheen jingdujingdu nouretienne shyamalschandra reiddraper deepandy dim25 kalvin807 pkurainbow thisguy726 fireeyesgit yqnt418 devbox10 mohan-zhang-u jason-leeo runningphoton knowledgehacker neveroldmilk mucaoshen shanggdlk dumpmemory sycomix surajitdb trantorrepository randomprassanna ghzytp duzhanyuan gelargew i-spark astrotuna201 huangwgang danielschulz afilimonov ulandz andrei-pokrovsky

triton's Issues

Support for HIP backend / AMD GPUs

Do you plan to add support for a HIP backend in addition to the CUDA backend? HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.

Adding support for the HIP backend would enable the triton library to also support AMD GPUs. The HIP API very closely resembles the CUDA API, and we have tools such as hipify to allow us to easily "translate" most of the CUDA sources into HIP sources.

Add permute 2/3/4D permute to triton.ops

The repertoire of available Triton ops has grown, but there are still important ones missing. In particular, a generic permute for 2/3/4D tensors would be nice, not only for practitioners (e.g., layout conversion in CNNs) but also as a benchmark/test for the compiler.

Overhaul symbolic engine

The symbolic engine is outdated. Parts are still written in C++03, and the dirty code structure makes it hard to add new kernel templates.
In particular, the new symbolic engine should allow for the transparent handling of index modifiers (row, col, trans, reshape, diag, etc.), the support of which is too sloppy for now -- thereby leading to many bugs in the C++ API.

Setup fails when llvm-config doesn't have version number

In setup.py the llvm version is checked here.

But my llvm-config is just called: llvm-config (no version number).

Plain llvm-config is listed as an option here.

Changing the lines in setup.py to:

versions = ['','-9.0', '-9', '-90', '-8.0', '-8', '-80']
supported = ['llvm-config{v}'.format(v=v) for v in versions]

worked for me. This maybe defeats the point of having the check, but you could also throw a warning asking the user to check that their llvm version is >=8?

If that fix is ok, I'm happy to open a PR ?

Why python operator defines does not use these in test/common/src?

https://github.com/ptillet/triton/blob/triton-ops/python/triton/ops/matmul.py
https://github.com/ptillet/triton/blob/triton-ops/python/examples/tutorials/matmul.py
https://github.com/ptillet/triton/blob/triton-ops/tests/common/src/dot.h
I think these are same operator defines, right? If they are, Why did use same operator defines?

compatibility with beignet

Hi,

I have installed ISSAC and clcaffe (https://github.com/01org/caffe/wiki/clCaffe) with beignet 1.1.1.2

bit throws errors with ISSAC when forwarding alexnet.

./tools/caffe time -model ../models/bvlc_alexnet/deploy.prototxt -gpu 0
ISAAC: unknow Intel CPU ID:      Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
ISAAC: use SKYLAKE by default.
Device : Intel(R) HD Graphics Bay Trail-TBuild Status = -2
Build Log = stringInput.cl:5827:1: warning: implicit declaration of function 'intel_sub_group_block_read8' is invalid in C99
stringInput.cl:5812:39: note: expanded from macro 'GEMM_NN'
stringInput.cl:5674:50: note: expanded from macro 'SUBGROUP_BLOCK_READ8'
stringInput.cl:5827:1: error: call to 'as_float8' is ambiguous
stringInput.cl:5812:28: note: expanded from macro 'GEMM_NN'
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:492:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:502:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:512:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:522:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:533:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:544:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:555:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//
terminate called after throwing an instance of 'isaac::exception::ocl::invalid_program_executable'
  what():  OpenCL: Error- invalid program executable
*** Aborted at 1516592801 (unix time) try "date -d @1516592801" if you are using GNU date ***
PC: @     0x7f5f7765a428 gsignal
*** SIGABRT (@0x3e800004d42) received by PID 19778 (TID 0x7f5f79919ac0) from PID 19778; stack trace: ***
    @     0x7f5f7765a4b0 (unknown)
    @     0x7f5f7765a428 gsignal
    @     0x7f5f7765c02a abort
    @     0x7f5f77c9484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7f5f77c926b6 (unknown)
    @     0x7f5f77c92701 std::terminate()
    @     0x7f5f77c92919 __cxa_throw
    @     0x7f5f755f05de isaac::driver::check()
    @     0x7f5f755f317d isaac::driver::Kernel::Kernel()
    @     0x7f5f756285cb isaac::templates::intelblas_gemm_image::enqueue()
    @     0x7f5f756b3098 isaac::runtime::profiles::value_type::execute()
    @     0x7f5f756be949 isaac::runtime::execute()
    @     0x7f5f75699e6e execute
    @     0x7f5f756aaa86 clblasSgemm
    @     0x7f5f7910f2e5 caffe::greentea_gpu_gemm<>()
    @     0x7f5f78ee4efa caffe::BaseConvolutionLayer<>::forward_gpu_gemm()
    @     0x7f5f78feacaa caffe::ConvolutionLayerSpatial<>::Forward_gpu()
    @     0x7f5f79195cd1 caffe::Net<>::ForwardFromTo()
    @     0x7f5f79195dc7 caffe::Net<>::Forward()
    @     0x5596234ec338 time()
    @     0x5596234e6652 main
    @     0x7f5f77645830 __libc_start_main
    @     0x5596234e70f9 _start
    @                0x0 (unknown)
Aborted (core dumped)

Here is the steps to install beignet

sudo apt-get install beignet

The attachment is my clinfo.
beignet_clinfo.txt

Any suggestion?

thank you

all tests got segmentation fault after launch

Building process looks correct except some warnings. But after that all tests fail.

$ ./bench/bench-blas
Devices available:
------------------
[x] - gfx803 on AMD Accelerated Parallel Processing
------------------
BENCH   M       N       K       AT      BT      ISAAC
Segmentation fault (core dumped)

$ gdb $_
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
......
Reading symbols from ./bench/bench-blas...(no debugging symbols found)...done.
(gdb) run
Starting program: /tmp/isaac/build/bench/bench-blas
Devices available:
------------------
Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error
warning: File "/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /lib/x86_64-linux-gnu/libthread_db-1.0.so
line to your configuration file "/var/data0/sandbox/user1/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/var/data0/sandbox/user1/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
[x] - gfx803 on AMD Accelerated Parallel Processing
------------------
BENCH   M       N       K       AT      BT      ISAAC

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7750a1d in isaac::runtime::profiles::value_type::init(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
(gdb) where
#0  0x00007ffff7750a1d in isaac::runtime::profiles::value_type::init(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#1  0x00007ffff7751440 in isaac::runtime::profiles::value_type::execute(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#2  0x00007ffff775b71e in isaac::runtime::execute(isaac::runtime::execution_handler const&, std::map<std::pair<isaac::expression_type, isaac::numeric_type>, std::shared_ptr<isaac::runtime::profiles::value_type>, std::less<std::pair<isaac::expression_type, isaac::numeric_type> >, std::allocator<std::pair<std::pair<isaac::expression_type, isaac::numeric_type> const, std::shared_ptr<isaac::runtime::profiles::value_type> > > >&) () from /tmp/isaac/build/lib/libisaac.so
#3  0x00007ffff7741200 in isaac::array_base::operator=(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#4  0x00007ffff77419a7 in isaac::array_base::operator=(isaac::expression_tree const&) () from /tmp/isaac/build/lib/libisaac.so
#5  0x0000000000409c7e in double bench<void bench<float>(isaac::numeric_type, std::string)::{lambda()#8}, void bench<float>(isaac::numeric_type, std::string)::{lambda()#1}>(void bench<float>(isaac::numeric_type, std::string)::{lambda()#8} const&, void bench<float>(isaac::numeric_type, std::string)::{lambda()#1} const&) ()
#6  0x000000000040d4f8 in void bench<float>(isaac::numeric_type, std::string) ()
#7  0x0000000000405caf in main ()
(gdb)

Build Error: "undefined reference to `pthread_rwlock_{destroy,rdlock,init,unlock,wrlock}'

I follow the build instructions on the Readme:

git clone https://github.com/ptillet/isaac.git
mkdir -p isaac/build && cd isaac/build
cmake ../ && make -j4

..and I get this build error:

cathal@thinkum:~/Projects/isaac/build$ make -j4
[  2%] Built target bin2cpp
[  5%] Built target headers
[  5%] Built target database_amd
[  8%] Built target database_unknown
[  8%] Built target database_intel
[ 14%] Built target database_nvidia
[ 71%] Built target isaac
[ 77%] Built target test-element-1d
[ 77%] Built target test-blas-3
[ 83%] Built target test-blas-2
[ 83%] Built target test-blas-1
[ 85%] Built target test-element-2d
[ 88%] Built target test-reduce-2d
[ 91%] Built target test-reduce-1d
[ 94%] Built target test-fusion
[ 98%] Built target example-indexing
[ 98%] Linking CXX executable bench-blas
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_destroy'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_rdlock'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_init'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_unlock'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_wrlock'
collect2: error: ld returned 1 exit status
bench/CMakeFiles/bench-blas.dir/build.make:96: recipe for target 'bench/bench-blas' failed
make[2]: *** [bench/bench-blas] Error 1
CMakeFiles/Makefile2:1663: recipe for target 'bench/CMakeFiles/bench-blas.dir/all' failed
make[1]: *** [bench/CMakeFiles/bench-blas.dir/all] Error 2
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2

I don't see a reference in a non-binary file to 'pthread' when I sift for it, but it's matched in lots of object files. So, perhaps it's just missing somewhere from the Makefiles...but I don't grok the C build system. :/

problems using triton kernel in TensorFlow custom op (illegal memory access)

hi! :) Thank you for such an amazing project!
I have been trying to use your sparse matmul kernels in tensorflow by making a custom op and calling the function instance in the overrided Compute function.
However, after successfully launching the kernel, I encounter illegal memory access errors when I try to print the result (matrix c) in python level.
Considering this may be due to conflicts with tensorflow itself, I referenced the tutorials/01-matmul.cc file and tried using the aforementioned kernel only by using c++ and triton api's, but I still have the same 'illegal memory access' error when trying to read the result into a cpu vector by using stream->read.
Is there some reason why I may be encountering such an error only with the aforementioned kernel and not the one in 01-matmul.cc (src::dot) ?
It would be great if you could give some advice about what may have gone wrong, or maybe a potential reason why using a tf custom op is inappropriate.
Thank you :)

//--------------------- c++ only file -------------------------
// tried to read from stream at the end, but gets illegal memory access error

void compute(drv::context* context,
             drv::stream* stream,
             bool trans_a,  // false
             bool trans_b,  // true
             bool trans_c,  // false
             int block,
             int num_blocks)
{
  size_t dt_nbytes = sizeof(float);
  drv::device* device = context->device();

  std::vector<half_float::half> ha(1 * 4 * 128 * 64);
  std::vector<half_float::half> hb(1 * 4 * 128 * 64);
  for(size_t i = 0; i < ha.size(); i++) {
    ha[i] = (half_float::half)rand()/RAND_MAX;
  }
  for(size_t i = 0; i < hb.size(); i++) {
    hb[i] = (half_float::half)rand()/RAND_MAX;
  }

  std::vector<float> widths = {24, 40};
  std::vector<float> spdims = {4, 8, 8};
  std::vector<float> packs = {2,1};

  int AS0 = 1; int AS1 = 4 ; int AS2 = 128 ; int AS3 = 64;
  int BS0 = 1; int BS1 = 4 ; int BS2 = 64 ; int BS3 = 128;

  bool is_16_multiple = (AS3 % 16 == 0);
  bool is_32_multiple = (AS3 % 32 == 0);
  bool is_64_multiple = (AS3 % 64 == 0);

  int total_width = 0;
  int num_elements =  widths.size();

  for(int i = 0 ; i < num_elements ; i++) {
    total_width += widths[i] * packs[i] * packs[i];
  }

  std::vector<half_float::half> hc(AS0 * total_width * block * block);

  auto da = std::shared_ptr<drv::buffer>(drv::buffer::create(context, AS0*AS1*AS2*AS3*dt_nbytes));
  auto db = std::shared_ptr<drv::buffer>(drv::buffer::create(context, BS0*BS1*BS2*BS3*dt_nbytes));
  auto dc = std::shared_ptr<drv::buffer>(drv::buffer::create(context, AS0*total_width*block*block*dt_nbytes));

  stream->write(&*da, true, 0, ha);
  stream->write(&*db, true, 0, hb);
  
  int pack = packs[0];
  
  rt::options_t opt;
  opt.defines["TM"] = "32"; //std::to_string(block * pack);
  opt.defines["TN"] = "32"; //std::to_string(block * pack);
  opt.defines["TMN"] = "1024"; //std::to_string(block * block * pack * pack);
  opt.defines["BLOCK"] = "16"; //std::to_string(block);
  opt.defines["TK"] = "32";
  opt.defines["TYPE"] = "float"; 
  opt.defines["STRIDE_AM"] = "lda"; //trans_a ? "1" : "lda";  //TODO:
  opt.defines["STRIDE_AK"] = "1"; //trans_a ? "lda" : "1";
  opt.defines["STRIDE_BN"] = "ldb"; //trans_b ? "ldb" : "1";
  opt.defines["STRIDE_BK"] = "1"; //trans_b ? "1" : "ldb";
  opt.defines["STRIDE_CM"] = "ldc";
  opt.defines["STRIDE_CN"] = "1";
  opt.defines["SDD"] = "True"; 
  opt.defines["TZ"] = "1";
  opt.defines["NAME"] = "sdd_kernel";
  opt.num_warps = 4;

  int num_lock = 1;
  int width = widths[0];
  std::vector<int> locks(2 * width * AS0 * num_lock, 0);
  auto locks_buf = std::shared_ptr<drv::buffer>(drv::buffer::create(context, 2*width*AS0*num_lock*sizeof(int)));
  stream->write(&*locks_buf, true, 0, locks);

  
  std::vector<int> lut = {
    0,   0,   0,   0,   0,   0,   1,   1,   0,   1,   0,   8,   0,   1,
    1,   9,   0,   0,   2,   2,   0,   0,   3,   3,   0,   2,   2,  13,
    0,   2,   3,  14,   0,   3,   0,  15,   0,   3,   3,  17,   0,   4,
    0,  19,   0,   4,   3,  20,   0,   0,   4,   4,   0,   0,   5,   5,
    0,   4,   4,  21,   0,   4,   5,  22,   0,   5,   0,  23,   0,   5,
    5,  25,   0,   6,   0,  27,   0,   6,   5,  28,   0,   0,   6,   6,
    0,   0,   7,   7,   0,   6,   6,  29,   0,   6,   7,  30,   1,   0,
    0,  34,   1,   0,   1,  35,   1,   1,   0,  42,   1,   1,   1,  43,
    1,   0,   2,  36,   1,   0,   3,  37,   1,   2,   2,  47,   1,   2,
    3,  48,   1,   3,   0,  49,   1,   3,   3,  51,   1,   4,   0,  53,
    1,   4,   3,  54,   1,   0,   4,  38,   1,   0,   5,  39,   1,   4,
    4,  55,   1,   4,   5,  56,   1,   5,   0,  57,   1,   5,   5,  59,
    1,   6,   0,  61,   1,   6,   5,  62,   1,   0,   6,  40,   1,   0,
    7,  41,   1,   6,   6,  63,   1,   6,   7,  64,   2,   0,   0,  68,
    2,   0,   1,  69,   2,   1,   0,  76,   2,   1,   1,  77,   2,   0,
    2,  70,   2,   0,   3,  71,   2,   2,   2,  81,   2,   2,   3,  82,
    2,   3,   0,  83,   2,   3,   3,  85,   2,   4,   0,  87,   2,   4,
    3,  88,   2,   0,   4,  72,   2,   0,   5,  73,   2,   4,   4,  89,
    2,   4,   5,  90,   2,   5,   0,  91,   2,   5,   5,  93,   2,   6,
    0,  95,   2,   6,   5,  96,   2,   0,   6,  74,   2,   0,   7,  75,
    2,   6,   6,  97,   2,   6,   7,  98,   3,   0,   0, 102,   3,   0,
    1, 103,   3,   1,   0, 110,   3,   1,   1, 111,   3,   0,   2, 104,
    3,   0,   3, 105,   3,   2,   2, 115,   3,   2,   3, 116,   3,   3,
    0, 117,   3,   3,   3, 119,   3,   4,   0, 121,   3,   4,   3, 122,
    3,   0,   4, 106,   3,   0,   5, 107,   3,   4,   4, 123,   3,   4,
    5, 124,   3,   5,   0, 125,   3,   5,   5, 127,   3,   6,   0, 129,
    3,   6,   5, 130,   3,   0,   6, 108,   3,   0,   7, 109,   3,   6,
    6, 131,   3,   6,   7, 132
  };
  auto lut_buf = std::shared_ptr<drv::buffer>(drv::buffer::create(context, lut.size()*sizeof(int)));
  stream->write(&*lut_buf, true, 0, luts_i);

  std::stringstream oss;
  rt::add_arg(oss, *da->cu());
  rt::add_arg(oss, *db->cu());
  rt::add_arg(oss, *dc->cu());
  rt::add_arg(oss, 64);
  rt::add_arg(oss, 64);
  rt::add_arg(oss, block);
  rt::add_arg(oss, 32768);
  rt::add_arg(oss, 32768);
  rt::add_arg(oss, 34816);
  rt::add_arg(oss, 8192);
  rt::add_arg(oss, 8192);
  rt::add_arg(oss, 34816);
  rt::add_arg(oss, AS2);
  rt::add_arg(oss, AS2);
  rt::add_arg(oss, AS3);
  rt::add_arg(oss, 0);
  rt::add_arg(oss, *lut_buf->cu());
  rt::add_arg(oss, *locks_buf->cu());
  rt::add_arg(oss, num_lock);

  rt::function function(src::matmul, opt, device);  // src::matmul is source code in matmul.c, passed over like the tutorial

  int max_width = 49152;

  auto grid = [max_width, width, AS0](const rt::options_t& x) {
    return rt::grid_t{(size_t)x.D<int>("TZ"),
                (size_t)std::min(max_width, width),
                (size_t)AS0};
  };

  function((void**)oss.str().data(), oss.str().size(), grid, stream);
  
  stream->read(&*dc, true, 0, hc);   // ERROR HAPPENS HERE!!!!
}

int main() {
  auto context = triton::driver::backend::contexts::get_default();
  triton::driver::stream* stream = triton::driver::stream::create(context->backend());
  compute(context, stream, false, true, false, 16, 0);
}

The below shows the tf-custom op version of using the sparse matmul kernel,
and the second code block shows how this op is called in python level,
the error happens when printing tensor c

//--------------------------------------- using tf custom op ----------------------------------------
using namespace tensorflow;

// op without caching ver
//************************ OP INTERFACE *********************************
REGISTER_OP("SddMatmul")
  .Attr("T: {int32, float, float16}")
  .Input("a: T")
  .Input("b: T")
  .Input("c: T")
  .Attr("trans_a: bool")  
  .Attr("trans_b: bool")
  .Attr("trans_c: bool")
  .Attr("block: int")
  .Input("luts_i: T")
  .Attr("widths_i: int")
  .Attr("packs_i: int")
  .Input("locks: T")

  .Attr("a_stride0: int")
  .Attr("a_stride1: int")
  .Attr("a_stride2: int")
  .Attr("b_stride0: int")
  .Attr("b_stride1: int")
  .Attr("b_stride2: int")
  .Attr("c_stride0: int");


//****************************** KERNEL **********************************
template <typename T>
class SddMatmulOp : public OpKernel {

 private:
  bool trans_a, trans_b, trans_c;
  int block, width, pack;
  int a_stride0, a_stride1, a_stride2, b_stride0, b_stride1, b_stride2, c_stride0;

 public:
  explicit SddMatmulOp(OpKernelConstruction* context) : OpKernel(context) {
    OP_REQUIRES_OK(context, context->GetAttr("trans_a", &trans_a));
    OP_REQUIRES_OK(context, context->GetAttr("trans_b", &trans_b));
    OP_REQUIRES_OK(context, context->GetAttr("trans_c", &trans_c));
    OP_REQUIRES_OK(context, context->GetAttr("block", &block));
    OP_REQUIRES_OK(context, context->GetAttr("widths_i", &width));
    OP_REQUIRES_OK(context, context->GetAttr("packs_i", &pack));

    OP_REQUIRES_OK(context, context->GetAttr("a_stride0", &a_stride0));
    OP_REQUIRES_OK(context, context->GetAttr("a_stride1", &a_stride1));
    OP_REQUIRES_OK(context, context->GetAttr("a_stride2", &a_stride2));
    OP_REQUIRES_OK(context, context->GetAttr("b_stride0", &b_stride0));
    OP_REQUIRES_OK(context, context->GetAttr("b_stride1", &b_stride1));
    OP_REQUIRES_OK(context, context->GetAttr("b_stride2", &b_stride2));
    OP_REQUIRES_OK(context, context->GetAttr("c_stride0", &c_stride0));
  }

  void Compute(OpKernelContext* context) override {
    auto triton_ctx = drv::backend::contexts::get_default();                      // get context
    drv::stream* stream = triton::driver::stream::create(triton_ctx->backend());  // get stream
    drv::device* device = triton_ctx->device();                                   // get device

    // get inputs
    OpInputList luts;
    const Tensor& a = context->input(0);
    const Tensor& b = context->input(1);
    const Tensor& c = context->input(2);
    const Tensor& lut = context->input(3);
    const Tensor& locks = context->input(4);

    // change to pointers 
    auto a_ = reinterpret_cast<const T*>(a.tensor_data().data());
    auto b_ = reinterpret_cast<const T*>(b.tensor_data().data());
    auto c_ = reinterpret_cast<const T*>(c.tensor_data().data());
    auto lut_ = reinterpret_cast<const T*>(lut.tensor_data().data());
    auto locks_ = reinterpret_cast<const T*>(locks.tensor_data().data());

    uintptr_t a_uintptr = (uintptr_t)a_;
    uintptr_t b_uintptr = (uintptr_t)b_;
    uintptr_t c_uintptr = (uintptr_t)c_;
    uintptr_t lut_uintptr = (uintptr_t)lut_;
    uintptr_t locks_uintptr = (uintptr_t)locks_;

    uint64 a_addr = a_uintptr;
    uint64 b_addr = b_uintptr;
    uint64 c_addr = c_uintptr;
    uint64 lut_addr = lut_uintptr;
    uint64 locks_addr = locks_uintptr;

    std::cout << "a addr: " << a_addr << std::endl;
    std::cout << "b addr: " << b_addr << std::endl;
    std::cout << "c addr: " << c_addr << std::endl;
    std::cout << "lut addr: " << lut_addr << std::endl;
    std::cout << "locks addr: " << locks_addr << std::endl;

    // define macros
    rt::options_t opt;
    opt.defines["TM"] = std::to_string(block * pack);
    opt.defines["TN"] = std::to_string(block * pack);
    opt.defines["TMN"] = std::to_string(block * block * pack * pack);
    opt.defines["BLOCK"] = std::to_string(block);
    opt.defines["TK"] = "32";   
    opt.defines["TYPE"] = "half"; //std::to_string(dtype);  //std::to_string<T>::value; TODO:
    opt.defines["STRIDE_AM"] = trans_a ? "1" : "lda";
    opt.defines["STRIDE_AK"] = trans_a ? "lda" : "1";
    opt.defines["STRIDE_BN"] = trans_b ? "ldb" : "1";
    opt.defines["STRIDE_BK"] = trans_b ? "1" : "ldb";
    opt.defines["STRIDE_CM"] = "ldc";
    opt.defines["STRIDE_CN"] = "1";
    opt.defines["SDD"] = "True"; // TODO: bool type
    opt.defines["TZ"] = "1";
    opt.defines["NAME"] = "sdd_kernel";
    opt.num_warps = 4;  

    // adding arguments
    std::stringstream oss;
    rt::add_arg(oss, a_addr);
    rt::add_arg(oss, b_addr);
    rt::add_arg(oss, c_addr);
    rt::add_arg(oss, a_stride2);
    rt::add_arg(oss, b_stride2);
    rt::add_arg(oss, block);
    rt::add_arg(oss, a_stride0);
    rt::add_arg(oss, b_stride0);
    rt::add_arg(oss, c_stride0);
    rt::add_arg(oss, a_stride1);
    rt::add_arg(oss, b_stride1);
    rt::add_arg(oss, c_stride0);
    rt::add_arg(oss, 128);  // AS2
    rt::add_arg(oss, 128);  // AS2
    rt::add_arg(oss, 64);   // AS3
    rt::add_arg(oss, 0);
    rt::add_arg(oss, lut_addr);
    rt::add_arg(oss, locks_addr);
    rt::add_arg(oss, 1);  // num_lock

    rt::function function(src::matmul, opt, device);

    auto grid = [this](const rt::options_t& x) {
      return rt::grid_t{(size_t)x.D<int>("TZ"),
                  (size_t)std::min(49152, width),
                  (size_t)1};
    };

    function((void**)oss.str().data(), oss.str().size(), grid, stream);
    std::cout << "a addr after: " << a_addr << std::endl;
    std::cout << "b addr after: " << b_addr << std::endl;
    std::cout << "c addr after: " << c_addr << std::endl;
    std::cout << "lut addr after: " << lut_addr << std::endl;
    std::cout << "locks addr after: " << locks_addr << std::endl;
    

  } // Compute

};  // SddMatmulOp class


REGISTER_KERNEL_BUILDER(Name("SddMatmul").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"), SddMatmulOp<Eigen::half>);

sdd_matmul = _load_library('c_lib' + get_ext_suffix()).sdd_matmul
# go around for loop once
with tf.device("/GPU:0"):
    a = tf.random.uniform([1, 4, 128, 64], dtype=tf.float16)
    b = tf.random.uniform([1, 4, 128, 64], dtype=tf.float16)
    c = tf.ones([1, 136, 16, 16], dtype=tf.float16)

    luts_i = tf.random.uniform([384], minval=0, maxval=130, dtype=tf.float16)
    locks = tf.zeros([2,24,1,1], dtype=tf.float16)


sdd_matmul(a=a,
            b=b,
            c=c,
            trans_a=False,
            trans_b=True,
            trans_c=False,
            block=16,
            luts_i=luts_i,
            widths_i=24,
            packs_i=2,
            locks=locks,
            a_stride0=32768,
            a_stride1=8192,
            a_stride2=64,
            b_stride0=32768,
            b_stride1=8192,
            b_stride2=64,
            c_stride0=34816
            )

print("C: ", c)    # ERROR HAPPENS HERE

The following is the specification of my environment:

Ubuntu 18.04
one Tesla-V100 GPU (GCP n1-standard-8 instance)
CUDA 11.0
TensorFlow v2.4.0 (from source build with GCC 7.5.0)
Triton : the latest master branch version
LLVM 10.0.0

undefined symbol: compress2

I installed the triton by pip install triton==0.2.3; but when I run the code
it return Import Error :
triton/_C/libtriton.so: undefined symbol: compress2

I have libz.so.1 in /usr/lib64 and /usr/lib64 in my $LD_LIBRARY_PATH.

Installation guide

Please can you provide detailed instructions on how to install the program and how to test it?

compiler requirement

hi -

Can you please clarify compiler requirement? I am mainly interested in python module, but it needs build extension anyway. is LLVM a must, version requirement? I ran into bunch of errors, I just want to confirm the basics before I post the logs.

Thanks
Oliver

Build failed with CMAKE_BUILD_TYPE=Debug

Hi,

I tried on Intel OCL only platform, and found that from commit f1a636f the master branch will build failed with CMAKE_BUILD_TYPE=Debug but can pass with Release or RelWithDebInfo, looks cause by introduced with cublas libraries, the build errors show belw:

../lib/libisaac.so: undefined reference to `cublasGetStream_v2'
../lib/libisaac.so: undefined reference to `cublasSetStream_v2'
collect2: error: ld returned 1 exit status
tests/CMakeFiles/test-fusion.dir/build.make:95: recipe for target 'tests/test-fusion' failed
make[2]: *** [tests/test-fusion] Error 1
CMakeFiles/Makefile2:1608: recipe for target 'tests/CMakeFiles/test-fusion.dir/all' failed
make[1]: *** [tests/CMakeFiles/test-fusion.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
../lib/libisaac.so: undefined reference to `cublasGetStream_v2'
../lib/libisaac.so: undefined reference to `cublasSetStream_v2'
collect2: error: ld returned 1 exit status

Unable to install via pip

Hi all,

I tried to install the versions 0.3.0 and 0.2.0 and got some build errors. Any ideas what might be going on? The error points me to a CMake log, but that log doesn't exist on my computer.

For what its worth, I have Cuda 11.1 and Python 3.9.1.

Thanks!

     
      -- Configuring incomplete, errors occurred!
      See also "/tmp/pip-req-build-ll5zoboi/build/temp.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/tmp/pip-req-build-ll5zoboi/setup.py", line 106, in <module>
          setup(
        File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/__init__.py", line 163, in setup
          return distutils.core.setup(**attrs)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/command/install.py", line 61, in run
          return orig.install.run(self)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/install.py", line 546, in run
          self.run_command('build')
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/tmp/pip-req-build-ll5zoboi/setup.py", line 55, in run
          self.build_extension(ext)
        File "/tmp/pip-req-build-ll5zoboi/setup.py", line 93, in build_extension
          subprocess.check_call(['cmake', sourcedir] + cmake_args, cwd=self.build_temp, env=env)
        File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/subprocess.py", line 373, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-req-build-ll5zoboi/src', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-req-build-ll5zoboi/build/lib.linux-x86_64-3.9/triton/_C', '-DBUILD_TESTS=OFF', '-DBUILD_PYTHON_MODULE=ON', '-DPYT
HON_INCLUDE_DIRS=/home/fishy/.pyenv/versions/3.9.1/include/python3.9;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/si
te-packages/torch/include/torch/csrc/api/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include/TH;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch
/include/THC;/usr/local/cuda-11.1/include', '-DPYTHON_LINK_DIRS=/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/lib;/usr/local/cuda-11.1/lib64', '-DTORCH_LIBRARIES=c10;c10_cuda;torch;torch_cuda;torch_c
pu;torch_python;triton', '-DLLVM_CONFIG=/usr/bin/llvm-config', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
      ----------------------------------------
  ERROR: Command errored out with exit status 1: /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ll5zoboi/setup.py'"'"'; __file__='"'"'/tmp/pip-r
eq-build-ll5zoboi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8fxaw_pr/install-reco
rd.txt --single-version-externally-managed --compile --install-headers /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/include/site/python3.9/triton Check the logs for full command output.

Unable to install via pip (different issue)

sort of a continuation from this issue https://github.com/ptillet/triton/issues/99
For a while the fix above worked fine, and I was able to use triton version 0.4.0, but recently I haven't been able to install it.
I still see it as the correct version on PyPi, but when I tried to install using pip I get the error

>>> pip3 install triton==0.4.0
ERROR: Could not find a version that satisfies the requirement triton==0.4.0 (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 0.4.1, 1.0.0.dev20210502, 1.0.0.dev20210506, 1.0.0.dev20210508, 1.0.0.dev20210509, 1.0.0.dev20210510, 1.0.0.dev20210515, 1.0.0.dev20210516, 1.0.0.dev20210520, 1.0.0.dev20210521, 1.0.0.dev20210525, 1.0.0.dev20210601)
ERROR: No matching distribution found for triton==0.4.0

I also can't install the version that's listed, 0.4.1, getting the following error

>>> pip3 install triton==0.4.1
Collecting triton==0.4.1
  Downloading triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl (14.9 MB)
     |████████████████████████████████| 14.9 MB 13.1 MB/s 
WARNING: Discarding https://files.pythonhosted.org/packages/74/00/990e44a74bba0735f55780ede9c717271887e7b48e4cedb1ab72086af169/triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl#sha256=ced9481babfc33f4abfcdc983be20e12085dd396030fa147074b6c7202df78c0 (from https://pypi.org/simple/triton/). Requested triton==0.4.1 from https://files.pythonhosted.org/packages/74/00/990e44a74bba0735f55780ede9c717271887e7b48e4cedb1ab72086af169/triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl#sha256=ced9481babfc33f4abfcdc983be20e12085dd396030fa147074b6c7202df78c0 has inconsistent version: filename has '0.4.1', but metadata has '0.4.0'
ERROR: Could not find a version that satisfies the requirement triton==0.4.1 (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 0.4.1, 1.0.0.dev20210502, 1.0.0.dev20210506, 1.0.0.dev20210508, 1.0.0.dev20210509, 1.0.0.dev20210510, 1.0.0.dev20210515, 1.0.0.dev20210516, 1.0.0.dev20210520, 1.0.0.dev20210521, 1.0.0.dev20210525, 1.0.0.dev20210601)
ERROR: No matching distribution found for triton==0.4.1

Possible memory out-of-bound in OpenCL sgemm

Hello,

I am running Isaac on an AMD Fiji Nano GPU, using ROCm 1.6
https://github.com/RadeonOpenCompute/ROCm

A memory error is being detected for sgemm problem
tA = no
tB = no
colMaj = yes
m = 35
n = 8457
k = 4096
lda = 35
ldb = 4096
ldc = 35

this corresponds to one of the problems in the DeepBench suite.

The error reported by the runtime is

Memory access fault by GPU node-1 on address 0x901e22000. Reason: Page not present or supervisor privilege.

I'm getting the same error on one of the DeepBench problems in Isaac's bench-blas.

AMD's previous Catalyst drivers just let these memory errors through.

I don't think this is a compiler issue, as there is still a problem when
export AMD_OCL_BUILD_OPTIONS_APPEND="-cl-opt-disable"

but I'm not certain.

Thanks.

3090/a100 support

Hi,

I build triton over 3090, and find cannot execute the generated binary:

root@2841f7a92187:/data/triton/build/tests/bench# ./bench_conv
terminate called after throwing an instance of 'triton::driver::exception::cuda::illegal_address'
what(): CUDA: Error- illegal address
Aborted (core dumped)

root@2841f7a92187:/data/triton/build/tests/bench# ./bench_copy
// {16777216}, {0}, {0}terminate called after throwing an instance of 'triton::driver::exception::cuda::launch_out_of_resources'
what(): CUDA: Error- launch out of resources
Aborted (core dumped)

So current 3090/a100 is not supported?

Thx,
Lei

einsum() got an unexpected keyword argument 'arrays'

I got a probem when I run python einsum.py, how to fix this error?

python einsum.py
Traceback (most recent call last):
File "einsum.py", line 195, in
triton.ops.einsum(expr, ta, tb, tc, arrays = arrays, bench = True)
TypeError: einsum() got an unexpected keyword argument 'arrays'

Memory corruption of Pascal Titan X for some shapes

It seems like some kernels tend to corrupt memory. This problem does not show on the benchmarks but breaks the BLAS3-test (despite computing the correct matrix).

Regression for caffe opencl branch.

The latest isaac code triggers many test failures with caffe's opencl branch. The good commit is:
Templates/Reduce1D: now properly loading 2D scalars commit 6ac5e1f

Since that commit, both "General: Internal code generator overhaul" and "JIT: No longer using fallbacks for stride[0] > 1" introduce some regressions.

It's easy to build the Caffe's opencl branch as below:

mkdir build

cmake -DUSE_GREENTEA=ON -DUSE_ISAAC=ON ..

cd build

make -j8

make runtest

Then you will see many new failures with the above two commit.

BTW It's better to use latest beignet driver as the OCL compiler. The good commit works great with beignet.

@ptillet Could you look at this issue? Thanks.

Changing num_warps results in different numerics

We found that if we change the num_warps on this line to any value besides 16, then the numerics become incorrect:
https://github.com/ptillet/triton/pull/53/files#diff-e5b8f91da04da23cb0d29b0d334fb41a185033e4974531de7494cfa38dd97880R58

My understanding is that num_warps should not affect the numerics at all, but it could be that we have a bug in our code, which violates a constraint needed for that guarantee!

(we are running on an A100 GPU)

Unable to build on arch linux

Many files, for example https://github.com/ptillet/triton/blob/master/lib/ir/constant.cc raises compilation error: 25:16: error: no member named 'runtime_error' in namespace 'std' on arch linux. When I manually add #include it starts working, but fails with same error on another files

Support of half version of isaac blas function

Hi @ptillet ,

fp16 of isaac blas functionality are added and under review at intel#5
please feel free to come up with your opinion about the patch:) thank you!

is this support the tensorflow？

If not, why can this repo support the tensorflow? Thanks !

Performance Evaluation methods? it's not very clear.

I'd like to do some performance evaluation of isaac blas library.

For example, clBLAS provides clBLAS-client that produces the following output.
./clBLAS-client
StatisticalTimer:: Pruning 0 samples from clfunc
StatisticalTimer:: Pruning 0 samples from clGemm
BLAS kernel execution time < ns >: 116170
BLAS kernel execution Gflops < 2.0_M_N*K/time >: 36.1049

On the otherhand, it is not very obvious how to do quick performance evaluation using ISAAC.

Even if there is no tool like clBLAS-client, can you provide info on how to do some evaluations on SGEMM, DGEMM for some sizes ? If you have some CPP code that does, would you provide the code or send it ? [email protected]

Thank you, I appreciate it

Matmul example (error: tile with more than one element cannot be casted to scalar)

Hi,

Apologies for this kind of question but I can't get the first matmul example to work: from here.

Here is a full compilable example:

import torch
import triton

class _dot(torch.autograd.Function):
    src = """
    __global__ void dot(float * A, float *B, float *C, int M, int N, int K,
                        int lda __multipleof(8), int ldb __multipleof(8), int ldc __multipleof(8)) {
        int pm = get_program_id(0);
        int pn = get_program_id(1);

        // ranges
        int rm[TM] = pm * TM + 0 ... TM;
        int rn[TN] = pn * TN + 0 ... TN;
        int rk[TK] = 0 ... TK;

        // accumulator
        float c[TM, TN] = 0;

        //pointers
        float* pa[TM, TK] = A + rk[newaxis, :] * 1 + rm[:, newaxis] * lda;
        float* pb[TK, TN] = B + rk[:, newaxis] * ldb + rn[newaxis, :] * 1;

        for(int k=K; k>0; k-=TK) {
            float a[TM, TK] = *pa;
            float b[TK, TN] = *pb;

            c += dot(a,b);

            pa = pa + TK * 1;
            pb = pb + TK * ldb;
        }

        float* pc[TM,TN] = C + rn[newaxis, :] + rm[:,newaxis] * ldc;
        *pc = c;
    }
    """

    @staticmethod
    def forward(ctx, a, b):
        c = _dot._call(a,b)
        return c


    @staticmethod
    def _call(a, b):
        M, K = a.shape
        K, N = b.shape

        lda = M
        ldb = K
        ldc = M

        dtype = a.dtype
        c = triton.empty([M,N], dtype=dtype)
        grid = lambda opt: [triton.cdiv(M, opt.d('TM')), triton.cdiv(N, opt.d('TN'))]
        defines= {
            'TYPE' : dtype,
            'TM'   : [32,64,128],
            'TN'   : [32,64,128],
            'TK'   : [8],
        }

        _dot.kernel = triton.kernel(_dot.src, defines=defines)
        _dot.kernel(a, b, c, M, N, K, lda, ldb, ldc,
                        grid=grid, num_warps=4, defines=defines)
        return c

dot = _dot.apply
torch.manual_seed(0)

M, N, K = 128, 512, 256
a = torch.rand((M, K)).cuda()
b = torch.rand((K, N)).cuda()

zc  = torch.matmul(a,b)
zc_ = dot(a,b)

print(zc)
print(zc_)

print(zc == zc_)

The use of dot throws an error:

(null):50:22: error: tile with more than one element cannot be casted to scalar
    c += dot(a,b);
             ^

But a @ b also seems wrong.

I think I'm possibly misunderstanding the use of leading dimensions and/or missing something simple ! Any help much appreciated, but since this isn't really an issue feel free to close if you don't have time at the moment.

Thanks,

Jack

Undefined symbol error when import

I installed the project via pip, but failed to import it.

Python 3.7.3 (default, Dec 20 2019, 18:57:59)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import triton
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rd/.local/lib/python3.7/site-packages/triton/__init__.py", line 1, in <module>
    from .kernel import *
  File "/home/rd/.local/lib/python3.7/site-packages/triton/kernel.py", line 1, in <module>
    import triton._C.libtriton as libtriton
ImportError: /home/rd/.local/lib/python3.7/site-packages/triton/_C/libtriton.so: undefined symbol: setupterm

However, ldd shows every dependent libraries of this .so are resolved.

ldd /home/rd/.local/lib/python3.7/site-packages/triton/_C/libtriton.so
        linux-vdso.so.1 (0x00007ffc7c375000)
        libc10.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fbae6190000)
        libc10_cuda.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fbae5f5f000)
        libtorch.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fbae5d49000)
        libtorch_cuda.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fbaab149000)
        libtorch_cpu.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fba9af88000)
        libtorch_python.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so (0x00007fba99dd6000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fba99c42000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fba99abf000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fba99aa3000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fba998e2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbae929b000)
        libgomp-7c85b1e2.so.1 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1 (0x00007fba996b8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fba99697000)
        libcudart-80664282.so.10.2 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libcudart-80664282.so.10.2 (0x00007fba99416
000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fba9940a000)
        libnvToolsExt-3965bdd0.so.1 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libnvToolsExt-3965bdd0.so.1 (0x00007fba992
00000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fba991fb000)
        libtensorpipe.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtensorpipe.so (0x00007fba98d32000)
        libshm.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libshm.so (0x00007fba98b2a000)

can this project be built in Windows?

I'd like to build it in windows, with vs2015, but I got hundreds errors, so can this projects be built in widows?

roofline model

Hi,

Whether there is test script to generate test data for comparing cublas/triton/tvm as your paper state?

And what I care is whether triton could provide some benefit that tensorRT couldn't provide, like more efficient conv kernel.
Or auto generate kernel for memory movement like nn.shuffle in pytorch?

I think more examples would benefit for this project.

Thx,
Lei

using clBLAS API (sgemm)

Hello,
I would like to use Isaac in OpenCL. After installing isaac and linking to libisaac.so, there is a link failure
undefined reference to `clblasSgemm'

From README

Link against libisaac.so instead of libcublas.so or libclblas.so, and you're good to go!

but grepping about the source file I can't find implementations of clBLAS.h functions. Isaac GEMM executes on my machine via the example bench/blas.cpp, but it is not clear in that example how an end user is supposed to use Isaac's GEMM (there are calls to function dot (?)).

So my question is : can I just link to libisaac.so and get sgemm working, or do I need to use a different API?

Thanks.

Support for Sparse-Dense Matrix Mulitplication

Hi, All

Is there any support for using GPU tensor core in Sparse-Dense Matrix Multiplication (SpMM) or Sampled Dense-Dense Matrix Multiplication (SDDMM)?

Thanks!

python example error

Hi,

I try to run the trans.py in python/examples/tutorials, but meet below error:
The docker environment I use is nvcr.io/nvidia/pytorch:20.12-py3

Traceback (most recent call last):
File "trans.py", line 1, in
import torch
File "/opt/conda/lib/python3.8/site-packages/torch/init.py", line 484, in
from .serialization import save, load
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 8, in
import tarfile
File "/opt/conda/lib/python3.8/tarfile.py", line 47, in
import copy
File "/data/tmp/triton/python/examples/tutorials/copy.py", line 4, in
class _copy(torch.autograd.Function):
AttributeError: partially initialized module 'torch' has no attribute 'autograd' (most likely due to a circular import)

torch.utils.ffi is deprecated

If I run,
python imagenet.py --arch resnet152 /data/imagenet_data

Then,

raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

it only supports https://github.com/pytorch/pytorch/tree/v0.4.1

Compilation failure

Hi I am trying to reproduce your SC results.

I got the following errors among others:

In file included from /home/flavio/isaac/lib/templates/gemm.cpp:26:0:                                                                                                   │···································
/home/flavio/isaac/include/isaac/templates/common.hpp:82:16: error: ‘vector’ in namespace ‘std’ does not name a template type                                           │···································
   virtual std::vector<param_t> tuning_params() const = 0;                                                                                                              │···································
                ^~~~~~                                                                                                                                                  │···································
In file included from /home/flavio/isaac/lib/templates/gemm.cpp:27:0:                                                                                                   │···································
/home/flavio/isaac/include/isaac/templates/gemm.h:60:8: error: ‘vector’ in namespace ‘std’ does not name a template type                                                │···································
   std::vector<param_t> tuning_params() const;                                                                                                                          │···································
        ^~~~~~                                                                                                                                                          │···································
In file included from /home/flavio/isaac/lib/templates/pool.cpp:26:0:                                                                                                   │···································
/home/flavio/isaac/include/isaac/templates/common.hpp:82:16: error: ‘vector’ in namespace ‘std’ does not name a template type                                           │···································
   virtual std::vector<param_t> tuning_params() const = 0;                                                                                                              │···································
                ^~~~~~                                                                                                                                                  │···································
In file included from /home/flavio/isaac/lib/templates/pool.cpp:27:0:                                                                                                   │···································
/home/flavio/isaac/include/isaac/templates/pool.h:42:34: error: ‘std::vector’ has not been declared                                                                     │···································
   void init_constant_memory(std::vector<int32_t>& delta, std::vector<uint32_t> &masks, size_t nlut, int32_t strideIc, int32_t strideIw, int32_t strideIh, int32_t strid│···································
eId);                                                                                                                                                                   │···································
                                  ^~~~~~  ````

I am using CUDA-9 and gcc/g++ -7.

Fail case when running caffe opencl branch with isaac

Hi,
I am running caffe opencl branch (https://github.com/BVLC/caffe/tree/opencl) with isaac master branch on INTEL BROADWELL platform with below command:
./build/test/test.testbin --gtest_filter=NetTest/2.TestLossWeight, that will bring fail.
While comment out the line 94 to 141 on file https://github.com/ptillet/isaac/blob/master/lib/runtime/profiles.cpp, that will pass the test case.

Can you reproduce the fail case, seems there is some problem with the copy operation on predict_ logic?

Get PTX for kernel

Hi!

Is there a way to get the underlying PTX for a given kernel? I recall there being an API for this at one point, but it looks like it no longer exists in the latest Triton.

Thanks!
Trevor

Support AutoDiff for one operator by use source-code-transform ?

When I develop a new operator, the time of backward always cost many time. If triton can support autodiff, It will be very nice!

question about the predictor machanism

Hi @ptillet ,

I want to add kernels on isaac library, but have some doubt about the implementation of the tuning machanism:

1)  there are some json file in the database folder, how do you generate them, and how can i modify them to fit the new kernels?
2)  what is the mechanism about the predictor and its relationship among expression_tree, random forest, feature, threashold and values, how can I use this mechanism?

Looking forward for your replay, thx!

Inference on K80/M60

Hi, I'm trying to do some inference on M60/K80 while always get issue like this:
"triton.code_gen.OutOfResources: out of resource: shared memoryRequired: 65536Hardware limit: 49152"

While training on A100 is ok because of bigger shared memory.

Just want to know is there any hard limit on GPU hardware? Is there any suggestions that we can skip this issue and enable inference on M60/K80/T4 ?

Custom operation tutorial: module 'triton' has no attribute 'Function'

If I just copy and paste the program in the tutorial: https://docs.triton-lang.org/tutorials/custom-operation.html

$ python custom-op.py 
Traceback (most recent call last):
  File "custom-op.py", line 4, in <module>
    class _add(triton.Function):
AttributeError: module 'triton' has no attribute 'Function'

Changing:

class _add(triton.Function):

class _add(torch.autograd.Function):

Seems to work (print(diff) outputs tensor(0., device='cuda:0')). Is this correct? (P.S. sorry for opening multiple issues in one day ;) )

How about consider to provide a conda install package for python users.

I suffer from installing old version of this which other package need.
If you need extra env to compile this, pip is not a good idea for users.

question on blocksparse.

Hi thanks for open-sourced contribution. I see there is block sparse implementation in the test.
Is this the block sparse attention mechanism specific for transformer? Or it is the general spMM?

Thanks

Can we rewrite pytorch backend in triton?

I have just checked out your blog and I must congratulate on the release!

May be I'm asking a dumb question here but how feasible is to replace cuda in pytorch with triton kernels. Your tutorials seem to indicate auto-optimized code seems to be better than hand-optimized torch cuda code. So, I wonder if we can actually rewrite torch cuda kernels in triton.

Installation from source instructions are out of date

The instructions for installing from source say to test "by running the einsum.py example". However, this example no longer exists

triton.kernel 1.1 requires argument "device"

After I upgrade from triton 0.3 to 1.1, triton.kernel is throwing the error that it requires "device" argument. Is it the same as torch.device? Thanks!.

Memory leak found

I was trying to use the GEMM for a program that has to multiply a large amount of matrices and I found a memory leak. Then I profiled test-gemm and found that the leak is also there as you can see on this Valgrind Memcheck output:

==11320== LEAK SUMMARY:
==11320==    definitely lost: 323,624 bytes in 166 blocks
==11320==    indirectly lost: 300 bytes in 16 blocks
==11320==      possibly lost: 121,368 bytes in 1,088 blocks
==11320==    still reachable: 14,228,175 bytes in 17,865 blocks
==11320==         suppressed: 0 bytes in 0 blocks
==11320== Reachable blocks (those to which a pointer was found) are not shown.
==11320== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==11320==

The problem seems to be here:

==11320== 26,144 bytes in 6 blocks are definitely lost in loss record 1,429 of 1,488
==11320==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11320==    by 0x4E93112: std::vector<float, std::allocator<float> >::_M_default_append(unsigned long) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E8EFAA: isaac::runtime::Dense::Dense(unsigned char*&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E8F094: isaac::runtime::Layer::read(unsigned char*&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E8F16F: isaac::runtime::Network::Network(unsigned char*) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E8F661: isaac::runtime::Profile::Profile(unsigned char*, unsigned long) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E8D20E: std::__shared_ptr<isaac::runtime::GEMMProfile, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<isaac::runtime::GEMMProfile>, unsigned char*>(std::_Sp_make_shared_tag, std::allocator<isaac::runtime::GEMMProfile> const&, unsigned char*&&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4E6A0EB: __static_initialization_and_destruction_0(int, int) [clone .constprop.78] (in /home/rogerpt32/isaac/build/libisaac.so)
==11320==    by 0x4010732: call_init (dl-init.c:72)
==11320==    by 0x4010732: _dl_init (dl-init.c:119)
==11320==    by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)

With Valgrind massif we can also see the memory leakage but this time it points to:

isaac::driver::Kernel::Kernel(isaac::driver::Module const&, char const*)

The memory leaks for every call to GEMM. It's not a problem for small programs but if anybody wants to run a program using ISAAC with a very large amount of calls to GEMM, it could be problematic.

Recompile Torch C++ binding when version has changed

Triton's caching mechanism fails to detect when Torch has been upgraded. This means that ~/.triton/torch/*.so will be compiled for an old version of torch and trigger a link error on import. This can be worked aroung by deleting ~/.triton/torch, but needs to be fixed properly.

Full support of S/DBLAS for both clBLAS + cuBLAS

Many symbols of the BLAS ABI are currently not supported, although the current ABI is enough to ensure compatibility with e.g. caffe.

Currently supported functions are:

xAXPY
xCOPY
xSCAL
xDOT
xASUM

xGEMV
xGER

xGEMM

x = {S, D}

Blocksparse matmul error under specific block size

I was looking at this test function: https://github.com/ptillet/triton/blob/master/python/test/test_blocksparse.py#L13,
when the blocksize is 16 or greater, it works fine. However, when the block size is set as 4 or 8, the error message looks like this:

Traceback (most recent call last):
  File "test_blocksparse.py", line 41, in <module>
    test_matmul("dds", TRANS_A=False, TRANS_B=False, BLOCK=8, DTYPE="float16", Z=3, H=2, M=512, N=384, K=256)
  File "test_blocksparse.py", line 30, in test_matmul
    rc = op(ra, rb)
  File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 651, in __call__
    db_lut, db_num_locks, db_width, db_packs = self.make_lut(a.dtype, a.device)
  File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 595, in make_lut
    c_lut, c_num_locks, c_width, c_packs = _matmul.make_dxx_lut(layout, block, step, self.trans_b, device)
  File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 391, in make_dxx_lut
    xincs[:, 0] -= (div - 1) * step
IndexError: index 0 is out of bounds for dimension 1 with size 0

Any ideas about what's the reason? To reproduce the error, can run

test_matmul("dds", TRANS_A=False, TRANS_B=False, BLOCK=4, DTYPE="float16", Z=3, H=2, M=512, N=384, K=256)

triton==1.0.0.dev20210329 no longer installable via pip

Hi - would it be possible to reinstantiate triton==1.0.0.dev20210329 on pip, or make it clear how to update to the latest dev branch?
The api seems to have changed significantly in the latest nightlies, and some function in https://github.com/microsoft/DeepSpeed rely on that particular interface.