japaric-archived / nvptx Goto Github PK

How to: Run Rust code on your NVIDIA GPU

License: Apache License 2.0

Rust 75.22% Shell 24.78%

nvptx's Introduction

Status

This documentation about an unstable feature is UNMAINTAINED and was written over a year ago. Things may have drastically changed since then; read this at your own risk! If you are interested in modern Rust on GPU development check out https://github.com/rust-cuda/wg

-- @japaric, 2018-12-08

`nvptx`

How to: Run Rust code on your NVIDIA GPU

First steps
Examples
Problems?
License
- Contribution

First steps

Since 2016-12-31, rustc can compile Rust code to PTX (Parallel Thread Execution) code, which is like GPU assembly, via --emit=asm and the right --target argument. This PTX code can then be loaded and executed on a GPU.

However, a few days later 128-bit integer support landed in rustc and broke compilation of the core crate for NVPTX targets (LLVM assertions). Furthermore, there was no nightly release between these two events so it was not possible to use the NVPTX backend with a nightly compiler.

Just recently (2017-05-18) I realized (thanks to this blog post) that we can work around the problem by compiling a fork of the core crate that doesn't contain code that involves 128-bit integers. Which is a bit unfortunate but, hey, if it works then it works.

Targets

The required targets are not built into the compiler (they are not in rustc --print target-list) but are available as JSON files in this repository:

nvptx64-nvidia-cuda.json, 64-bit PTX, and
nvptx-nvidia-cuda.json, 32-bit PTX

If the host is running a 64-bit OS, you should use the nvptx64 target. Otherwise, use the nvptx target.

Minimal example

Here's a minimal example of emitting PTX from a Rust crate:

$ cargo new --lib kernel && cd $_

$ cat src/lib.rs

#![no_std]

fn foo() {}

# emitting debuginfo is not supported for the nvptx targets
$ edit Cargo.toml && tail -n2 $_
[profile.dev]
debug = false

# The JSON file must be in the current directory
$ test -f nvptx64-nvidia-cuda.json && echo OK
OK

# You'll need to use Xargo to build the `core` crate "on the fly"
# Install it if you don't already have it
$ cargo install xargo || true

# Then instruct Xargo to compile a fork of the core crate that contains no
# 128-bit integers
$ edit Xargo.toml && cat Xargo.toml
[dependencies.core]
git = "https://github.com/japaric/core64"

# Xargo has the exact same CLI as Cargo
$ xargo rustc --target nvptx64-nvidia-cuda -- --emit=asm
   Compiling core v0.0.0 (file://$SYSROOT/lib/rustlib/src/rust/src/libcore)
    Finished release [optimized] target(s) in 18.74 secs
   Compiling kernel v0.1.0 (file://$PWD)
    Finished debug [unoptimized] target(s) in 0.4 secs

The PTX code will be available as a .s file in the target directory:

$ find -name '*.s'
./target/nvptx64-nvidia-cuda/debug/deps/kernel-e916cff045dc0eeb.s

$ cat $(find -name '*.s')
.version 3.2
.target sm_20
.address_size 64

.func _ZN6kernel3foo17h24d36fb5248f789aE()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

Global functions

Although this PTX module (the whole file) can be loaded on the GPU, the function foo contained in it can't be "launched" by the host because it's a device function. Only global functions (AKA kernels) can be launched by the hosts.

To turn foo into a global function, its ABI must be changed to "ptx-kernel":

#![feature(abi_ptx)]
#![no_std]

extern "ptx-kernel" fn foo() {}

With that change the PTX of the foo function will now look like this:

.entry _ZN6kernel3foo17h24d36fb5248f789aE()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

foo is now a global function because it has the .entry directive instead of the .func one.

Avoiding mangling

With the CUDA API, one can retrieve functions from a PTX module by their name. foo's' final name in the PTX module has been mangled and looks like this: _ZN6kernel3foo17h24d36fb5248f789aE.

To avoid mangling the foo function add the #[no_mangle] attribute to it.

#![feature(abi_ptx)]
#![no_std]

#[no_mangle]
extern "ptx-kernel" fn foo() {}

This will result in the following PTX code:

.entry foo()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

With this change you can now refer to the foo function using the "foo" (C) string from within the CUDA API.

Optimization

So far we have been compiling the crate using the (default) "debug" profile which normally results in debuggable but slow code. Given that we can't emit debuginfo when using the nvptx targets, it makes more sense to build the crate using the "release" profile.

The catch is that we'll have to mark global functions as public otherwise the compiler will "optimize them away" and they won't make it into the final PTX file.

#![feature(abi_ptx)]
#![no_std]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}

$ cargo clean

$ xargo rustc --release --target nvptx64-nvidia-cuda -- --emit=asm

$ cat $(find -name '*.s')
.visible .entry foo()
{
        ret;
}

Examples

This repository contains runnable examples of executing Rust code on the GPU. Note that no effort has gone into ergonomically integrating both the device code and the host code :-).

There's a kernel directory, which is a Cargo project as well, that contains Rust code that's meant to be executed on the GPU. That's the "device" code.

You can convert that Rust code into a PTX module using the following command:

$ xargo rustc \
    --manifest-path kernel/Cargo.toml \
    --release \
    --target nvptx64-nvidia-cuda \
    -- --emit=asm

The PTX file will available in the kernel/target directory.

$ find kernel/target -name '*.s'
kernel/target/nvptx64-nvidia-cuda/release/deps/kernel-bb52137592af9c8c.s

The examples directory contains the "host" code. Inside that directory, there are 3 file; each file is an example program:

add - Add two (mathematical) vectors on the GPU
memcpy - memcpy on the GPU
rgba2gray - Convert a color image to grayscale

Each example program expects as first argument the path to the PTX file we generated previously. You can run each example with a command like this:

$ cargo run --example add -- $(find kernel/target -name '*.s')

The rgba2gray example additionally expects a second argument: the path to the image that will be converted to grayscale. That example also compares the runtime of converting the image on the GPU vs the runtime of converting the image on the CPU. Be sure to run that example in release mode to get a fair comparison!

$ cargo run --release --example rgba2gray -- $(find kernel/target -name '*.s') ferris.png
Image size: 1200x800 - 960000 pixels - 3840000 bytes

RGBA -> grayscale on the GPU
    Duration { secs: 0, nanos: 602024 } - `malloc`
    Duration { secs: 0, nanos: 718481 } - `memcpy` (CPU -> GPU)
    Duration { secs: 0, nanos: 1278006 } - Executing the kernel
    Duration { secs: 0, nanos: 306315 } - `memcpy` (GPU -> CPU)
    Duration { secs: 0, nanos: 322648 } - `free`
    ----------------------------------------
    Duration { secs: 0, nanos: 3227474 } - TOTAL

RGBA -> grayscale on the CPU
    Duration { secs: 0, nanos: 12299 } - `malloc`
    Duration { secs: 0, nanos: 4171570 } - conversion
    Duration { secs: 0, nanos: 493 } - `free`
    ----------------------------------------
    Duration { secs: 0, nanos: 4184362 } - TOTAL

Problems?

If you encounter any problem with the Rust -> PTX feature in the compiler, report it to this meta issue.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

nvptx's People

Contributors

Stargazers

Watchers

Forkers

bumzack antoinewdg rozgo vadixidav seeker14491 acidburn0zzz gnzlbg digideskio shasthojoy reiner-dolp kykc gurpreetshanky safai-labs iamjagan maurocordioli restaurant-wan mfkiwl

nvptx's Issues

add min-global-align to the target specifications

once rust-lang/rust#44440 is in nightly

Trying to work through the README, can't find crate

╭ ➜ kimockb@gator3:~/kernel
╰ ➤ cat Xargo.toml 
[dependencies.core]
git = "https://github.com/japaric/core64"
╭ ➜ kimockb@gator3:~/kernel
╰ ➤ xargo rustc --target nvptx64-nvidia-cuda -- --emit=asm
   Compiling kernel v0.1.0 (file:///home/kimockb/kernel)
'bdver1' is not a recognized processor for this target (ignoring processor)
'bdver1' is not a recognized processor for this target (ignoring processor)
'bdver1' is not a recognized processor for this target (ignoring processor)
'bdver1' is not a recognized processor for this target (ignoring processor)
'bdver1' is not a recognized processor for this target (ignoring processor)
'bdver1' is not a recognized processor for this target (ignoring processor)
error[E0463]: can't find crate for `core`
  |
  = note: the `nvptx64-nvidia-cuda` target may not be installed

error: aborting due to previous error

error: Could not compile `kernel`.

What am I doing wrong here? It almost seems like xargo isn't picking up the crate URL

core64 cannot be compiled due to the lack of `str_eq` on current nightly

The procedure in README cannot be compiled using current nightly (2017/11/05):

%xargo rustc --target nvptx64-nvidia-cuda -- --emit=asm
    Updating git repository `https://github.com/japaric/core64`
   Compiling core v0.0.0 (https://github.com/japaric/core64#202f5ca8)
error[E0522]: definition of an unknown language item: `str_eq`.
    --> /home/myname/.cargo/git/checkouts/core64-98c99607e3b29655/202f5ca/str/mod.rs:1379:1
     |
1379 | / fn eq_slice(a: &str, b: &str) -> bool {
1380 | |     a.as_bytes() == b.as_bytes()
1381 | | }
     | |_^

error: aborting due to previous error

This seems to be due to rust-lang/rust#44658
Actually, we can compile using nightly-2017-09-01, for example.

whats the status of debugging nvptx targets?

Open issues, etc. +

Examples throw Error E0225

Compiling the kernels in the examples directory using

xargo rustc \
    --manifest-path kernel/Cargo.toml \
    --release \
    --target nvptx64-nvidia-cuda \
    -- --emit=asm

I get the following error during the compilation of core64:

error[E0225]: only Send/Sync traits can be used as additional traits in a trait object
   --> C:\Users\Reiner\.cargo\git\checkouts\core64-98c99607e3b29655\0e29675\any.rs:133:27
    |
133 | impl fmt::Debug for Any + Send {
    |                           ^^^^ non-Send/Sync additional trait

error[E0225]: only Send/Sync traits can be used as additional traits in a trait object
   --> C:\Users\Reiner\.cargo\git\checkouts\core64-98c99607e3b29655\0e29675\any.rs:244:10
    |
244 | impl Any+Send {
    |          ^^^^ non-Send/Sync additional trait

error: aborting due to 2 previous errors

Any Idea what I am doing wrong?

Error compiling the minimal example.

Hi.
I have successfully installed your cuda crate (with cuda 8.0).
My system: ubunto 16.04, Intel® Core™ i7-5500U CPU @ 2.40GHz × 4 , GeForce 940M/PCIe/SSE2
(Lenovo Yoga 500 Notebook)

rustup show
Default host: x86_64-unknown-linux-gnu

nightly-x86_64-unknown-linux-gnu (default)
rustc 1.17.0-nightly (29dece1c8 2017-02-08)

When trying to compile the minimal examples in the README I run into the following error:

RUST_BACKTRACE=1  xargo rustc --release --target nvptx64-nvidia-cuda -- --emit=asm --verbose


+ "rustc" "--print" "sysroot"
+ "rustc" "--print" "target-list"
+ "cargo" "build" "--release" "--manifest-path" "/tmp/xargo.YgwtgUACXnwK/Cargo.toml" "--target" "nvptx64-nvidia-cuda" "-v" "-p" "core"
   Compiling core v0.0.0 (file:///home/georg/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore)
     Running `rustc --crate-name core /home/georg/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=9513f6a24932021b -C extra-filename=-9513f6a24932021b --out-dir /tmp/xargo.YgwtgUACXnwK/target/nvptx64-nvidia-cuda/release/deps --target nvptx64-nvidia-cuda -L dependency=/tmp/xargo.YgwtgUACXnwK/target/nvptx64-nvidia-cuda/release/deps -L dependency=/tmp/xargo.YgwtgUACXnwK/target/release/deps --sysroot /home/georg/.xargo`
rustc: /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp:976: std::string llvm::NVPTXTargetLowering::getPrototype(const llvm::DataLayout&, llvm::Type*, const ArgListTy&, const llvm::SmallVectorImpl<llvm::ISD::OutputArg>&, unsigned int, const llvm::ImmutableCallSite*) const: Assertion `(getValueType(DL, Ty) == Outs[OIdx].VT || (getValueType(DL, Ty) == MVT::i8 && Outs[OIdx].VT == MVT::i16)) && "type mismatch between callee prototype and arguments"' failed.
error: Could not compile `core`.

Caused by:
  process didn't exit successfully: `rustc --crate-name core /home/georg/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=9513f6a24932021b -C extra-filename=-9513f6a24932021b --out-dir /tmp/xargo.YgwtgUACXnwK/target/nvptx64-nvidia-cuda/release/deps --target nvptx64-nvidia-cuda -L dependency=/tmp/xargo.YgwtgUACXnwK/target/nvptx64-nvidia-cuda/release/deps -L dependency=/tmp/xargo.YgwtgUACXnwK/target/release/deps --sysroot /home/georg/.xargo` (exit code: 1)
error: `"cargo" "build" "--release" "--manifest-path" "/tmp/xargo.YgwtgUACXnwK/Cargo.toml" "--target" "nvptx64-nvidia-cuda" "-v" "-p" "core"` failed with exit code: Some(101)
stack backtrace:
   0:     0x55ae51a8dd1d - backtrace::backtrace::trace::h0d0ee87a30cd6975
   1:     0x55ae51a8e202 - backtrace::capture::Backtrace::new::hb5a725a088a2a2fc
   2:     0x55ae51a82b66 - error_chain::make_backtrace::h3d048cb120b8c1ea
   3:     0x55ae51a82c18 - <error_chain::State as core::default::Default>::default::h872828f66aa5352f
   4:     0x55ae51a7314e - xargo::sysroot::build::hc3480546d64ef68b
   5:     0x55ae51a785b2 - xargo::sysroot::update::h876ac002099ed432
   6:     0x55ae51a80285 - xargo::run::h7a335e8f14257f91
   7:     0x55ae51a7d1ed - xargo::main::h219698d15c956fce
   8:     0x55ae51acb8aa - panic_unwind::__rust_maybe_catch_panic
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
   9:     0x55ae51ac51a6 - std::panicking::try<(),fn()>
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:436
                         - std::panic::catch_unwind<fn(),()>
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panic.rs:361
                         - std::rt::lang_start
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/rt.rs:57
  10:     0x7f8c1d9bb82f - __libc_start_main
  11:     0x55ae51a638e8 - _start
  12:                0x0 - <unknown>

I also tried an older nightly versions - same error

active toolchain
----------------

nightly-2017-01-04-x86_64-unknown-linux-gnu (directory override for '/home/georg/stoff/rust/kernel')
rustc 1.16.0-nightly (468227129 2017-01-03)

Is Linux a supported platform or do you have any hints on how to get the minimal example to compile?
(next step would be compiling your examples - I am interested in matrix multiplication on the GPU)

Thanks a lot for your work!

kind regards georg

for the record: the command from the xargo repo works

xargo build --target thumbv6m-none-eabi
  Compiling core v0.0.0 (file:///home/georg/.rustup/toolchains/nightly-2017-01-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore)
   Finished release [optimized] target(s) in 14.45 secs
  Compiling kernel v0.1.0 (file:///home/georg/stoff/rust/kernel)
warning: function is never used: `foo`, #[warn(dead_code)] on by default
--> src/lib.rs:3:1
 |
3 | fn foo() {}
 | ^^^^^^^^^^^

   Finished debug [unoptimized] target(s) in 0.8 secs

Unnecessary wrapping_add?

I noticed that you're manually calling wrapping_* for arithmetic operations in the PTX kernels. Shouldn't overflow wrap around already?

Note that I also noticed that it appears the intrinsics use signed for these variables (for which overflow would be undefined in C, although defined as wraparound in RFC 560 for Rust) but are defined as using unsigned (defined as wraparound in both C and Rust) in the CUDA programming manual. I filed a ticket in that library about this.

https://github.com/nox/rust-rfcs/blob/master/text/0560-integer-overflow.md
japaric-archived/nvptx-builtins#1

Feel free to close the ticket if it isn't relevant - I don't typically do GPGPU programming and this was the easiest way for me to provide feedback.

main
+targets=(nvptx-nvidia-cuda nvptx64-nvidia-cuda)
+local targets
+local toml=kernel/Cargo.toml
+for target in '${targets[@]}'
+cargo clean --manifest-path kernel/Cargo.toml
+xargo rustc --manifest-path kernel/Cargo.toml --release --target nvptx-nvidia-cuda -- --emit=asm
    Updating git repository `https://github.com/japaric/core64`
   Compiling core v0.0.0 (https://github.com/japaric/core64#e2433188)
error[E0522]: definition of an unknown language item: `send`.
  --> /home/travis/.cargo/git/checkouts/core64-98c99607e3b29655/e243318/marker.rs:44:1
   |
44 | / pub unsafe trait Send {
45 | |     // empty.
46 | | }
   | |_^
error: aborting due to previous error
error: Could not compile `core`.