rocm / hip Goto Github PK

View Code? Open in Web Editor NEW

3.6K 143.0 521.0 20.58 MB

HIP: C++ Heterogeneous-Compute Interface for Portability

Home Page: https://rocmdocs.amd.com/projects/HIP/

License: MIT License

Shell 1.03% CMake 9.07% C 12.84% C++ 75.52% Vim Script 1.55%

hip hip-runtime hip-portability hip-kernel-language hipify cuda

hip's Introduction

What is this repository for?

HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.

Key features include:

HIP is very thin and has little or no performance impact over coding directly in CUDA mode.
HIP allows coding in a single-source C++ programming language including features such as templates, C++11 lambdas, classes, namespaces, and more.
HIP allows developers to use the "best" development environment and tools on each target platform.
The HIPIFY tools automatically convert source from CUDA to HIP.
Developers can specialize for the platform (CUDA or AMD) to tune for performance or handle tricky cases.

New projects can be developed directly in the portable HIP C++ language and can run on either NVIDIA or AMD platforms. Additionally, HIP provides porting tools which make it easy to port existing CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application. HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Repository branches

The HIP repository maintains several branches. The branches that are of importance are:

develop branch: This is the default branch, on which the new features are still under development and visible. While this maybe of interest to many, it should be noted that this branch and the features under development might not be stable.
Main branch: This is the stable branch. It is up to date with the latest release branch, for example, if the latest HIP release is rocm-4.3, main branch will be the repository based on this release.
Release branches. These are branches corresponding to each ROCM release, listed with release tags, such as rocm-4.2, rocm-4.3, etc.

Release tagging

HIP releases are typically naming convention for each ROCM release to help differentiate them.

rocm x.yy: These are the stable releases based on the ROCM release. This type of release is typically made once a month.*

More Info

Installation
HIP FAQ
HIP Kernel Language
HIP Porting Guide
HIP Porting Driver Guide
HIP Programming Guide
HIP Logging
Building HIP From Source
HIP Debugging
HIP RTC
HIP Terminology (including Rosetta Stone of GPU computing terms across CUDA/HIP/OpenCL)
HIPIFY
Supported CUDA APIs:
- Runtime API
- Driver API
- cuComplex API
- Device API
- cuBLAS
- cuRAND
- cuDNN
- cuFFT
- cuSPARSE
Developer/CONTRIBUTING Info
Release Notes

How do I get set up?

See the Installation notes.

Simple Example

The HIP API includes functions such as hipMalloc, hipMemcpy, and hipFree. Programmers familiar with CUDA will also be able to quickly learn and start coding with the HIP API. Compute kernels are launched with the "hipLaunchKernelGGL" macro call. Here is simple example showing a snippet of HIP API code:

hipMalloc(&A_d, Nbytes);
hipMalloc(&C_d, Nbytes);

hipMemcpy(A_d, A_h, Nbytes, hipMemcpyHostToDevice);

const unsigned blocks = 512;
const unsigned threadsPerBlock = 256;
hipLaunchKernelGGL(vector_square,   /* compute kernel*/
                dim3(blocks), dim3(threadsPerBlock), 0/*dynamic shared*/, 0/*stream*/,     /* launch config*/
                C_d, A_d, N);  /* arguments to the compute kernel */

hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost);

The HIP kernel language defines builtins for determining grid and block coordinates, math functions, short vectors, atomics, and timer functions. It also specifies additional defines and keywords for function types, address spaces, and optimization controls (See the HIP Kernel Language for a full description). Here's an example of defining a simple 'vector_square' kernel.

template <typename T>
__global__ void
vector_square(T *C_d, const T *A_d, size_t N)
{
    size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
    size_t stride = blockDim.x * gridDim.x;

    for (size_t i=offset; i<N; i+=stride) {
        C_d[i] = A_d[i] * A_d[i];
    }
}

The HIP Runtime API code and compute kernel definition can exist in the same source file - HIP takes care of generating host and device code appropriately.

HIP Portability and Compiler Technology

HIP C++ code can be compiled with either,

On the NVIDIA CUDA platform, HIP provides header file in the repository hipother which translate from the HIP runtime APIs to CUDA runtime APIs. The header file contains mostly inlined functions and thus has very low overhead - developers coding in HIP should expect the same performance as coding in native CUDA. The code is then compiled with nvcc, the standard C++ compiler provided with the CUDA SDK. Developers can use any tools supported by the CUDA SDK including the CUDA profiler and debugger.
On the AMD ROCm platform, HIP provides a header and runtime library built on top of HIP-Clang compiler in the repository Common Language Runtime (CLR). The HIP runtime implements HIP streams, events, and memory APIs, and is a object library that is linked with the application. The source code for all headers and the library implementation is available on GitHub. HIP developers on ROCm can use AMD's ROCgdb for debugging and profiling.

Thus HIP source code can be compiled to run on either platform. Platform-specific features can be isolated to a specific platform using conditional compilation. Thus HIP provides source portability to either platform. HIP provides the hipcc compiler driver which will call the appropriate toolchain depending on the desired platform.

Examples and Getting Started

A sample and blog that uses any of HIPIFY tools to convert a simple app from CUDA to HIP:

cd samples/01_Intro/square
# follow README / blog steps to hipify the application.

Guide to Porting a New Cuda Project

More Examples

The GitHub repository HIP-Examples contains a hipified version of benchmark suite. Besides, there are more samples in Github HIP samples, showing how to program with different features, build and run.

Tour of the HIP Directories

include:
- hip_runtime_api.h : Defines HIP runtime APIs and can be compiled with many standard Linux compilers (GCC, ICC, CLANG, etc), in either C or C++ mode.
- hip_runtime.h : Includes everything in hip_runtime_api.h PLUS hipLaunchKernelGGL and syntax for writing device kernels and device functions. hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
- amd_detail/** , nvidia_detail/** : Implementation details for specific platforms. HIP applications should not include these files directly.
bin: Tools and scripts to help with hip porting
- hipcc : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or HIP-Clang depending on platform and include appropriate platform-specific headers and libraries.
- hipconfig : Print HIP configuration (HIP_PATH, HIP_PLATFORM, HIP_COMPILER, HIP_RUNTIME, CXX config flags, etc.)
docs: Documentation - markdown and doxygen info.

Reporting an issue

Use the GitHub issue tracker. If reporting a bug, include the output of "hipconfig --full" and samples/1_hipInfo/hipInfo (if possible).

hip's People

Contributors

Stargazers

Watchers

Forkers

sutt0n kumsh wiltonlazary mrg7 leimingyu yan-ming codeaudit adakite zjudzl merlinzone varunnagpaal guacamoleo jkelling raxopsrc jcbaratault rjmcguire waqasm86 chenminacm ekondis neelmcw steipete kshitijl jackbro yanlinaung ss0mes yc-collection templeblock fredlang xixihaha369300 kukiman010 frostmn brainiarc7 soledad89 maxkarel benjamesbabala hoojaoh havenbreaker erazerbrecht wanjinchang minganlin ezhangle caomw amos-zq alexvlx huamichaelchen lunochod fireae 4skynet honggui iamsile gjtjx rossanag pfultz2 hongyunnchen msaoyleh apanqasem alexeyab tsingjinyun vrajeshlm mlebeane momo-iah yaochx kalaluthien yhoazk homercourage patflick jgoldsamd jakirkham sriharikarnam nagyistge mangupta gargrahul sunway513 bensander scchan syifan mikalaidrabovich jameslinus wsttiger ebosenbacher boradeanup nixworks phani544 quadcannon aaronenyeshi solertis parallelo harishkool praveenmunagapati victorv pzins atifs wahello cheinger subashp jjykh zhyj3038 quanfuw forkbackups nazpyro

hip's Issues

kernel oops when running hip kernel with dev branch ROCR/ROCK

I was able to do the tutorial on gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:

/*
  * Build a table of valid compute devices.
  */
 auto accs = hc::accelerator::get_all();
 int deviceCnt = 0;
 for (int i=0; i<accs.size(); i++) {
     if (! accs[i].get_is_emulated()) {
         deviceCnt++;
     }
 };
 -
 +    printf("actual device count is %d\n", deviceCnt);
 // Make sure the hip visible devices are within the deviceCnt range
 for (int i = 0; i < g_hip_visible_devices.size(); i++) {
     if(g_hip_visible_devices[i] >= deviceCnt){
         // Make sure any DeviceID after invalid DeviceID will be erased.
         g_hip_visible_devices.resize(i);
         break;
     }
 }

But I can't even get it to compile:
~/devel/HIP2$ make
./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o
src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)
#error (USE_AM_TRACKER requries HCC version of 16074 or newer)
^
Died at ./bin/hipcc line 208.
Makefile:20: recipe for target 'src/hip_hcc.o' failed
make: *** [src/hip_hcc.o] Error 1

I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc:
mmacy@pandemonium:~/devel/HIP2$ hipcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

hipify tensorflow?

Hi.

I'm curious if anyone has tried doing this, or is working on doing so...

If not, how, generally, could I accomplish this?

What are the requirements for using `shfl` operations on AMD GPU?

There are requirements to use shfl operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia

requirement for nvidia

please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.

Also noted that HIP supports shfl for 64 wavesize (WARP-size) on AMD: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly

In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.

But which of AMD GPUs does support functions shfl, or does any AMD GPU support shfl because on AMD GPU it implemented by using Local-memory without hardware instruction register-to-register?

hipMemcpyAsync works incorrectly

Seems like the update yesterday brings some incorrect behaviors to hipMemcpyAsync and streams?

system configurations:

HIP: 276ca7c4
HCC:  HCC 0.10.16102-064fdb1-f002ffd 
ROCK: 2126ace1317 (dev)
ROCR: 85ad07b (dev)

The following test code will fail with mismatched results.

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include "hip_runtime.h"

#define HIP_ASSERT(x) (assert((x)==hipSuccess))


#define WIDTH     1024
#define HEIGHT    1024

#define NUM       (WIDTH*HEIGHT)

#define THREADS_PER_BLOCK_X  16
#define THREADS_PER_BLOCK_Y  16
#define THREADS_PER_BLOCK_Z  1

int main() {

  int *hostA;
  int *hostB;

  int *deviceA;
  int *deviceB;

  int i;
  int errors;

  hostA = (int *)malloc(NUM * sizeof(int));
  hostB = (int *)malloc(NUM * sizeof(int));

  // initialize the input data
  for (i = 0; i < NUM; i++) {
    hostB[i] = i;
  }

  HIP_ASSERT(hipMalloc((void**)&deviceA, NUM * sizeof(int)));
  HIP_ASSERT(hipMalloc((void**)&deviceB, NUM * sizeof(int)));

  hipStream_t s;
  hipStreamCreate(&s);


  // hostB -> deviceB -> hostA
#define ASYNC 1
#if ASYNC
  HIP_ASSERT(hipMemcpyAsync(deviceB, hostB, NUM*sizeof(int), hipMemcpyHostToDevice, s));
  HIP_ASSERT(hipMemcpyAsync(hostA, deviceB, NUM*sizeof(int), hipMemcpyDeviceToHost, s));
#else
  HIP_ASSERT(hipMemcpy(deviceB, hostB, NUM*sizeof(int), hipMemcpyHostToDevice));
  HIP_ASSERT(hipMemcpy(hostA, deviceB, NUM*sizeof(int), hipMemcpyDeviceToHost));
#endif

  hipStreamSynchronize(s);
  hipDeviceSynchronize();

  // verify the results
  errors = 0;
  for (i = 0; i < NUM; i++) {
    if (hostA[i] != (hostB[i])) {
      errors++;
    }
  }
  if (errors!=0) {
    printf("FAILED: %d errors\n",errors);
  } else {
    printf ("PASSED!\n");
  }

  hipStreamDestroy(s);

  HIP_ASSERT(hipFree(deviceA));
  HIP_ASSERT(hipFree(deviceB));

  free(hostA);
  free(hostB);

  //hipResetDefaultAccelerator();

  return errors;
}

hipEventDestroy(NULL) crashes

The CUDA runtime returns invalid-resource-handle. (It does not crash.)

(In retrospect, as Tommy Thorn pointed out back in the day, we screwed up the interfaces for the CUDA runtime. free() returns void and free(NULL) is valid and has no side effects. Really we should have echoed those API semantics in the CUDA runtime.)

Incorrect behavior of hipCreateChannelDesc and hipMallocArray

Current HIP-HCC path doesn't support channelDesc with compound types like float2 or float4 [1]. Consequently, the memory allocation and initialization of a hipArray bound on a float4 texture failed.

Consider the following code:

#include <hip_runtime.h>                                                        
#include <stdio.h>                                                              

template<typename T>                                                            
void f(void) {                                                                  
  hipChannelFormatDesc desc = hipCreateChannelDesc<T>();                        
  printf("%d ", desc.x);                                                        
  printf("%d ", desc.y);                                                        
  printf("%d ", desc.z);                                                        
  printf("%d ", desc.w);                                                        
  printf("%d \n", desc.f);                                                      
}                                                                               

int main () {                                                                   
  f<float>();                                                                   
  f<float2>();                                                                  
  f<float4>();                                                                  
  return 0;                                                                     
}

On cuda platform, it prints

32 0 0 0 2 
32 32 0 0 2 
32 32 32 32 2

On HSA platform, it prints

32 0 0 0 2 
0 0 0 0 3  
0 0 0 0 3

In hipMallocArray, it only consider the hipChannelFormatKind of the array but not the dimensions of the channeldesc, this might be incorrect [2].

[1] https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/developer-preview/include/hcc_detail/hip_texture.h#L226
[2] https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/developer-preview/src/hip_memory.cpp#L227

Implement stream-level control over blocking/active sync

Support this flag: hipDeviceScheduleBlockingSync

Add flags definitions to HCC and NVCC paths.

On HCC side:
save flag with the ihipDevice_t structure (in _flags) defined in include/hcc_detail.h

Inside streamSynchronize, use hc::hcWaitModeBlocked : hc::hcWaitModeActive based on state of the parent device associated with this stream.

functions that differ only in their return type cannot be overloaded * 5

Hi,

I'm running Crunchbang++ (debian jessie kernel 4.7pf)

I'm trying out hcc built from this script (build_hcc_and_hip.sh - can't remember where I found it now)
It's probable I'm doing something wrong or have forgot to do something, but with hcc/hip/llvm trying to build a simple "hipified" (slightly modified) cuda example, you can see here

I get the following error

build error of fp16 in developer-preview branch

I am building the "developer-preview branch" of HIP. My hcc compiler is "HCC clang version 3.5.0 (based on HCC 0.10.16501-81f0a2f-02246a0 LLVM 3.5.0svn)" based on rocm-1.4.

I see the compiler error:
"/home/tim/HIP/src/hip_fp16.cpp:37:12: error: invalid operands to binary expression ('__half' and '__half')"
return a + b;
~ ^ ~

Kernel with no parameters

A kernel with no parameters (from cudahandbook/concurrency/nullKernelSync.cu, for example) generates the appended compile error.

Adding a dummy int parameter fixes the problem.

/opt/rocm/hip/bin/hipcc -stdlib=libc++ -Wnull-character -I ../chLib -D NO_CUDA nullKernelSync.cpp -o nullKernelSync
nullKernelSync.cpp:67:9: error: expected expression
hipLaunchKernel(HIP_KERNEL_NAME(NullKernel), dim3(1), dim3(1), 0, 0, );
^
/opt/rocm/hip/include/hip/hcc_detail/hip_runtime.h:541:33: note: expanded from
macro 'hipLaunchKernel'
_kernelName (lp, ##VA_ARGS);
^
1 error generated.
Died at /opt/rocm/hip/bin/hipcc line 269.

cuda math intrinsic functions support

Hello,

I found that the math intrinsics on cuda, such as __float2int_rn, are not supported by hipcc. Can you confirm this? If not, how to make it work?

Best,
Leiming

p.s. Here is a snippet of the error message I got by running the following command with rocm.

$ /opt/rocm/hip/bin/hipcc  -I/opt/rocm/hip/include  -std=c++11 -fopenmp -DUSE_ATOMIC  -DSAVE_DETECTORS -DUSE_CACHEBOX  -DMCX_TARGET_NAME='"Fermi MCX"' -o mcx_core.o  mcx_core.cu
mcx_core.cu:182:30: error: use of undeclared identifier '__float2int_rn'
      xi[0] = mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f));

hipLaunchKernel alone does not work

The following simple test fails with no device found.
a.out: /srv/git/HIP/src/hip_hcc.cpp:1019: ihipDevice_t *ihipGetTlsDefaultDevice(): Assertion `ihipIsValidDevice(tls_defaultDevice)' failed.
Aborted (core dumped)

Adding some other code, for example:
void* mem;
hipMalloc(&mem, 1024);
to the beginnig makes it work. The problem seems to be that hipLaunchKernel does not initialize HIP runtime.

include "hip_runtime.h"

global void empty_hip_kernel(hipLaunchParm lp, int param)
{
}

int main(int argc, const char** argv)
{
hipLaunchKernel(HIP_KERNEL_NAME(empty_hip_kernel), dim3(1), dim3(1), 0, 0, 0);
hipDeviceSynchronize();
return 0;
}

function "hipDeviceGetPCIBusId" has already been defined

Error building HIP/samples/2_Cookbook/0_MatrixTranspose:
# make
results in
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(778): error: function "hipDeviceGetPCIBusId" has already been defined

Dockerfile for developer image:

FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04

RUN \
	apt-get update && \
	apt-get install -y wget

RUN \
	wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | apt-key add - && \
	sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list' && \
	apt-get update && \
	DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y hip_nvcc
ENV PATH /opt/rocm/bin/:$PATH

Unions passed as kernel parameters have incorrect values

Here is my test case:
structs-test.cpp.txt

hipcc structs-test.cpp -o structs-test
./structs-test

Output:

Expected output:

I pass 3 structs: V1 - simple, V2 and V3 - with unions. V1 works as expected, fields of V2 and V3 contain incorrect values (in my case at least, they contain value of other parameter p0).

These structs are based on this type: https://github.com/g-truc/glm/blob/master/glm/detail/type_vec3.hpp#L43
It works correctly on CUDA (I mean, not HIP-nvcc), and IIRC worked on ROCm 1.4 with a few modifications of glm.
Perhaps, my test case will work on 1.4 too, but I can't check right now.

If this is not a bug (in case HIP/HCC do not support unions as kernel params), then perhaps a warning should be shown.

Java bindings

Hi, I am cross-posting this request in the JCUDA repo as well. (jcuda/jcuda#5)

JCUDA is a set of Java bindings for the CUDA API (both Driver API and the parts of Runtime API) and associated libraries (cuBLAS, cuDNN, etc.). It's simple, straight-forward, and works very well with no performance impact.

I'd love to jump into HIP by converting my existing JCUDA code. Is there any chance you could collaborate with the JCUDA project or create an analogous library?

Add performance comparison with OpenCL

I've been looking for a performance comparison of using using HCC (through HIP) vs OpenCL. But I haven't been able to find any 😟

I see there is an issue for porting the PARBOIL benchmark suite, but that is not completed yet.

The Rodinia benchmark suite seems to be ported (as mentioned in the README), but I can't find any performance graphs. The same goes for mixbench and GPU-Stream mentioned in the HIP-Examples repo.

It would be great to have performance graphs readily available. It could also be fun to have a server run such benchmarks every week or so, as rust does: http://perf.rust-lang.org/

keep up the good work 👍

sqrtf and sqrt in device code cannot be compiled

To reproduce replace line 46 (https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/samples/0_Intro/square/square.hipref.cpp#L46)
with

        C_d[i] = sqrtf(A_d[i]);

hipcc square.hipref.cpp -o square.hip.out fails with

Referencing function in another module!
  %34 = call float @sqrtf(float %33) #5
LLVM ERROR: Broken function found, compilation aborted!
Generating HSAIL BRIG kernel failed

Full log: hipcc-sqrt-log.txt

I don't know about other functions but the same error for sqrt(double).
But, for example, this can be compiled:

        C_d[i] = __fsqrt_rn(A_d[i]);

Checked on:

HIP version: 1.0.16503
HCC clang version 3.5.0  (based on HCC 0.10.16501-81f0a2f-02246a0 LLVM 3.5.0svn)

BLAS and other Maths library support?

HIP is a really nice effort to have a transparent C++ code for both Nvidia and AMD graphic cards. One important feature, however, for GPGPU is BLAS routines. I saw that there is an hcBLAS implementation. I was wondering if you guys intend to support a hipBLAS, which to dispatch the BLAS call to the appropriate hardware calling the corresponding library, or that is not the case?

Transition HCC path to use async mem copy APIs

hsa_memory_copy_async and friends which allow specifying input signal dependency. This is also the stepping stone towards HCC support for async streams.

hipFree returns error code when 0/null-pointer is passed.

I don't know how common the practice is. But Udacity homework code calls cudaFree(0) before doing anything else and hipified version returns error code and cause the program to exit with an error. If this practice is common among CUDA software, hipFree should return success code and make it behave more like cudaFree. In C, calling free with 0/null-pointer does not cause error either.

See line 27 of https://github.com/udacity/cs344/blob/master/Problem%20Sets/Problem%20Set%201/HW1.cpp

no matching function for call to 'hipHostGetDevicePointer'

Hello,

Is hipHostGetDevicePointer supported in hipcc?

Here is my experiment.
The progress is allocated on the host using pinned memory.

CUDA_ASSERT(hipHostMalloc((void **)&progress, sizeof(int), hipHostMallocMapped))

However, when I try to get the device pointer, the following error shows.

mcx_core.cu:1301:18: error: no matching function for call to 'hipHostGetDevicePointer'
     CUDA_ASSERT(hipHostGetDevicePointer((int **)&gprogress, (int *)progress, 0));

Hipifying a Cuda file that has a call to a reduction function

Hi!

I'm using a CUDA code that calls a function from the NVIDIA CUB library.

This is the code:

    // Determine temporary device storage requirements
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, in_d, sum_d, N);

    // Allocate temporary storage
    CHECK(cudaMalloc(&d_temp_storage, temp_storage_bytes));

    // Run sum-reduction
    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, in_d, sum_d, N);

'in_d' is the input array with the values to be reduced, 'sum_d' is an array with a single position to include the result of the complete reduction of the array, and 'N' is the number of elements in the array that are to be reduced.

Is there any equivalent HIP library function for Sum reduction that I can use on an AMD Fiji card (AMD R9 Nano)?

Equivalent to mad24 call in opencl ?

Is run-time compilation supported/planned?

Basically I'm wondering if HIP has an alternative to NVRTC for compiling at runtime? Although NVRTC is somewhat hacky, it is an immensely useful tool at times, and an alternative for HIP would be great for targeting AMD / portable code.

hipStreamAddCallback compile error

Hi,

I got the following error when trying to compile the sample square code:

/opt/rocm/hip/bin/hipcc square.hipref.cpp -o square.hip.out
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(588): error: type name is not allowed

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(588): error: expected a ")"

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(588): error: too few arguments in function call

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(589): error: type name is not allowed

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(589): error: expected a ")"

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(589): error: too many arguments in function call

/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h(589): error: expected a ";"

7 errors detected in the compilation of "/tmp/tmpxft_0000396e_00000000-9_square.hipref.cpp1.ii".
Died at /opt/rocm/hip/bin/hipcc line 365.
make: *** [square.hip.out] Error 2

The line 588 and 589 of hip_runtime_api.h is:

inline static hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void *userData, unsigned int flags)
{
return *hipCUDAErrorTohipError(cudaStreamAddCallback(cudaStream_t stream,
cudaStreamCallback_t callback, void userData, unsigned int flags));
}

Shouldn't brackets to be add to each type name here?

Thanks,

Missing API hipEventDisableTiming for nvcc path

hipEventDesableTiming is detined in hcc_detail/hip_runtime_api.h but not in nvcc_detail/hip_runtime_api.h. Need a substitution from hipEventDisableTiming to cudaEventDisableTiming

Compilation error for oceanFFT + HIP

Hello! I have compilation error after code hiplification for the sample oceanFFT ( /usr/local/cuda/samples/5_Simulations/oceanFFT ):

../../common/inc/helper_cuda.h:990:81: error: no matching function for call to ‘_cudaGetErrorEnum(hipError_t&)’
file, line, static_cast(result), _cudaGetErrorEnum(result), func);

What am i doing wrong or how fix it manually?

Missing update documentation

There is build_hcc_and_hip.sh and good installation notes, but for updating there is a lack of documentation. Just read https://github.com/ROCm-Developer-Tools/HIP/releases/tag/preview_1.0.17241 and you see all is oriented towards new users for installation.

We don't want to get to a situation where we should had done something different. And while you're on it, please also explain how to test a new version - switching between multiple versions is really required.

Add a list of supported hardware to README.md

This would be handy to have.

Implement async streams

Requires async copy.

missing hipArray APIs in nvcc path

hipArray APIs, such as hipArray, hipMallocArray, ... etc are missing in nvcc path. Need to map them in include/nvcc_details/hip_runtime_api.h.

Finding good flags for the 'hipcc' compiler

What parameters should I pass to 'hipcc' so that it will most likely result in the fastest possible device code being generated?

For instance, with 'nvcc' (NVIDIA's compiler) I use '--use-fast-math' wich sets '--prec-div=false' and '--prec-sqrt=false', so that the resulting compiled device code is faster as the cost of precision.

Does the 'hipcc' '-ffast-math' accomplish the same thing?

hipblas.h not found

Hi, I just wanted to play around with the hip samples that come with rocm 1.4. I went into samples/7_Advanced/hipblas_saxpy and called make and received this:

saxpy.hipblasref.cpp:9:10: fatal error: 'hipblas.h' file not found
#include <hipblas.h>
         ^
1 error generated.
Died at /opt/rocm/hip/bin/hipcc line 378.

related to this, I was wondering how to use hipBlas, hipFFT, hipRNG?

error: expected top-level entity

Hello,
I am trying to compile monte carlo photon migration in cuda (mcx) on rocm.

I have cleared all the warnings, but am confronted by the following error. 

Any suggestions?

leiming@lowfreq:~/mcx_no_symbol/src$ /opt/rocm/hip/bin/hipcc -I/opt/rocm/hip/include  -std=c++11  -DUSE_ATOMIC  -DSAVE_DETECTORS -DUSE_CACHEBOX  -DMCX_TARGET_NAME='"Fermi MCX"'  mcx_core.cpp  -o mcx_core.o 
/opt/rocm/hcc-lc/compiler/bin/llvm-as: /tmp/tmp.U4yAIWtXHr/mcx_core-c8f8d4.o.host_redirect.ll:1:1: error: expected top-level entity
/opt/rocm/hcc-lc/compiler/bin/opt: <stdin>:5768:9: error: stored value and pointer type do not match
^
clang-3.5: error: no such file or directory: '/tmp/tmp.U4yAIWtXHr/mcx_core-c8f8d4.o.host_redirect.bc'
clang-3.5: error: no input files
objdump: 'mcx_core.o': No such file
objdump: '/tmp/mcx_core-c8f8d4.o': No such file
ld: cannot find /tmp/mcx_core-c8f8d4.o: No such file or directory
clang-3.5: error: linker command failed with exit code 1 (use -v to see invocation)
Died at /opt/rocm/hip/bin/hipcc line 378.

High-impact application of HIP

Google has contributed to a beta version of Eigen (3.3) that uses CUDA to implement their tensor operations in Tensorflow. If I understand the HIP marketing literature, it allows one to port a CUDA codebase to run on AMD GPUs whilst keeping a mostly CUDA-type codebase.

There is an effort underway to port tensorflow to an opencl back-end, but it appears to have mostly stalled out. At least from what I can tell.

Some of us tensorflowers have found that the fp16 support in Maxwell and Pascal (GP102) GPUs (maybe not GP100, but who has $120K lying around?) is crippled to some extent; runs slower than fp32. fp16, even if it's not faster, allows larger models to fit in memory. I think Vega and Polaris support native fp16. It might be a big win for those of us who would want to run fp16 to have a hippified Eigen backend in Tensorflow.

Sorry, I'm not offering to help. But in case this is news to anyone here, I thought I would pass along the thought.

Compilation problem

Hello! I try to compile caffe 1.0.0-rc3 with HIP. There are some errors like:

/home/beamoflight/caffe/caffe/src/caffe/util/im2col.cu(57): error: no instance of function template "caffe::im2col_gpu_kernel" matches the argument list
argument types are: (int, int, const float *, const int, const int, const int, const int, const int, const int, const int, const int, const int, const int, int, int, float *)
detected during instantiation of "void caffe::im2col_gpu(const Dtype *, int, int, int, int, int, int, int, int, int, int, int, Dtype *) [with Dtype=float]"
(65): here
Can you help me?

P.S. See files in attachment
im2col_cu.txt
im2col_cu_diff.txt

HIP with NVIDIA

When compiling on a machine with an AMD card I do the following and it works:
gcc -c main.c misc.c -DTYPE=float
hipcc -O2 Kernel.hip.cpp main.o misc.o -DTYPE=float -o kernel_hip_float

But when compiling a machine with an NVIDIA card (and with theCUDA toolkit installed) I get lots of 'warning: null character(s) ignored' and 'error: unrecognized token' errors (reported for the 'main.o' object file) when executing the second command.

What is the correct way of compiling and linking multiple files, where only a single file includes the HIP runtime library (in this example 'Kernel.hip.cpp')?

Hip error

I just cloned this repo. The previous versions were working but now I get an error that it requires a different version of HCC. However the HCC link given in the github page for HIP has deb/tar packages that were last updated in January. Could anyone please point me to a recent package of HCC because the ones there are outdated and I cannot find a newer version. As a result my HIP is not running. Please help.

Converting CUDA sample that uses OpenCV to HIP

I'm trying to convert Udacity parallel programming class problem sets to HIP. It uses OpenCV and it looks like it is not compatible with hcc. I reported the problem to hcc project and was told that HIP team may have a solution. Is there a way to use hcc/HIP to compile/link OpenCV proejcts?
See ROCm/hcc#47 for more information.

The issue seems to be that OpenCV requires libstdc++ and hcc requires libc++.

Thanks

Hipify PARBOIL benchmark suite

Port PARBOIL benchmark suite from CUDA to HIP.

http://impact.crhc.illinois.edu/parboil/parboil.aspx
http://impact.crhc.illinois.edu/parboil/parboil_download_page.aspx

Use logic-defined or in hipcc and hipconfig script for better readability

Hi,

In hipcc and hipconfig script, I see a lot of unless defined idioms used for setting default values of flags, such as the following:

    $HSA_PATH=$ENV{'HSA_PATH'};
    $HSA_PATH="/opt/rocm/hsa" unless defined $HSA_PATH;

This can be replaced by Perl's logic-defined-or operator //, which literally means setting the default value for Perl people.

    $HSA_PATH = $ENV{'HSA_PATH'} // "/opt/rocm/hsa";

Reference:

http://perldoc.perl.org/perlop.html#Logical-Defined-Or
http://perlmaven.com/how-to-set-default-values-in-perl

Port to windows

List of supported GPUs

There seems that in the primary release only the

Fury Series GPUs were supported.

Is there a updated list of supported GPUs?

Cannot call device class member from kernel function on HCC platform

I am not sure if this issue is a genuine HIP issue or if it originates in HCC. Because I am only using HCC through HIP, I am reporting this here.

$ hipcc --version

HCC clang version 3.5.0 (based on HCC 0.10.16186-d14f969-7461349 LLVM 3.5.0svn)
Target: x86_64-unknown-linux-gnu
Thread model: posix

The following code does not compile:

#include "hip/hip_runtime.h"

struct Foo
{
    int m_bar = 5;

    __device__ int bar(int a)
    {
        return a+m_bar;
    }
};

__global__ void kernel(hipLaunchParm lp, int a)
{
    Foo foo;
    foo.bar(a);
}

int main()
{
    hipLaunchKernel(HIP_KERNEL_NAME(kernel), dim3(1), dim3(1), 0, 0, 5);

    return 0;
}

test_direct.cpp:16:6: error: 'Foo::bar': no overloaded function has restriction specifiers that are compatible with the ambient context 'kernel'
So, it looks like the compiler either thinks, that foo.bar is not a __device__ function, or it mistakes the kernel for a host function.

Strangely, calling a non-member __device__ function, which in turn calls the member works. So the following code compiles:

#include "hip/hip_runtime.h"

struct Foo
{
    int m_bar = 5;

    __device__ int bar(int a)
    {
        return a+m_bar;
    }
};

__device__ int bar_wrapper(Foo& foo, int a)
{
    return foo.bar(a);
}

__global__ void kernel(hipLaunchParm lp, int a)
{
    Foo foo;
    bar_wrapper(foo, a);
}

int main()
{
    hipLaunchKernel(HIP_KERNEL_NAME(kernel), dim3(1), dim3(1), 0, 0, 5);

    return 0;
}

can not build hip from source

it is rocm-1.5 master branch

/home/tim/HIP/src/device_util.cpp:1219:10: error: call to 'sqrt' is ambiguous
return hc::fast_math::sqrt(x);
^~~~~~~~~~~~~~~~~~~
/opt/rocm/include/kalmar_math.h:654:16: note: candidate function
__fp16 sqrt(__fp16 x) { return __hc_sqrt_native_half(x); }
^
/opt/rocm/include/kalmar_math.h:657:15: note: candidate function
float sqrt(float x) { return fast_math::sqrtf(x); }
^
/usr/include/x86_64-linux-gnu/bits/mathcalls.h:156:13: note: candidate function
__MATHCALL (sqrt,, (Mdouble __x));
^
/usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/cmath:482:3: note: candidate function
sqrt(float __x)
^
/usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/cmath:486:3: note: candidate function
sqrt(long double __x)

HIPCC not defined in HIP

The __HIPCC__ predefined macro seems not to be defined during compilation. Here is a test case with the following source:

#include <stdio.h>

int main(int argc, char* argv[]) {
#ifdef __HIPCC__
    printf("__HIPCC__ is defined\n");
#endif
#ifdef __CUDACC__
    printf("__CUDACC__ is defined\n");
#endif
#ifdef __HCC__
    printf("__HCC__ is defined\n");
#endif
    return 0;
}

After compiling with hipcc (/opt/rocm/bin/hipcc -o test-macro -O2 -I/opt/rocm/hip/include test-macro.cpp) and running, it returns just the following output:

__HCC__ is defined

Support VISIBLE_DEVICES environment var

use an environment variable to control which devices are made visible to the application.

support comma-separated list of device IDs.
order of device ids in the env var control order in the application, ie 2,1,0 will cause HW device 2 to be listed at device0 in the app.

Implementation:

modify src/hip_hcc.cpp ihipInit to use env variable to control which of the detected devices get added to the "g_devices" list, and also control the order they get added in.

Forcing HIP runtime initialization before starting time counter

When using CUDA, and measuring time for the execution of a kernel, I'm used to call 'cudaFree(0)' before starting to count time.
The call of 'cudaFree(0)' forces the CUDA runtime to initialize before starting the time counter.

E.g.:

// So that the CUDA runtime is initialized before starting measuring time
cudaFree(0);

struct timespec t_start, t_stop;
clock_gettime( CLOCK_MONOTONIC, &t_start);
deviceReduceKernel <<<numBlocks_red, threadsPerBlock_red>>> (distmat_d, sum_perblock_d, totalNumElementsDistmat);
clock_gettime( CLOCK_MONOTONIC, &t_stop);

How could I achuive the same (force runtime intialization) when using HIP?

Using 'hipFree(0)' instead of 'cudaFree(0)' does not result in the same behavior, as the elapsed time as reported by the timer remais the same as when not calling 'hipFree(0)'.

Implement hipGetErrorString

[thanks! updated to hipGetErrorString]
Currently hipGetErrorString returns the name of the error (ie "hipErrorInvalidDevice") but really should return a full message string explaining the error.

Reference:
See src/hip_hcc.cpp for implemention of hipGetErrorString function.
See include/hip_runtime_api.h for a doxygen-format comments that explain each error.- these could form the text of the error-string message set by this function.

Mismatched API hipMemsetAsync between hcc and nvcc

The nvcc version doesn't take the last argument stream.

// nvcc/detail/hip_runtime_api.h
inline static hipError_t hipMemsetAsync(void* devPtr,int value, size_t count)

// hcc_detail/hip_runtime_api.h
#if __cplusplus
hipError_t hipMemsetAsync(void* dst, int  value, size_t sizeBytes, hipStream_t = 0 );
#else
hipError_t hipMemsetAsync(void* dst, int value, size_t sizeBytes, hipStream_t stream);
#endif

https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/hcc_detail/hip_runtime_api.h#L903

https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/nvcc_detail/hip_runtime_api.h#L222