nvidia / jitify Goto Github PK

View Code? Open in Web Editor NEW

509.0 26.0 63.0 1.11 MB

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).

License: BSD 3-Clause "New" or "Revised" License

Makefile 1.17% Cuda 20.20% C++ 77.90% Shell 0.73%

nvrtc cuda jit-compilation runtime-compilation cpp single-header

jitify's People

Stargazers

Watchers

jitify's Issues

Can't build with NVCC option '--Werror cross-execution-space-call' on Windows

Tested under Visual Studio 2015 and 2019, with CUDA 10.1/10.2.

If you try to build the below example with --Werror cross-execution-space-call enabled in Visual Studio it will fail.

This is the flag for error : calling a __host__ function from a __host__ __device__ function is not allowed

#include "jitify/jitify.hpp"

int main() {
    return EXIT_SUCCESS;
}

Under Visual Studio 2015, it produces the compilation error shown at the bottom of this issue.
TLDR: This is the offending line _ranked_keys.erase(rank);

I'm not actually sure why it's being triggered in this instance, but it's upsetting one of our continuous integration builds, so it would nice if it could be fixed without having to disable that compiler argument

1>  C:\Users\rob\Documents\Visual Studio 2015\Projects\Jitifytest>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\nvcc.exe" -gencode=arch=compute_61,code=\"sm_61,compute_61\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64" -x cu -rdc=true -I"C:\Program Files (x86)\dlfcn-win32\include" -I"C:\Program Files (x86)\Visual Leak Detector\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include"  -G   --keep-dir x64\Debug -maxrregcount=0  --machine 64 --compile -cudart static --Werror cross-execution-space-call -g   -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc140.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "C:\Users\rob\Documents\Visual Studio 2015\Projects\Jitifytest\kernel.cu"
1>C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\xutility(2636): error : calling a __host__ function from a __host__ __device__ function is not allowed
1>            detected during:
1>              instantiation of "_OutIt std::move(_InIt, _InIt, _OutIt) [with _InIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>, _OutIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1660): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator, std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1628): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(246): here
1>              instantiation of "void jitify::ObjectCache<KeyType, ValueType>::touch(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(254): here
1>              instantiation of "jitify::ObjectCache<KeyType, ValueType>::value_type &jitify::ObjectCache<KeyType, ValueType>::get(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(2861): here
1>
1>C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\xutility(2636): error : calling a __host__ function from a __host__ __device__ function is not allowed
1>            detected during:
1>              instantiation of "_OutIt std::move(_InIt, _InIt, _OutIt) [with _InIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>, _OutIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1660): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator, std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1628): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(246): here
1>              instantiation of "void jitify::ObjectCache<KeyType, ValueType>::touch(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(254): here
1>              instantiation of "jitify::ObjectCache<KeyType, ValueType>::value_type &jitify::ObjectCache<KeyType, ValueType>::get(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(2861): here
1>
1>C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\xutility(2637): error : calling a __host__ function from a __host__ __device__ function is not allowed
1>            detected during:
1>              instantiation of "_OutIt std::move(_InIt, _InIt, _OutIt) [with _InIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>, _OutIt=std::_Deque_iterator<std::_Deque_val<std::_Deque_simple_types<jitify::JitCache_impl::key_type>>>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1660): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator, std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\deque(1628): here
1>              instantiation of "std::deque<_Ty, _Alloc>::iterator std::deque<_Ty, _Alloc>::erase(std::deque<_Ty, _Alloc>::const_iterator) [with _Ty=jitify::JitCache_impl::key_type, _Alloc=std::allocator<jitify::JitCache_impl::key_type>]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(246): here
1>              instantiation of "void jitify::ObjectCache<KeyType, ValueType>::touch(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(254): here
1>              instantiation of "jitify::ObjectCache<KeyType, ValueType>::value_type &jitify::ObjectCache<KeyType, ValueType>::get(const jitify::ObjectCache<KeyType, ValueType>::key_type &) [with KeyType=jitify::JitCache_impl::key_type, ValueType=jitify::detail::CUDAKernel]"
1>  c:\users\rob\documents\visual studio 2015\projects\jitifytest\jitify/jitify.hpp(2861): here
1>
1>  3 errors detected in the compilation of "C:/Users/rob/AppData/Local/Temp/tmpxft_000033b4_00000000-7_kernel.cpp1.ii".
1>  kernel.cu

(Note jitify.hpp(2861), shown at the bottom is an incorrect line number, the length of jitify.hpp confuses visual studio compiler for some reason, i previously looked into this and couldn't find an explicit cause beyond breaking the file up into smaller files corrects the line numbers.)

Jitify compile doesn't parse numbers with separators and suffix correctly

Describe the bug
The jitify compile doesn't parse numbers with separators and suffix correctly. For example, "1'000ULL".

Steps/Code to reproduce bug
This can be reproduced by using a number like this in a header used by the jitify compilation, such as fixed_point.hpp in the cuDF library.

Questions: what can be done asynchronously?

Hey folks,

This bug is a question, feel free to close on an answer, no code needed.

I'm trying to use Jitify asynchronously, I have a pipeline which asynchronously creates program templates to hand off to Jitify (or Cling), launches a generic version of things to be jitted while waiting, and then swaps in the NVRTC products when they're available. Jitify is printing the PTX it asynchronously generates (yay) and then erroring
CUDA_ERROR_INVALID_CONTEXT (aww).

I'm assuming this is because I'm creating a KernelLauncher on a different thread than the one in which I wish to execute it, std::async will launch tasks in another thread, if Jitify is picking up the default CUDA context for that thread I don't know what happens when the KernelLauncher gets returned to a different thread with a different CUDA context. My questions are:

Is there a way to pass a cudaContext to the KernelInstantiation to create the KernelLauncher under that context?
If not, how far down the Jitify stack can I go in another thread before cudaContexts become relevant? My intuition is that I can instantiate a kernel, I just can't configure it, but let me know if I'm wrong there.

Thanks again for your help!

GTC Talk link is dead

The GTC portal no longer has a link to the jitify GTC talk (404). This is a shame as its a great intro to the tool.

Jitify should not print warnings to stdout by default

Jitify prints logs to stdout by default because JITIFY_PRINT_LOG is defined to 1 by default.

jitify/jitify.hpp

Lines 2817 to 2821 in 3e96bcc

 #if JITIFY_PRINT_LOG 

 std::cout << include_parent << "(" << line_num 

 << "): warning: " << include_name << ": [jitify] File not found" 

 << std::endl; 

 #endif

I think it really should print warnings and errors to stderr by default, and ideally provide a way to override logs so they can be output to a file if desired.

See rapidsai/cudf#6117

Failure to compile with XLC

XLC isn't the most robust C++11 compiler, so perhaps not surprising.

jitify.hpp:2115:8: error: expression contains unexpanded parameter pack
      'args'
return (this->launch(std::vector< void *> ({((void *)(&args))}), {reflection::reflect< ArgTypes> ()...}));

@alexstrel

[BUG] GetNumSupportedArchs GetSupportedArchs compile error

template arguments nvrtcProgram is not correct, should remove it.

jitify/jitify2.hpp

Line 2392 in 0efec93

detail::function_type<nvrtcResult, nvrtcProgram, int*>*

jitify/jitify2.hpp

Line 2396 in 0efec93

detail::function_type<nvrtcResult, nvrtcProgram, int*>* GetSupportedArchs() {

Does jitify supports CUDA 11.x???

i really appreciate for y'all to provide this nice header. it's really helpful. but it seems to stop updating for a long time ? so does it support latest CUDA version?? thanks!

Include jitify.hpp in more than one C++ file

When I include jitify.hpp in more than one translation unit, I get tons of linker errors about multiple definitions of text strings and function definitions.

(jitify2) Serialisation post-NVRTC

Within jitify2, the latest point of implemented serialisation appears to be that of PreprocessedProgram. Hence, after deserialization it requires compilation via NVRTC. In contrast jitify1 serialises the ptx blob output by NVRTC.

The jitify2 approach is presumably more portable, however in our use-case we are serialising to memory and/or /tmp/, so there's no requirement of portability. Furthermore, with our large compilation units, the serialised output is ~2x bigger and the time to load them and ~50x slower.

A quick test with CUDA 12.3

	Serialised size (KB)	Serialised load time (ms)
jitify1	1200-1800	12-31
jitify2	3200-3300	685-1372

Is there a reason post-NVRTC serialisation is no-longer present/am I mistaken?
Is it on your roadmap/would you be happy with me submitting a PR?

Add support for Turing on pre-CUDA 10

Presently jitify fails on pre-CUDA 10 when running with Turing. Trivial fix.

Clarification of implicit conversion requirements

Hey folks,

Sorry for the bug a day pace, you just made too useful a product. This bug is the one I'm iffiest about filing, if nothing immediately pops out at you let me know and I'll try to get you a reproducer, it's 100% possible this is on our side.

I'm reading this, which mentions implicit conversions and variadic packs likely resulting in a segfault. I'm seeing such a segfault, though my stack trace goes through this version of launch, which I believe it's supposed to

I can generate one of two functions

header
__global__ void jitify_example_cu112_0(int specialization,int * d_array,int debug_do_not_merge=0){
int i = threadIdx.x + blockDim.x*blockIdx.x;
if(i<1024){
for(int k = 0; k < 8; k++){          
d_array[i] += specialization * specialization * specialization;        }    
}
}

Or I can elide specialization as a constant

header
__global__ void jitify_example_cu112_0(int * d_array,int debug_do_not_merge=0){
int i = threadIdx.x + blockDim.x*blockIdx.x;
if(i<1024){
for(int k = 0; k < 8; k++){          
d_array[i] += 512;        }    
}
}

The launch happens a little like

static jitify::KernelLauncher* launcher;
template<class... Args>
void launch_assist(Args... args){
  launcher->launch(args...);
}

From that, I successfully create a program, instantiation,...,launcher. I'm invoking the launcher with an "invoke" pattern, I'm packing my args as a tuple and then calling camp::invoke (a lot like std::apply, but more nvcc friendly). The first path, in which we have two real arguments being passed, crashes. The second succeeds. An equivalent operation piped through Cling seems to work.

Any experience with expanding variadic packs causing you grief? Again, if nothing jumps out, I'll drill down myself a bit.

Thanks!

Question: Linking RTC code with executing program

If the executing program has been compiled with -rdc=true, is it possible to link RTC compiled code with this so that it has access to shared symbols (__device__, __constant__)?

Small example of what I mean here: (Tested in Visual Studio 2015)
https://gist.github.com/Robadob/b2b7704a36e2f679942e854a0f41082a

I've found that cuLinkAddFile() would need to be used.

However it's not clear how that could be used to link with the executing program (and whether also building the executing program to fatbin/similar for linking would be useful).

Cleanup use of `std::map` and `std::unordered_map`

Jitify is using both std::map and std::unordered_map with no real thought as to which is more suitable. This is a request to clean this up to use the most appropropriate map consistently, ideally based on benchmarked performance.

Possibility of cross-pollination/collaboration with cuda-api-wrappers?

Hello Jitify authors/maintainers,

I am the author the CUDA Modern-C++ API Wrappers library, cuda-api-wrappers. While originally focused on the runtime API, more recently I've undertaken an expansion of the API wrappers to cover the driver API as well, and no less importantly, NVRTC.

While this is still technically an unreleased development branch, it has matured for a while now and I will soon release it. Perhaps not surprisingly, it covers a significant part - though definitely not all - of what jitify offers. Specifically, there's an adaptation of jitify_example.cpp which uses the wrappers.

I would be happy if one, or some, of you would be interested in a chat sometime about the aspects of the library's design, mangling and when/how to do it, and perhaps even the potential for some sort of collaboration between us.

I would have liked to reach out to you by email, but couldn't quite find any of your addresses, so I'm trying this way instead.

Parallel calls to Jitify

Hey folks,

Really appreciate the consistent support. I'm continuing down the train I started in #9 , and am back in a state where I get problems. In that issue, we solved the problem of "I want to use Jitify from one other thread," and it's working well.

Today I'm finding myself needing to use Jitify concurrently on different threads, and it's segfaulting. I've verified that the source I'm passing to Jitify is valid. If I don't synchronize between threads, I get an error in the compile

---------------------------------------------------
--- JIT compile log for header ---
---------------------------------------------------
Command-line error: cannot open source file "jitify_preinclude.h"

The error comes from here.

Note that the compilation doesn't seem to crash my program, I actually crash when I try to configure my KernelInstantiation (I think it's trying to configure something that's invalid). If after each invocation of Jitify I do a synchronization, I get no such error.

A valid reproducer would be something along the lines of

std::program_strings* program_string_set = {/** insert valid program strings*/}
for(int iter =0 ;iter<5;iter++){
  std::thread([=](){
    //invoke jitify on program_string_set[iter]
  }).detach()
}

Anyway, this isn't game-breaking, I can always serialize my calls to Jitify, but I'd love to be able to do them in parallel.

Thanks as always!

catching polymorphic type ‘class std::runtime_error’ by value

When compiling with g++ 8.3, catching polymorphic type ‘class std::runtime_error’ by value occurred on L503, jitify.hpp

$ cd jitify && make
g++ -o stringify stringify.cpp -O3 -Wall
./stringify example_headers/my_header1.cuh > example_headers/my_header1.cuh.jit
g++ -o jitify_example jitify_example.cpp -O3 -Wall -g -fmessage-length=80 -pthread -std=c++11 -D LINUX -I/usr/local/cuda/include -rdynamic -Wl,-b,binary,example_headers/my_header2.cuh,-b,default -ldl -L/usr/local/cuda/lib64 -lcuda -lcudart -lnvrtc
In file included from jitify_example.cpp:40:
jitify.hpp: In function ‘bool 
   jitify::detail::load_source(std::__cxx11::string, 
   std::map<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> 
   >&, std::__cxx11::string, std::vector<std::__cxx11::basic_string<char> >, 
   jitify::file_callback_type, std::map<std::__cxx11::basic_string<char>, 
   std::__cxx11::basic_string<char> >*)’:
jitify.hpp:503:21: warning: catching polymorphic 
   type ‘class std::runtime_error’ by value [
   -Wcatch-value=]
       } catch (std::runtime_error) {
                     ^~~~~~~~~~~~~

Compiling cudf, which using jitify, requires -Werror and changing catch (std::runtime_error) to catch (std::runtime_error&) solved the problem.

load_program() performance (with large include hierarchies)

When I include the main glm header within an RTC source, the runtime of load_program() increases to over 60 seconds (from what was already ~10 seconds), so I started investigating the problem.

I discovered that to discover include files, Jitify calls nvrtcCompileProgram(), catches the failure and parses the missing include from the error log, and then repeats the process until compilation succeeds or fails for a different reason.

GLM's include hierarchy adds ~109 separate glm header files, and we have ~20 internal library headers which are included too (although we recently dynamically pruned 20 down to ~14 for a small improvement). (I calculated these values with output from pcpp so they might be a little off as I haven't tested the flattened file it output)

The problem is, that each call to nvrtcCompileProgram() causes it to reparse the include hierarchy, so the cost grows from an initial ~100ms, to ~600ms as each header is added. I logged it doing 198 failed calls to nvrtcCompileProgram(). This is highly inefficient, leading to the load_program() function taking 60+ seconds with the final successful nvrtc call taking around 1 second.

In comparison pcpp, was able to pre-process the full include hierarchy in 3 seconds. So it's fair to assume it could theoretically be compiled in 4 seconds with appropriate include pre-processing, 15x faster.

In our use-case, we perform multiple individual RTC compilations with the same include hierarchy, so we have this unnecessary cost. Worst case being our test suite, 85 minutes total with glm included in all agent functions, 25 minutes total with glm only included where used and 11 minutes with glm tests and include disabled. But even in that case, probably 10 minutes are spent doing RTC for the small number of RTC tests (we have most the RTC tests in our python test suite).

I'd be interested to know your thoughts on whether Jitify can address this, or even whether NVRTC could be extended by the team who develops it to actually load include files from disk (given implementing a full pre-processor within Jitify is impractical) and perhaps also mimic Jitify's missing include ignoring behaviour.

The optimal performance in our case (FLAMEGPU2) would probably be achieved by pre-processing and flattening the header hierarchy that we pass to all RTC sources (to even reduce the ~3s of header parse/loading each time), but the high compile costs due to include processing might be scaring other users away from using RTC if they're finding it via Jitify and building things with larger include hierarchies (In our case, we've always had the ~20 library headers and thought the ~10s expense was just an unfortunate byproduct of RTC for our non-simple use-case).

Irrespective of NVRTC, Jitify might be able to improve the cost of multiple compiles, by caching loaded headers and always passing them to NVRTC regardless of if they're required? This would still leave the 1st compilation costly, but would make any subsequent compilations with the same headers much cheaper.

As I quick test, I was able to flatten the majority of our includes using pcpp, and this reduces the number of false calls to nvrtcCompileProgram() from 198 down to 32. I believe these remaining 32, are due to our dynamic header, which has a delayed include which can't be flattened the same, and the various system includes which I haven't flattened yet (there are around 60, but I think they're mostly repeats). But already the RTC time was down to ~8 seconds.

create_module build error when using JITIFY_PRINT_LOG

JITIFY_PRINT_LOG is incorrectly used in create_module``. It should instead use JITIFY_PRINT_LINKER_LOGelse there is a build failure due to_error_log``` not existing.

Kernels running progressively slower

Hi.
I have a simple kernel that's run in a draw loop. I have noticed it takes longer and longer to execute? (my frames per second drops) In my program if I comment out the kernel launch it does not get progressively slower and maintains a steady frame rate. The kernel is compiled once and then uses the cache. The only bit of my code I need to comment out to maintain a steady frame rate is the following.

static_castjitify::Program*(m_program)->kernel("my_kernel")
.instantiate(count, jitify::reflection::type_of(*indata))
.configure(grid, block)
.launch(indata, mag);

So if I run my program in the first 1000 frames I get around 314fps by frame 3427 i'm down to 121fps

Have you guys tried running a jitify kernel like this effectively calling a kernel in loop many thousands of times?

I noticed in KernelInstantiation new is called if I'm running the same kernel again and again can the cache handle this. I have a feeling I'm doing this wrong?
Any help appreciated.
Cheers

Simon

sizeof(CUdeviceptr)==4 on the device side code

I ran into a problem when compiling some optix programs

More info here as well as a reproducible example
https://devtalk.nvidia.com/default/topic/1072808/optix/-bugreport-amp-fix-optix-7-corrupts-cudeviceptr-in-the-sbt-due-to-truncation-hardcore-/?offset=2#5434966

I'm using your "stddef.h" no idea what is the proper way to fix it and have my kernels compiled by nvrtc have 8 byte pointers.

Will cupti be notified of a kernel launch using the driver api?

I am profiling kernels using the CUPTI api. All of my kernels were using the cuda runtime api.
I converted my kernels to the string format that jitify wants and proved that they are compiled correctly and launch correctly. I proved they return the correct values.

Jitify uses the cuda driver api. I am trying to profile my kernels that were launched using jitify. I'm no longer getting notified of the kernel launch through CUPTI.

So, my question is, "do you know for certain that you can profile kernels (using CUPTI) that were launched with jitify given that jitify uses the cuda driver api?"

Integration with Occupancy Calculator

Hey Folks,

I've been working on using Jitify, before getting to the request just wanted to say the software has been really excellent.

I know NVIDIA has an Occupancy Calculator, there's a bit of code which uses it here, the code is a little dense, but essentially you can get the ideal occupancy for a kernel. I believe your kernels are represented in your KernelInstantiations here, and we interface with that most closely here. I'd like to be able to call something like myKernelInstantation.configureViaCalculator() and have the calculator handle that logic and return a KernelLauncher

Any opposition? If not, would it be easier for you to accept a pull request or write code yourselves?

Thanks again, really excited to get JIT capabilities.

Memory leak within nvrtcCompileProgram()

This isn't a bug within Jitify, but likely of interest to Jitify users.

Right before Christmas we were running a large complicated benchmark, which results in ~500 RTC compilations on HPC. We noticed that executing this, the memory footprint grew to over 10GB. A re-execution, where most compilations were skipped and instead loaded from the disc cache we create then used significantly less memory.

Playing with valgrind identified a leak within nvrtcCreateProgram().

Further testing on Windows suggests the leak is actually caused by nvrtcCompileProgram(). Larger source inputs appear to create a larger leak, but the leak size isn't a 1:1 mapping.

The leak appears present in atleast CUDAs 11.2, 11.4 and 11.5 on both Linux and Windows.

Our original issue: FLAMEGPU/FLAMEGPU2#756
A short explain of the valgrind run: https://gist.github.com/ptheywood/ca81e19dd8fe79e75ae3269014389009
Trivial NVRTC Loop that produces small memory increases: https://gist.github.com/Robadob/77ef11e6e8fdd72d6431218da4ff7da4
NV Bug Report Link: https://developer.nvidia.com/nvidia_bug/3488772

Author email address(es) unavailable

There's no obvious way - except opening an issue - to get in touch with the authors of jitify. README.md doesn't have mailto links, nor do the two main contributors' GitHub profiles.

Why doesn't JITIFY support CUDA 10?

Hello, I've been chasing a jitify bug for 2 days. I am using VC++ 2017, CUDA 10 and Win7/64.
I was seeing weird things happen in release mode that didnt happen in debug mode.
Then I went back to the jitify home page and saw that it only works on cuda 7, 8 and 9.
Why doesn't it work on CUDA 10? What is the issue?
--Bob

more examples ?

Could you please add a few more examples so that people can have a better understanding of the use of your JIT tool in scientific and benchmark applications ?

Thank you

Is it possible to create program from cuBin or PTX?

Hi guys, i used jitify.hpp in my project, and it works just fine. However, i still have a question:
according to this:

jitify/jitify.hpp

Line 3507 in b22bf7d

inline Program program(std::string source,

seems it's no way to create progtam from cuBin or PTX? or how can i create a CUFunction by giving cuBin or PTX?
Thank you very much!

Request for nicer compiler error logs

Would it be possible to have a nicer mechanism for compiler error logs? E.g. be able to query the compile log as a string rather than have this switched via MACRO?

It is upsetting to see the jitify logs in my test suite outputs when checking running tests for compilation failures. ;-)

Visual Studio Support for Windows

I noticed some closed issues (e.g. #16, #19) suggesting jitify can be used with MSVC but the main jitify.hpp file requires dlfcn.h. Has jitify been tested with https://github.com/dlfcn-win32/dlfcn-win32 ?

Support C++ 17

Can use NVRTC in a C ++ 17 codebase?

CUDA 11 support?

just asking..

C++ header "functional" not supported?

I am testing jitify with a simple Thrust code, in which I just include <thrust/count.h>. In the end, it looks up at my /usr/local/cuda-10.2/include/thrust/functional.h, in which <functional> is included:
https://github.com/NVIDIA/thrust/blob/deac895a041fc5fc6443b49f6846f6bbdcb60756/thrust/functional.h#L25
But it looks like <functional> is not part of jitify's get_jitsafe_headers_map? The error I got is

/usr/local/cuda-10.2/include/thrust/functional.h not in loaded sources! This may be due to a header being loaded by NVRTC without Jitify's knowledge.

Is there any workaround that I can do as a user? Thanks.

Cannot use `<limits>` and `<cuda/std/limits>` in the same source file

Invoking jitify with the following source file:

#include <limits>
#include <cuda/std/limits>

as follows:

jitify2_preprocess -std=c++11 -D__CUDACC_RTC__ test.hpp

results in:

Error processing source file test.hpp
Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++11 -D__CUDACC_RTC__ -include=jitify_preinclude.h -default-device"
detail/libcxx/include/limits(211): error: identifier "__CHAR_BIT__" is undefined

detail/libcxx/include/limits(312): error: identifier "__FLT_MANT_DIG__" is undefined

detail/libcxx/include/limits(313): error: identifier "__FLT_DIG__" is undefined

detail/libcxx/include/limits(321): error: identifier "__FLT_RADIX__" is undefined

detail/libcxx/include/limits(325): error: identifier "__FLT_MIN_EXP__" is undefined

<many more similar errors>

As a workaround I can do:

include <limits>
#include <cuda/std/climits>
#include <cuda/std/limits>

jitify2_preprocess require -lpthread in linkage stage for safety

In jitify, the app uses dlopen() to load libnvrtc.so, not dynamical link in compilation. It could cause a hidden trouble about pthread and stdc++ libraries.
This story has a description in detail :
https://stackoverflow.com/questions/51209268/using-stdthread-in-a-library-loaded-with-dlopen-leads-to-a-sigsev

It will cause the pthread_create function call into nullptr ptr in somewhere, because libnvrtc.so depends pthread but jitify2_preprocess does not.

for safety, the jitify2_preprocess could depend on pthread in linkage stage. for example:

./CMakeLists.txt:84: target_link_libraries(jitify2_preprocess PRIVATE ${CMAKE_DL_LIBS})
=> ./CMakeLists.txt:84: target_link_libraries(jitify2_preprocess PRIVATE ${CMAKE_DL_LIBS} pthread)

Jitify doesn't forward `-I` options to NVRTC.

We have a Linux box running Ubuntu 16.04.

This has the Ubuntu packaged version of CUDA installed (very old 7.5 or similar), which leaves some directories on path.
It also has all of the modern CUDA versions which must be enabled via module load.

When testing our RTC code on this machine (using CUDA 10.1), all runtime compilation failed with.

---------------------------------------------------
--- JIT compile log for rtc_test_func_program ---
---------------------------------------------------
/usr/include/cuda.h(229): error: invalid redeclaration of type name "CUuuid_st"
__nv_nvrtc_builtin_header.h(1549): here
/usr/include/cuda.h(231): error: invalid redeclaration of type name "CUuuid"
__nv_nvrtc_builtin_header.h(1552): here
2 errors detected in the compilation of "rtc_test_func_program".

After some debugging, it appears that NVRTC loads cuda.h, not Jitify.

When Jitify is processing options, to detect passed include directories it removes them from the options.
https://github.com/NVIDIA/jitify/blob/master/jitify.hpp#L2579

As such when NVRTC tries to locate cuda.h, and it can only look on path, hence finds the really old one, rather than the one in the provided include dir.

The workaround for this is to pass the CUDA include dir to Jitify with --include-path= rather than -I, so Jitify doesn't intercept and remove it.

Is there a reason that all include directories are not forwarded to NVRTC?

It's unclear to me, how NVRTC finds cuda.h on Windows, as this file isn't on the path.

Constant memory in jitified kernels

Hi!
I have a template kernel which also uses constant memory. Can I somehow make this constant memory visible for a kernel if I use jitify for its instantiation?

Thank you

#include <limits> fails on Windows

Visual Studio 2019
CUDA 11.2

Adding #include <limits> to any RTC kernel causes a compilation error at runtime.

---------------------------------------------------
--- JIT compile log for my_program ---
---------------------------------------------------
limits(124): error: invalid narrowing conversion from "int" to "wchar_t": constant value does not fit in destination type

limits(124): error: invalid narrowing conversion from "int" to "wchar_t": constant value does not fit in destination type

2 errors detected in the compilation of "my_program".

---------------------------------------------------

Appears to be unhappy about this line:

jitify/jitify.hpp

Line 1628 in 5f8f0a1

__jitify_detail::IntegerLimits<wchar_t, WCHAR_MIN, WCHAR_MAX> {};

Update: Based on further debugging in #86. There are two issues at play.

Jitify defines WCHAR_MIN/WCHAR_MAX as signed short rather than unsigned on windows.
CUDA versions prior to 11.0.x do not predefine _WIN64 during NVRTC compilation. (So these CUDA versions use the Linux defines of WCHAR_MIN/WCHAR_MAX)

Support for curand

I have been making lots of good progress with jitify. Thanks for the excellent tool. One issue which I am currently unable to resolve however is the use of curand.

If I include curand in my jitify kernel (e.g. #include <curand_kernel.h>) and correctly set the compiler to add the cuda include directory (from CUDA_PATH) then there are a whole bunch of errors from cuda.h relating to ambiguous definitions of size_t

e.g.

cuda.h(1773): error: "size_t" is ambiguous

detail::path_base() doesn't handle Linux path separators on Windows

Was attempting to test whether a kernel with GLM would build with NVRTC (I'm not hopeful).

Ran into a failure, where all GLMs nested #include statements were failing to resolve.

glm/glm.hpp(105): warning: detail/_fixes.hpp: [jitify] File not found
glm/glm.hpp(107): warning: detail/setup.hpp: [jitify] File not found
glm/glm.hpp(114): warning: fwd.hpp: [jitify] File not found
glm/glm.hpp(116): warning: vec2.hpp: [jitify] File not found
glm/glm.hpp(117): warning: vec3.hpp: [jitify] File not found
glm/glm.hpp(118): warning: vec4.hpp: [jitify] File not found
glm/glm.hpp(119): warning: mat2x2.hpp: [jitify] File not found
glm/glm.hpp(120): warning: mat2x3.hpp: [jitify] File not found
glm/glm.hpp(121): warning: mat2x4.hpp: [jitify] File not found
glm/glm.hpp(122): warning: mat3x2.hpp: [jitify] File not found
glm/glm.hpp(123): warning: mat3x3.hpp: [jitify] File not found
glm/glm.hpp(124): warning: mat3x4.hpp: [jitify] File not found
glm/glm.hpp(125): warning: mat4x2.hpp: [jitify] File not found
glm/glm.hpp(126): warning: mat4x3.hpp: [jitify] File not found
glm/glm.hpp(127): warning: mat4x4.hpp: [jitify] File not found
etc

Added a breakpoint here to catch where detail::load_source() was being called.

Found these variable values

include_parent_fullpath = "C:/Users/Robadob/fgpu2/build/FLAMEGPU2/_deps/flamegpu2_visualiser-build/glm-src\\glm/glm.hpp" (Note that's an escaped \, not double)
include_path = "C:/Users/Robadob/fgpu2/build/FLAMEGPU2/_deps/flamegpu2_visualiser-build/glm-src"

This means that detail::path_base() is handling the path improperly.

The method is fairly simple

inline std::string path_base(std::string p) {
  // "/usr/local/myfile.dat" -> "/usr/local"
  // "foo/bar"  -> "foo"
  // "foo/bar/" -> "foo/bar"
#if defined _WIN32 || defined _WIN64
  char sep = '\\';
#else
  char sep = '/';
#endif
  size_t i = p.find_last_of(sep);
  if (i != std::string::npos) {
    return p.substr(0, i);
  } else {
    return "";
  }
}

Seems quite clear that it's neglecting the fact that either path separator is supported under Windows (as far as I understand).

Should be an easy fix, I'll make a PR soon.

Edit: According to people on StackOverflow, MSVC used to require backslash, but now it doesn't care. Unclear when that requirement was lifted, regardless I don't expect it will affect NVRTC.

#pragma HACK war fails on #pragma unroll x

A "#pragma unroll 16" would be altered to "_Pragma("unroll") 16", which leads to compile failure.

Is --expt-extended-lambda and --expt-relaxed-constexpr not supported?

From the nvrtc docs, it seems that the above options are not supported. Can you please confirm?
@maddyscientist

Question: Are cuModules shared between kernels from same program

I.e. If I create mutiple jitify kernels from the same program which have a shared device symbol does get_global_ptr return the same address for each?

Would be good to know before I do some refactoring of some code.

Debugging NVRTC_ERROR_INVALID_OPTION Windows

Hi there,

I am trying to integrate this into a project that I'm building using CMake on Windows. As a first step, I tried taking one of your example kernels and seeing if I can compile it. I keep getting this NVRTC_ERROR_INVALID_OPTION exception being thrown. I have tried to #define JITIFY_PRINT_ALL 1, but it doesn't seem to be printing except for the compiler options before it crashes. I am building on Windows 10 with CUDA 11.0 and Visual Studio 2019, Community Edition. I have the CUDA compiler set to C++14 and my CXX compiler set to C++17. I have also tried to jump through it with the Visual Studio debugger, however, it skips over the code that does the compilation, jumps straight to where it prints the source code (but there's no console output), and then it aborts the load_program function.

I made sure to link my code with the cuda, cudart, and nvrtc libraries through CMake.

Any advice on how to go about debugging this issue?
Thanks

Compiler options: -std=c++14 -arch=compute_30
due to unexpected exception with message:
  NVRTC error: NVRTC_ERROR_INVALID_OPTION

thread_local static jitify::JitCache kernel_cache;
const char* program_source =
                "template<int N, typename T>\n"
                "__global__\n"
                "void my_kernel(T* data) {\n"
                "    T data0 = data[0];\n"
                "    for( int i=0; i<N-1; ++i ) {\n"
                "        data[0] *= data0;\n"
                "    }\n"
                "}\n";
jitify::Program program = kernel_cache.program(program_source, 0, { "-std=c++14",});

Building on windows

Hi.
I''m currently getting these errors when running the simple example
--- JIT compile log for my_program ---

type_traits(18): error: expected an identifier

type_traits(19): error: expected a ")"

type_traits(19): error: a template argument list is not allowed in a declaration of a primary template

3 errors detected in the compilation of "my_program".

Any thoughts?

Using __half with NVRTC and jitify

Hi!
I want to use fp16 data type in kernels compiled with NVRTC. However, when I try to do so, I get following errors:

warning: cuda_fp16.h: File not found
---------------------------------------------------
--- JIT compile log for ...---
---------------------------------------------------
error: identifier "__half" is undefined

Can you tell me, what is the best way to use cuda_fp16.h with jitify?

Thank you.

Some ETA of initial code drop?

[Help]How to include <cuda_fp16.h> in jitify？

I tried to include in my kernel string like: "#include <cuda_fp16.h>" or make a header named "JITFP16.cuh" and pass into jitify::Program::program() functions. can't work.

so how can i include these standard header like cuda_fp16.h ？？？

Please help!

	#if JITIFY_PRINT_LOG
	std::cout << include_parent << "(" << line_num
	<< "): warning: " << include_name << ": [jitify] File not found"
	<< std::endl;
	#endif

nvidia / jitify Goto Github PK

jitify's People

Stargazers

Watchers

Forkers

jitify's Issues

Hi. I''m currently getting these errors when running the simple example --- JIT compile log for my_program ---

Recommend Projects

Recommend Topics

Recommend Org

Hi.
I''m currently getting these errors when running the simple example
--- JIT compile log for my_program ---