Git Product home page Git Product logo

drjit's Introduction

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering

Documentation Continuous Integration PyPI

Dr.Jit logo Dr.Jit logo

About this project

Dr.Jit is a just-in-time (JIT) compiler for ordinary and differentiable computation. It was originally created as the numerical foundation of Mitsuba 3, a differentiable Monte Carlo renderer. However, Dr.Jit is a general-purpose tool that can also help with various other types of embarrassingly parallel computation.

Dr.Jit helps with three steps:

  • Vectorization and tracing: When Dr.Jit encounters an operation (e.g. an addition a + b) it does not execute it right away: instead, it remembers that an addition will be needed at some later point by recording it into a graph representation (this is called tracing). Eventually, it will just-in-time (JIT) compile the recorded operations into a fused kernel using either LLVM (when targeting the CPU) or CUDA (when targeting the GPU). The values a and b will typically be arrays with many elements, and the system parallelizes their evaluation using multi-core parallelism and vector instruction sets like AVX512 or ARM Neon.

    Dr.Jit works particular well for Monte Carlo methods, which performs the same computation on for millions of random samples. Dr.Jit dynamically generates specialized parallel code for the target platform. As a fallback, Dr.Jit can also be used without JIT-compilation, which turns the project into a header-only vector library without external dependencies.

  • Differentiation: If desired, Dr.Jit can compute derivatives using automatic differentiation (AD), using either forward or reverse-mode accumulation. Differentiation and tracing go hand-in-hand to produce specialized derivative evaluation code.

  • Python: Dr.Jit types are accessible within C++17 and Python. Code can be developed in either language, or even both at once. Combinations of Python and C++ code can be jointly traced and differentiated.

Dr.Jit handles large programs with custom data structures, side effects, and polymorphism. It includes a mathematical support library including transcendental functions and types like vectors, matrices, complex numbers, quaternions, etc.

Difference to machine learning frameworks

Why did we create Dr.Jit, when dynamic derivative compilation is already possible using Python-based ML frameworks like JAX, Tensorflow, and PyTorch along with backends like XLA and TorchScript?

The reason is related to the typical workloads: machine learning involves smallish computation graphs that are, however, made of arithmetically intense operations like convolutions, matrix multiplications, etc. The application motivating Dr.Jit (differentiable rendering) creates giant and messy computation graphs consisting of 100K to millions of "trivial" nodes (elementary arithmetic operations). In our experience, ML compilation backends use internal representations and optimization passes that are too rich for this type of input, causing them to crash or time out during compilation. If you have encountered such issues, you may find Dr.Jit useful.

Cloning

Dr.Jit recursively depends on two other repositories: nanobind for Python bindings, and drjit-core providing core components of the JIT-compiler.

To fetch the entire project including these dependencies, clone the project using the --recursive flag as follows:

$ git clone --recursive https://github.com/mitsuba-renderer/drjit

Documentation

Please see Dr.Jit's page on readthedocs.io for example code and reference documentation.

References, citations

Please see the paper Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering for the nitty-gritty details and details on the problem motivating this project. There is also a video presentation explaining the design decisions at a higher level.

If you use Dr.Jit in your own research, please cite it using the following BibTeX entry:

@article{Jakob2022DrJit,
  author = {Wenzel Jakob and S{\'e}bastien Speierer and Nicolas Roussel and Delio Vicini},
  title = {Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering},
  journal = {Transactions on Graphics (Proceedings of SIGGRAPH)},
  volume = {41},
  number = {4},
  year = {2022},
  month = jul,
  doi = {10.1145/3528223.3530099}
}

Logo and history

The Dr.Jit logo was generously created by Otto Jakob. The "Dr." prefix simultaneously abbreviates differentiable rendering with the stylized partial derivative D, while also conveying a medical connotation that is emphasized by the Rod of Asclepius. Differentiable rendering algorithms are growing beyond our control in terms of conceptual and implementation-level complexity. A doctor is a person, who can offer help in such a time of great need. Dr.Jit tries to fill this role to to improve the well-being of differentiable rendering researchers.

Dr.Jit is the successor of the Enoki project, and its high-level API still somewhat resembles that of Enoki. The system evolved towards a different approach and has an all-new implementation, hence the decision to switch to a different project name.

drjit's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drjit's Issues

LLVM 14 invalid IR

Here is a tiny reproducer that generates invalid LLVM IR according to LLVM 14 but works fine with LLVM 13.

x = dr.arange(Float, 4)
i = UInt(0)
loop = Loop("Loop", lambda: (x, i))
while loop(i < 10):
    x += 1
    i += 1
print(x)

which fails with the error:

scope must have two or three operands
!1 = distinct !{!1}
scope must have two or three operands
!1 = distinct !{!1}

This can be easily reproduced and tested on the different LLVM compiler using the online compiler explorer and copy-pasting the following IR:

define void @drjit_48eecf3af2bbee94e3a6133c4f03f6e8(i64 %start, i64 %end, i8** noalias %params) #0 {
entry:
    br label %body

body:
    %index = phi i64 [ %index_next, %suffix ], [ %start, %entry ]
    %r1_0 = trunc i64 %index to i32
    %r1_1 = insertelement <4 x i32> undef, i32 %r1_0, i32 0
    %r1_2 = shufflevector <4 x i32> %r1_1, <4 x i32> undef, <4 x i32> zeroinitializer
    %r1 = add <4 x i32> %r1_2, <i32 0, i32 1, i32 2, i32 3>
    %f2 = uitofp <4 x i32> %r1 to <4 x float>
    %r3_1 = insertelement <4 x i32> undef, i32 0, i32 0
    %r3 = shufflevector <4 x i32> %r3_1, <4 x i32> undef, <4 x i32> zeroinitializer
    br label %l_4_start

l_4_start:
    br label %l_4_cond

l_4_cond: ; Loop (Loop)
    %f5 = phi <4 x float> [ %f5_final, %l_4_tail ], [ %f2, %l_4_start ]
    %r6 = phi <4 x i32> [ %r6_final, %l_4_tail ], [ %r3, %l_4_start ]
    %r7_1 = insertelement <4 x i32> undef, i32 10, i32 0
    %r7 = shufflevector <4 x i32> %r7_1, <4 x i32> undef, <4 x i32> zeroinitializer
    %p8 = icmp ult <4 x i32> %r6, %r7
    %p4 = call i1 @llvm.experimental.vector.reduce.or.v4i1(<4 x i1> %p8)
    br i1 %p4, label %l_4_body, label %l_4_done

l_4_body:
    %f10 = phi <4 x float> [ %f5, %l_4_cond ]
    %r11 = phi <4 x i32> [ %r6, %l_4_cond ]
    %f12_1 = insertelement <4 x float> undef, float 0x3ff0000000000000, i32 0
    %f12 = shufflevector <4 x float> %f12_1, <4 x float> undef, <4 x i32> zeroinitializer
    %f13 = fadd <4 x float> %f10, %f12
    %r14_1 = insertelement <4 x i32> undef, i32 1, i32 0
    %r14 = shufflevector <4 x i32> %r14_1, <4 x i32> undef, <4 x i32> zeroinitializer
    %r15 = add <4 x i32> %r11, %r14
    br label %l_4_tail

l_4_tail:
    %f5_final = select <4 x i1> %p8, <4 x float> %f13, <4 x float> %f5
    %r6_final = select <4 x i1> %p8, <4 x i32> %r15, <4 x i32> %r6
    br label %l_4_cond;

l_4_done:
    %f17_p1 = getelementptr inbounds i8*, i8** %params, i32 3
    %f17_p2 = load i8*, i8** %f17_p1, align 8, !alias.scope !1
    %f17_p3 = bitcast i8* %f17_p2 to float*
    %f17_p4 = getelementptr inbounds float, float* %f17_p3, i64 %index
    %f17_p5 = bitcast float* %f17_p4 to <4 x float>*
    %f17 = bitcast <4 x float> %f5 to <4 x float>
    store <4 x float> %f17, <4 x float>* %f17_p5, align 16, !noalias !1, !nontemporal !{i32 1}

    br label %suffix

suffix:
    %index_next = add i64 %index, 4
    %cond = icmp uge i64 %index_next, %end
    br i1 %cond, label %done, label %body, !llvm.loop !2

done:
    ret void
}

declare i1 @llvm.experimental.vector.reduce.or.v4i1(<4 x i1>)

!0 = !{!0}
!1 = !{!1, !0}
!2 = !{!"llvm.loop.unroll.disable", !"llvm.loop.vectorize.enable", i1 0}

attributes #0 = { norecurse nounwind "frame-pointer"="none" "no-builtins" "no-stack-arg-probe" "target-features"="" }

Compiler intrinsic not supported

Executing one of the code snippets from the tutorial with the LLVM backend triggers the following error :

Critical Dr.Jit compiler failure: jit_llvm_compile(): a global offset table was generated by LLVM, which typically means that a compiler intrinsic was not supported by the target architecture. DrJit cannot handle this case and will terminate the application now. For reference, the following kernel code was responsible for this problem:

define void @drjit_2a9fb1a3d6277c99f9e36112fe3a83c2(i64 %start, i64 %end, i8** noalias %params) #0 {
entry:
    br label %body

body:
    %index = phi i64 [ %index_next, %suffix ], [ %start, %entry ]
    %f1_1 = insertelement <8 x float> undef, float 0xbff0000000000000, i32 0
    %f1 = shufflevector <8 x float> %f1_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f2_1 = insertelement <8 x float> undef, float 0x3f80732600000000, i32 0
    %f2 = shufflevector <8 x float> %f2_1, <8 x float> undef, <8 x i32> zeroinitializer
    %r3_0 = trunc i64 %index to i32
    %r3_1 = insertelement <8 x i32> undef, i32 %r3_0, i32 0
    %r3_2 = shufflevector <8 x i32> %r3_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r3 = add <8 x i32> %r3_2, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %r4_1 = insertelement <8 x i32> undef, i32 103079216, i32 0
    %r4 = shufflevector <8 x i32> %r4_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r5_1 = insertelement <8 x i32> undef, i32 32, i32 0
    %r5 = shufflevector <8 x i32> %r5_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r6_0 = zext <8 x i32> %r4 to <8 x i64>
    %r6_1 = zext <8 x i32> %r3 to <8 x i64>
    %r6_2 = zext <8 x i32> %r5 to <8 x i64>
    %r6_3 = mul <8 x i64> %r6_0, %r6_1
    %r6_4 = lshr <8 x i64> %r6_3, %r6_2
    %r6 = trunc <8 x i64> %r6_4 to <8 x i32>
    %r7 = sub <8 x i32> %r3, %r6
    %r8_1 = insertelement <8 x i32> undef, i32 1, i32 0
    %r8 = shufflevector <8 x i32> %r8_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r9 = lshr <8 x i32> %r7, %r8
    %r10 = add <8 x i32> %r9, %r6
    %r11_1 = insertelement <8 x i32> undef, i32 7, i32 0
    %r11 = shufflevector <8 x i32> %r11_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r12 = lshr <8 x i32> %r10, %r11
    %r13 = sub <8 x i32> zeroinitializer, %r12
    %r14_1 = insertelement <8 x i32> undef, i32 250, i32 0
    %r14 = shufflevector <8 x i32> %r14_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r15_0 = mul <8 x i32> %r13, %r14
    %r15 = add <8 x i32> %r15_0, %r3
    %f16 = uitofp <8 x i32> %r12 to <8 x float>
    %f17 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f16, <8 x float> %f2, <8 x float> %f1)
    %f18 = uitofp <8 x i32> %r15 to <8 x float>
    %f19 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f18, <8 x float> %f2, <8 x float> %f1)
    %f20_1 = insertelement <8 x float> undef, float 0xc000000000000000, i32 0
    %f20 = shufflevector <8 x float> %f20_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f21_1 = insertelement <8 x float> undef, float 0x3ff0000000000000, i32 0
    %f21 = shufflevector <8 x float> %f21_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f22 = fmul <8 x float> %f19, %f19
    %f23 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f17, <8 x float> %f17, <8 x float> %f22)
    %f24 = fadd <8 x float> %f21, %f23
    %f25 = fdiv <8 x float> %f21, %f24
    %f26 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f25)
    %f27 = fmul <8 x float> %f19, %f26
    %f28 = fmul <8 x float> %f17, %f26
    %f29 = fadd <8 x float> %f26, %f20
    %f30 = fmul <8 x float> %f27, %f27
    %f31 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f28, <8 x float> %f30)
    %f32 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f29, <8 x float> %f29, <8 x float> %f31)
    %f33 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f32)
    %f34 = fsub <8 x float> %f33, %f21
    %f35 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f34, <8 x float> %f27)
    %f36 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f34, <8 x float> %f28)
    %f37 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f34, <8 x float> %f29)
    %f38 = fmul <8 x float> %f35, %f35
    %f39 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f36, <8 x float> %f36, <8 x float> %f38)
    %f40 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f37, <8 x float> %f37, <8 x float> %f39)
    %f41 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f40)
    %f42 = fsub <8 x float> %f41, %f21
    %f43 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f42, <8 x float> %f35)
    %f44 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f42, <8 x float> %f36)
    %f45 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f42, <8 x float> %f37)
    %f46 = fmul <8 x float> %f43, %f43
    %f47 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f44, <8 x float> %f44, <8 x float> %f46)
    %f48 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f45, <8 x float> %f45, <8 x float> %f47)
    %f49 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f48)
    %f50 = fsub <8 x float> %f49, %f21
    %f51 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f50, <8 x float> %f43)
    %f52 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f50, <8 x float> %f44)
    %f53 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f50, <8 x float> %f45)
    %f54 = fmul <8 x float> %f51, %f51
    %f55 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f52, <8 x float> %f52, <8 x float> %f54)
    %f56 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f53, <8 x float> %f53, <8 x float> %f55)
    %f57 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f56)
    %f58 = fsub <8 x float> %f57, %f21
    %f59 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f58, <8 x float> %f51)
    %f60 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f58, <8 x float> %f52)
    %f61 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f58, <8 x float> %f53)
    %f62 = fmul <8 x float> %f59, %f59
    %f63 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f60, <8 x float> %f60, <8 x float> %f62)
    %f64 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f61, <8 x float> %f61, <8 x float> %f63)
    %f65 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f64)
    %f66 = fsub <8 x float> %f65, %f21
    %f67 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f66, <8 x float> %f59)
    %f68 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f66, <8 x float> %f60)
    %f69 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f66, <8 x float> %f61)
    %f70 = fmul <8 x float> %f67, %f67
    %f71 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f68, <8 x float> %f68, <8 x float> %f70)
    %f72 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f69, <8 x float> %f69, <8 x float> %f71)
    %f73 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f72)
    %f74 = fsub <8 x float> %f73, %f21
    %f75 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f74, <8 x float> %f67)
    %f76 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f74, <8 x float> %f68)
    %f77 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f74, <8 x float> %f69)
    %f78 = fmul <8 x float> %f75, %f75
    %f79 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f76, <8 x float> %f76, <8 x float> %f78)
    %f80 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f77, <8 x float> %f77, <8 x float> %f79)
    %f81 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f80)
    %f82 = fsub <8 x float> %f81, %f21
    %f83 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f82, <8 x float> %f75)
    %f84 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f82, <8 x float> %f76)
    %f85 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f82, <8 x float> %f77)
    %f86 = fmul <8 x float> %f83, %f83
    %f87 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f84, <8 x float> %f84, <8 x float> %f86)
    %f88 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f85, <8 x float> %f85, <8 x float> %f87)
    %f89 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f88)
    %f90 = fsub <8 x float> %f89, %f21
    %f91 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f90, <8 x float> %f83)
    %f92 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f90, <8 x float> %f84)
    %f93 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f90, <8 x float> %f85)
    %f94 = fmul <8 x float> %f91, %f91
    %f95 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f92, <8 x float> %f92, <8 x float> %f94)
    %f96 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f93, <8 x float> %f93, <8 x float> %f95)
    %f97 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f96)
    %f98 = fsub <8 x float> %f97, %f21
    %f99 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f27, <8 x float> %f98, <8 x float> %f91)
    %f100 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f28, <8 x float> %f98, <8 x float> %f92)
    %f101 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f26, <8 x float> %f98, <8 x float> %f93)
    %f102_1 = insertelement <8 x float> undef, float 0x0, i32 0
    %f102 = shufflevector <8 x float> %f102_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f103_1 = insertelement <8 x float> undef, float 0x3f50624de0000000, i32 0
    %f103 = shufflevector <8 x float> %f103_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f104 = fadd <8 x float> %f99, %f103
    %f105 = fmul <8 x float> %f104, %f104
    %f106 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f100, <8 x float> %f100, <8 x float> %f105)
    %f107 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f101, <8 x float> %f101, <8 x float> %f106)
    %f108 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f107)
    %f109 = fsub <8 x float> %f108, %f21
    %f110 = fsub <8 x float> %f99, %f103
    %f111 = fmul <8 x float> %f110, %f110
    %f112 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f100, <8 x float> %f100, <8 x float> %f111)
    %f113 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f101, <8 x float> %f101, <8 x float> %f112)
    %f114 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f113)
    %f115 = fsub <8 x float> %f114, %f21
    %f116 = fsub <8 x float> %f109, %f115
    %f117 = fadd <8 x float> %f100, %f103
    %f118 = fmul <8 x float> %f99, %f99
    %f119 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f117, <8 x float> %f117, <8 x float> %f118)
    %f120 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f101, <8 x float> %f101, <8 x float> %f119)
    %f121 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f120)
    %f122 = fsub <8 x float> %f121, %f21
    %f123 = fsub <8 x float> %f100, %f103
    %f124 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f123, <8 x float> %f123, <8 x float> %f118)
    %f125 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f101, <8 x float> %f101, <8 x float> %f124)
    %f126 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f125)
    %f127 = fsub <8 x float> %f126, %f21
    %f128 = fsub <8 x float> %f122, %f127
    %f129 = fadd <8 x float> %f101, %f103
    %f130 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f100, <8 x float> %f100, <8 x float> %f118)
    %f131 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f129, <8 x float> %f129, <8 x float> %f130)
    %f132 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f131)
    %f133 = fsub <8 x float> %f132, %f21
    %f134 = fsub <8 x float> %f101, %f103
    %f135 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f134, <8 x float> %f134, <8 x float> %f130)
    %f136 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f135)
    %f137 = fsub <8 x float> %f136, %f21
    %f138 = fsub <8 x float> %f133, %f137
    %f139_1 = insertelement <8 x float> undef, float 0x407f400000000000, i32 0
    %f139 = shufflevector <8 x float> %f139_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f140 = fmul <8 x float> %f116, %f139
    %f141 = fmul <8 x float> %f128, %f139
    %f142 = fmul <8 x float> %f138, %f139
    %f143 = fmul <8 x float> %f140, %f102
    %f144 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f141, <8 x float> %f1, <8 x float> %f143)
    %f145 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f142, <8 x float> %f1, <8 x float> %f144)
    %f146 = call <8 x float> @llvm.maxnum.v8f32(<8 x float> %f102, <8 x float> %f145)
    %f147 = call <8 x float> @llvm.fma.v8f32(<8 x float> %f101, <8 x float> %f101, <8 x float> %f130)
    %f148 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f147)
    %f149 = fsub <8 x float> %f148, %f21
    %f150_1 = insertelement <8 x float> undef, float 0x3fb99999a0000000, i32 0
    %f150 = shufflevector <8 x float> %f150_1, <8 x float> undef, <8 x i32> zeroinitializer
    %p151 = fcmp ogt <8 x float> %f149, %f150
    %f152 = select <8 x i1> %p151, <8 x float> %f102, <8 x float> %f146
    %f153_1 = insertelement <8 x float> undef, float 0x3fc99999a0000000, i32 0
    %f153 = shufflevector <8 x float> %f153_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f154_1 = insertelement <8 x float> undef, float 0x3fd99999a0000000, i32 0
    %f154 = shufflevector <8 x float> %f154_1, <8 x float> undef, <8 x i32> zeroinitializer
    %f155 = fmul <8 x float> %f154, %f152
    %f156 = fmul <8 x float> %f153, %f152
    %f157 = fadd <8 x float> %f150, %f155
    %f158 = fadd <8 x float> %f153, %f156
    %r159_1 = insertelement <8 x i32> undef, i32 3, i32 0
    %r159 = shufflevector <8 x i32> %r159_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r160 = mul <8 x i32> %r3, %r159
    %r161 = add <8 x i32> %r160, %r8
    %r162_1 = insertelement <8 x i32> undef, i32 2, i32 0
    %r162 = shufflevector <8 x i32> %r162_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r163 = add <8 x i32> %r160, %r162
    %rd164_p1 = getelementptr inbounds i8*, i8** %params, i32 3
    %rd164 = load i8*, i8** %rd164_p1, align 8, !alias.scope !1
    %r165_0 = trunc i64 %index to i32
    %r165_1 = insertelement <8 x i32> undef, i32 %r165_0, i32 0
    %r165_2 = shufflevector <8 x i32> %r165_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %r165 = add <8 x i32> %r165_2, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %p166_0 = trunc i64 %end to i32
    %p166_1 = insertelement <8 x i32> undef, i32 %p166_0, i32 0
    %p166_2 = shufflevector <8 x i32> %p166_1, <8 x i32> undef, <8 x i32> zeroinitializer
    %p166 = icmp ult <8 x i32> %r165, %p166_2
    %u167_0 = bitcast i8* %rd164 to float*
    %u167_1 = getelementptr float, float* %u167_0, <8 x i32> %r160
    call void @llvm.masked.scatter.v8f32(<8 x float> %f157, <8 x float*> %u167_1, i32 4, <8 x i1> %p166)
    %u168_0 = bitcast i8* %rd164 to float*
    %u168_1 = getelementptr float, float* %u168_0, <8 x i32> %r161
    call void @llvm.masked.scatter.v8f32(<8 x float> %f157, <8 x float*> %u168_1, i32 4, <8 x i1> %p166)
    %u169_0 = bitcast i8* %rd164 to float*
    %u169_1 = getelementptr float, float* %u169_0, <8 x i32> %r163
    call void @llvm.masked.scatter.v8f32(<8 x float> %f158, <8 x float*> %u169_1, i32 4, <8 x i1> %p166)

    br label %suffix

suffix:
    %index_next = add i64 %index, 8
    %cond = icmp uge i64 %index_next, %end
    br i1 %cond, label %done, label %body, !llvm.loop !2

done:
    ret void
}

declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>)

declare <8 x float> @llvm.sqrt.v8f32(<8 x float>)

declare <8 x float> @llvm.maxnum.v8f32(<8 x float>, <8 x float>)

declare void @llvm.masked.scatter.v8f32(<8 x float>, <8 x float*>, i32, <8 x i1>)

!0 = !{!0}
!1 = !{!1, !0}
!2 = !{!"llvm.loop.unroll.disable", !"llvm.loop.vectorize.enable", i1 0}

attributes #0 = { norecurse nounwind "frame-pointer"="none" "no-builtins" "no-stack-arg-probe" "target-cpu"="ivybridge" "target-features"="-vzeroupper,-avx512pf,-tsxldtrk,+cx16,+sahf,-tbm,-avx512ifma,-sha,+crc32,-fma4,-vpclmulqdq,-prfchw,-bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-uintr,-gfni,+popcnt,-widekl,+aes,-avx512bitalg,-movdiri,-xsaves,-avx512er,-avxvnni,-avx512fp16,-avx512vnni,-amx-bf16,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-xsavec,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,-rdseed,-waitpkg,-kl,-movdir64b,-sse4a,-avx512bw,-clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,-serialize,-hreset,-invpcid,-avx512cd,+avx,-vaes,-avx512bf16,+cx8,-fma,-rtm,-bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,-avx2,+fxsr,-wbnoinvd,+sse,-lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,-amx-int8,-movbe,-avx512vp2intersect,+xsaveopt,-avx512dq,+sse2,-adx,+sse3" }
Aborted

CPU is Xeon E5-2680 (Sandy Bridge)
Is there a way to explicitly restrict the instruction set from python ?

Tests files not compiling (sphere.cpp, dynamic.cpp, ...).

Hi,

Congratulation for the making of DrJit.

As a first experiment with it was trying to compare SoA vs AoS implementations of the same numerical method but didn't succeeded so far. As the c++ documentation is TBD... to get an idea on the general API design I used the one from Enoki's combined with the files available in the tests directory. Doing so I noticed that several files in the tests directory are not compiling (vector.cpp, dynamic.cpp, sphere.cpp).

The commented tests seems to use AoS style of coding and I wonder if you have plan to restore them, possibly by re-implementing currently missing features, or if it was decided not to go in this direction.

Regards,

Unexpected behavior of drjit.shape

I am playing with drjit in python and seem to find an undesirable behavior of the shape function in router.py:
If you define a ragged 2D array whose # entries of the second dimension decrease, the output of dr.shape is None, which is the expected result of ragged arrays.

a = mi.Color3f([0, 1, 2], [0, 1], [0, 1])
dr.shape(a)

out:

None

But if you define it in a way where # entries of the second dimension increase, the output is not None, but instead behave like the array is extended to a uniform-length array

a = mi.Color3f([0, 1], [0, 1, 2], [0, 1, 2, 3])
dr.shape(a)

out:

[3, 4]

What should I do before loading drjit variable to optix to ensure sync?

Hi,
I might have a brief question
I have a function call ray_intersect which take an rayclass as input and load the ray data to optix to get some ray tracing result.
I load the ray data to m_accel like this:

    m_accel->params.ray_o_x         = ray.o.x().data();
    m_accel->params.ray_o_y         = ray.o.y().data();
    m_accel->params.ray_o_z         = ray.o.z().data();
    m_accel->params.ray_d_x         = ray.d.x().data();
    m_accel->params.ray_d_y         = ray.d.y().data();
    m_accel->params.ray_d_z         = ray.d.z().data();
    m_accel->params.ray_tmax        = ray.tmax.data();
    m_accel->params.tri_index       = m_its.triangle_id.data();
    m_accel->params.shape_index     = m_its.shape_id.data();
    m_accel->params.barycentric_u   = m_its.uv.x().data();
    m_accel->params.barycentric_v   = m_its.uv.y().data();

    CUDA_CHECK(
        cudaMemcpyAsync(
            reinterpret_cast<void*>( m_accel->d_params ),
            &m_accel->params, sizeof( Params ),
            cudaMemcpyHostToDevice, m_accel->stream
        )
    );
    std::cout << ray.o << std::endl;
    std::cout << ray.d<< std::endl;
    std::cout << ray.tmax << std::endl;
    OPTIX_CHECK(
        optixLaunch(
            m_accel->pipeline,
            m_accel->stream,
            reinterpret_cast<CUdeviceptr>( m_accel->d_params ),
            sizeof( Params ),
            &m_accel->sbt,
            m,              // launch size
            1,              // launch height
            1               // launch depth
        )
    );
    CUDA_SYNC_CHECK();

However, without std::cout the ray data, the Optix cannot return me back the correct result such as the m_its.shape_id or m_its.triangle_id.
Is there a possible reason for this?
When I was using enoki, I simply call cuda_eval(); before send the enoki's data to optix. Should I do something similar in drjit? (btw I know drjit have its own drjit->optix methods but I am not using it yet)
Thank you very much for answering my questions, but let me know is asking on github is appropriate as well...

Calling some functions decorated with `wrap_ad` results in a crash

Hi,

I discovered a pair of settings where wrapping a function with wrap_ad(source='torch', target='drjit') and calling it results in a crash and you might be interested in looking into it. Some of it might be expected behavior (?) although I don't see why it should be. Here are minimal reproducers.

  1. The identity function
import drjit as dr
import torch

device = torch.device('cuda:0')

@dr.wrap_ad(source='torch', target='drjit')
def do_something(u):
    return u

u = torch.tensor([0], dtype=torch.float32, device=device)
v = do_something(u) # crashes here
  1. Taking the gradient of a simple sum involving the input and any number
import drjit as dr
import torch

device = torch.device('cuda:0')

@dr.wrap_ad(source='torch', target='drjit')
def do_something(u):
    return u + 1

u = torch.tensor([0], dtype=torch.float32, device=device)
u.requires_grad = True

v = do_something(u)

du = torch.autograd.grad(v.mean(), u)[0] # crashes here

Transforming the input with an arbitrary drjit function fixes the crashes, for example by defining do_something as

def do_something(u):
    return dr.abs(u) + 1

Here some information about my system:

  • OS: Windows 10
  • GPU: RTX 3080 Ti
  • Python 3.8.10
    • drjit 0.3.2
    • pytorch 1.12.1
    • cudatoolkit 11.6.0

Query/Feature request - pairwise broadcasting?

Hi,

Thanks for this great tool - I've been finding it very useful programming parts of a 3D processing pipeline for plant geometry, so far finding it much more convenient than similar libraries.

My question is, if it is possible to compute pairwise interactions with a function, for example to compute a distance matrix between Array3f or operations like large (larger than 4x4) matrix multiplication? Is there a way for example to implement a small MLP like this?

My approach thus far is to loop over one of the dimensions - would that be the recommended way to go about this, or am I abusing the tool for something it probably isn't meant for?

Thanks,
Oliver

enoki-jit submodule missing

Hello!
It appears that enoki-jit has either been removed, or made private in such a way that I am unable to build enoki!

Forgive me if this is my mistake, I've searched for a few hours on quick fixes, but I don't see any.

On a recursive submodule pull, I am prompted to log into github, which I don't have the password for, I assume this is because enoki-jit has been made private!

thanks,

-Will

Second order derivative?

Is it possible to do backward_to of a Float variable to itself?
Is there a standard way to do something like jax's .hessian, .jacrev, .jacfwd ?

Tensor::entry() missing

Seems the Tensor class has no entry() method used for indexing.

DRJIT_TEST(tensor)
{
    using Real = dr::Array<float>;
    using Tensor = dr::Tensor<Real>;
    using UInt32 = dr::uint32_array_t<Real>;

    float data[] = {1, 2, 3,  4, 5, 6};
    size_t shape[] = {2, 3};
    Tensor t(data, 2, shape);

    auto x = t[0];    // <----- Compilation error: no entry function on array_base.h:182: return derived().entry(i)
}

drjit.make_opaque fails with ad tensor

I am trying to make an ad tensor type opaque:

import drjit as dr
from drjit.llvm.ad import TensorXf

t = dr.zeros(TensorXf, (10, 10, 3))
dr.make_opaque(t)

Got AttributeError: 'drjit.cuda.ad.TensorXf' object has no attribute 'detach_ref_'

However, for non-ad tensors, it works fine

[:bug: bug report] Incompatible function arguments (probably related to int32 variable indexing)

Hi,
I think I have found a bug related to the recent change allowing more than 4 billion variables.
When optimizing a large scene I get the error after 910 optimization steps.

File "/home/doeringc/workspace/python/zed/ddtar/zed/test01/out/final/../../difr.py", line 95, in step
    img = mi.render(
  File "/home/doeringc/workspace/cpp/mitsuba3-test/build/python/mitsuba/python/util.py", line 518, in render
    return dr.custom(_RenderOp, scene, sensor, params, integrator,
  File "/home/doeringc/workspace/cpp/mitsuba3-test/build/python/drjit/router.py", line 5613, in custom
    Type.ad_add_edge_(index, tmp_in.index_ad)
TypeError: ad_add_edge_(): incompatible function arguments. The following argument types are supported:
    1. (src_index: int, dst_index: int, cb: handle = None) -> None

Invoked with: 2148013987, 2148014840

This seems pretty ironic since 2148013987 and 2148014840 are both integers in the python sense.
In the ad_add_edge interface the arguments are however defined as int32_t C++ types here or here. I'm not quite sure which definition is the correct one for this issue but I imagine that since the 4 billion variables update not all bindings where updated to use int64_t types.
Currently I don't have the time for it but otherwise I would try to fix it myself since it seems to easy at first glance.
Thanks for your Help.

Suboptimal Codegen with enoki::Array for shuffling

Enoki generates suboptimal code when using the enoki array type for shuffling:

template<typename T, size_t num>
enoki::Complex<enoki::Packet<T, num>> val2(std::complex<T> const * ptr) {

    using packetT = enoki::Packet<T, num>;

    auto ld = enoki::load_unaligned<enoki::Array<float, 2*num>>(ptr);

    enoki::Array<int, 2*num> indices;
    for(size_t index = 0; index < num; ++index) {
        indices[index] = 2*index;
        indices[num + index] = 2*index + 1;
    }

    auto shuffled = enoki::shuffle(ld, indices);

    return enoki::Complex<packetT>(
        enoki::head<num>(shuffled),
        enoki::tail<num>(shuffled)    
    );
}

Gives

push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 96
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymm1, ymmword ptr [rsi + 32]
        vmovaps ymm5, ymmword ptr [rip + .LCPI0_1] # ymm5 = [1,3,5,7,9,11,13,15]
        vmovaps ymm4, ymmword ptr [rip + .LCPI0_2] # ymm4 = <u,u,u,u,1,3,5,7>
        vmovaps ymm2, ymmword ptr [rip + .LCPI0_0] # ymm2 = [0,2,4,6,8,10,12,14]
        mov     rax, rdi
        vshufps ymm3, ymm0, ymm1, 136           # ymm3 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]
        vpermps ymm0, ymm5, ymm0
        vpermps ymm1, ymm4, ymm1
        vmovaps ymmword ptr [rsp], ymm2
        vmovaps ymmword ptr [rsp + 32], ymm5
        vpermpd ymm3, ymm3, 216                 # ymm3 = ymm3[0,2,1,3]
        vblendps        ymm0, ymm0, ymm1, 240           # ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
        vmovaps ymmword ptr [rdi], ymm3
        vmovaps ymmword ptr [rdi + 32], ymm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret

(demo https://godbolt.org/z/7e1Pj3joo)

whereas (thanks to @willwray)

template<typename T, size_t num>
constexpr enoki::Complex<enoki::Packet<T, num>> val2(std::complex<T> const * ptr) {

    using packetT = enoki::Packet<T, num>;

    auto ld = enoki::load_unaligned<enoki::Array<float, 2*num>>(ptr);

    int indices[2*num];
    for(size_t index = 0; index < num; ++index) {
        indices[index] = 2*index;
        indices[num + index] = 2*index + 1;
    }

    auto shuffled = enoki::shuffle(ld, __builtin_bit_cast(enoki::Array<int, 2*num>,indices));

    return enoki::Complex<packetT>(
        enoki::head<num>(shuffled),
        enoki::tail<num>(shuffled)    
    );
}

gives

        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymm1, ymmword ptr [rsi + 32]
        mov     rax, rdi
        vshufps ymm2, ymm0, ymm1, 136           # ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]
        vshufps ymm0, ymm0, ymm1, 221           # ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
        vpermpd ymm2, ymm2, 216                 # ymm2 = ymm2[0,2,1,3]
        vpermpd ymm0, ymm0, 216                 # ymm0 = ymm0[0,2,1,3]
        vmovaps ymmword ptr [rdi], ymm2
        vmovaps ymmword ptr [rdi + 32], ymm0
        vzeroupper
        ret

(demo https://godbolt.org/z/KTs9TT7Pn)

Given that in the first codegen, clang knows ymm5 = [1,3,5,7,9,11,13,15] and [0,2,4,6,8,10,12,14], it seems to have figured out that it's able to generate the indices at compile time. We also see vshufps ymm3, ymm0, ymm1, 136 and vpermpd ymm3, ymm3, 216...

Well, one way around is with template metaprogramming:
https://godbolt.org/z/eGfhKxhP3

(base by me, improved by @willwray)
Perhaps a nice snippet to include in Enoki?

Update:
Initialising the array
enoki::Array<int, 2*num> indices(0);
gives proper codegen!

Update 2:
That was nonsense, it's a compiler issue.
Clang 12 generates good code, Clang 11 fails.
gcc always generates bad code.

Generate stubs

Since drjit loads its python functions into the init.py module, I'm not able to get linting and code completion to work.
Is it possible to generate stub files for drjit like they are for mitsuba3?

I'm currently trying to adapt mitsuba3's generate_stub_files.py to work for drjit so maybe I'll add a pull request when I got it working.

`ModuleNotFoundError: No module named 'drjit.drjit_ext'` when compiling `drjit` myself

When I compile drjit myself using

git clone --recursive https://github.com/mitsuba-renderer/drjit
cd drjit
pip install .

I get the following error, when I try to import drjit:

>>> import drjit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/drjit/drjit/__init__.py", line 24, in <module>
    import drjit.drjit_ext as drjit_ext  # noqa
ModuleNotFoundError: No module named 'drjit.drjit_ext'

Am I missing a dependency or a flag?

Memory leak in PyTorch and Mitsuba interoperability tutorial

Hi, I'm having some issues with the Pytorch-Mitsuba interoperability tutorial. I'm following the tutorial step by step yet end up with errors.

If I rescale the textures to something smaller, eg (256,256), I get the below message:

Traceback (most recent call last):
  File "bp_playground.py", line 154, in <module>
    loss.backward()
  File "/vol/research/relighting/conda/envs/torch181_cuda111/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/vol/research/relighting/conda/envs/torch181_cuda111/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 23.69 GiB total capacity; 16.00 GiB already allocated; 6.21 GiB free; 16.01 GiB reserved in total by PyTorch)
drjit-autodiff: variable leak detected (5 variables remain in use)!
 - variable a121 (1 references)
 - variable a1 (1 references)
 - variable a9 (1 references)
 - variable a13 (1 references)
 - variable a5 (1 references)

If I use a larger image size, eg (1024,1024), the code fails earlier:

  File "bp_playground.py", line 130, in <module>
    model = Model1()
  File "bp_playground.py", line 119, in __init__
    nn.Linear(res**2, res**2),
  File "/vol/research/relighting/conda/envs/torch181_cuda111/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 79, in __init__
    self.weight = Parameter(torch.Tensor(out_features, in_features))
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4398046511104 bytes. Error code 12 (Cannot allocate memory)

Do you have any advice?

Many thanks,
NK

Possible bug with Loop

Not sure if it's a bug or reproducible, but when I try this code snippet, it gets stuck at the point of compiling the kernel at random probability. (in the worst case takes around 10 seconds to finish, strangely)

import drjit as dr
from drjit.llvm.ad import Int, UInt, Float, Loop

i = dr.zeros(Int, 10)
j = dr.zeros(Int, 10)

dr.make_opaque(i, j)  # commenting this line, it runs super fast

loop = Loop("MyLoop", lambda: (i, j))
while loop(i < 10):
    j += i
    i += 1

assert i == Int([10]*10)
assert j == Int([45]*10)

It could be an issue related to using Loop and make_opaque.

Any suggestion on when to use drjit::eval()?

I am writing a PT using drjit, however.
It seems I have to call drjit::eval() VERY often to ensure it can give me the correct result.
like this:

BSDFArra bsdf_array = its.shape->bsdf();
drjit::eval(bsdf_array);
...
drjit::eval(its)
bsdf_val2 = bsdf_array->eval(its, wo_local, active_direct);
...
auto rs = throughput * emitter_val * bsdf_val2 * weight1;
drjit::eval(rs);
result[active_direct] +=  rs; // this will have correct result
// result[active_direct] += throughput * emitter_val * bsdf_val2 * weight1; // this will have incorrect result!!

Is there anything I might forget causing such issue? I think it is quite not normal to use such a lot eval to ensure the correct result compared to enoki...
Thanks!

[❔ other question] How to suppress import warning when no CUDA-enabled device is available?

Description

I'm using Dr.Jit on a machine without a CUDA-enabled device. I made sure to set export CUDA_VISIBLE_DEVICES="", but I'm still getting the following warning message upon import:

jit_cuda_init(): cuInit failed, disabling CUDA backend.
There are two common explanations for this type of failure:

 1. your computer simply does not contain a graphics card that supports CUDA.

 2. your CUDA kernel module and CUDA library are out of sync. Try to see if you
    can run a utility like 'nvida-smi'. If not, a reboot will likely fix this
    issue. Otherwise reinstall your graphics driver.

 The specific error message produced by cuInit was
   "no CUDA-capable device is detected"

So I have two questions:

  • Is this behaviour intended?
  • If so, how can I fix my system or hide this warning message?

System configuration

  • Dr.Jit version: 0.2.1 (from PyPI)
  • OS: Ubuntu 20.04
  • Python version: 3.10.5

Use `@dr.wrap_ad` with a method of a class

@Speierers Is it currently possible to use the @dr.wrap_ad decorator for a method of a class? A prominent example would be the forward method of a torch.nn.Module.

If it is currently not supported, is there an elegant workaround then?

Thanks.

The location of cached kernel

Hi,

I am wondering where I can find the cached kernels on the disk. Is it possible to delete cached kernels or redirect the stored location? Are the cached kernels stored as PTX files or even lower-level machine code?

Cheng

Loop Variables

Hello drjit & mitsuba team!

This is an absolutely fantastic release, thank you so much for making this incredible API even better.

Tried looking through the examples and documentation for an answer to the following but couldn't find anything yet, so asking here for a clarification. Let's say I have a loop that does ray marching using path replay backpropagation:

        <...>

        loop = mi.Loop(name=f"PRB ({mode.name})",

                       state=lambda: (sampler, ray, L, t, δL, β, active))

        while loop(active):

            p = ray(t)

            <...>

Now, say I have two grids (A and B), and I want to use p.z to make a binary decision which of those I'm looking up a value at p in. So, in pseudo-code:

if p.z < 0.5:

    <look up values in grid A>

else:

    <look up values in grid B>

This decision is fixed, but I do want gradients for the values in grids A and B, In pytorch this could be written using masking, but what is the correct way to write this using drjit / mitsuba's recorded loops?

Many thanks for your time.

Best,
Stephan

How to convert tensor to python builtin type?

Hi,

I'm trying to convert a drjit.cuda.ad.Float to python builtin float. I've tried calling the .Scalar(), however, it always return 0.

some_var = dr.cuda.Float(1)
print(a) # [1.0]
print(some_var.Scalar()) # 0.0

What should be the correct way to do this?

Best Regards,
Daxuan

Reading back to Scalar types floods cache dir with kernels

Depending on how I read back values from wide device arrays to scalar arrays, I find a weird issue with unexpected kernel regeneration on every read: Using .numpy() to access values generates new kernels every time, whereas dr.slice(...) does not result in additional kernels (only one kernel cached in ~/.drjit).

Interestingly, this behavior is only observed for more complex types such as Array3f, Matrix4f, Transform4f, but not for Float. Are there some kind of uncached reshuffling kernels involved in the multi-dimensional .numpy() accessors?

Repro

Tested on Ubuntu 20.04 LTS with Python 3.8:

import drjit as dr
from drjit.scalar import Array3f as ScalarArray3f
from drjit.cuda.ad import Array3f

device = Array3f(3.0)

for i in range(100):
    host = ScalarArray3f(device.numpy()) # generates many kernels in ~/.drjit
    #host = ScalarArray3f(dr.slice(device)) # generates one kernel in ~/.drjit
    host[0] += i
    device.assign(host)

print(device)

Maximum order for `drjit.sh_eval`

The order parameter of drjit.sh_eval is limited to 10. I guess the reason is that the code becomes increasingly verbose for higher orders, because the parameters are basically hard-coded. Would it be interesting to you to add a version, which calculates the parameters of higher orders on-the-fly? I'd like to come up with a PR, but I'm afraid I lack the required skills.

[Question] How to convert static arrays like Vector3f to tensors inside recorded loops?

I want to convert static arrays like Vector3f to TensorXf inside recorded loops by something like mi.TensorXf(a), but when I tried, the following errors occured:

jit_var_eval(): variable r51154 remains dirty after evaluation!
jit_var_eval(): variable r51170 remains dirty after evaluation!
TypeError: TensorXf: expect an array that implements the array interface protocol!

And I couldn't even ravel them

Critical Dr.Jit compiler failure: jit_var_new_scatter(): variable remains dirty after evaluation!

Is there a way to do that?

Importing `torch` breaks `mitsuba`/`drjit` (again)

Since #62 has been merged, importing torch breaks mitsuba/drjit

# isort: off
import torch
import mitsuba as mi
import drjit as dr
# isort: on

print(dr.sum([1, 2, 3]))
mi.set_variant("llvm_ad_rgb")

test_scene = mi.load_file("./tests/fixtures/scenes/cbox.xml", res=128, integrator="prb")
print("Success")

I encountered this problem before (see #62 (comment)), but there existed a workaround, i.e. importing torch first. Unfortunately, this no longer works (at least for me).

@Speierers Can you reproduce this?

Missing overload for Arm Neon intrinsics

Hi! I am evaluating if it is worth switching some code from enoki to drjit. I noticed that currently there is no packet overload for Neon intrinsics, i.e. something corresponding to enokis array_neon.h.
Are you planning on adding this at a later point, or would you welcome some PRs adding this back?

Differentiating through loops

I've been playing around with Dr.Jit. It seems to work very well, but I think there's something fairly fundamental that I'm not understanding.

I tried to implement a very basic differentiable renderer based on the sphere tracing example in the docs. (I know that I'm not handling discontinuities properly and this won't work well).
I do actually get fairly decent results if I use a regular python loop in the sphere tracing routine, but whenever I try to use the "recorded loop" construct and then ask it to compute gradients I get the message:

loop_process_state(): one of the supplied loop state variables of type Float is attached to the AD graph (i.e., grad_enabled(..) is true). However, propagating derivatives through multiple iterations of a recorded loop is not supported (and never will be). Please see the documentation on differentiating loops for details and suggested alternatives.

I haven't found any documentation on "differentiating loops". Is there something you could point me to that explains how I would differentiate through a loop? I did look at the differentiable sphere tracing code, but it's hard for me to disentangle the discontinuity handling from the basic "differentiable loop" code.

Thanks!

question: feasibility of separating vectorization into separate library

Hi, I've got a C++ code that I'm modifying to vectorize with enoki, and I've been really impressed with the speedups and how straightforward it is to adopt. I noticed that development on enoki has slowed, in favor of drjit-- which sounds like an amazing project!

My question is: would it make sense to separate out the vectorization tools of these libraries (e.g. enoki::Array and its implementations for different SIMD instructions) into a separate repo? The ease of use and features (vectorization of special functions, masked operations, etc) make enoki/drjit's vectorization capability way nicer than other SIMD libraries available, and making it into an independent module could help make it more accessible.

Thanks again for developing and sharing these wonderful libraries-- drjit seems like an incredibly ambitious project.

Unable to Optimize Vertex Positions

I wanted to optimize the vertex positions of a mesh and tried the following code but encountered some strange error.

import drjit as dr
import mitsuba as mi

mi.set_variant('cuda_ad_rgb')

key = 'redwall.vertex_positions'

# Loading the Cornell Box scene shipped along with Mitsuba 3
scene = mi.load_file('tutorials/scenes/cbox.xml', res=128, integrator='prb_reparam')

params = mi.traverse(scene)

opt = mi.ad.Adam(lr=0.02)
opt[key] = params[key]
params.update(opt)
Critical failure in Dr.Jit AD backend: referenced an unknown variable a2494425056!
Aborted (core dumped)

What is the problem here?

[Question] Is there a way to concatenate two Dr.Jit arrays in python?

Is there a way to concatenate two arrays in python like a = dr.concat(b, c) where a, b, and c are Dr.Jit arrays of the same type, for example, Vector3f?
The only way I can think of is to convert them to, for example, a numpy array, concatenate them, and then convert the resulting array back into a Dr.Jit array. Is there a better way?

Bug: tensor multi index broadcasting fails.

I tried to do multi index tensor broadcasting and the indices go out of range. I went through the unit tests and found only single index broadcasting, so I guess multi index tensor broadcasting has not been tested.

image

I'm currently adding else before size = size_next in file tensor.h and this temporarily solves the problem.

An issue with dr.repeat and dr.tile

As the title suggests, I am encountering some issues with dr.repeat and dr.tile.

The following fails with an error AttributeError: module 'drjit.llvm.ad' has no attribute 'UInt8'

import drjit as dr
import mitsuba as mi
mi.set_variant('llvm_ad_rgb')

b = mi.Bool([True, False])
dr.repeat(b, 2)

The error message is the same for dr.tile.

Kernel runtime/op-count increases gradually during optimization loop

Hi everyone!

I've run into an issue where the runtime of my kernel goes up steadily as I run my optimization. It also seems to recompile the kernel for every iteration (I get thousands of files in the cache directory after an hour or two of working). When I enable Info output, it also seems like the number of ops creeps up steadily.
I don't think I'm doing anything too weird, but I haven't broken down my program to a minimal example that exhibits the issue (behavior seems fairly random, removing/adding a few lines code can shift the point at which the slow down occurs drastically).
Is that something you have encountered before?

Thanks!

More than 4 billion variables error

Hi! I did a large optimization using Mitsuba and after many iterations received

'Critical Dr.Jit compiler failure: DrJit has created more than 2^32 (4 billion) variables, which is currently the limit. Bug Wenzel to fix this (it will involve sorting scheduled variables by scope ID instead of variable ID and making the counter big enough that it will never overflow..)'.

Could you please tell if this is fine / will be fixed / ... ?

How to do enoki.psum() in drjit

Here is an example I am trying to do for a discrete distribution:

using FloatC    = CUDAArray<float>;
m_pmf = ....
m_cmf = psum(m_pmf);

I see mitsuba3 was using a loop to calculate the cmf. Is Drjit cannot support Enoki's psum feature?
Let me know if there is any better way to calculate the cdf using drjit on gpu.
Thanks!

Weird variable leak issue

Priori to the following commit: 6cf418c

I'd get

drjit-autodiff: variable leak detected (2 variables remain in use)!
 - variable a8760 (1 references)
 - variable a8761 (1 references)

Now it has become a scarier error:

jit_shutdown(): detected variable leaks:
 - variable r57096 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=uint32, size=1, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r52 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=uint32, size=1, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r21 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=float32, size=95, stmt="$r0 = call <$w x $t0> @llvm.fma.v$w$a1(<$w x $t1> $r1, <$w x $t2> $r2, <$w x $t3> $r3)$[declare <$w x $t0> @llvm.fma.v$w$a1(<$w x $t1>, <$w x $t2>, <$w x $t3>)$]", dep=[18, 3, 15, 0])
 - variable r17838 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=float32, size=1, stmt="$r0 = fneg <$w x $t0> $r1", dep=[17426, 0, 0, 0])
 - variable r40 is still being referenced! (int_ref=0, ext_ref=9, se_ref=0, type=float32, size=1, stmt="<literal>", dep=[0, 0, 0, 0])
 - variable r2 is still being referenced! (int_ref=3, ext_ref=1, se_ref=0, type=float32, size=95, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r20633 is still being referenced! (int_ref=0, ext_ref=2, se_ref=0, type=float32, size=110592, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r57067 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=float32, size=4096, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r57095 is still being referenced! (int_ref=0, ext_ref=1, se_ref=0, type=uint32, size=1, stmt="<null>", dep=[0, 0, 0, 0])
 - variable r10 is still being referenced! (int_ref=1, ext_ref=0, se_ref=0, type=float32, size=95, stmt="$r0 = fmul <$w x $t0> $r1, $r2", dep=[7, 1, 0, 0])
 - (skipping remainder)
jit_shutdown(): 68 variables are still referenced!
jit_registry_shutdown(): LLVM registry leaked 2 forward and 1 reverse mappings!
jit_registry_shutdown(): LLVM registry leaked 2 attributes!
jit_malloc_shutdown(): leaked
 - host-async memory: 4.097 MiB in 31 allocations
libc++abi: terminating with uncaught exception of type std::runtime_error: jit_init_thread_state(): the LLVM backend is inactive because the LLVM shared library ("libLLVM.dylib") could not be found! Set the DRJIT_LIBLLVM_PATH environment variable to specify its path.
[1]    18938 abort      python -m text_to_3d_mistuba.train --save-model $(pwd)/models/dummy_test  1

Should I be worried?

Unfortunately since I don't even know why I used to have a leak even before the commit, it's quite hard to obtain a small reproducible codebase.

Update

I reran the tutorial on NeRF (concatenating everything inside a single script, and getting the same error)

How to shuffle a Tensor

Hi,
I have rendered a scene using an aov integrator, with normals as an output. This in my case returns a tensor of the shape: (256, 256, 6).
Now I want to get just the normal output from the 6 channels, do do so I tried using the dr.shuffle([2, 0, 1], aov) function.
Unfortunately this fails with the following error:

Traceback (most recent call last):
  File "/home/doeringc/workspace/python/mitsuba3/test-denoise.py", line 19, in <module>
    img = dr.shuffle([2, 0, 1], aov)
  File "/home/doeringc/workspace/cpp/mitsuba3-main/build/python/drjit/router.py", line 3196, in shuffle
    raise Exception("shuffle(): incompatible input!")
drjit.Exception: shuffle(): incompatible input!

When looking at the implementation of shuffle I noticed, that len(perm) has to be equal to value.Size.
aov.Size in my case is: 18446744073709551615.

Example code

import matplotlib.pyplot as plt
import mitsuba as mi
import drjit as dr
mi.set_variant("cuda_rgb")

scene = mi.cornell_box()
print(f"{scene=}")
scene['integrator'] = {
    'type': 'aov',
    'aovs': 'nn:sh_normal',
}
scene = mi.load_dict(scene)

aov = mi.render(scene)

print(f"{aov.Size=}")
print(f"{aov=}")

img = dr.shuffle([2, 0, 1], aov)

How can I shuffle this tensor to get the channels of my output, alternatively
is there another way to get only the normal channels without shuffling the tensor?
Would you recommend using dr.gather?

Thanks for your help.

dr.round does not work for TensorXf

Hello!

I'm playing around with Dr Jit and noticed that dr.round() does not work for TensorXf objects... Is this expected?

Minimal reproducible example:

import drjit as dr
from drjit.cuda.ad import TensorXf, Array3f

array = Array3f([1.1, 2.1, 3.1])
tensor = TensorXf([1.1 ,2.2 ,3.3], shape=(3,))

print(dr.round(array))
print(dr.round(tensor))

=> The last line will raise an error:

IndexError: slice_tensor(): too many indices specified!

Thanks for taking the time to help me!
Pj

Problem with `dr.backward`, when using `@dr.wrap_ad`:

When porting the Gradient based optimization tutorial to PyTorch using @dr.wrap_ad:

import drjit as dr
import mitsuba as mi
import numpy as np
import torch


@dr.wrap_ad(source="drjit", target="torch")
def criterion(image, image_ref):
    return torch.mean(torch.square(image - image_ref))


def test_llvm_ad_rgb_torch():
    rendering_seed = 0
    comparison_spp = 512
    image_size = 128

    mi.set_variant("llvm_ad_rgb")

    scene_dictionary = mi.cornell_box()
    scene_dictionary["sensor"]["film"]["width"] = image_size
    scene_dictionary["sensor"]["film"]["height"] = image_size
    scene_dictionary["integrator"] = dict(type="prb")
    test_scene = mi.load_dict(scene_dictionary)
    image_reference = mi.render(test_scene, seed=rendering_seed, spp=comparison_spp)

    parameters = mi.traverse(test_scene)

    key = "red.reflectance.value"

    color_reference = parameters[key]

    # Set another color value and update the scene
    parameters[key] = mi.Color3f(0.01, 0.2, 0.9)
    parameters.update()

    image_init = mi.render(test_scene, seed=rendering_seed, spp=128)
    mi.util.convert_to_bitmap(image_init)

    optimizer = mi.ad.Adam(lr=0.05)
    optimizer[key] = parameters[key]
    parameters.update(optimizer)

    for _ in range(50):
        # Perform a (noisy) differentiable rendering of the scene
        image = mi.render(test_scene, parameters, spp=4)

        # Evaluate the objective function from the current rendered image
        loss = criterion(image, image_reference)

        # Backpropagate through the rendering process
        dr.backward(loss)

        # Optimizer: take a gradient descent step
        optimizer.step()

        # Post-process the optimized parameters to ensure legal color values.
        optimizer[key] = dr.clamp(optimizer[key], 0.0, 1.0)

        # Update the scene state to the new optimized values
        parameters.update(optimizer)

    image_final = mi.render(test_scene, seed=rendering_seed, spp=comparison_spp)

    color_restored = parameters[key]

    np.testing.assert_allclose(color_reference, color_restored, atol=0.01)
    np.testing.assert_allclose(image_final, image_reference, atol=0.01)
    print("Success!")


if __name__ == "__main__":
    test_llvm_ad_rgb_torch()

I receive the following error:

Exception has occurred: TypeError
__init__(): incompatible constructor arguments. The following argument types are supported:
    1. drjit.llvm.TensorXf()
    2. drjit.llvm.TensorXf(array: object)
    3. drjit.llvm.TensorXf(array: drjit.llvm.Float)
    4. drjit.llvm.TensorXf(array: drjit.llvm.Float, shape: List[int])
    5. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXi)
    6. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXu)
    7. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXi64)
    8. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXu64)
    9. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXf)
    10. drjit.llvm.TensorXf(arg0: drjit.llvm.TensorXf64)

Invoked with: None, (128, 128, 3)
  File "/src/mitsuba3/build/python/drjit/router.py", line 4055, in accum_grad
    src = t(src, dst.shape)
  File "/src/mitsuba3/build/python/drjit/router.py", line 4066, in accum_grad
    accum_grad(dst[i], src[i] if vs else src)
  File "/src/mitsuba3/build/python/drjit/router.py", line 5490, in set_grad_in
    _dr.accum_grad(self.inputs[name], value)
  File "/src/mitsuba3/build/python/drjit/router.py", line 5868, in backward
    self.set_grad_in('args', args_grad)
  File "/src/mitsuba3/build/python/drjit/router.py", line 4341, in traverse
    dtype.traverse_(mode, flags)
  File "/src/mitsuba3/build/python/drjit/router.py", line 4460, in backward_from
    traverse(ta, _dr.ADMode.Backward, flags)
  File "/src/mitsuba3/build/python/drjit/router.py", line 4514, in backward
    backward_from(arg, flags)
  File "/workspace/development/test_llvm_ad_rgb_torch.py", line 51, in test_llvm_ad_rgb_torch
    dr.backward(loss)
  File "/workspace/development/test_llvm_ad_rgb_torch.py", line 72, in <module>
    test_llvm_ad_rgb_torch()

This is strange, since it used to work with previous versions of the wrap_ad branch. Has something changed about the usage?

GCC 12.1.1 - bmiintrin.h vs x86gprintrin.h

Hi,

I have tried to compiled dr.jit on my Archlinux machine, and this type of error (several of them, all the same):

/usr/bin/c++ -DDRJIT_ENABLE_AUTODIFF=1 -DDRJIT_ENABLE_CUDA=1 -DDRJIT_ENABLE_JIT=1 -DDRJIT_UNROLL=" " -Ddrjit_python_EXPORTS -I/home/beltegeuse/tmp/drjit/include -I/home/beltegeuse/tmp/drjit/ext/drjit-core/include -I/home/beltegeuse/tmp/drjit/ext/drjit-core/ext/nanothread/include -isystem /usr/include/python3.10 -isystem /home/beltegeuse/tmp/drjit/ext/pybind11/include -fdiagnostics-color=always -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fno-math-errno -ffp-contract=fast -fno-trapping-math -march=native -Wall -Wextra -Wno-unused-local-typedefs -flto -fno-fat-lto-objects -Os -fno-strict-aliasing -MD -MT src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o -MF src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o.d -o src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o -c /home/beltegeuse/tmp/drjit/src/python/implicit.cpp
In file included from /home/beltegeuse/tmp/drjit/include/drjit/packet_intrin.h:44,
                 from /home/beltegeuse/tmp/drjit/include/drjit/packet.h:16,
                 from /home/beltegeuse/tmp/drjit/src/python/common.h:8,
                 from /home/beltegeuse/tmp/drjit/src/python/implicit.cpp:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/12.1.1/include/bmiintrin.h:25:3: error: #error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
In file included from /home/beltegeuse/tmp/drjit/include/drjit/packet_intrin.h:48:
/usr/lib/gcc/x86_64-pc-linux-gnu/12.1.1/include/bmi2intrin.h:25:3: error: #error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~

Replace the include bmi2intri to x86gprintrin inside packet_intrin.h:44 solve the issue (i.e., drjit compile). I did not had this problem on other linux machine (but different CPU and linux distro).

Bug report with Loop

A case where I found a possible bug with drjit Loop feature:

import drjit as dr
from drjit.llvm.ad import UInt, Float, Loop

def f(x):  # returns x + 0 + 1 + ... + 9
    i = UInt(0)
    loop = Loop("Loop", lambda: (x, i))
    while loop(i < 10):
        x = x + i
        i += 1
    return x

a = dr.zeros(Float, 10)
dr.make_opaque(a)
a_old_index = a.index

print(f'{a.index=}, {a}')
b = f(a)
print(f'{b.index=}, {b}')
assert dr.allclose(b, [45] * 10)
assert a.index != a_old_index  # drjit.loop seems to change the index of the captured variable to that of a placeholder
print(f'{a.index=}, {a}')  # runtime error: a is the placeholder, which cannot be evaluated

Adding a copy in the first line of def f(x): solves this issue, but it just appears strange to me that after instantiating Loop, the original jit variable becomes a placeholder without any notice.

Question about `drjit::Array`

Hi,

I'm porting some existing simulation code to enoki / drjit::Arrays and some of the original code uses bool to double conversions to implement piecewise functions in a branchless way,

double foo(double x) {
  return (x > 0) * 4 + sin(2 * x);
}

when moving this code to drjit::Array, it fails because there is no way to convert the mask to a drjit::Array

using namespace drjit;
using double8 = drjit::Array<double, 8>;

double8 foo(double8 x) {
  return (x > 0) * 4 + sin(x);
  //     ^^^^^^^
  // error: no viable conversion from returned value of type 
  // 'drjit::Mask<double, 8>' to function return type 'double8' (aka 'Array<double, 8>')
}

One way to address this would be to manually transform the entire codebase to adopt drjit::select( ... ), or drjit::masked( ... ) where appropriate.

My question is: would it make sense to define this kind of drjit::Mask -> drjit::Array conversion (or some basic arithmetic operations between masks and arrays), so that drjit::Array could support this kind of branchless pattern? The potential benefit is that more regular C++ code would be directly compatible with drjit::Array without modification.

Fail to compile with GCC 12

My OS:

$ uname -a
Linux x 5.19.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.6-1 (2022-09-01) x86_64 GNU/Linux

drjit version: dbdb578
My compiler version:

$ gcc --version
gcc (Debian 12.2.0-3) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ clang --version
Debian clang version 14.0.6-2
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

When building with GCC, I got error as follows:

[12/22] Building CXX object src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o
FAILED: src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o
/usr/bin/c++ -DDRJIT_ENABLE_AUTODIFF=1 -DDRJIT_ENABLE_CUDA=1 -DDRJIT_ENABLE_JIT=1 -DDRJIT_UNROLL=" " -Ddrjit_python_EXPORTS -I/tmp/drjit/include -I/tmp/drjit/ext/drjit-core/include -I/tmp/drjit/ext/drjit-core/ext/nanothread/include -isystem /usr/include/python3.10 -isystem /tmp/drjit/ext/pybind11/include -fdiagnostics-color=always -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fno-math-errno -ffp-contract=fast -fno-trapping-math -march=native -Wall -Wextra -Wno-unused-local-typedefs -flto -fno-fat-lto-objects -Os -fno-strict-aliasing -MD -MT src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o -MF src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o.d -o src/python/CMakeFiles/drjit-python.dir/implicit.cpp.o -c /tmp/drjit/src/python/implicit.cpp
In file included from /tmp/drjit/include/drjit/packet_intrin.h:44,
                 from /tmp/drjit/include/drjit/packet.h:16,
                 from /tmp/drjit/src/python/common.h:8,
                 from /tmp/drjit/src/python/implicit.cpp:1:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmiintrin.h:25:3: error: #error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
In file included from /tmp/drjit/include/drjit/packet_intrin.h:48:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmi2intrin.h:25:3: error: #error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
[13/22] Building CXX object src/python/CMakeFiles/drjit-python.dir/reinterpret.cpp.o
FAILED: src/python/CMakeFiles/drjit-python.dir/reinterpret.cpp.o
/usr/bin/c++ -DDRJIT_ENABLE_AUTODIFF=1 -DDRJIT_ENABLE_CUDA=1 -DDRJIT_ENABLE_JIT=1 -DDRJIT_UNROLL=" " -Ddrjit_python_EXPORTS -I/tmp/drjit/include -I/tmp/drjit/ext/drjit-core/include -I/tmp/drjit/ext/drjit-core/ext/nanothread/include -isystem /usr/include/python3.10 -isystem /tmp/drjit/ext/pybind11/include -fdiagnostics-color=always -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fno-math-errno -ffp-contract=fast -fno-trapping-math -march=native -Wall -Wextra -Wno-unused-local-typedefs -flto -fno-fat-lto-objects -Os -fno-strict-aliasing -MD -MT src/python/CMakeFiles/drjit-python.dir/reinterpret.cpp.o -MF src/python/CMakeFiles/drjit-python.dir/reinterpret.cpp.o.d -o src/python/CMakeFiles/drjit-python.dir/reinterpret.cpp.o -c /tmp/drjit/src/python/reinterpret.cpp
In file included from /tmp/drjit/include/drjit/packet_intrin.h:44,
                 from /tmp/drjit/include/drjit/packet.h:16,
                 from /tmp/drjit/src/python/common.h:8,
                 from /tmp/drjit/src/python/reinterpret.cpp:1:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmiintrin.h:25:3: error: #error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
In file included from /tmp/drjit/include/drjit/packet_intrin.h:48:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmi2intrin.h:25:3: error: #error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
[14/22] Building CXX object src/python/CMakeFiles/drjit-python.dir/dlpack.cpp.o
FAILED: src/python/CMakeFiles/drjit-python.dir/dlpack.cpp.o
/usr/bin/c++ -DDRJIT_ENABLE_AUTODIFF=1 -DDRJIT_ENABLE_CUDA=1 -DDRJIT_ENABLE_JIT=1 -DDRJIT_UNROLL=" " -Ddrjit_python_EXPORTS -I/tmp/drjit/include -I/tmp/drjit/ext/drjit-core/include -I/tmp/drjit/ext/drjit-core/ext/nanothread/include -isystem /usr/include/python3.10 -isystem /tmp/drjit/ext/pybind11/include -fdiagnostics-color=always -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fno-math-errno -ffp-contract=fast -fno-trapping-math -march=native -Wall -Wextra -Wno-unused-local-typedefs -flto -fno-fat-lto-objects -Os -fno-strict-aliasing -MD -MT src/python/CMakeFiles/drjit-python.dir/dlpack.cpp.o -MF src/python/CMakeFiles/drjit-python.dir/dlpack.cpp.o.d -o src/python/CMakeFiles/drjit-python.dir/dlpack.cpp.o -c /tmp/drjit/src/python/dlpack.cpp
In file included from /tmp/drjit/include/drjit/packet_intrin.h:44,
                 from /tmp/drjit/include/drjit/packet.h:16,
                 from /tmp/drjit/src/python/common.h:8,
                 from /tmp/drjit/src/python/dlpack.cpp:1:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmiintrin.h:25:3: error: #error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmiintrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~
In file included from /tmp/drjit/include/drjit/packet_intrin.h:48:
/usr/lib/gcc/x86_64-linux-gnu/12/include/bmi2intrin.h:25:3: error: #error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
   25 | # error "Never use <bmi2intrin.h> directly; include <x86gprintrin.h> instead."
      |   ^~~~~

It seems that gcc dislike some intrinsic header file. But when build with clang(I use cmake .. -G Ninja -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++), there is no such error.
Is there any ideas to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.