supranational / sppark Goto Github PK

Zero-knowledge template library

License: Apache License 2.0

C++ 28.94% Cuda 64.01% Rust 6.94% Shell 0.12%

cuda bls12-377 bls12-381 pasta-curves zero-knowledge zero-knowledge-proofs zk-snarks zk-starks ntt

sppark's Introduction

sppark

sppark (pronounced 'spark') is Supranational's performance primitives for arguments of knowledge such as SNARKs and STARKs. The library focuses on accelerating the most computationally expensive pieces of zero-knowledge proofs generation such as multi-scalar multiplication (MSM), number theoretic transform (NTT), arithmetic hashes, and more. The library is a collection of CUDA/C++ templates that can be instantiated for a range of finite fields and elliptic curves.

Status
General notes on implementation
Platform and Language Compatibility
Introductory Integration Tutorial
- Multi-scalar Multiplication (MSM)
Repository Structure
Performance
License

Status

This library is under active development [with a list of planned significant improvements]

General notes on implementation

The goal of the sppark library is to provide foundational components for applications and other libraries that require high-performance operations for zero-knowledge proofs generation.

Platform and Language Compatibility

This library primarily supports x86_64 with Nvidia's Volta+ GPU hardware platforms on Linux and Windows operating systems. Non-GPU portions can be utilized even on ARM64, and additionally on Mac.

We show how to interface with Rust and are open to discussions on how to interface with Go. Caveat lector. Achieving highest possible GPU performance requires interfacing with target language memory management, possibly its async facilities, and might even require changes to object's data layout. These are hard to generalize and consequently are also a matter of discussion, likely on a case-by-case basis.

Introductory Integration Tutorial

[TBD]

Multi-scalar Multiplication (MSM)

[TBD]

Repository Structure

Root - Contains various configuration files, documentation, licensing.

conversion -
ec - Contains templates for elliptic curve operations such as addition and doubling for different point representations.
ff - Contains CUDA template[s] for finite field operations and instantiations of a variety of fields.
hash -
memory -
merkle -
msm - Contains multi-scalar multiplication template[s] that can be instantiated for a variety of elliptic curves such as BLS12-381 and the Pasta curves
ntt - Contains NTT CUDA kernels.
poc - Proof-of-concept implementations, including benchmarking.
rust - Houses Rust crate definition.
util - General-purpose helper classes.

Performance

Simplified benchmark results can be collected by end users by exercising proof-of-concept applications. "Simplified" refers to the fact that there is always room for application-specific tuning. Intention is to give a general "taste." Just in case, benchmarks are likely to require high-end GPUs and one can't expect that they will execute on a laptop unmodified.

Caveat lector. As you compile PoC applications you might get warnings about not FFI-safe types. Alarming as they are, the fact that tests pass means that it works out nevertheless. However, this is not to say that they should be ignored and that one can proceed to build production code upon it. We intend to work with external software maintainers to resolve these warnings.

License

The sppark library is licensed under the Apache License Version 2.0 software license.

sppark's People

Contributors

Stargazers

Watchers

sppark's Issues

data copy time fluctuating while concurrent NTT invokation

i am testing for gl64 NTT with log_n_size=17, under concurent environment.
i observed that the data copy from host to device ranges from 20us to 6ms. i think the underlying code does not utilise aync. the last line gpu.sync() will block CPU.

  gpu.select();
  size_t domain_size = (size_t)1 << lg_domain_size;
  dev_ptr_t<fr_t> d_inout{domain_size, gpu};
   gpu.HtoD(&d_inout[0], inout, domain_size);
    NTT_internal(&d_inout[0], lg_domain_size, order, direction, type, gpu,
                         coset_ext_pow);
   gpu.DtoH(inout, &d_inout[0], domain_size);
    gpu.sync();

or, it would be better to provide a batch NTT function

Sppark error

I am pretty new to this, I was trying to setup a repo which has sppark as a dependency I guess and I am getting the below error

error: failed to run custom build command for `sppark v0.1.2 ([https://github.com/supranational/sppark?rev=48f6ac6#48f6ac60)`](https://github.com/supranational/sppark?rev=48f6ac6#48f6ac60))

Caused by:
  process didn't exit successfully: `/home/aditya/aleo-prover/target/release/build/sppark-329937d95720d5f1/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/ec
  cargo:rerun-if-changed=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/ff
  cargo:rerun-if-changed=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/ntt
  cargo:rerun-if-changed=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/msm
  cargo:rerun-if-changed=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util
  cargo:ROOT=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("3")
  HOST = Some("x86_64-unknown-linux-gnu")
  cargo:rerun-if-env-changed=CXX_x86_64-unknown-linux-gnu
  CXX_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_x86_64_unknown_linux_gnu
  CXX_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = None
  cargo:rerun-if-env-changed=NVCC_x86_64-unknown-linux-gnu
  NVCC_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=NVCC_x86_64_unknown_linux_gnu
  NVCC_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_NVCC
  HOST_NVCC = None
  cargo:rerun-if-env-changed=NVCC
  NVCC = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64-unknown-linux-gnu
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64_unknown_linux_gnu
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/home/aditya/aleo-prover/target/release/build/sppark-ce7f4abd18d08224/out/src/lib.o" "-c" "--device-c" "src/lib.cpp"
  cargo:warning=In file included from /home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh:9,
  cargo:warning=                 from src/lib.cpp:2:
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh: In member function ‘void* stream_t::Dmalloc(size_t) const’:
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh:29:17: error: ‘cudaMallocAsync’ was not declared in this scope; did you mean ‘cudaMallocArray’?
  cargo:warning=   29 |         CUDA_OK(cudaMallocAsync(&d_ptr, sz, stream));
  cargo:warning=      |                 ^~~~~~~~~~~~~~~
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/exception.cuh:32:24: note: in definition of macro ‘CUDA_OK’
  cargo:warning=   32 |     cudaError_t code = expr;                                \
  cargo:warning=      |                        ^~~~
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh: In member function ‘void stream_t::Dfree(void*) const’:
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh:33:17: error: ‘cudaFreeAsync’ was not declared in this scope; did you mean ‘cudaFreeArray’?
  cargo:warning=   33 |     {   CUDA_OK(cudaFreeAsync(d_ptr, stream));   }
  cargo:warning=      |                 ^~~~~~~~~~~~~
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/exception.cuh:32:24: note: in definition of macro ‘CUDA_OK’
  cargo:warning=   32 |     cudaError_t code = expr;                                \
  cargo:warning=      |                        ^~~~
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh: In constructor ‘pin_t<T>::pin_t(const T*, size_t)’:
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/gpu_t.cuh:230:59: error: ‘cudaHostRegisterReadOnly’ was not declared in this scope; did you mean ‘cudaHostRegisterDefault’?
  cargo:warning=  230 |                                  cudaHostRegisterPortable|cudaHostRegisterReadOnly));
  cargo:warning=      |                                                           ^~~~~~~~~~~~~~~~~~~~~~~~
  cargo:warning=/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6/util/exception.cuh:32:24: note: in definition of macro ‘CUDA_OK’
  cargo:warning=   32 |     cudaError_t code = expr;                                \
  cargo:warning=      |                        ^~~~
  exit status: 1

  --- stderr


  error occurred: Command "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "/home/aditya/.cargo/git/checkouts/sppark-766e0a6fee265121/48f6ac6" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/home/aditya/aleo-prover/target/release/build/sppark-ce7f4abd18d08224/out/src/lib.o" "-c" "--device-c" "src/lib.cpp" with args "nvcc" did not execute successfully (status code exit status: 1).


warning: build failed, waiting for other jobs to finish...

scalar is not constant

Hi!
I have noticed that scalars argument in the pippenger funciton has a type of const scalar_t*

sppark/msm/pippenger.cuh

Lines 44 to 45 in d967007

 void pippenger(const affine_t* points, size_t npoints, 

 const scalar_t* scalars, bool mont,

but the elements of scalars will be convert to scalar_T and write back inplace in the pippenger function.

sppark/msm/pippenger.cuh

Lines 158 to 176 in d967007

 if (mont) { 

 uint32_t np = (npoints+WARP_SZ-1) & (0U-WARP_SZ); 

 #pragma unroll 1 

 for (uint32_t i = NTHREADS*bid + tid; i < np; i += NTHREADS*NWINS) { 

 scalar_t s = scalars_[i]; 

 s.from(); 

 scalars[i] = s; 

 } 

 cooperative_groups::this_grid().sync(); 

 } else { // if (typeid(scalars) != typeid(scalars_)) { 

 uint32_t np = (npoints+WARP_SZ-1) & (0U-WARP_SZ); 

 #pragma unroll 1 

 for (uint32_t i = NTHREADS*bid + tid; i < np; i += NTHREADS*NWINS) { 

 scalar_t s = scalars_[i]; 

 __syncwarp(); 

 scalars[i] = s; 

 } 

 cooperative_groups::this_grid().sync(); 

 }

After the pippenger function executed, scalars are not the same as before. The type definition of scalars may be a little misleading.

A question about Montgomery mult (form ff/mont_t.cuh)

When i observed the implementation of Montgomery multiplication over finite fields, I'm very confused about

static inline void mad_n_redc(uint32_t even[], uint32_t odd[],
                                    const uint32_t a[], uint32_t bi, bool first=false)
        {
            if (first) {
                mul_n(odd, a+1, bi);
                mul_n(even, a,  bi);
            } else {
                asm("add.cc.u32 %0, %0, %1;" : "+r"(even[0]) : "r"(odd[1]));
                madc_n_rshift(odd, a+1, bi);
                cmad_n(even, a, bi);
                asm("addc.u32 %0, %0, 0;" : "+r"(odd[n-1]));
            }

            uint32_t mi = even[0] * M0;

            cmad_n(odd, MOD+1, mi);
            cmad_n(even, MOD,  mi);
            asm("addc.u32 %0, %0, 0;" : "+r"(odd[n-1]));
        }

third to last row
cmad_n(odd, MOD+1, mi);

As you commented on the last line in the function cmad_n “”// return carry flag“”
But obviously the carry in the first “cmad_n” function will not be received by the second “cmad_n” function
Why is the overflow not considered?

If you think overflow must not happen here, please tell me why as well，THANKs

how to understand the implementation of `reduce(uint32_t temp[4])`

the rust code below is quite easy to understand. i.e reduce x_hi_hi, then x_hi_lo.

fn reduce128(x: u128) -> GoldilocksField {
    let (x_lo, x_hi) = split(x); // This is a no-op
    let x_hi_hi = x_hi >> 32;
    let x_hi_lo = x_hi & EPSILON;

    let (mut t0, borrow) = x_lo.overflowing_sub(x_hi_hi);
    if borrow {
        branch_hint(); // A borrow is exceedingly rare. It is faster to branch.
        t0 -= EPSILON; // Cannot underflow.
    }
    let t1 = x_hi_lo * EPSILON;
    let t2 = unsafe { add_no_canonicalize_trashing_input(t0, t1) };
    GoldilocksField(t2)
}

however the below cuda inline assembly code of 128 reduction is not easy to understand; possible to provide some docs? thanks

1    inline void reduce(uint32_t temp[4])
 2   {
3        uint32_t carry;
4 # if __CUDA_ARCH__ >= 700
5        asm("sub.cc.u32 %0, %0, %3; subc.cc.u32 %1, %1, %4; subc.u32 %2, 0, 0;"
6            : "+r"(temp[0]), "+r"(temp[1]), "=r"(carry)
7            : "r"(temp[2]), "r"(temp[3]));
8        asm("add.cc.u32 %0, %0, %2; addc.u32 %1, %1, %3;"
9            : "+r"(temp[1]), "+r"(carry)
10            : "r"(temp[2]), "r"(temp[3]));
11
 12       asm("mad.lo.cc.u32 %0, %3, %4, %0; madc.hi.cc.u32 %1, %3, %4, %1; addc.u32 %2, 0, 0;"
  13          : "+r"(temp[0]), "+r"(temp[1]), "=r"(temp[2])
 14           : "r"(carry), "r"(gl64_device::W));
 15       asm("mad.lo.cc.u32 %0, %2, %3, %0; madc.hi.u32 %1, %2, %3, %1;"
 16           : "+r"(temp[0]), "+r"(temp[1])
 17           : "r"(temp[2]), "r"(gl64_device::W));
18  # else
19        uint32_t b0, b1;
 20       asm("add.cc.u32 %0, %2, %3; addc.u32 %1, 0, 0;"
 21           : "=r"(b0), "=r"(b1)
 22           : "r"(temp[2]), "r"(temp[3]));
  23      asm("sub.cc.u32 %0, %0, %3; subc.cc.u32 %1, %1, %4; subc.u32 %2, 0, 0;"
24            : "+r"(temp[0]), "+r"(temp[1]), "=r"(carry)
25            : "r"(b0), "r"(b1));
26        asm("add.cc.u32 %0, %0, %2; addc.u32 %1, %1, %3;"
27            : "+r"(temp[0]), "+r"(temp[1])
 28           : "r"(-carry), "r"(carry));
 29       asm("add.cc.u32 %0, %0, %1; addc.u32 %1, 0, 0;"
30            : "+r"(temp[1]), "+r"(temp[2]));

31 #  if __CUDA_ARCH__ >= 700
 32       asm("mad.lo.cc.u32 %0, %2, %3, %0; madc.hi.u32 %1, %2, %3, %1;"
 33           : "+r"(temp[0]), "+r"(temp[1])
  34          : "r"(temp[2]), "r"(gl64_device::W));
35 #  else
 36       asm("add.cc.u32 %0, %0, %2; addc.u32 %1, %1, 0;"
 37           : "+r"(temp[0]), "+r"(temp[1])
 38           : "r"(-temp[2]));
39 #  endif
40 # endif
41        asm("mov.b64 %0, {%1, %2};" : "=l"(val) : "r"(temp[0]), "r"(temp[1]));
 42   }

Documentation of CUDA MSM limitations

When I call mult_pippenger() from Rust with <= 512 bases I get a segfault: signal: 11, SIGSEGV: invalid memory reference. It would be cool if such limitations would be documented.

The msm results are different between VariableBaseMSM::multi_scalar_mul with multi_scalar_mult_arkworks when generated scalars’ lengths are not divisible by 32

@dot-asm

I noticed the sppark code on accelerating msm on GPU, it's great. but I met a problem in that the case msm_correctness failed after changing the generated points length.

the command is : cargo test --features=bls12_377

the resuts is:

the code is:

fn msm_correctness() {
    let test_npow = std::env::var("TEST_NPOW").unwrap_or("15".to_string());
    let npoints_npow = i32::from_str(&test_npow).unwrap();

    let (points, scalars) =
        **//util::generate_points_scalars::<G1Affine>(1usize << npoints_npow);
        util::generate_points_scalars::<G1Affine>(237);**

    let msm_result = multi_scalar_mult_arkworks(points.as_slice(), unsafe {
        std::mem::transmute::<&[_], &[BigInteger256]>(scalars.as_slice())
    })
    .into_affine();

    let arkworks_result =
        VariableBaseMSM::multi_scalar_mul(points.as_slice(), unsafe {
            std::mem::transmute::<&[_], &[BigInteger256]>(scalars.as_slice())
        })
        .into_affine();

    assert_eq!(msm_result, arkworks_result);
}

fork safety issue

I am trying to use Ray (https://github.com/ray-project/ray) to wrap the RISC Zero prover, which internally uses sppark. An error that I encountered is as follows:

(async_join_segment_receipts pid=2928) *** SIGSEGV received at time=1706179489 on cpu 0 *** [repeated 3x across cluster]
(async_join_segment_receipts pid=2928) PC: @     0x7f79a0fd870c  (unknown)  select_gpu() [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)     @     0x7f7edbcb2090  (unknown)  (unknown) [repeated 2x across cluster]
(async_join_segment_receipts pid=2928)     @     0x7f79a0fd57d0        160  sppark_init [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)     @     0x7f79a09a6529  (unknown)  risc0_zkp::hal::cuda::CudaHal$LT$CH$GT$::new::hfb16181478125328 [repeated 4x across cluster]
(async_join_segment_receipts pid=2928) [2024-01-25 02:44:49,148 E 2928 2928] logging.cc:361: *** SIGSEGV received at time=1706179489 on cpu 0 *** [repeated 3x across cluster]
(async_join_segment_receipts pid=2928) [2024-01-25 02:44:49,148 E 2928 2928] logging.cc:361: PC: @     0x7f79a0fd870c  (unknown)  select_gpu() [repeated 3x across cluster]
(async_join_segment_receipts pid=2928) [2024-01-25 02:44:49,148 E 2928 2928] logging.cc:361:     @     0x7f7edbcb2090  (unknown)  (unknown) [repeated 2x across cluster]
(async_join_segment_receipts pid=2928) [2024-01-25 02:44:49,148 E 2928 2928] logging.cc:361:     @     0x7f79a0fd57d0        160  sppark_init [repeated 3x across cluster]
(async_join_segment_receipts pid=2928) [2024-01-25 02:44:49,148 E 2928 2928] logging.cc:361:     @     0x7f79a09a6529  (unknown)  risc0_zkp::hal::cuda::CudaHal$LT$CH$GT$::new::hfb16181478125328 [repeated 4x across cluster]
(async_join_segment_receipts pid=2928) Fatal Python error: Segmentation fault [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)  [repeated 3x across cluster]
(async_join_segment_receipts pid=2928) Stack (most recent call first): [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)   File "ray_test.py", line 29 in async_join_segment_receipts [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)   File "/home/ubuntu/.pyenv/versions/pyo3/lib/python3.8/site-packages/ray/_private/worker.py", line 847 in main_loop [repeated 3x across cluster]
(async_join_segment_receipts pid=2928)   File "/home/ubuntu/.pyenv/versions/pyo3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 282 in <module> [repeated 3x across cluster]
packages/ray/_private/workers/default_worker.py", line 282 in <module> [repess cluster]

The expectation is that Ray forks processes. And therefore, if a parent process initializes gpus already, the child would use the ones that are already initialized by the parent. However, NVIDIA doesn't work when this happens---the contexts and many other data structures are somewhat specific to the processes.

I am still checking if this is the reason of the error happening.

Race condition in msm/sort.cuh

There is a race condition between lines 267 and 278 in the msm/sort.cuh file. If a warp with a larger warpid executes line 278 before the warp with warpid=0 executes line 267, the calculation result will be wrong. In fact, on the ampere architecture, due to the scheduling strategy of the warp scheduler, it is impossible for warp (id>0) to execute line 278 before warp(id=0) executing line 278 and therefore the test always passes. However, there is indeed a logical error.

Compilation error (observed from pasta-msm)

TL;DR: compilation error on a `cargo check`

I upgraded nvcc to 12.3 according to the hints in supranational/pasta-msm@8ccdc45 and got a compilation error on a simple cargo check:

The warnings are several instances of the following warning :

warning: [email protected]: /usr/include/c++/13.2.1/bits/std_function.h: In instantiation of ‘std::function<_Res(_ArgTypes ...)>::_Requires<std::function<_Res(_ArgTypes ...)>::_Callable<_Functor>, std::function<_Res(_ArgTypes ...)>&> std::function<_Res(_ArgTypes ...)>::operator=(_Functor&&) [with _Functor = std::function<void()>&; _Res = void; _ArgTypes = {}; _R
equires<_Callable<_Functor>, std::function<_Res(_ArgTypes ...)>&> = std::function<void()>&; typename std::enable_if<(! std::is_same<typename std::remove_cv<typename std::remove_reference<_Tuple>::type>::type, std::function<_Res(_ArgTypes ...)> >::value), std::decay<_Func> >::type::type = std::function<void()>; typename std::enable_if<(! std::is_same<typename s
td::remove_cv<typename std::remove_reference<_Tuple>::type>::type, std::function<_Res(_ArgTypes ...)> >::value), std::decay<_Func> >::type = std::decay<std::function<void()>&>; typename std::remove_cv<typename std::remove_reference<_Tuple>::type>::type = std::function<void()>&; typename std::remove_reference<_Tuple>::type = std::function<void()>&]’:
warning: [email protected]: /home/huitseeker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sppark-0.1.5/sppark/util/thread_pool_t.hpp:164:24:   required from here
warning: [email protected]: /usr/include/c++/13.2.1/bits/std_function.h:534:44: error: call of overloaded ‘forward<std::function<void()>&>(std::function<void()>&)’ is ambiguous
warning: [email protected]:   534 |           function(std::forward<_Functor>(__f)).swap(*this);
warning: [email protected]:       |                   ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
warning: [email protected]: /usr/include/c++/13.2.1/bits/move.h:70:1: note: candidate: ‘constexpr _Tp&& std::forward(typename remove_reference<_Functor>::type&) [with _Tp = function<void()>&; typename remove_reference<_Functor>::type = function<void()>&]’
warning: [email protected]:    70 |     forward(typename std::remove_reference<_Tp>::type& __t) noexcept
warning: [email protected]:       | ^   ~~~
warning: [email protected]: /usr/include/c++/13.2.1/bits/move.h:82:1: note: candidate: ‘constexpr _Tp&& std::forward(typename remove_reference<_Functor>::type&&) [with _Tp = function<void()>&; typename remove_reference<_Functor>::type = function<void()>&]’
warning: [email protected]:    82 |     forward(typename std::remove_reference<_Tp>::type&& __t) noexcept
warning: [email protected]:       | ^   ~~~
warning: [email protected]: /usr/include/c++/13.2.1/bits/vector.tcc: In instantiation of ‘std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::emplace_back(_Args&& ...) [with _Args = {std::thread&}; _Tp = std::thread; _Alloc = std::allocator<std::thread>; reference = std::thread&]’:
warning: [email protected]: /usr/include/c++/13.2.1/bits/stl_vector.h:1296:15:   required from ‘void std::vector<_Tp, _Alloc>::push_back(value_type&&) [with _Tp = std::thread; _Alloc = std::allocator<std::thread>; value_type = std::thread]’
warning: [email protected]: /home/huitseeker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sppark-0.1.5/sppark/util/thread_pool_t.hpp:79:20:   required from here
...

Full output below.

Details

on pasta-msm main at 8ccdc45ca5938168b0a3f67462765926a0da9b40
Full (cargo) compilation output at https://gist.github.com/huitseeker/e22b6f7f21057bfd3f45601724f4c04d

Version info

huitseeker@sleipnir➜huitseeker/tmp/pasta-msm(main)» nvcc --version                                                                                                                                                                   [10:04:07]
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
huitseeker@sleipnir➜huitseeker/tmp/pasta-msm(main)» gcc --version                                                                                                                                                                    [10:06:00]
gcc (GCC) 13.2.1 20230730
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
huitseeker@sleipnir➜huitseeker/tmp/pasta-msm(main)» uname -a                                                                                                                                                                         [10:06:30]
Linux sleipnir 6.1.24-1-lts #1 SMP PREEMPT_DYNAMIC Thu, 13 Apr 2023 17:22:35 +0000 x86_64 GNU/Linux

Trying to combine MSM and NTT into single kernel

I'm trying to combine MSM and NTT into a single kernel. I use the code from the POCs as an inspiration. The problem I'm currently hitting is, that if I include ec/jacobian_t.hpp, then I cannot compile the NTT code anymore. See below for the errors I'm getting.

It's easy to reproduce, just add #include <ec/jacobian_t.hpp> to the NTT POC.

   Compiling ntt-cuda v0.1.0 (/home/vmx/src/pl/filecoin/upstream/sppark/poc/ntt-cuda)
The following warnings were emitted during compilation:

warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2144): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__hdiv") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2145): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__hdiv") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2259): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2260): error: calling a __device__ function("__float_as_uint") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2261): error: calling a __device__ function("__fsub_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2262): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2263): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2297): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2298): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2299): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2300): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/driver_abi.h(60): error: calling a __device__ function("__trap") from a __host__ function("load_env_reg") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(69): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(80): error: calling a __device__ function("__threadfence") from a __host__ function("sync_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(82): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAddE1?1?") from a __host__ function("sync_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(86): error: calling a __device__ function("__threadfence") from a __host__ function("sync_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(98): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(107): error: calling a __device__ function("__threadfence_block") from a __host__ function("atom_or_acq_rel_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(108): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069088atomicOrE1?1?") from a __host__ function("atom_or_acq_rel_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(118): error: calling a __device__ function("__threadfence_block") from a __host__ function("red_or_release_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(119): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069088atomicOrE1?1?") from a __host__ function("red_or_release_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(128): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAndE1?1?") from a __host__ function("red_and_relaxed_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(138): error: calling a __device__ function("__threadfence_block") from a __host__ function("red_and_release_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(139): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAndE1?1?") from a __host__ function("red_and_release_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(150): error: calling a __device__ function("__threadfence_block") from a __host__ function("ld_acquire_cta") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(175): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(187): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(198): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_last_releases") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(204): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690811__shfl_syncE1?1?1?1?") from a __host__ function("sync_warps_last_releases") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(229): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_arrive") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(255): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_reset") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(87): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(283): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690811__shfl_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(292): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690816__shfl_down_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(301): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__shfl_up_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(310): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690815__shfl_xor_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(418): error: calling a __device__ function("__clusterGridDimInClusters()") from a __host__ function("dim_clusters") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(427): error: calling a __device__ function("__clusterIdx()") from a __host__ function("cluster_index") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(453): error: calling a __device__ function("cudaCGGetIntrinsicHandle") from a __host__ function("get_intrinsic_handle") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(458): error: calling a __device__ function("cudaCGSynchronize") from a __host__ function("sync") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(464): error: calling a __device__ function("cudaCGGetSize") from a __host__ function("size") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(471): error: calling a __device__ function("cudaCGGetRank") from a __host__ function("thread_rank") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(478): error: calling a __device__ function("cudaCGGetRank") from a __host__ function("grid_rank") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(485): error: calling a __device__ function("cudaCGGetSize") from a __host__ function("num_grids") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(501): error: a static "__constant__" variable declaration is not allowed inside a host function body
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(521): error: calling a __device__ function("__clusterDimIsSpecified()") from a __host__ function("isReal") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(526): error: calling a __device__ function("__cluster_barrier_arrive()") from a __host__ function("barrier_arrive") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(531): error: calling a __device__ function("__cluster_barrier_wait()") from a __host__ function("barrier_wait") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(542): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690827__cluster_query_shared_rankE1?") from a __host__ function("query_shared_rank") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(553): error: calling a __device__ function("__clusterRelativeBlockIdx()") from a __host__ function("block_index") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(558): error: calling a __device__ function("__clusterRelativeBlockRank()") from a __host__ function("block_rank") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(568): error: calling a __device__ function("__clusterDim()") from a __host__ function("dim_blocks") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(573): error: calling a __device__ function("__clusterSizeInBlocks()") from a __host__ function("num_blocks") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(616): error: calling a __device__ function("__trap") from a __host__ function("abort") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(262): error: calling a __device__ function("__trap") from a __host__ function("sync") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(340): error: calling a __device__ function("__trap") from a __host__ function("sync") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(598): error: calling a __device__ function("__popc") from a __host__ function("_get_tiled_threads") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(749): error: calling a __device__ function("__popc") from a __host__ function("coalesced_group") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(772): error: calling a __device__ function("__popc") from a __host__ function("thread_rank") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(786): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(792): error: calling a __device__ function("__ffs") from a __host__ function("shfl") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(793): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(806): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl_down") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(822): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl_up") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(862): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("any") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(865): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("all") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(869): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("ballot") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(871): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("ballot") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(903): error: calling a __device__ function("__activemask()") from a __host__ function("coalesced_threads") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(1410): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__any_syncE1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(1421): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__all_syncE1?1?") from a __host__ function("operator()") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(66): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("_binary_partition") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(85): error: calling a __device__ function("__ffs") from a __host__ function("_labeled_partition") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(93): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("_labeled_partition") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(94): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("_labeled_partition") is not allowed
warning: 
warning: /usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(96): error: calling a __device__ function("__popc") from a __host__ function("_labeled_partition") is not allowed
warning: 
warning: /home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(14): error: calling a __device__ function("__brev") from a __host__ function("bit_rev") is not allowed
warning: 
warning: /home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(16): error: calling a __device__ function("__brevll") from a __host__ function("bit_rev") is not allowed
warning: 
warning: /home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(32): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690815__shfl_xor_syncE1?1?1?1?") from a __host__ function("shfl_bfly") is not allowed
warning: 
warning: 79 errors detected in the compilation of "cuda/ntt_api.cu".

error: failed to run custom build command for `ntt-cuda v0.1.0 (/home/vmx/src/pl/filecoin/upstream/sppark/poc/ntt-cuda)`

Caused by:
  process didn't exit successfully: `/home/vmx/src/pl/filecoin/upstream/sppark/poc/ntt-cuda/target/debug/build/ntt-cuda-e41d1a0ded9a429a/build-script-build` (exit status: 1)
  --- stdout
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("0")
  HOST = Some("x86_64-unknown-linux-gnu")
  cargo:rerun-if-env-changed=CXX_x86_64-unknown-linux-gnu
  CXX_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_x86_64_unknown_linux_gnu
  CXX_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = None
  cargo:rerun-if-env-changed=NVCC_x86_64-unknown-linux-gnu
  NVCC_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=NVCC_x86_64_unknown_linux_gnu
  NVCC_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_NVCC
  HOST_NVCC = None
  cargo:rerun-if-env-changed=NVCC
  NVCC = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64-unknown-linux-gnu
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64_unknown_linux_gnu
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O0" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-G" "-Xcompiler" "-gdwarf-4" "-Xcompiler" "-fno-omit-frame-pointer" "-m64" "-I" "/home/vmx/.cargo/registry/src/github.com-1ecc6299db9ec823/blst-0.3.10/blst/src" "-I" "/home/vmx/src/pl/filecoin/upstream/sppark" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-arch=sm_70" "-Xcompiler" "-Wno-unused-parameter" "-Xcompiler" "-Wno-subobject-linkage" "-DTAKE_RESPONSIBILITY_FOR_ERROR_MESSAGE" "-DFEATURE_BLS12_381" "-o" "/home/vmx/src/pl/filecoin/upstream/sppark/poc/ntt-cuda/target/debug/build/ntt-cuda-5f054fafddbbbff8/out/cuda/ntt_api.o" "-c" "cuda/ntt_api.cu"
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2144): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__hdiv") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2145): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__hdiv") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2259): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2260): error: calling a __device__ function("__float_as_uint") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2261): error: calling a __device__ function("__fsub_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2262): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2263): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_trig_reduction_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2297): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2298): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2299): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2300): error: calling a __device__ function("__fmaf_rn") from a __host__ function("__internal_sin_cos_kernel") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/driver_abi.h(60): error: calling a __device__ function("__trap") from a __host__ function("load_env_reg") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(69): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(80): error: calling a __device__ function("__threadfence") from a __host__ function("sync_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(82): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAddE1?1?") from a __host__ function("sync_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(86): error: calling a __device__ function("__threadfence") from a __host__ function("sync_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(98): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(107): error: calling a __device__ function("__threadfence_block") from a __host__ function("atom_or_acq_rel_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(108): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069088atomicOrE1?1?") from a __host__ function("atom_or_acq_rel_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(118): error: calling a __device__ function("__threadfence_block") from a __host__ function("red_or_release_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(119): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069088atomicOrE1?1?") from a __host__ function("red_or_release_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(128): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAndE1?1?") from a __host__ function("red_and_relaxed_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(138): error: calling a __device__ function("__threadfence_block") from a __host__ function("red_and_release_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(139): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069089atomicAndE1?1?") from a __host__ function("red_and_release_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(150): error: calling a __device__ function("__threadfence_block") from a __host__ function("ld_acquire_cta") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(175): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(187): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(198): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_last_releases") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(204): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690811__shfl_syncE1?1?1?1?") from a __host__ function("sync_warps_last_releases") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(229): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_arrive") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/sync.h(255): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync_warps_reset") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(87): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__barrier_syncE1?") from a __host__ function("sync") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(283): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690811__shfl_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(292): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690816__shfl_down_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(301): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690814__shfl_up_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(310): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690815__shfl_xor_syncE1?1?1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(418): error: calling a __device__ function("__clusterGridDimInClusters()") from a __host__ function("dim_clusters") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(427): error: calling a __device__ function("__clusterIdx()") from a __host__ function("cluster_index") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(453): error: calling a __device__ function("cudaCGGetIntrinsicHandle") from a __host__ function("get_intrinsic_handle") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(458): error: calling a __device__ function("cudaCGSynchronize") from a __host__ function("sync") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(464): error: calling a __device__ function("cudaCGGetSize") from a __host__ function("size") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(471): error: calling a __device__ function("cudaCGGetRank") from a __host__ function("thread_rank") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(478): error: calling a __device__ function("cudaCGGetRank") from a __host__ function("grid_rank") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(485): error: calling a __device__ function("cudaCGGetSize") from a __host__ function("num_grids") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(501): error: a static "__constant__" variable declaration is not allowed inside a host function body
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(521): error: calling a __device__ function("__clusterDimIsSpecified()") from a __host__ function("isReal") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(526): error: calling a __device__ function("__cluster_barrier_arrive()") from a __host__ function("barrier_arrive") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(531): error: calling a __device__ function("__cluster_barrier_wait()") from a __host__ function("barrier_wait") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(542): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690827__cluster_query_shared_rankE1?") from a __host__ function("query_shared_rank") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(553): error: calling a __device__ function("__clusterRelativeBlockIdx()") from a __host__ function("block_index") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(558): error: calling a __device__ function("__clusterRelativeBlockRank()") from a __host__ function("block_rank") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(568): error: calling a __device__ function("__clusterDim()") from a __host__ function("dim_blocks") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(573): error: calling a __device__ function("__clusterSizeInBlocks()") from a __host__ function("num_blocks") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/helpers.h(616): error: calling a __device__ function("__trap") from a __host__ function("abort") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(262): error: calling a __device__ function("__trap") from a __host__ function("sync") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(340): error: calling a __device__ function("__trap") from a __host__ function("sync") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(598): error: calling a __device__ function("__popc") from a __host__ function("_get_tiled_threads") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(749): error: calling a __device__ function("__popc") from a __host__ function("coalesced_group") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(772): error: calling a __device__ function("__popc") from a __host__ function("thread_rank") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(786): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__syncwarpE1?") from a __host__ function("sync") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(792): error: calling a __device__ function("__ffs") from a __host__ function("shfl") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(793): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(806): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl_down") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(822): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("shfl_up") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(862): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("any") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(865): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("all") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(869): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("ballot") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(871): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("ballot") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(903): error: calling a __device__ function("__activemask()") from a __host__ function("coalesced_threads") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(1410): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__any_syncE1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups.h(1421): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690810__all_syncE1?1?") from a __host__ function("operator()") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(66): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("_binary_partition") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(85): error: calling a __device__ function("__ffs") from a __host__ function("_labeled_partition") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(93): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690813__ballot_syncE1?1?") from a __host__ function("_labeled_partition") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(94): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc069085__fnsE1?1?1?") from a __host__ function("_labeled_partition") is not allowed
  cargo:warning=
  cargo:warning=/usr/local/cuda/bin/../targets/x86_64-linux/include/cooperative_groups/details/partitioning.h(96): error: calling a __device__ function("__popc") from a __host__ function("_labeled_partition") is not allowed
  cargo:warning=
  cargo:warning=/home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(14): error: calling a __device__ function("__brev") from a __host__ function("bit_rev") is not allowed
  cargo:warning=
  cargo:warning=/home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(16): error: calling a __device__ function("__brevll") from a __host__ function("bit_rev") is not allowed
  cargo:warning=
  cargo:warning=/home/vmx/src/pl/filecoin/upstream/sppark/ntt/kernels.cu(32): error: calling a __device__ function("_ZN41_INTERNAL_f2530303_10_ntt_api_cu_fcc0690815__shfl_xor_syncE1?1?1?1?") from a __host__ function("shfl_bfly") is not allowed
  cargo:warning=
  cargo:warning=79 errors detected in the compilation of "cuda/ntt_api.cu".
  exit status: 2

  --- stderr


  error occurred: Command "nvcc" "-ccbin=c++" "-Xcompiler" "-O0" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-G" "-Xcompiler" "-gdwarf-4" "-Xcompiler" "-fno-omit-frame-pointer" "-m64" "-I" "/home/vmx/.cargo/registry/src/github.com-1ecc6299db9ec823/blst-0.3.10/blst/src" "-I" "/home/vmx/src/pl/filecoin/upstream/sppark" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-arch=sm_70" "-Xcompiler" "-Wno-unused-parameter" "-Xcompiler" "-Wno-subobject-linkage" "-DTAKE_RESPONSIBILITY_FOR_ERROR_MESSAGE" "-DFEATURE_BLS12_381" "-o" "/home/vmx/src/pl/filecoin/upstream/sppark/poc/ntt-cuda/target/debug/build/ntt-cuda-5f054fafddbbbff8/out/cuda/ntt_api.o" "-c" "cuda/ntt_api.cu" with args "nvcc" did not execute successfully (status code exit status: 2).

vec256 type declaration

Hi,
I am trying to build ntt part of sppark as independent C/C++ project for a project. However, the type declaration/definition for vec256 is no where to be found under sppark/ project. I even search into the codebase of [blst](https://github.com/supranational/blst/tree/master) and still I couldn't find it either.

Building sppark from cargo is no problem at all. However, as I mentioned that I am trying to build the ntt portion independently as C/C++/Cuda project and the type declaration for vec256 is no where to be found which would enable me to be able to successfully build.

I would greatly appreciate your response

Linker issue with blst-msm

I get an error like

  = note: /usr/bin/ld: /tmp/blstmsmlinkerissue/target/debug/deps/libblst_msm-f154847bd3acdcfa.rlib(blst_msm-f154847bd3acdcfa.blst_msm.14511a1f-cgu.6.rcgu.o): in function `blst_msm::multi_scalar_mult':
          /home/vmx/.cargo/git/checkouts/sppark-766e0a6fee265121/f939f36/poc/blst-cuda/src/lib.rs:49: undefined reference to `mult_pippenger'
          collect2: error: ld returned 1 exit status
          
  = note: some `extern` functions couldn't be found; some native libraries may need to be installed or have their path specified
  = note: use the `-l` flag to specify native libraries to link
  = note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)

when trying to use blst-msm as a dependency. Steps to reproduce (you can copy&paste that into your terminal and execute it directly):

cargo init blstmsmlinkerissue
cd blstmsmlinkerissue
cargo add blst
cargo add blst-msm --git https://github.com/supranational/sppark --features bls12_381
cat > src/main.rs << EOF
use blst::{blst_scalar, blst_p1_affine, blst_fp};

fn main() {                                      
    let bases = vec![blst_scalar{ b: [0; 32] }];
    let exponents = vec![blst_p1_affine{ x: blst_fp { l: [0; 6] }, y: blst_fp { l: [0; 6]} }];
    let point = blst_msm::multi_scalar_mult(&exponents, &bases);                              
    println!("point: {:?}", point);                             
}                                  
EOF
cargo build

I'm on Debian.

Clean up MSM poc

This repository contains a proof of concept for using the MSM functionality from Rust. It's in a directory called blst-cuda, although the crate is called blst-msm. That poc only works with arkworks types, not with blst(rs) types. All this is confusing. Hence I propose:

renaming blst-cuda to arkworks-msm
renaming the crate from blst-msm to arkworks-msm
create a new crate called blst-msm that is within a directory of the same name. That crate would work with blst types.

This issue was triggered by #4.

missing headers in `ff` module

# include <blst_t.hpp> in sppark/ff/bls12-381.hpp fails as this file is nowhere defined.

Implementing Display for error

Currently the cuda errors impl From<Error> for String. If instead it would implement Display (see https://rust-lang.github.io/api-guidelines/interoperability.html?highlight=display#error-types-are-meaningful-and-well-behaved-c-good-err for more information), you could e.g. also call err.to_string() or even use it like println!("error was : {}", err).

If I should provide a pull request, let me know.

what is the definition of `const uint32_t& M0` in mont_t.cuh

any documentation on how to understand the M0 in the Montgomery Reduction? Or which Montgomery Reduction algorithm is implemented? As of my shallow understanding of normal Mongt Implementation. There is no M0 involved. Maybe you are using another trick or algorithm implementation here?

WebGPU support

Instead of building for alone, WebGPU provides the a wrapper which would be compatible with vulkan, metal, and more. This also allows the library to be called from the web supporting more web based GPU acceleration, opening more edge processing setups.

Is this something that would be worthwhile to build for sppark?

sppark/ff/mont_t.cuh(721): error: identifier "i" is undefined

c++ -DSECTOR_SIZE_512MiB -g -O2 -Ideps/spdk-v22.09/include -Ideps/spdk-v22.09/isa-l/.. -Ideps/spdk-v22.09/dpdk/build/include -D__ADX__ -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -fno-strict-aliasing -march=native -Wformat -Wformat-security -D_GNU_SOURCE -fPIC -fstack-protector -fno-common -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -DSPDK_GIT_COMMIT=4be6d3043 -pthread -Ideps/spdk-v22.09/include -Ideps/spdk-v22.09/isa-l/.. -Ideps/spdk-v22.09/dpdk/build/include -Ideps/sppark -Ideps/sppark/util -Ideps/blst/src -c sealing/supra_seal.cpp -o obj/supra_seal.o -Wno-subobject-linkage
deps/sppark/ff/mont_t.cuh(721): error: identifier "i" is undefined
even[i] = even[i+1] = 0;
^

deps/sppark/ff/mont_t.cuh(729): error: expected a ">"
for (i = 0; i < n; i += 2) {
^

deps/sppark/ff/mont_t.cuh(735): error: expression must be a modifiable lvalue
: "+r"(odd[n+i-1]), "+r"(carry));
^

deps/sppark/ff/mont_t.cuh(735): error: an asm operand must have scalar type
: "+r"(odd[n+i-1]), "+r"(carry));
^

deps/sppark/ff/mont_t.cuh(742): error: expression must be a modifiable lvalue
: "+r"(even[n+i+1]), "+r"(carry));
^

deps/sppark/ff/mont_t.cuh(742): error: an asm operand must have scalar type
: "+r"(even[n+i+1]), "+r"(carry));
^

6 errors detected in the compilation of "pc2/cuda/pc2.cu".
deps/sppark/ff/mont_t.cuh(721): error: identifier "i" is undefined
even[i] = even[i+1] = 0;
^