filecoin-project / ec-gpu Goto Github PK

OpenCL code generator for finite-field arithmetic over arbitrary prime fields

License: Other

C 22.14% Rust 77.86%

ec-gpu's Introduction

`ec-gpu` & `ec-gpu-gen`

CUDA/OpenCL code generator for finite-field arithmetic over prime fields and elliptic curve arithmetic constructed with Rust.

Notes:

Limbs are 32/64-bit long, by your choice (on CUDA only 32-bit limbs are supported).
The library assumes that the most significant bit of your prime-field is unset. This allows for cheap reductions.

Usage

Quickstart

Generating CUDA/OpenCL codes for blstrs Scalar elements:

use blstrs::Scalar;
use ec_gpu_gen::SourceBuilder;

let source = SourceBuilder::new()
    .add_field::<Scalar>()
    .build_64_bit_limbs();

Integration into your library

This crate usually creates GPU kernels at compile-time. CUDA generates a fatbin, which OpenCL only generates the source code, which is then compiled at run-time.

In order to make things easier to use, there are helper functions available. You would put some code into build.rs, that generates the kernels, and some code into your library which then consumes those generated kernels. The kernels will be directly embedded into your program/library. If something goes wrong, you will get an error at compile-time.

In this example we will make use of the FFT functionality. Add to your build.rs:

use blstrs::Scalar;
use ec_gpu_gen::SourceBuilder;

fn main() {
    let source_builder = SourceBuilder::new().add_fft::<Scalar>()
    ec_gpu_gen::generate(&source_builder);
}

The ec_gpu_gen::generate() takes care of the actual code generation/compilation. It will automatically create a CUDA and/or OpenCL kernel. It will define two environment variables, which are meant for internal use. _EC_GPU_CUDA_KERNEL_FATBIN that points to the compiled CUDA kernel, and _EC_GPU_OPENCL_KERNEL_SOURCE that points to the generated OpenCL source.

Those variables are then picked up by the ec_gpu_gen::program!() macro, which generates a program, for a given GPU device. Using FFT within your library would then look like this:

use ec_gpu_gen::{
    rust_gpu_tools::Device,
};

let devices = Device::all();
let programs = devices
    .iter()
    .map(|device| ec_gpu_gen::program!(device))
    .collect::<Result<_, _>>()
    .expect("Cannot create programs!");

let mut kern = FftKernel::<Fr>::create(programs).expect("Cannot initialize kernel!");
kern.radix_fft_many(&mut [&mut coeffs], &[omega], &[log_d]).expect("GPU FFT failed!");

Feature flags

This crate supports CUDA and OpenCL, which can be enabled with the cuda and opencl feature flags.

Environment variables

EC_GPU_CUDA_NVCC_ARGS

By default the CUDA kernel is compiled for several architectures, which may take a long time. EC_GPU_CUDA_NVCC_ARGS can be used to override those arguments. The input and output file will still be automatically set.
```
// Example for compiling the kernel for only the Turing architecture.
EC_GPU_CUDA_NVCC_ARGS="--fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75"
```
EC_GPU_FRAMEWORK

When the library is built with both CUDA and OpenCL support, you can choose which one to use at run time. The default is cuda, when you set nothing or any other (invalid) value. The other possible value is opencl.
```
// Example for setting it to OpenCL.
EC_GPU_FRAMEWORK=opencl
```
EC_GPU_NUM_THREADS

Restricts the number of threads used in the library. The default is set to the number of logical cores reported on the machine.
```
// Example for setting the maximum number of threads to 6.
EC_GPU_NUM_THREADS=6
```

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

ec-gpu's People

Contributors

Stargazers

Watchers

ec-gpu's Issues

Remove 64-bit limb support

Currently there is support for 64-bit limbs on NVidia devices. Though it seems that it doesn't improve the performance significantly. Hence remove support for it and only support 32-bit limbs.

Verify again that there is no performance difference between 32-bit and 64-bit limbs
Remove the support for 64-bit limbs

ec-gpu-gen 0.2.1 fails to build on docs.rs due to missing `nvcc`

https://docs.rs/crate/ec-gpu-gen/0.2.1/builds/531107

I have nix-based fix which adds the cudatoolkit dependency to fix this: https://github.com/lurk-lang/ec-gpu/blob/df6de7cd06e37ce3ba8d311e5d6c073a6e87b15a/flake.nix#L51, but presumably there's a non-nix way to get the docs.rs CI to include this.

Questions about "POINT_multiexp" KERNEL.

I noticed that you used an optimization in the "POINT_multiexp" KERNEL as shown below

Can you tell me the function of the optimization?

Questions about "POINT_multiexp" KERNEL in "cuda" code.

in the code for "cuda", I noticed that you used an optimization in the "POINT_multiexp" KERNEL as shown below

  // O_o, weird optimization, having a single special case makes it
  // tremendously faster!
  // 511 is chosen because it's half of the maximum bucket len, but
  // any other number works... Bigger indices seems to be better...
    if(ind == 511) buckets[510] = G1_add_mixed(buckets[510], bases[i]);
    else if(ind--) buckets[ind] = G1_add_mixed(buckets[ind], bases[i]);

but when I remove the else branch, it becomes as shown below,

     if(ind--) buckets[ind] = G1_add_mixed(buckets[ind], bases[i]);

After testing on "RTX 3060", when the input scale is "2 ^ 21" and the input point is "G2Affine"，Your optimization has obvious effect. When the input scale is "2 ^ 21" and the input point is of "G1Affine" type ，Your optimization has no effect, but will lead to long execution time.
Could you tell us in detail the role of optimization here?

How can I modify the calls for inverse FFT calculation?

Does it work with compute capability 5.2?

Hi, I'm trying to run the tests on my GPU. Wanted to know if it can run on GTX Titan X (Maxwell, 2015). When I run cargo test I get the following output

https://pastebin.com/ETDPm1d5

Is it possible to use with bls12_377

Hi!

I'm trying to use generated GPU code for bls12_377 and stumbled accross the function Fq_mul_nvidia, it doesn't work for me (ofc I have Nvidia gpu) while Fq_mul_default works fine. Are there any possible pitfalls of this function? The bls12_377 has 377-bit module len, could it be a crux of the problem? Or does it depends on a type of a limb? I'm using Limb64 since code I'm trying to optimize uses 64-bit limbs.

Targeted finite fields

I am doing a state-of-the-art of the existing libraries on GPU, and I stumbled upon your project.
What exactly finite fields are supported ? Only fields of characteristic 2 ?
Extension or only prime fields ?
Single or multi-word fields ?
Are there optimisations for some of them.
In the FFT, the parameter is often a power of two. Do you only tackle these ?
The description looks very vague to me.e
Do you have benchmark plots with comparison NVIDIA vs AMD ?

some questions about principle of "FFT"

I want some papers or blogs about the principles of "FFT" which used in "/ec-gpu/ec-gpu-gen/src/fft_cpu.rs/parallel_fft()".

How to use methods in "G1Affine" by the <G as PrimeCurveAffine>::Curve?

I want to transform the result of the function "multiexp" in gpu from the form of "::Curve" to "&[u8]".I want to use function "to_uncompressed" and "from_uncompressed".But these methods are in "G1Projective" and I can not use them in a "::Curve" .
Please help me with this!

Subtraction on bls12-381

Given a "G1_affine" type point G (x, y), then - G should be (x, Fq_P - y). A "G1_projective" point is marked as P, so G1_ sub_ mixed(P,G) = G1_ add_ mixed(P,-G)。 I want to know what is wrong with this understanding? Or can you provide some help for the calculation of "G1_sub_mixed"?