0xpolygonzero / plonky2 Goto Github PK

License: Apache License 2.0

Rust 99.24% Shell 0.06% Python 0.60% HTML 0.10%

plonky2's Introduction

Plonky2 & more

This repository was originally for Plonky2, a SNARK implementation based on techniques from PLONK and FRI. It has since expanded to include tools such as Starky, a highly performant STARK implementation.

Documentation

For more details about the Plonky2 argument system, see this writeup.

Polymer Labs has written up a helpful tutorial here!

Examples

A good starting point for how to use Plonky2 for simple applications is the included examples:

factorial: Proving knowledge of 100!
fibonacci: Proving knowledge of the hundredth Fibonacci number
range_check: Proving that a field element is in a given range
square_root: Proving knowledge of the square root of a given field element

To run an example, use

cargo run --example <example_name>

Building

Plonky2 requires a recent nightly toolchain, although we plan to transition to stable in the future.

To use a nightly toolchain for Plonky2 by default, you can run

rustup override set nightly

in the Plonky2 directory.

Running

To see recursion performance, one can run this bench, which generates a chain of three recursion proofs:

RUSTFLAGS=-Ctarget-cpu=native cargo run --release --example bench_recursion -- -vv

Jemalloc

Plonky2 prefers the Jemalloc memory allocator due to its superior performance. To use it, include jemallocator = "0.5.0" in your Cargo.toml and add the following lines to your main.rs:

use jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

Jemalloc is known to cause crashes when a binary compiled for x86 is run on an Apple silicon-based Mac under Rosetta 2. If you are experiencing crashes on your Apple silicon Mac, run rustc --print target-libdir. The output should contain aarch64-apple-darwin. If the output contains x86_64-apple-darwin, then you are running the Rust toolchain for x86; we recommend switching to the native ARM version.

Contributing guidelines

See CONTRIBUTING.md.

Licenses

All crates of this monorepo are licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Security

This code has not yet been audited, and should not be used in any production systems.

While Plonky2 is configurable, its defaults generally target 100 bits of security. The default FRI configuration targets 100 bits of conjectured security based on the conjecture in ethSTARK.

Plonky2's default hash function is Poseidon, configured with 8 full rounds, 22 partial rounds, a width of 12 field elements (each ~64 bits), and an S-box of x^7. BBLP22 suggests that this configuration may have around 95 bits of security, falling a bit short of our 100 bit target.

plonky2's People

Contributors

Stargazers

Watchers

Forkers

recmo rajivpo serinuntius federico2014 ohhkaneda tysun sorydi3 proxima-one micropaylabs mfaulk bthoneybadger skipjack8 w2k-star-forks shalevos nujz sukantag kthcode bulalu mendella vadym-f shanu516516 gilescope cryptopunkchina therealyingtong providenetwork polymerdao domiw22 bamaao kjessec regulusyg shanpengfei7 lynndell2010 piavgh miss-bug tumelo-mapheto radiusxyz hedgcoxe nashtare shishirhome alexey-ernest sin7y i-am-yona22 0x0ece davidnevadoc qope ceciliaz030 openzklib fengzhizi319 baysao vivijj ch4r10t33r namnc nalinbhardwaj immanuelsegol chickenlover kokeshim0chi sladuca agnxsh benjaminbollen jorgeantonio21 dima-brook dohoonkim8 dcntrlzr ivolkg violetsakura777 patrickmao93 lurium jimpo topos-protocol okuyiga mizukisonoko horizenlabs mo-guan ihistoryz zkpkg hidon1c zpokenweb3 tikhono auradine zkvers allchain bhgomes 6ikezbad3cznf sangoukjeon98 palozano lattejed honeywest zlemon819 jamestiotio ajgappmark xianruimeng iq-scm aerius-labs eightfilms nickray tumberger matthiasgoergens arielgabizon poseidon-zkp 0xmozak

plonky2's Issues

Support quadratic extensions in recursion code

It seems like D=2 should actually work for up to ~128 bit security, under a conjecture, based on 5.10.1 of the ethSTARK paper.

The one thing (that I'm aware of) that we'll need to change is RandomAccessGate and/or how it's used by verify_merkle_proof_with_cap_index. (See requires D = 4 in the code.)

Perhaps we could change RandomAccessGate to do a "batch" (c.f. SwitchGate etc) of RAM operations, where the list items are single field elements. Each operation would have its own list, index, and output. RAM on a list of extension field elements would be implemented with D of these smaller RAM operations.

This would be worse in a way, since there would be some redundancy in indices (in our use cases we'd repeat the same index either D or 4 times). But also better in a way, since RandomAccessGate would be more flexible, and could take advantage of a large num_routed_wires, potentially reducing the number of gates.

Plonk tweaks to support more routed wires

In "standard Plonk", the vanishing polynomial contains a product of all routed wire polynomials. We plan to have quite a few routed wires (probably at least 16, maybe more), but we wouldn't want such a high-degree vanishing polynomial. We want to keep all polynomials within degree 8n.

I think we can get around this by introducing a few intermediate polynomials. I.e. we could use one to hold the product of the first 8 routed wire polynomials, then another for the next 8, etc.

The vanishing polynomial would use these intermediate polynomials in its permutation term. It would have additional terms to check that these intermediate polynomials agree with the product of the 8 associated polynomials.

Set up CI tests

Maybe with Github Actions; we used that in r1cs.

Cache-friendly bit-order reversal

See the cache-oblivious FFT algorithm in chapter 3 of [1]. It can be made into a bit-order reversal algorithm by:

striking step 3 (multiplication by twiddle factors),
computing a length-2 bit-order reversal as the base case instead of a length-2 FFT (duh).

[1] Prokop, H. (1999), Cache-Oblivious Algorithms, http://supertech.csail.mit.edu/papers/Prokop99.pdf

Consider evaluating lower-degree constraints on smaller domain

Then low-degree extending them to the larger domain where we evaluate the quotient polynomial. Now that we have a good FFT implementation, this might be more efficient?

Add gate filters

Implement eval_filtered, so that gates are "activated" only at the proper row indices.

test_arithmetic for extension fields

test_arithmetic! doesn't currently work for them because it uses the (u64) ORDER field. I think we should make ORDER a BigUInt. If there's any performance-critical code that depends on it (I can't think of any), we can add a Field64 with an ORDER_u64 or something.

Make Poseidon a trait on prime field arrays (instead of prime fields)?

Consider making Poseidon a trait not on GoldilocksField et.al., with a const generic for width, but a trait on a fixed-width array.

For example,

impl Poseidon for [GoldilocksField; 12] {
    ...

This would permit us to get rid of the WIDTH const generic, hopefully silencing some warnings. It would also be more conceptually correct: you can’t apply Poseidon to a single GoldilocksField. There would be no more need for the awkward <GoldilocksField as Poseidon<12>>.

Are there any difficulties that I’ve missed?

Cache FFT root tables

Our current FFT re-computes its root tables at every invocation. It means:

More numerical calculations
More allocations/deallocations
More pagefaults (kernel zeroing new pages)
Memory latency (new root table not in cache)
Cluttered cache (multiple copies of root table in cache)

Try caching the tables once computed.

Subtraction can be wrong

I've run into the following bug: When a field element is not reduced, substracting by this element can produce incorrect results.
Here is a test that is failing with the current Sub implementation:

fn test_sub() {
        type F = CrandallField;
        let (a, b) = (
            F::from_canonical_u64(2762315674048163),
            F::from_canonical_u64(6678),
        );
        let x = a * b; // = 1 in the field.
        assert_eq!(x, F::ONE);
        assert_eq!(F::ZERO - x, F::NEG_ONE);
    }

I think simply changing the Sub implementation to using rhs.to_canonical_u64() instead of rhs.0 should solve this, but I'll defer to you if you think there is a better way.

Tracking issue: Poseidon vectorization

	AVX2	NEON
Constant layer	#243	#245
Full S-box layer	#244
MDS layer	#229	#231
@unzvfu’s fast partial round trick	Done on `jakub/avx2-fast-poseidon2`. It’s already pretty ugly and only a small gain. More work will make it faster but also uglier. I don’t think it’s worth merging right now, but I will revisit it after the switch to Goldilocks.

Reducing proof size

Compress Merkle paths. Done in #255.
Remove the element that can be inferred in the FRI reduction step. We already did that before f2c423e, but it was removed when we replaced the InsertionGate with the RandomAccessGate. We can add it back to save some field elements. Done in #298.
Replace Merkle caps with Merkle roots in the compressed proofs. In #170 we replace Merkle roots with Merkle caps to save some gates in the recursive verifier. However this is a bit wasteful in compressed proofs, so replacing the caps with the roots would save some space there.
The FRI verifier queries the oracle at many indices that are not necessarily distinct. We could check for duplicates and remove the duplicate query proofs in the compressed proofs. Done in #279.
We could implement our own proof serialization. Right now serde_cbor wastes quite a bit of space. For example a vector of 100 CrandallField elements is serialized with 902 bytes, whereas a simple byte array would take 800 bytes. Done in #280.
Reduce num_query_rounds. The proof size is currently given by 12.4 + 6.7*num_query_rounds KB, so simply reducing num_query_round can save quite a bit of memory. Note that we'd need to increase proof_of_work_bits or rate_bits to maintain security, which would in turn increase the proving time.

FFT optimizations

The existing implementation does the minimal number of field operations -- i.e. it uses radix-2 layers only, and precomputes the roots of unity. At the time, field operations were the bottleneck, so I didn't give much thought to cache efficiency or anything.

Now that we use a smaller field, we should probably start over with a more efficient implementation. As @unzvfu mentioned, the NTT code in SEAL would be a good reference.

Safe division

We're using div_unsafe[_extension] (which permits 0/0 = <anything>) in several places, and it seems unclear whether they're all safe. E.g. interpolate2 calls it, so it has the same unsafe property, but doesn't warn its callers. I think the interpolate2 call in fri_combine_initial might be unsafe.

Let's add a safe division function, and replace calls with that if there's any doubt. We can always switch back later after writing a safety argument.

I think we can also simplify div_unsafe[_extension] using virtual targets that we can pass to mul etc.

Add more benchmarks

Basic field arithmetic (might be nontrivial to measure such small operations accurately?)
Inversion
A single permutation or hash, for GMiMC, and later Poseidon
Merklize some data (MerkleTree::new)
FFTs
Tranpose

Verify batch FRI soundness error with our parameters

The proximity gaps paper includes a soundness analysis of batch FRI. Hopefully soundness error will already be sufficiently small with our current parameters (a batch of 2-3 hundred purported codewords, over a ~192 or ~256 bit field). If not, we can reduce it by adding back multiple challenge points.

Use of quartic-quartic field extension in `InterpolationGate`

I had a closer look at how quartic-quartic fields are used in the eval_* methods of the InterpolationGate, and I think there is an issue.
I think it's easier to understand by just looking at how multiplication of quartic field extension elements works in the circuit.
This multiplication can be represented as a multivariate polynomial over the base field

$$M(x0, x1, x2, x3, y0, y1, y2, y3) = (z0, z1, z2, z3) z0 = x0*y0 + 3 * (x1*y3 + x2*y2 + x3*y1) // 3 is the W for F_4/F ...$$

So if the xs and ys are base field elements representing two quartic field element, the zs will represent their product.

If the xs and ys are quartic field elements, as will be the case when we evaluate the gate at a random zeta, we can still use the same multivariate polynomial M seen as a polynomial over the quartic field this time. In this case, the resulting zs don't have a clear meaning (which isn't an issue).

Currently, the InterpolationGate interprets the xs and ys as representations of quartic-quartic field elements, and the zs as the representation of their multiplication in this field. However, this isn't (in general) the same as the correct z=M(x,y).
Concretely, the issue is that by working in the quartic-quartic field, we use the W of F_16/F_4 instead of the W of F_4/F as in M(x,y).

I see two possible ways of fixing this:

Change the current QuarticQuarticField to a QuarticQuarticAlgebra which would represent a 4-dimensional algebra over QuarticField with a multiplication given by the multivariate polynomial M above, aka F_4[X]/(X^4-3).
Implement gate constraints directly as multivariate polynomials over the base field. This would make the code a bit clumsier, but then we wouldn't have to juggle between fields when evaluating constraints, and we would get the constraint degrees for free. Note that this is the approach I took in Cobia where the code for gate constraints is directly compiled to multivariate polynomials.

PS: Here is a failing test

#[test]
fn test_gate_constraint() {
        type F = CrandallField;
        type FF = QuarticCrandallField;
        const D: usize = 4;

        /// Returns the local wires for an interpolation gate for given coeffs, points and eval point.
        fn get_wires(
            num_points: usize,
            coeffs: PolynomialCoeffs<FF>,
            points: Vec<F>,
            eval_point: FF,
        ) -> Vec<FF> {
            let mut v = vec![F::ZERO; num_points * 5 + (coeffs.len() + 3) * D];
            for j in 0..num_points {
                v[j] = points[j];
            }
            for j in 0..num_points {
                for i in 0..D {
                    v[num_points + D * j + i] = <FF as FieldExtension<D>>::to_basefield_array(
                        &coeffs.eval(points[j].into()),
                    )[i];
                }
            }
            for i in 0..D {
                v[num_points * 5 + i] =
                    <FF as FieldExtension<D>>::to_basefield_array(&eval_point)[i];
            }
            for i in 0..D {
                v[num_points * 5 + D + i] =
                    <FF as FieldExtension<D>>::to_basefield_array(&coeffs.eval(eval_point))[i];
            }
            for i in 0..coeffs.len() {
                for (j, input) in
                    (0..D).zip(num_points * 5 + (2 + i) * D..num_points * 5 + (3 + i) * D)
                {
                    v[input] = <FF as FieldExtension<D>>::to_basefield_array(&coeffs.coeffs[i])[j];
                }
            }
            v.iter().map(|&x| x.into()).collect::<Vec<_>>()
        }

        /// Get a working row for InterpolationGate.
        let coeffs = PolynomialCoeffs::new(vec![FF::rand(), FF::rand()]);
        let points = vec![F::rand(), F::rand()];
        let eval_point = FF::rand();
        let gate = InterpolationGate::<F, D> {
            num_points: 2,
            _phantom: PhantomData,
        };
        let vars = EvaluationVars {
            local_constants: &[],
            local_wires: &get_wires(2, coeffs, points, eval_point),
        };

        // !!! This fails !!!
        assert!(
            gate.eval_unfiltered(vars.clone())
                .iter()
                .all(|x| x.is_zero()),
            "Gate constraints are not satisfied."
        );
    }

cannot use constants which depend on generic parameters in types

This compiler warning pops up a bunch of times. It seems to only affect things that call Poseidon.

warning: cannot use constants which depend on generic parameters in types
  --> src/field/field_types.rs:19:47
   |
19 | pub trait RichField: PrimeField + GMiMC<12> + Poseidon<12> {}
   |                                               ^^^^^^^^^^^^
   |
   = note: `#[warn(const_evaluatable_unchecked)]` on by default
   = warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
   = note: for more information, see issue #76200 <https://github.com/rust-lang/rust/issues/76200>

warning: cannot use constants which depend on generic parameters in types
  --> src/hash/hashing.rs:83:26
   |
83 |             state = self.permute(state);
   |                          ^^^^^^^

warning: cannot use constants which depend on generic parameters in types
   --> src/hash/hashing.rs:142:33
    |
142 |         HashFamily::Poseidon => F::poseidon(inputs),
    |                                 ^^^^^^^^^^^

The Rust tracking issue is rust-lang/rust#76200.

More efficient blinding scheme for Z

We might want to consider a method of blinding Z that requires fewer blinding factors, along the lines of Mina's scheme. It's particularly relevant for small signature circuits.

Cache-friendly tranpose

Our transpose function is taking about 6-7% of our proving time. Not a huge bottleneck, so it's probably not worth investing too much effort in this. But it seems like there may be straightforward ways to improve it, like the ones mentioned in this SO question.

Final coefficients are zero-padded

In fri_committed_trees, our PolynomialCoeffs is zero-padded to the LDE size, leading to a final_poly that's also zero-padded, e.g.

final_poly: PolynomialCoeffs {
    coeffs: [
        11138935987459839938 + 15860308905970075097*a + 4209094335074549309*a^2 + 1390825533366268468*a^3,
        15988777520974773963 + 9820978433862994999*a + 8053082307224957654*a^2 + 15841019660058770957*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
        0 + 0*a + 0*a^2 + 0*a^3,
    ],
},

Fixing this in a really performant way might be nontrivial (we'd want to keep the coeffs unpadded and use Hamish's fast LDE code). To avoid throwaway work, it might be best to just switch to FRI with interpolation, where we don't need coeffs until the very end.

Implement interpolation over arbitrary domains

By this I mean going from an arbitrary set of (point, value) pairs to a list of coefficients. There are a couple places where this would be useful. One example:

To verify openings of f at multiple challenge points, we run FRI on (f(x) - interpolant(x)) / vanishing(x), where interpolant is the polynomial of minimal degree that agrees with f at the challenge points, and vanishing is the polynomial of minimal degree that vanishes at the challenge points. We could do Lagrange interpolation, but we want this to be as simple and efficient as possible since it needs to be done by the recursive verifier, so I think it's better for the prover to give purported coefficients and the verifier to evaluate from those.

I'm not sure what a "good" method for this sort of interpolation would be, but I think performance is not really important here, as these polynomials should be small. Also, this doesn't have to be fully general interpolation since our fields are 2-adic. So we could potentially round up to a power of two and leverage the subgroup of that size. I.e., we could evaluate the Lagrange polynomial at each point in that subgroup, then call our IFFT code. There might be better methods, that's just the first thought I had.

Tree-like `batch_multiplicative_inverse`?

We spend ~3% of proof time in Field::batch_multiplicative_inverse (not counting Field::try_inverse). This routine computes the multiplicative inverse for every element of a vector. In order to do so, it computes the running product (think np.cumprod) of all those elements.

The running product creates a long dependency chain that is likely a bottleneck. It should be broken up. I propose something like:

let mut running_prods = vec![F::ONE, F::ONE, F::ONE, F::ONE];
for (i, x) in input.iter().enumerate() {
    running_prods.push(running_prods[i] * x);
}

let n = running_prods.len()
let t0 = running_prods[n - 4] * running_prods[n - 3];
let t1 = running_prods[n - 2] * running_prods[n - 1];
let u = t0 * t1;
let u_inv = u.inverse();

// Unwind analogously.

This would break up the dependency chain and increase instruction-level parallelism. (It would also permit us to try vectorizing this code at a later time.)

Need public API for creating proof targets and populating them using proof objects

As @Sladuca pointed out,

How do I construct recursive circuit outside the plonky2 crate? I see in the test suite for recursive_verifier.rs that recursive_proof uses proof_to_proof_target and set_proof_target, which turns a proof into a target and adds the proof as an input in a PartialWitness. Is there a reason these two functions are only in the test suite?

proof_to_proof_target seems to me like it would fit as a CircuitBuilder method (it takes a &mut CircuitBuilder) and set_proof_target seems similar to other PartialWitness methods.

Also, does one need to build a new recursive circuit for each proof, or is it possible to build one recursive 'outer' circuit and feed it different proofs for the same 'inner' circuit as input?

I'm not sure about the proof_to_proof_target API though; let's think about some alternatives.

`PackedField` with NEON

We have a PackedField trait representing a vector of field elements, but it doesn't have a NEON implementation currently. cc @recmo who might be interested in helping with this.

From talking with @nbgl just now, it sounds 64-bit multiplication won't be very efficient with NEON. So it won't be easy to actually get a speedup over scalar code, though it may be doable by doing scalar and vector operations in parallel.

GMiMC, Rescue aren't compatible with `GoldilocksField`

Our GMiMC and Rescue code is using x^3, since when we wrote it we were using p = 2^64 - 9 * 2^28 + 1, but we're now using p = 2^64 - 2^32 + 1.

We could update it to x^7 (at least Rescue; GMiMC specifies x^3 though others might be okay). That said, maybe we should just delete them, since we have no plans to use GMiMC or Rescue.

Thanks to Vijay who noticed this incompatibility.

Investigate other field options

We're currently using p = 2^64 - 9 * 2^28 + 1, which I figured would enable a simple and fast Crandall reduction. It's reasonably fast, but seems like there's room for improvement.

We could probably get some speedup by switching to a slightly smaller field and delaying some reductions, although I expect that this would be fairly minor, since GMiMC doesn't do many additions in between multiplications. My code might also have room for improvement independent of field choice.

@unzvfu suggested looking into binary fields, which should be very fast on CPUs. There may be some complications with using FRI with binary fields, but it seems worth looking into.

A much more speculative idea is to use a field that isn't 2-adic, like p = 2^61 - 1. This would complicate LDEs and FRI. For LDEs, I suppose we could either use mixed-arity FFTs (might be expensive to do the high-arity layers), or use some other DFT algorithm with no smoothness requirement (not sure how practical they are). For FRI, we typically want pretty high-arity reductions anyway (like 4-to-1 or 8-to-1 layers), so it might not affect efficiency that much, but it would be a bit more complex.

It would be interesting to see what ~64 bit fields other projects have used, but I found very little online. AFAIK they don't come up in any widely used crypto primitives.

Test `plonk::recursive_verifier::tests::test_size_optimized_recursion` fails

Test plonk::recursive_verifier::tests::test_size_optimized_recursion fails (and has been for a while, it seems) with:

$ cargo test plonk::recursive_verifier::tests::test_size_optimized_recursion -- --ignored
   Compiling plonky2 v0.1.0 (/home/hlaw/src/plonky2)
    Finished test [unoptimized + debuginfo] target(s) in 5.76s
     Running unittests (target/debug/deps/plonky2-5a582d6d2807f96f)

running 1 test
test plonk::recursive_verifier::tests::test_size_optimized_recursion ... FAILED

failures:

---- plonk::recursive_verifier::tests::test_size_optimized_recursion stdout ----
thread 'plonk::recursive_verifier::tests::test_size_optimized_recursion' panicked at 'assertion failed: `(left == right)`
  left: `12`,
 right: `8`', src/fri/recursive_verifier.rs:265:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    plonk::recursive_verifier::tests::test_size_optimized_recursion

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 198 filtered out; finished in 23.86s

error: test failed, to rerun pass '--lib'

The line in question is

        debug_assert_eq!(
            degree_log,
            common_data.config.cap_height + proof.evals_proofs[0].1.siblings.len()
                - config.rate_bits)

where

degree_log: 12
common_data.config.cap_height: 3
proof.evals_proofs[0].1.siblings.len(): 12
config.rate_bits: 7

I don't properly understand how these parameters relate to each other yet, but I think the config.rate_bits value might be causing the problem. There are two CircuitConfig objects in play with different data, which are defined in the failing test, common_data.config:

CircuitConfig {
  num_wires: 135,
  num_routed_wires: 80,
  constant_gate_size: 5,
  use_base_arithmetic_gate: true,
  security_bits: 93, rate_bits: 3,
  num_challenges: 2,
  zero_knowledge: false,
  cap_height: 3,
  fri_config: FriConfig {
    proof_of_work_bits: 16,
    reduction_strategy: ConstantArityBits(3, 5),
    num_query_rounds: 26
  }
}

and self.config:

CircuitConfig {
  num_wires: 135,
  num_routed_wires: 80,
  constant_gate_size: 5,
  use_base_arithmetic_gate: true,
  security_bits: 93,
  rate_bits: 7,
  num_challenges: 2,
  zero_knowledge: false,
  cap_height: 3,
  fri_config: FriConfig {
    proof_of_work_bits: 16,
    reduction_strategy: ConstantArityBits(3, 5),
    num_query_rounds: 11
  }
}

One way to fix this failure would be to replace config.rate_bits in the failing expression with common_data.config.rate_bits, but I'm not sure if that's valid.

@dlubarov or @wborgeaud, could you enlighten me/us as to the proper relationships between these different values?

Get rid of `to_canonical_biguint` and `from_canonical_biguint`

Faster inverse using 'almost inverse' trick

As suggested by @dlubarov, the implementation here should be faster than the current binary GCD based inverse.

Public inputs

Just a tracking ticket to add support for public inputs. We plan to add a PI polynomial which the verifier evaluates at zeta, similar to standard Plonk.

An alternative scheme is here. It may be cheaper for the verifier, as it doesn't require evaluating Lagrange bases, but I'm not sure that'd be a big deal to us.

Remove openings at frob(zeta)

As per our discussions, it turns out they're unnecessary for ensuring witness polynomials are polynomials in the base field.

Implement native Poseidon

I don't mean to suggest that we should necessarily replace GMiMC with Poseidon; might be good to wait for more clarity around the security of both. But in the meantime, it would be good to have both in the codebase so that we can get a clearer picture of their costs.

Note that implementing Poseidon efficiently involves a bit of trickery; there are some suggestions in Appendix B of the paper. The reference implementation might also be useful.

There are some variations to consider, but I think a good candidate would be

permutation width of 12 elements (as we use for GMiMC)
S-box of x⁷ (so we can keep our constraint degree at 8, just barely)
over the Goldilocks field mentioned in #1

This leads to a recommendation of 8 full rounds and 22 partial rounds, for a total of 118 S-boxes.

It could also be worth considering x³, and trying to arithmetize it in a clever way that doesn't involve a wire for every S-box (see #2). However, our gate would then have degree 9 constraints, which would mean (since we use radix-2 FFTs) a bunch of intermediate polynonmials need to be degree 16n

Address redundancy with "adaptable gate" logic

See #343 (comment)

Use PrimeField trait in extension field code

The current extension field code seems quite a lot more general than we need. If we replace uses of BaseField there with PrimeField then many simplifications and efficiency improvements will result.

Implement Poseidon circuit

Follow on and informed by #4.

Remove unstable features

We should work towards making Plonky2 compile on stable.

Unstable features that need to either be removed from Plonky2 or that need to be stabilized in Rust:

asm_sym Used in one asm! block but I can easily remove it and just pass a pointer.
generic_const_exprs Used by PackedField (in from_arr/to_arr); these methods can be removed, although having to do so will make me curse Rust's half-baked type system. I think it's also used in the plonky2 crate, but I'm not sure where.
specialization There are two traits that use this: Square and Packable. The default implementation of the former is just for convenience and can be removed. The latter has a default implementation so Rust can prove that every Field is Packable. We can remove the default impl if we merge both traits.
stdsimd Used for Neon support in Poseidon. Also used for packed AVX-512 arithmetic in #400. ARM Poseidon can be done in pure asm, removing the need for intrinsics. We can replace AVX-512 with AVX2 when compiling on stable, at a performance penalty.

Compilation error

On Mac with rustc 1.58.0-nightly (65f3f8b22 2021-11-21)

RUST_LOG=debug RUSTFLAGS=-Ctarget-cpu=native cargo test --release test_recursive_recursive_verifier -- --ignored

The test did not worked.

https://www.dropbox.com/s/xipw15lmi76uxvb/Screen%20Shot%202022-01-11%20at%2013.01.47.png?dl=0

Add zero knowledge

My understanding is that this requires these changes:

Send evaluations on a coset of the subgroup, rather than a subgroup itself. Otherwise the FRI verifier might query some points on H, which would directly leak witness data. Alternatively, I think we could have the prover rewind, and retry with new randomness, until it gets a query set disjoint from H (should take just a couple hundred attempts). That might be simpler, but it's not the standard approach.
Salt each leaf (with 2 field elements ~= 128 bits, I think?) in Merkle trees containing witness data, to make them hiding commitments. (Ideally we would not salt the Merkle tree that contains public, precomputed circuit data like sigma polynomials. But if it's simpler to salt everything, that might not be a big deal.)
Add a bunch of random blinding factors to each witness polynomial (need to calculate how many). We also need to blind Plonk's Z polynomials; see here for one method of doing that.
I noticed that facebook/winterfell#9 also mentioned "Combine composition polynomial with a random low-degree polynomial before running it through FRI protocol." I vaguely recall the Aurora and/or Fractal papers mentioning something similar. I'm not clear on the purpose of it though.

Refactor `eval_vanishing_poly_base`

@dlubarov:

Right now we evaluate every gate's constraint poly at every point in the extended domain
Or more specifically, for each point in the LDE domain, we evaluate each gate poly, then take a random combination of those values
Maybe instead, for each gate, we could evaluate it at each point in the LDE domain. I figure it'd be fewer allocations, fewer vtable lookups, and a smaller amount of code being executed in a loop (not sure if that matters, maybe all the eval code is in L1 anyway?)
And that way for lower-degree gates, we could potentially evaluate them on a smaller domain and then use an FFT to get evals over the LDE domain. I'm not sure if it's faster or not but maybe worth a try

Do path compression upfront

PartitionWitness::find does path compression, but there are various places where we skip find and just use parent as the representative, assuming path compression has already occurred. If we explicitly compressed all paths upfront (during preprocessing, just after all nodes are added), I think it would be more clear that the parent uses are correct.

Combine preprocessed polynomials into one Merkle tree

We currently have one Merkle tree for sigma polynomials (used by Plonk's permutation argument), and another for constant polynomials (used for configuring gates). Since they're both created during preprocessing, it seems like they could be combined into one, to reduce the number of authentication paths we need.

SIMD gate evaluation

Gate evaluations should be faster in vector, so let's do that.

Add a `Field64` subtrait

And move methods like to_canonical_u64 to it.

`PoseidonGate` could use fewer wires

Let original be the 12 original input wires, and let swapped be the 12 wires representing S-box inputs of the very first round.

It seems swapped is storing more info than we really need, in a couple of ways:

Instead of swapped[0..8], we could store delta[i] = swap * (original[i + 4] - original[i]) for i in 0..4. Then for i in 0..4, swapped[i] = original[i] + delta[i], and swapped[i + 4] = original[i + 4] - delta[i]. Note that these derivations are still degree 1, so they can be S-box inputs without increasing our constraint degree.
original[8..12] will always match swapped[8..12], so swapped[8..12] can simply be omitted.

With both changes, it seems we could save 8 wires.

Is there a better way to arithmetize GMiMC?

We have a GMiMCGate which evaluates a full GMiMC permutation. The current code uses a wire for each S-box, so constraints have degree 3. We could go to a higher degree, which would increase costs in some ways, but would probably be worth it if it lets us decrease the number of wires substantially.

With a permutation like MiMC, we could just have a wire for every other S-box. Instead of degree-3 constraints like wire_2 = wire_1^3 + constant_2, we'd have degree-9 constraints like wire_3 = (wire_1^3 + constant_2)^3 + constant_3 (just inlining wire_2).

With a permutation like Rescue, reducing the wire count is similarly easy. We could have wires for every other layer, resulting in constraints of degree alpha^2.

GMiMC uses a Feistel network, so its dependency graph has a different structure. I don't see an obvious way to reduce the wire count, but I might just not be seeing it; maybe a fresh pair of eyes would help.

Support Keccak-256

This would be good to have just for bridge proofs, since Keccak-256 is much cheaper than Poseidon on the EVM.

The sha3 crate has a Keccak256 hasher we can use. That part is easy.

However, there are a few complications:

HASH_FAMILY is currently a compile-time constant. We'll now need the same build to generate some proofs with Poseidon hashes and other proofs with Keccak-256 hashes. We could include the family in CircuitConfig, though checking it at runtime could add add overhead in hashing. It might be best if we could control it with const generics instead, like <H: HashFamily>, or a more general <C: GenericConfig> which could contain other config too.
The bridge proof circuit will still be verifying inner proofs that used Poseidon. So we'll need separate config for inner proof and outer proof hash families. Maybe the former could be an argument to add_recursive_verifier.
A lot of our hashing code assumes that hashes are sponge-based, so we'll need to rework that a bit.

Poseidon implementation fixes

The reference implementation of Poseidon has some errors ~~as noted and corrected by Daira~~ [ETA: Actually the errors were noted and corrected in the original implementation]. I think I inadvertently reproduced them in my implementation; need to check and fix.

Make ZK optional

I think CircuitConfig should have a zero_knowledge flag, since we won't need ZK e.g. for proof aggregation.

The main benefit would be not needing blinding gates. That seems like a significant cost even for large circuits.

We could also run send function evaluations on the subgroup itself instead of a coset, and/or not salt Merkle leaves. I think those are fairly minor costs though, especially the coset shift.

Parallelize `merkle_root`

We'll call this once to build a big Merkle tree containing (LDEs of) all wire data, so that's where the bulk of the hashing will happen. We'll also call it for a few smaller Merkle trees.

A simple approach would be to split the tree into n sections, and spawn n threads (or async calls or what not) to build those sections, then aggregate those n section roots with a single thread (i.e. at the end of the merkle_root call). n could be something like the number of CPUs rounded up to a power of two, or just a number like 128.

This has a couple minor downsides: some CPUs may be idle toward the end, while the last sections are being built, and the final aggregation is single-threaded. In theory, it might be better to use a fork-join approach to avoid these inefficiencies. In practice, though, I think these inefficiencies should be insignificant if we choose n wisely, and I'm not sure what kind of overheads a fork-join library might add.