Git Product home page Git Product logo

plonky3's Introduction

Plonky3-powered-by-polygon

Plonky3 is a toolkit for implementing polynomial IOPs (PIOPs), such as PLONK and STARKs. It aims to support several polynomial commitment schemes, such as Brakedown.

This is the "core" repo, but the plan is to move each crate into its own repo once APIs stabilize.

Status

Fields:

  • Mersenne31
    • "complex" extension field
    • ~128 bit extension field
    • AVX2
    • AVX-512
    • NEON
  • BabyBear
    • ~128 bit extension field
    • AVX2
    • AVX-512
    • NEON
  • Goldilocks
    • ~128 bit extension field

Generalized vector commitment schemes

  • generalized Merkle tree

Polynomial commitment schemes

  • FRI-based PCS
  • tensor PCS
  • univariate-to-multivariate adapter
  • multivariate-to-univariate adapter

PIOPs

  • univariate STARK
  • multivariate STARK
  • PLONK

Codes

  • Brakedown
  • Reed-Solomon

Interpolation

  • Barycentric interpolation
  • radix-2 DIT FFT
  • radix-2 Bowers FFT
  • four-step FFT
  • Mersenne circle group FFT

Hashes

  • Rescue
  • Poseidon
  • Poseidon2
  • BLAKE3
    • modifications to tune BLAKE3 for hashing small leaves
  • Keccak-256
  • Monolith

Benchmark

We sometimes use a Keccak AIR to compare Plonky3's performance to other libraries like Plonky2. Several variations are possible here, with different fields and so forth, but here is one example:

RUST_LOG=info cargo run --example prove_baby_bear_keccak --release --features parallel

CPU features

Plonky3 contains optimizations that rely on newer CPU instructions that are not available in older processors. These instruction sets include x86's BMI1 and 2, AVX2, and AVX-512. Rustc does not emit those instructions by default; they must be explicitly enabled through the target-feature compiler option (or implicitly by setting target-cpu). To enable all features that are supported on your machine, you can set target-cpu to native. For example, to run the tests:

RUSTFLAGS="-Ctarget-cpu=native" cargo test

Support for some instructions, such as AVX-512, is still experimental. They are only available in the nightly build of Rustc and are enabled by the nightly-features feature flag. To use them, you must enable the flag in Rustc (e.g. by setting target-feature) and you must also enable the nightly-features feature.

Nightly-only optimizations

Some optimizations (in particular, AVX-512-optimized math) rely on features that are currently available only in the nightly build of Rustc. To use them, you need to enable the nightly-features feature. For example, to run the tests:

cargo test --features nightly-features

Known issues

The verifier might panic upon receiving certain invalid proofs.

License

Licensed under either of

at your option.

Guidance for external contributors

Do you feel keen and able to help with Plonky3? That's great! We encourage external contributions!

We want to make it easy for you to contribute, but at the same time we must manage the burden of reviewing external contributions. We are a small team, and the time we spend reviewing external contributions is time we are not developing ourselves.

We also want to help you to avoid inadvertently duplicating work that is already underway, or building something that we will not want to incorporate.

First and foremost, please keep in mind that this is a highly technical piece of software and contributing is only suitable for experienced mathematicians, cryptographers and software engineers.

The Polygon Zero Team reserves the right to accept or reject any external contribution for any reason, including a simple lack of time to maintain it (now or in the future); we may even decline to review something that is not considered a sufficiently high priority for us.

To avoid disappointment, please communicate your intention to contribute openly, while respecting the limited time and availability we have to review and provide guidance for external contributions. It is a good idea to drop a note in our public Discord #development channel of your intention to work on something, whether an issue, a new feature, or a performance improvement. This is probably all that's really required to avoid duplication of work with other contributors.

What follows are some more specific requests for how to write PRs in a way that will make them easy for us to review. Deviating from these guidelines may result in your PR being rejected, ignored or forgotten.

General guidance for your PR

Obviously PRs will not be considered unless they pass our Github CI. The Github CI is not executed for PRs from forks, but you can simulate the Github CI by running the commands in .github/workflows/ci.yml.

Under no circumstances should a single PR mix different purposes: Your PR is either a bug fix, a new feature, or a performance improvement, never a combination. Nor should you include, for example, two unrelated performance improvements in one PR. Please just submit separate PRs. The goal is to make reviewing your PR as simple as possible, and you should be thinking about how to compose the PR to minimise the burden on the reviewer.

Plonky3 uses stable Rust, so any PR that depends on unstable features is likely to be rejected. It's possible that we may relax this policy in the future, but we aim to minimize the use of unstable features; please discuss with us before enabling any.

Here are a few specific guidelines for the three main categories of PRs that we expect:

The PR fixes a bug

In the PR description, please clearly but briefly describe

  1. the bug (could be a reference to a GH issue; if it is from a discussion (on Discord/email/etc. for example), please copy in the relevant parts of the discussion);
  2. what turned out to the cause the bug; and
  3. how the PR fixes the bug.

Wherever possible, PRs that fix bugs should include additional tests that (i) trigger the original bug and (ii) pass after applying the PR.

The PR implements a new feature

If you plan to contribute an implementation of a new feature, please double-check with the Polygon Zero team that it is a sufficient priority for us that it will be reviewed and integrated.

In the PR description, please clearly but briefly describe

  1. what the feature does
  2. the approach taken to implement it

All PRs for new features must include a suitable test suite.

The PR improves performance

Performance improvements are particularly welcome! Please note that it can be quite difficult to establish true improvements for the workloads we care about. To help filter out false positives, the PR description for a performance improvement must clearly identify

  1. the target bottleneck (only one per PR to avoid confusing things!)
  2. how performance is measured
  3. characteristics of the machine used (CPU, OS, #threads if appropriate)
  4. performance before and after the PR

Licensing

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

plonky3's People

Contributors

0xjepsen avatar 0xkanekiken avatar adventureseeker987 avatar bingcicle avatar chengwenxi avatar codeblooded1729 avatar dependabot[bot] avatar dlubarov avatar hardlydearly avatar huitseeker avatar jimpo avatar jorgeantonio21 avatar jtguibas avatar kevjue avatar lky-stephen avatar matthiasgoergens avatar maxgillett avatar nbgl avatar npwardberkeley avatar onurinanc avatar patrickmao93 avatar pgebheim avatar rjeli avatar sladuca avatar syxtonprime avatar tamirhemo avatar tgolang avatar tonyfloatersu avatar unzvfu avatar voidkai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plonky3's Issues

Portable serialization

With some upcoming BabyBear code from @nbgl, the internal encoding of BabyBear will no longer be portable; it will differ for different targets. This may happen with other fields too. This is problematic since we currently derive Serialize and Deserialize, which uses these internal encodings.

It might be cleanest to just change all fields (but at least BabyBear for now) to use canonical encodings (as_canonical_u32 etc.) for serialization. It will involve custom impls of Serialize and Deserialize.

Optional Trait for Two Adic Field

An extension field can be a two-adic field with a subfield with the order 2^k.

A safe way to implement a two-adic field trait for the extension field is to mention the degree and generator at the XXExtendable trait of the base field, such that the developer can check the real order this subfield can have

Do we need nightly for CI?

CI uses a specific nightly from a few months ago, which can be slightly annoying when fmt or clippy behaves differently than a local stable toolchain. AFAICT everything works fine on stable. OK to change CI to stable?

Poseidon2: use specialized matrix for external rounds

I think we initially misunderstood Poseidon2 as requiring an MDS matrix for external (a.k.a. full) rounds. After looking at it more carefully, it seems the (external) linear layer they propose is not MDS (although it uses a 4x4 MDS matrix as a building block), but has a slightly weaker diffusion property. It seems like it should probably be faster than any MDS layer.

cc @jtguibas @vivekvpandya

`MulAir` verification fails with degree 2 constraints

I had to artificially increase MulAir's constraint degree to 3 to avoid the error for now; see

// TODO: Temporarily added this silly degree 3 constraint because we're getting an
// OodEvaluationMismatch when log_quotient_degree = 0.
builder.assert_zero(a * b * c - c * b * a);

Delayed-reduction arithmetic API

As @unzvfu mentioned on a call, our current packed field types reduce very aggressively, after each addition and multiplication. However, it is possible to make some calculations faster by delaying the reduction. (For example, consider the problem of summing n Mersenne-31 field elements; we can accumulate into a 64-bit register and only reduce right before returning, thus replacing n - 1 simple reduction with one complex reduction.) We would like to expose an API that lets users exploit these efficiencies without relying on the implementation details of packed field arithmetic (e.g. the field order, valid representation range, Montgomery width+constants).

I am not in favor of adding a low-level API that defines a type for a wide accumulator (which would contain an unreduced sum). Our packed arithmetic is already complex to understand, and I am skeptical that untrained users would be able to use such an API effectively.

Instead, I propose implementing common operations that benefit from delayed reduction. For now, this means a sum of many values and a dot product of two long sequences.

There are two main settings into which our solution must fit: scalar code and packed code. I propose separating those interfaces out.

Scalar code

We should implement a sum of many values returning a scalar result; and a dot product of many pairs, also returning a scalar result. Although the operations would (should) be ultimately implemented in SIMD, it does not make sense to mention any packed field types in the API.

The easiest place to put it would be in the AbstractField trait, which also permits a trivial default implementation.

pub trait AbstractField: ... {
    ...

    fn sum(xs: &[Self]) -> Self {
        // default implementation
    }

    fn dp(xs: &[Self], ys: &[Self]) -> Self {
        // default implementation
    }
}

The inputs are in the form of contiguous vectors, as that's the most natural for SIMD code (a vector of contiguous elements can be loaded from memory in one instruction). It would be more Rust-y to accept iterators of scalars instead, but this would hide the performance implications: we would have to move all values into the SIMD domain and gather them in vector registers; this can be a surprising amount of work.

Packed code

A function catering to packed workflows should accept packed values and return packed values. Everything would happen element-wise (e.g. if a PackedField has width 4, the sum function performs four separate sums).

Again, we can decide exactly where to put it, but the function API could be put in PackedField, in which case it would look something like:

pub trait PackedField: ... {
    ...

    fn sum(xs: impl Iterator<Item = Self>) -> Self {
        // default implementation
    }

    fn dp(xs: impl Iterator<Item = (Self, Self)>) -> Self {
        // default implementation
    }
}

We can accept iterators here, because the elements they yield are already loaded into SIMD types.

Matrices?

Although the dp function could be used to implement matrix-vector and matrix-matrix multiplication, more specialized code can achieve better cache performance (and, in the matrix-matrix case, better time complexity). I suggest omitting it for now, however, while we wait for our matrix APIs to mature.

Simplify implementation of `split_32()`

The current implementation of the split_32() function seems unnecessarily complicated:

pub fn split_32<SF: PrimeField, TF: PrimeField32>(val: SF, n: usize) -> Vec<TF> {
let po2 = BigUint::from(1u128 << 64);
let mut val = val.as_canonical_biguint();
let mut result = Vec::new();
for _ in 0..n {
let mask: BigUint = po2.clone() - BigUint::from(1u128);
let digit: BigUint = val.clone() & mask;
let digit_u64s = digit.to_u64_digits();
if !digit_u64s.is_empty() {
result.push(TF::from_wrapped_u64(digit_u64s[0]));
} else {
result.push(TF::zero())
}
val /= po2.clone();
}
result
}

Is the following straightforward implementation not equivalent?

    let mut result: Vec<TF> = val
        .as_canonical_biguint()
        .to_u64_digits()
        .iter()
        .take(n)
        .map(|d| TF::from_wrapped_u64(*d))
        .collect();
    result.resize_with(n, || TF::zero());
    result

Support of `ExtensionField` in proof generation

I'm trying to run the keccak_air prover example on an ExtensionField:

type Val = Goldilocks;
type Domain = BinomialExtensionField<Goldilocks, 2>;
type Challenge = BinomialExtensionField<Goldilocks, 2>;

I'm seeing a type mismatch error which is required by a bound in QuotientMmcs.

image

I think another blocker is TwoAdicField is not yet implemented for BinomialExtensionField.

Cache-friendly transpose in `RowMajorMatrix`

Besides being a generally useful utility, this would give us an easy way to parallelize a batch FFT, by splitting it into single-column FFTs as Plonky2 does. I think this probably isn't the best way to parallelize in general, but it would be interesting to have the option and compare it with other methods.

poseidonhash compatibility with v2?

I didn't find the correct round constants in the codebase. So I tried to do that manually, but unfortunately I still get different result with plonky2

Replace references to OEFs in the extension field with something more accurate

The extension field code introduced in #95 refers to extensions defined by binomial irreducible polynomials as Optimal Extension Fields (OEFs) (e.g. QuadraticOef, CubicOef, OptimallyExtendable etc.). This naming was inherited from Plonky2, however it is not really accurate and could cause confusion. Moreover, depending on a particular architecture, we may want to consider implementations of field extensions with irreducible polynomials that are not binomial.

We should try to find more accurate names for the current traits and structs, and have a system of names that will not clash with choosing irreducible polynomials of other forms in the future.

Test coverage seems a bit low?

$ cargo tarpaulin --all-features
[...]
2.65% coverage, 29/1094 lines covered, +0.00% change in coverage

Note, running cargo tarpaulin requires cargo install tarpaulin and the fixes from #12 because tarpaulin wants to build all targets by default, and some of them are currently broken.

FRI grinding (proof of work)

Grinding can boost security without adding queries. I think the idea was first documented in the ethSTARK paper (though it was tribal knowledge before that).

There are some subtly different ways of integrating grinding into the protocol. This is what we settled on in Plonky2:

  • In between FRI's commit and query phases, have the prover send an arbitrary message (Plonky2 called it pow_witness).
  • Observe that message, i.e. treat it as part of the transcript for the purpose of Fiat-Shamir.
  • Have the verifier sample a random challenge, using CanSample as usual.
  • Have the verifier reject unless the challenge is sufficiently small.

I thought treating the witness and challenge as part of the transcript like that was a nice and clean approach. However, it does make it potentially harder for the prover to conduct its search efficiently. See this prover code from Plonky2.

There's also a question of how large the PoW witness should be. In Plonky2 it was a single field element, so there was a non-negligible chance that no valid PoW witness existed. Plonky2 had to deal with non-negligible completeness errors anyway, given the way it used cumulative product arguments over small fields. Plonky3 doesn't have any such non-negligible completeness errors built in, so to maintain that property, it might be best to use several field elements for the PoW witness, or make it configurable.

Poseidon2 for AbstractField<F>

Focusing on F = BabyBear here but the same idea applies for other fields.

In #303 we changed the impl's for DiffusionMatrixBabyBear to allow for architecture specific implementations. This should allow for fast vectorized code to be written for Poseidon2. Unfortunately, this also comes with the downside that DiffusionMatrixBabyBear no longer works for arbitrary abstract fields over BabyBear such as SymbolicExpression.

Ideally we would a solution where we can use a general implementation for an arbitrary BabyeBear abstract field but then also a specialized implementation for specific abstract fields (BabyBearNeon, BabyBearAVX2, BabyBearAVX512). Unfortunately, this sort of specialization is not something which Rust supports in a natural way.

Just going to leave this issue here for now, and comeback to it in the future.

Parallelized FFTs

#50 could lead to one method, but it might not be ideal, certainly now when the number of columns is less than the number of cores. Need to experiment with other approaches.

Allow building without rayon

Currently p3-maybe-rayon has parallel enabled by default, it should probably not be default, and all consumer crates need feature flags

As well, p3-matrix, p3-dft, p3-maybe-rayon, p3-uni-stark, & others need fixes to import the right traits to work seamlessly

DFT base is reversed.

After some painful debugging, I found the current DFT implementation is using a different group compared to the generator provided by the group.

In NaiveDft and all other DFT implementations that are consistent with that, including Mersenne31Dft

let g = F::two_adic_generator(log_h);

let mut res = RowMajorMatrix::new(vec![F::zero(); w * h], w);
for (res_r, point) in g.powers().take(h).enumerate() {
    for (src_r, point_power) in point.powers().take(h).enumerate() {
        for c in 0..w {
            res.values[res_r * w + c] += point_power * mat.values[src_r * w + c]
        }
    }
}

Usually, we predict DFT is on {1,g,g^2,g^3,...}, but now it is over {1, g^-1, g^-2,g^-3,g^-4...}. This can be a trap for users and developers (and me). Is there any specific reason to do that?

Re-enable univariate STARK verifier divisibility check

The univariate STARK verifier finally checks the proof by checking that folded_constraints != z_h * quotient, i.e. that the claimed quotient polynomial times the zerofier is equal to the claimed folded constraints polynomial at the pseudo-random challenge point. Currently this check is disabled. This seems to be because the verifier is not successfully verifying the proofs generated by the prover. In order to solve this issue, some debugging of the prover may be needed.

Optimize MDS Vectors for faster MDS convolutional layers

Currently, the MDS layers of size 8, 12, 16, 24, 32, 64 are implemented by doing a convolution with an MDS vector (Meaning a vector whose associated circulant matrix is MDS) v of appropriate size. As of #248, these convolutions are now done using a divide and conquer strategy.

This lets us replace a convolution of size N with a convolution and a negacyclic convolution of size N/2. Similarly, we can replace a negacyclic convolution of size N by three a negacyclic convolutions of size N/2.

Right now, the MDS vectors have been chosen to optimise the speed of the naive approach to convolutions (Which essentially means maximizing the number of 1's in the vector). Hence we should adjust our choices reflecting the change of strategies.

Unfortunately, finding optimal MDS vectors in our new setting is complicated. Broadly speaking we want to find vectors with the following properties ranked in order of importance/ease of doing:

  1. Small Entries -- (To avoid ever needing to pass through i128's)
  2. Lots of 1/-1 in the innermost layers
  3. Faster innermost layers.

For sizes 24, 32, 64 currently we cannot even do step 1 as checking manually if a vector of that size is, for now, MDS is too expensive. Long term size 24 should be doable, 32 might be if we focus enough resources and 64 is almost certainly completely out of reach without a serious mathematical advance.

On the other hand, the MDS vectors of sizes 8, 12, 16 are definitely optimisable. For an example let's look at the vector of length 8.

Currently we use the vector: v = (4, 1, 2, 9, 10, 5, 1, 1) which means that the inner convolution/negacyclic convolutions are with the vectors: v_pos[x] = v[x] + v[x + 4] = (14, 6, 3, 10) and v_neg[x] = v[x] - v[x + 4] = (-6, -4, 1, 8).

On the other hand, we could use the vector: v' = (1, 2, -4, -4, 2, 3, 5, -3) which makes the inner vectors v'_pos[x] = (3, 5, 1, -7) and v'_neg[x] = (-1, -1, -9, -1). Thanks to all the 1's and -1's this should be somewhat (At a guess ~5-15%) faster.

Godbolt code for comparison.

A similar strategy should work in the N = 12, 16 cases though these cases may be more complicated as there are multiple folding steps before we reach the base cases of 3, 4. It may also be worth stopping some of the recursion at size 8, 6 depending on the optimal vectors we find. Once we have chosen all the vectors we may also want to hardcode specific addition/multiplication chains to speed up the innermost layers further. (See Poseidon2 for an example of this sort of hard coding)

I'm just leaving this here for now as this will probably require a bunch of engineering both in the search and rewriting the code so we should leave this to later once we know what the big bottlenecks. It does seems likely though that this could give a significant speedup to the MDS layers.

Make partial functions total

Various Plonky3 functions might exhibit partiality; on some inputs, they might fail to terminate with a result of the output type. In all the cases I know of, this is due to the use of functions which may panic. Instead of panicking, functions should return a result indicating an error. The functions which are currently partial should be modified to return a Result type.

It is okay to use functions which may panic in unreachable code sections and in cases where it is impossible for the functions to get inputs which result in a panic. However, attainable conditions should never result in a panic.

The reason for this issue is that Plonky3 is going to be used in services, and those services can't panic. They have to stay running even when errors occur.

The following code sections (as far as I know) may potentially result in partiality. Line numbers are relative to commit 000681e190eb0f7fc8523f47b8bd06de075460d8.

There are 10 instances of panic!() in non-test code:

  1. In the MatrixRows<T>::row implementation for TwoRowMatrixView<’_, T>, in air/src/two_matrix.rs.
  2. In the MatrixRowSlices<T>::row_slice implementation for TwoRowMatrixView<’_, T>, in air/src/two_matrix.rs.
  3. In the unsafe implementation of PackedField::interleave for PackedBabyBearNeon in baby-bear/src/aarch64_neon.rs.
  4. In the implementation of PackedField::interleave for PackedMersenne31Neon in mersenne-31/src/aarch64_neon.rs.
  5. In the implementation of get_max_height in the trait definition of Mmcs<T>, in commit/src/mmcs.rs.
  6. In the implementation of PackedField::interleave for F: Field in field/src/packed.rs.
  7. In the implementation of prove in fri/src/prover.rs.
  8. In the implementation of AirBuilder::is_transition_window for ConstraintFolder<’a, F, Challenge, PackedChallenge> in multi-stark/src/folder.rs.
  9. In the implementation of AirBuilder::is_transition_window for ProverConstraintFolder<’a, SC> in uni-stark/src/folder.rs.
  10. In the implementation of AirBuilder::is_transition_window for VerifierConstraintFolder<’a, Challenge> in uni-stark/src/folder.rs.

There are 9 instances of .expect() in non-test code:

  1. In the implementation of CanSample<F>::sample for DuplexChallenger<F, P, WIDTH> in challenger/src/duplex_challenger.rs.
  2. In the implementation of CanSample<F>::sample for HashChallenger<F, H, OUT_LEN> in challenger/src/hash_challenger.rs.
  3. In the implementation of AbstractExtensionField<F>::from_base_slice for BinomialExtensionField<F, D> in fields/src/extension/binomial_extension.rs.
  4. In the default implementation of Field::inverse in field/src/field.rs.
  5. In the implementation of sum_vecs in field/src/helpers.rs.
  6. In the implementation of get_repeated in ldt/src/quotient.rs.
  7. In the implementation of AbstractField::from_canonical_u64 for Mersenne31 in mersenne-31/src/lib.rs.
  8. In the implementation of AbstractField::from_canonical_usize for Mersenne31 in mersenne-31/src/lib.rs.
  9. In the implementation of get_inverse in rescue/src/util.rs.

There are 20 instances of .unwrap() in non-test code:

  1. In the implementation of batch_multiplicative_inverse in field/src/batch_inverse.rs.
  2. In the implementation of Permutation::permute for KeccakF in keccak/src/lib.rs.
  3. In the implementation of Mccs<EF>::open_batch for QuotientMmcs<F, EF, Inner> in ldt/src/quotient.rs.
  4. In the implementation of Mccs<EF>::verify_batch for QuotientMmcs<F, EF, Inner> in ldt/src/quotient.rs.
  5. In the implementation of Default::default for CosetMds<F, N> in mds/src/coset_mds.rs.
  6. In the implementation of apply_circulant_fft in mds/src/util.rs.
  7. In the implementation of FieldMerkleTree<F, DIGEST_ELEMS>::new, in merkle-tree/src/merkle-tree.rs.
  8. In the implementation of FieldMerkleTree<F, DIGEST_ELEMS>::root, in merkle-tree/src/merkle-tree.rs.
  9. In the implementation of Mmcs<P::Scalar>::verify_batch for FieldMerkleTreeMmcs<P, H, C, DIGEST_ELEMS>, in merkle-tree/src/mmcs.rs.
  10. In the implementation of Permutation::permute for MonolithMdsMatrixMersenne31<NUM_ROUNDS> in monolith/src/monolith.rs.
  11. In the implementation of Rescue<F, Mds, Sbox, WIDTH>::num_rounds in rescue/src/rescue.rs.
  12. In the implementation of get_alpha in rescue/src/util.rs.
  13. In the implementation of get_inverse in rescue/src/util.rs.
  14. In the implementation of PseudoCompressionFunction<[T; CHUNK], N>::compress for TruncatedPermutation<InnerP, N, CHUNK, WIDTH> in symmetric/src/compression.rs.
  15. In the implementation of CryptographicHasher<F, [F; 4]>::hash_iter for SerializingHasher<F, Inner> in symmetric/serializing_hasher.rs.
  16. In the implementation of CryptographicHasher<F, [F; 8]>::hash_iter for SerializingHasher<F, Inner> in symmetric/serializing_hasher.rs.
  17. In the implementation of CryptographicHasher<T, [T; OUT]>::hash_iter for PaddingFreeSponge<P, WIDTH, RATE, OUT> in symmetric/src/sponge.rs.
  18. In the implementation of optimal_wraps in tensor_pcs/src/reshape.rs.
  19. In the implementation of Iterator::next for WrappedMatrixRow<’a, T, M> in tensor-pcs/src/wrapped_matrix.rs.
  20. In the implementation of prove in uni-stark/src/prover.rs.

The following files contain instances of assert!() and/or assert_eq!() in non-test code:

  1. baby-bear/src/aarch64_neon.rs
    a. In PackedField::from_slice for PackedBabyBearNeon, line 572
    b. In PackedField::from_slice_mut for PackedBabyBearNeon, line 582
  2. field/src/field.rs
    a. In AbstractExtensionField<AF>::from_base_slice for AF : AbstractField, line 309
  3. field/src/helpers.rs
    a. In add_vecs, line 36
  4. keccak-air/src/columns.rs
    a. In Borrow<KeccakCols<T>> for [T], line 126
    b. In Borrow<KeccakCols<T>> for [T], line 127
    c. In Borrow<KeccakCols<T>> for [T], line 128
    d. In BorrowMut<KeccakCols<T>> for [T], line 137
    e. In BorrowMut<KeccakCols<T>> for [T], line 138
    f. In BorrowMut<KeccakCols<T>> for [T], line 139
  5. matrix/src/mul.rs
    a. In mul_csr_dense, line 19
  6. matrix/src/stack.rs
    a. In VerticalPair<T, First, Second>::new, line 14
  7. mersenne-31/src/aarch64_neon.rs
    a. In PackedField::from_slice for PackedMersenne31Neon, line 522
    b. In PackedField::from_slice_mut for PackedMersenne31Neon, line 533
  8. mersenne-31/src/complex.rs
    a. In TwoAdicField::two_adic_generator for Mersenne31Complex<Mersenne31>, line 275
    b. In AbstractExtensionField<AF>::from_base_slice for Mersenne31Complex<AF> where AF: AbstractField<F = Mersenne31>
  9. monolith/src/monolith.rs
    a. In MonolithMersenne31<Mds, WIDTH, NUM_FULL_ROUNDS>::new where Mds: MdsPermutation<Mersenne31, WIDTH>, line 38
    b. In MonolithMersenne31<Mds, WIDTH, NUM_FULL_ROUNDS>::new where Mds: MdsPermutation<Mersenne31, WIDTH>, line 39
    c. In MonolithMersenne31<Mds, WIDTH, NUM_FULL_ROUNDS>::new where Mds: MdsPermutation<Mersenne31, WIDTH>, line 40
  10. poseidon/src/lib.rs
    a. In Poseidon<F, Mds, WIDTH, ALPHA>::new, line 40
  11. tensor-pcs/src/wrapped_matrix.rs
    a. In WrappedMatrix::new, line 16
  12. uni-stark/src/prover.rs
    a. In prove, line 45
  13. util/src/lib.rs
    a. In log2_strict_usize, line 32

The following files contain unsafe blocks, which should be assumed pending further review to potentially result in partiality:

  1. baby-bear/src/aarch64_neon.rs
  2. dft/src/butterflies.rs
  3. dft/src/util.rs
  4. field/src/packed.rs
  5. goldilocks/src/lib.rs
  6. keccak-air/src/columns.rs
  7. matrix/src/dense.rs
  8. mersenne-31/src/aarch64_neon.rs
  9. monolith/src/monolith.rs
  10. util/src/lib.rs

The following arithmetic operations may potentially result in a division by zero, which would cause a panic:

  1. In the implementation of Matrix<T>::height for RowMajorMatrix<T>, in matrix/src/dense.rs, line 179.
  2. In the implementation of Matrix<T>::height for RowMajorMatrixViewM<’_, T> in matrix/src/dense.rs, line 256.
  3. In the implementation of Matrix<T>::height for VerticallyStridedMatrixView<Inner> in matrix/src/dense.rs, line 16.
  4. In the implementation of Mmcs<EF>::open_batch for QuotientMmcs<F, EF, Inner> in ldt/src/quotient.rs, on line 71.
  5. In the implementation of Matrix<T>::height for WrappedMatrix<T, M> in tensor-pcs/src/wrapped_matrix.rs, line 34.

The following instances of unchecked array indexing may potentially result in an index out of range error and a panic:

  1. In FieldMerkleTree<F, DIGEST_ELEMS>::root, in merkle_tree/src/merkle_tree.rs, line 53.
  2. In first_digest_layer, in merkle_tree/src/merkle_tree.rs, line 107.
  3. In compress_and_inject, in merkle_tree/src/merkle_tree.rs, lines 160, 171, 172, 188, 189, 198, and 199.
  4. In compress, in merkle_tree/src/merkle_tree.rs, lines 230, 231, 241, 242.
  5. In prove, in uni-stark/src/prover.rs, lines 75, 76, and 77.

Optimized delayed-reduction loops

I investigated delayed-reduction sums and dot products1 for Mersenne-31 and Baby Bear arithmetic on Neon, AVX2 and AVX-512 and2 they are really bloody fast!

Here are the core techniques and tricks, together with calculated throughput results. The core loops are the same for Mersenne-31 and Baby Bear, since the reduction doesn't happen until the end3.

As before, the Neon code is optimized for the Apple M1 (Firestorm)4, and the AVX-512 code is optimized for Ice Lake5. My previous AVX2 code was optimized for Skylake, but I've decided it's high time to write code optimized Alder Lake-P6; it does not support AVX-512, so AVX2 performance is critical.

Sums

Neon

Let's start with the easy one.

Remember that are given a sequence of vectors where each vector holds four 31-bit values. At the end, we want the result to be one length-4 vector, where each element is a sum of the corresponding elements in the sequence7.

We do this by keeping a running 64-bit total of the sum and only reducing at the end8. Since we are accumulating vectors of length 4, we need two 128-bit vectors to fit four 64-bit accumulators.

We actually begin by taking two vectors from our input sequence and performing a 32-bit sum, which cannot overflow; this is cheap, as it only needs one instruction. It is this vector of sums that gets added to our accumulators; for that we use the uaddw and uaddw2 instructions, which take care of both the type conversion (32 to 64-bit) and addition.

In the code below, the accumulators are in v0 and v1. The pointer to the buffer containing the inputs is in x0.

    // Load 2 vectors
    ldp     q16, q17, [x0]

    // 32-bit add
    add     v16.4s, v16.4s, v17.4s
    
    // 64-bit accumulate
    uaddw   v0.2d, v0.2d, v16.2s
    uaddw2  v1.2d, v1.2d, v16.4s

The vector units are the bottleneck here. The M1 only has 4 vector units, so we can accumulate 2.67 vectors per cycle if we have enough accumulators. Without delayed reductions, we could only accumulate 1.33 vectors per cycle, so we obtain a 2x improvement in throughput.

Note that the dependency chain on v0 and v1 still requires breaking. One solution is to have eight, not two, accumulators, and to sum them at the end. Another solution is to perform more intermediate sums before accumulating:

    // Load 8 vectors
    ldp     q16, q17, [x0]
    ldp     q18, q19, [x0, #32]
    ldp     q20, q21, [x0, #64]
    ldp     q22, q23, [x0, #96]
    
    // 32-bit sums
    add     v16.4s, v16.4s, v17.4s
    add     v18.4s, v18.4s, v19.4s
    add     v20.4s, v20.4s, v21.4s
    add     v22.4s, v22.4s, v23.4s

    // 32->64-bit widening sums
    uaddl   v24.2d, v16.2s, v18.2s
    uaddl2  v25.2d, v16.4s, v18.4s
    uaddl   v26.2d, v20.2s, v22.2s
    uaddl2  v27.2d, v20.4s, v22.4s

    // 64-bit sums
    add     v24.2d, v24.2d, v26.2d
    add     v25.2d, v25.2d, v27.2d

    // 64-bit accumulate
    add     v0.2d, v0.2d, v24.2d
    add     v1.2d, v1.2d, v25.2d

AVX2/AVX-512

AVX is somewhat trickier, because it does not have "widening add" instructions that can add 32-bit values to 64-bit values.

We begin with the same trick as Neon, performing a 32-bit add of two values; call this sum s. We will accumulate s into our two accumulators ceven and codd.

On AVX2 with 256-bit vectors, s has the form [ s7, s6, s5, s4, s3, s2, s1, s0 ]9. If we obtain seven = [ 0, s6, 0, s4, 0, s2, 0, s0 ] and sodd = [ 0, s7, 0, s5, 0, s3, 0, s1 ], we can immediately add them to ceven and codd, respectively. This can be done with two instructions: masking out the odd indices for seven10, and a 64-bit right shift for sodd.

But there is a trick! Suppose that we do accumulate sodd into codd, but instead of having a ceven, we just accumulate all of s (without masking!) into some third thing, call it ctmp. codd can be reduced to obtain the results at odd positions. ctmp is not meaningful in and of itself because it combines values from multiple lanes.

Notice, however, that ctmp = (ceven + 232codd) mod 264, where we treat all variables as 4-vectors of 64-bit values. To recover ceven we just have to subtract 232codd from ctmp. This lets us save one instruction.

In the end, we end up with the following code, where ymm0 and ymm1 are the accumulators and rdi contains the pointer to the input buffer.

    // Load from memory and perform 32-bit add
    vmovdqu  ymm2, [rdi]
    vpaddd   ymm2, ymm2, [rdi + 32]

    // Extract values at odd indices
    vpsrlq   ymm3, ymm2, 32

    // Accumulate
    vpaddq   ymm0, ymm0, ymm2
    vpaddq   ymm1, ymm1, ymm3

The bottleneck is the vector ports, of which we have three; we can accumulate 1.5 vectors per cycle (on Alder Lake-P). Without delayed reductions, the throughput is one vector per cycle.

For AVX-512, the code is analogous:

    // Load from memory and perform 32-bit add
    vmovdqu  zmm2, [rdi]
    vpaddd   zmm2, zmm2, [rdi + 32]

    // Extract values at odd indices
    vpsrlq   zmm3, zmm2, 32

    // Accumulate
    vpaddq   zmm0, zmm0, zmm2
    vpaddq   zmm1, zmm1, zmm3

The throughput is 1 vector accumulation per cycle (on Ice Lake), against .67 without delayed reductions.

Alternative

As an alternative, we could replace each 64-bit accumulator with two 32-bit accumulators and perform additions with carry.

    // Load from memory and perform 32-bit add
    vmovdqu   ymm2, [rdi]
    vpaddd    ymm2, ymm2, [rdi + 32]

    // Accumulate low
    vpaddd    ymm2, ymm0, ymm2

    // Check for carry
    vpcmpgtd  ymm3, ymm0, ymm2
    vmovdqa   ymm0, ymm2

    // Add carry to high accumulator
    vpsubq    ymm1, ymm1, ymm3

We use the vpcmpgtd to detect overflow11; it returns −1 if overflow is detected, which is why its result is subtracted, not added. Note that the vpcmpgtd operation is signed, so at the beginning of the procedure we must set the accumulator's sign bit to turn it into an unsigned operation12; the reduction code must undo this trick.

This method has the same throughput as our 64-bit accumulator method. However, it's a little bit more complicated to get right and has higher latency, so it's just a little bit worse.

Dot products

Neon

For dot products, we are working with two buffers. We want to multiply elements of one buffer with the corresponding elements of the other buffer and accumulate.

Since our inputs are 31-bit, the products are 62-bit. We can add four of these together without overflowing a 64-bit integer. This takes advantage of Neon's multiply-accumulate instructions.

It is that 64-bit sum which we will accumulate. We split each 64-bit sum into two 32-bit halves and accumulate them separately, letting the reduction code combine the accumulators into one 96-bit sum; the widening add instructions are useful here.

In the code below, v0, …, v3 are accumulators, while x0 and x1 hold the pointers to our two buffers.

    // Load 4 vectors from each buffer
    ldp     q16, q18, [x0]
    ldp     q17, q19, [x1]
    ldp     q20, q22, [x0, #32]
    ldp     q21, q23, [x1, #32]

    // Multiply-accumulate low and high halves of the vectors
    umull   v24.2d, v16.2s, v17.2s
    umull2  v25.2d, v16.4s, v17.4s
    umlal   v24.2d, v18.2s, v19.2s
    umlal2  v25.2d, v18.4s, v19.4s
    umlal   v24.2d, v20.2s, v21.2s
    umlal2  v25.2d, v20.4s, v21.4s
    umlal   v24.2d, v22.2s, v23.2s
    umlal2  v25.2d, v22.4s, v23.4s

    // Split the 64-bit sums into 32-bit halves and add them into 64-bit accumulators.
    uaddw   v0.2d, v0.2d, v24.2s
    uaddw2  v1.2d, v1.2d, v24.4s
    uaddw   v2.2d, v2.2d, v25.2s
    uaddw2  v3.2d, v3.2d, v25.4s

The vector execution units are again the bottleneck. We can process 1.33 vector terms per cycle (on the Apple M1), whereas the naive method achieves .5 terms per cycle for Mersenne-31 (2.67x speedup) and .36 terms per cycle for Baby Bear (3.67x speedup).

Note that the umull/umlal instructions do form a rather long dependency chain13, so aggressive loop unrolling may be beneficial here.

AVX2/AVX-512

On AVX, we use the same general technique as on Neon, but with a few differences.

Firstly, AVX does not have multiply-accumulate instructions14, so we end up with a few more add instructions. No biggie. Another difference is that, as with sum, AVX does not have widening adds. So we utilize the same trick as with sums for our two accumulators per result, although with a vmovshdup instead of a right shift to move some of the work to port 5.

The somewhat tricky difference is in AVX's handling of vpmuludq inputs. The instruction treats each input as a vector of four (eight for AVX-512) 64-bit values, but it completely ignores the upper 32 bits. The result is a vector of four(/eight) 64-bit products.

What this means is that we can load two vectors of eight 32-bit ints and pass them to vpmuludq to obtain the products of all the even positions. We don't have to clear the odd indices or anything; they just get ignored.

To get the products of the odd indices, we first have to move them into even positions (again, we don't have to worry about what gets left in the odd positions). One's first thought might be right shifts, but these compete with multiplication (they run on the same ports), so we should use swizzle instructions. One particular swizzle instruction, vmovshdup15, is particularly interesting here, because when its operand is in memory, it executes entirely within a memory port, relieving pressure from our vector ports.

Without further ado, on AVX2 we end up with the following code, where ymm0, …, ymm3 are accumulators, and rdi and rsi hold the pointers to the input buffers:

    // ymm4 = x[0].even * y[0].even
    vmovdqu    ymm6, [rdi]
    vpmuludq   ymm4, ymm6, [rsi]
    // ymm5 = x[0].odd * y[0].odd
    vmovshdup  ymm6, [rdi]
    vmovshdup  ymm7, [rsi]
    vpmuludq   ymm5, ymm6, ymm7

    // ymm4 += x[1].even * y[1].even
    vmovdqu    ymm6, [rdi + 32]
    vpmuludq   ymm6, ymm6, [rsi + 32]
    vpaddq     ymm4, ymm4, ymm6
    // ymm5 += x[1].odd * y[1].odd
    vmovshdup  ymm6, [rdi + 32]
    vmovshdup  ymm7, [rsi + 32]
    vpmuludq   ymm6, ymm6, ymm7
    vpaddq     ymm5, ymm5, ymm6

    // ymm4 += x[2].even * y[2].even
    vmovdqu    ymm6, [rdi + 64]
    vpmuludq   ymm6, ymm6, [rsi + 64]
    vpaddq     ymm4, ymm4, ymm6
    // ymm5 += x[2].odd * y[2].odd
    vmovshdup  ymm6, [rdi + 64]
    vmovshdup  ymm7, [rsi + 64]
    vpmuludq   ymm6, ymm6, ymm7
    vpaddq     ymm5, ymm5, ymm6

    // ymm4 += x[3].even * y[3].even
    vmovdqu    ymm6, [rdi + 96]
    vpmuludq   ymm6, ymm6, [rsi + 96]
    vpaddq     ymm4, ymm4, ymm6
    // ymm5 += x[3].odd * y[3].odd
    vmovshdup  ymm6, [rdi + 96]
    vmovshdup  ymm7, [rsi + 96]
    vpmuludq   ymm6, ymm6, ymm7
    vpaddq     ymm5, ymm5, ymm6

    // Duplicate high 32 bits of 64-bit sum into low.
    vmovshdup  ymm6, ymm4
    vmovshdup  ymm7, ymm5

    // Accumulate
    vpaddq     ymm0, ymm0, ymm4
    vpaddq     ymm1, ymm1, ymm5
    vpaddq     ymm2, ymm2, ymm6
    vpaddq     ymm3, ymm3, ymm7

On Alder Lake-P, the vector ports are still the bottleneck. We achieve a throughput of .6 vector terms per cycle. The native method has a throughput of .21 for Mersenne-31 (for a 2.8x speedup), and a throughput of .2 for Baby Bear (3x speedup).

The AVX-512 code is analogous; I won't bother reproducing it here. Again, the vector ports are the bottleneck. We get a throughput of .4 vector terms per cycle. The naive method has a throughput of .15 for Mersenne-31 (2.6x speedup) and .12 for Baby Bear (3.2x speedup).

Misaligned data

The methods I've described assume that the input buffers are aligned to the vector size16. This is, of course, not always the case with user-provided data.

As noted by Eli, a sum can handle misalignment by shifting the pointer and special-casing the leftovers. This technique does not work with dot products, as the buffers may be out of alignment with one another; we should still align at least one of the pointers.

Such misalignment changes nothing for our Neon dot product procedure. The M1 has 128-byte cachelines and 16-byte vectors, so only 1 out of 16 memory accesses17 would cross cache boundaries. If my assumptions about the M1's L1 cache are correct18, misalignment barely increases our L1 transactions and should be unnoticeable.

Our AVX dot product code requires more thorough analysis, because of the trick where we use the memory ports to execute the vmovshdup instruction: we end up reading all data twice.

Alder Lake-P has three read ports, so the L1 cache can perform three read transactions per cycle19. Loads that cross the cache boundary count as two transactions20, so if one of the buffers is misaligned, we have 25% more transactions. With 20 transactions per iteration, this is likely the new bottleneck, but due to the nondeterministic nature of caches, I'd have to confirm this with a benchmark. The remedy would be to either perform some of the vmovshdup on port 5 again, or to perform aligned reads and use valignd to shift the data. On Ice Lake and AVX-512, the situation is similar.

More concerning is this claim that misaligned memory accesses affect throughput from further down in the cache hierarchy. This is something we would definitely want to benchmark.

Cache sizes

A stupidly high compute throughput is nice and all, but eventually we do bump up against memory bandwidth. If the data is already in L2 we'll be fine, but expect a significant slowdown if we have to fetch it from L3 or main memory, even with aggressive prefetching. Realistically, if we're doing a single pass over the data, like with a standard sum or dot product, there's nothing we can do about that. Still, L2 caches tend to be quite large these days, so this won't come up for a lot of use cases.

Cache hierarchies are an excellent reason for specialized algorithms for data structures more complex than a vector. I am thinking in particular of matrices and matrix-vector and matrix-matrix multiplication. Matrix-vector multiplication can be built out of a dot product procedure, but a blockwise method has significantly better cache locality, which will result in a significant speedup for large matrices. The dot product code above still applies; it's just the loop structure that becomes more complicated. The difference is even greater with matrix-matrix multiplication, where not only do we get better cache locality, but can also take advantage of algorithms with a better cache complexity21, such as Strassen's.

Footnotes

  1. See #238.

  2. Spoiler!

  3. unless we have a lot of values

  4. It's the last Apple microarchitecture that I have the instruction tables for.

  5. My most important x86 optimization resources are: Agner Fog's microarchitecture guide, uops.info, and Intel's optimization manual.

  6. a.k.a. Golden Cove

  7. For example, res.1 == seq[0].1 + seq[1].1 + ... + seq[len - 1].1.

  8. Unless the sequence has a length of more than 233, in which case we need to perform intermediate reductions to avoid overflow. This is a somewhat contrived problem, but it needs to be taken into account for correctness, since Rust does permit long sequences.

  9. Remember that x86 is little-endian.

  10. With a vpand or vpblendd or whatever.

  11. Hacker's Delight, §2-13.

  12. Hacker's Delight, §2-12. Also see this comment in my AVX2 Goldilocks code.

  13. Each of those instructions has a latency of 3 cycles, whereas the adds have a latency of 2 cycles. We want to be looking 4 iterations ahead, be it through CPU reordering or loop unrolling.

  14. Yes, AVX-512 IFMA exists, but it's limited to 52-bit limbs, so it's not the same. This limitation makes it not useful here.

  15. It's a floating-point instruction that (strangely) does not have an integer variant. This distinction doesn't actually matter anymore (very old hardware had higher latencies if you got it wrong), but Intel kept the two classes of instructions in AVX-512 (what else would they do with all that opcode space?). I think this instruction is meant for complex multiplication.

  16. 16 bytes for Neon, 32 bytes for AVX2, 64 bytes for AVX-512.

  17. 1 out of 8 reads for one out of two buffers.

  18. We don't know much about it; I'm assuming it behaves the same as recent Intel.

  19. Presumably. I've not seen any documentation on this, but it's a safe bet that it behaves the same as previous generations.

  20. In lieu of documentation, I base this on educated speculation.

  21. It still blows my mind that matrix multiplication is o(n3).

Tests: mersenne-31 aarch64_neon test fails to compile

There seems to be a typo here that breaks this tests. In the current version Poseideon2 is instantiated as a Perm16 rather than a Perm24. This is not what is intended as the test seems to be for a width of 24 not 16.

// Our Poseidon2 implementation.
let poseidon2 = Perm16::new_from_rng_128(
Poseidon2ExternalMatrixGeneral,
DiffusionMatrixMersenne31,
&mut rng,
);

This causes this causes type conflicts with the compiler when trying to permute in the following lines

https://github.com/Plonky3/Plonky3/blob/abce7369f2f089f7b3a4f97f527fb32b395e33b4/mersenne-31/src/aarch64_neon/poseidon2.rs#L82C1-L88C48

since the permute_mut expects a an input of width that is the same as the width of poseidon.

Suggested fix

Simply changing the instantiation to use a perm24 rather than perm16 fixes this and seems to be the correct intention. I am happy to make this change if it is welcome and that my understanding here is correct.

Reconcile with Starkware on the quartic extension polynomial

It appears that this repo creates M31's degree-4 extension by extending the complex extension with

y^2 - i - 2

See https://github.com/Plonky3/Plonky3/blob/main/mersenne-31/src/extension.rs#L20

However, Starkware's Stwo uses a different polynomial

y^2 - 1 - 2i

See https://github.com/starkware-libs/stwo/blob/dev/src/core/fields/qm31.rs#L13C46-L13C48

Definitely it would be beneficial if Plonky3 and Stwo can reconcile on this. The issue is that both repositories have been working heavily on CPU acceleration with, for example, AVX and Neon.

Potential bugs from `*_base_slice()` on different architectures

AbstractFieldExtension::{as,from}_base_slice() are not really well-defined since they expose the internal choice of representation of the extension. If we have two different machine architectures whose arithmetic for a particular field extension uses two different representations (easy to imagine with CPU vs GPU or vector vs serial), then calling u = fe.as_base_slice() on one machine and then v = FE::from_base_slice(u) on another could result in garbage in v.

IMHO, the guiding principle should be: Any operation that depends on the internal representation of elements of a field extension needs to be defined inside the field extension code. Exposing the internal representation is asking for trouble.

Thoughts?

DuplexChallenger update the capicity of the sponge_state when observe()

It seems that the code here leads to the whole sponge_state updated by duplex() function .

For detail, the below code updates the whole sponge_state whose capacity partion should no be touched

if self.input_buffer.len() == WIDTH {
            self.duplexing();
        }
截屏2024-05-14 16 42 14
impl<F, P, const WIDTH: usize> DuplexChallenger<F, P, WIDTH>
where
    F: Copy,
    P: CryptographicPermutation<[F; WIDTH]>,
{
    pub fn new(permutation: P) -> Self
    where
        F: Default,
    {
        Self {
            sponge_state: [F::default(); WIDTH],
            input_buffer: vec![],
            output_buffer: vec![],
            permutation,
        }
    }

    fn duplexing(&mut self) {
        assert!(self.input_buffer.len() <= WIDTH);

        // Overwrite the first r elements with the inputs.
        for (i, val) in self.input_buffer.drain(..).enumerate() {
            self.sponge_state[i] = val;
        }

        // Apply the permutation.
        self.permutation.permute_mut(&mut self.sponge_state);

        self.output_buffer.clear();
        self.output_buffer.extend(self.sponge_state);
    }
}

impl<F, P, const WIDTH: usize> FieldChallenger<F> for DuplexChallenger<F, P, WIDTH>
where
    F: PrimeField64,
    P: CryptographicPermutation<[F; WIDTH]>,
{
}

impl<F, P, const WIDTH: usize> CanObserve<F> for DuplexChallenger<F, P, WIDTH>
where
    F: Copy,
    P: CryptographicPermutation<[F; WIDTH]>,
{
    fn observe(&mut self, value: F) {
        // Any buffered output is now invalid.
        self.output_buffer.clear();

        self.input_buffer.push(value);

        if self.input_buffer.len() == WIDTH {
            self.duplexing();
        }
    }
}

PCSs should support different opening locations for each matrix

Currently we have

    fn open_multi_batches(
        &self,
        prover_data_and_points: &[(&Self::ProverData, &[EF])],
        challenger: &mut Challenger,
    ) -> (OpenedValues<EF>, Self::Proof);

Each ProverData corresponds to a batch of (potentially multiple) committed matrices, but each such batch can only have one set of opening points, &[EF].

In typical usage, the opening locations within a batch are almost uniform, like all "round 1" traces are typically opened at {zeta, g * zeta}. However, the shift g depends on trace length, and within a batch there may be traces of different lengths.

So I think we need to generalize it to something like

prover_data_and_points: &[(&Self::ProverData, &[Vec<EF>])],

This isn't an issue for uni-stark, where each batch contains a single matrix, but it's a blocker for the Valida prover integration (cc @maxgillett).

use codespell check all possiable typos and exclude typical expression in rust(eg.crate,ser)

https://github.com/codespell-project/codespell use -L to exclude crate,ser(actually not typo,eg.serde::ser::SerializeSeq)

➜  Plonky3 git:(main) codespell -L crate,ser
./challenger/src/duplex_challenger.rs:109: valuess ==> values
./challenger/src/duplex_challenger.rs:110: valuess ==> values
./challenger/src/multi_field_challenger.rs:143: valuess ==> values
./challenger/src/multi_field_challenger.rs:144: valuess ==> values
./mersenne-31/benches/bench_field.rs:20: repitions ==> repetitions
./baby-bear/benches/bench_field.rs:20: repitions ==> repetitions

Multi batch opening with different max heights in FRI

When attempting to make a open_multi_batches call with batches that have different maximum heights, the code panics for accessing invalid rows.

Here is an example of a test failure:
https://github.com/Plonky3/Plonky3/blob/multi-batch-fail/fri/tests/pcs.rs#L108

In the context of FRI, the problem seems to be that the shifts of degrees on the tree assume the log_max_degree is the same for all batches in https://github.com/Plonky3/Plonky3/blob/main/fri/src/two_adic_pcs.rs#L299:

        let query_openings = query_indices
            .into_iter()
            .map(|index| {
                prover_data_and_points
                    .iter()
                    .map(|(data, _)| {
                        let (opened_values, opening_proof) = self.mmcs.open_batch(index, data);
                        BatchOpening {
                            opened_values,
                            opening_proof,
                        }
                    })
                    .collect()
            })
            .collect();

That is because query_indices are computed with the maximum height of all batches as basis, whereas the shifts in open_batch, which are computed by the following code:

        let openings = prover_data
            .leaves
            .iter()
            .map(|matrix| {
                let log2_height = log2_ceil_usize(matrix.height());
                let bits_reduced = log_max_height - log2_height;
                let reduced_index = index >> bits_reduced;
                matrix.row(reduced_index).collect()
            })
            .collect_vec();

are computing the shift with respect to the maximum height of that batch. This ends up with overflow in the rows if the max_height_for batch < max_height_on_all_batches.

I have also checked an earlier commit of FRI, and had the same failure, so this behavior did not originate from recent fixes.

Is this the intended behavior of open_multi_batches? If so, we should probably add some comments in the trait as the errors were a little confusing to debug. If it's not the intended behavior, happy to explore a fix (some naive fixes I tried didn't work).

Avoid unchecked arithmetic overflow / underflow

In Rust, the default signed and unsigned n-bit integer types exhibit different behavior on under/overflow in debug vs release mode. In debug mode, the arithmetic operations panic on under/overflow. In release mode, under/overflow wraps around modulo 2^n. This inconsistent behavior in debug vs release mode is undesirable; in any instance, either under/overflow is an error condition and should be treated as such in both debug and release mode, or it is not an error condition and it should wrap around in both debug and release mode.

  1. The multiplication in PrimeField64::linear_combination_u64 for Mersenne31, on line 257 of mersenne-31/src/lib.rs.
  2. The multiplication in PrimeField64::linear_combination_u64 for Mersenne31, on line 259 of mersenne-31/src/lib.rs.
  3. In the implementation of dit_butterfly_inner in mersenne/src/radix_2_dit.rs, lines 133-148.
  4. In the default implementation of TwoAdicSubgroupDft::coset_lde_batch, on line 93 of dft/src/traits.rs.
  5. In the default implementation of SystematicCode::parity_len, on line 15 of code/src/systematic.rs.
  6. In the implementation of interpolate_coset, on line 55 of interpolation/src/lib.rs.

`BabyBear` Extension Fields

For BabyBear, we need an extension instance for QuinticExtension (154-bit sec) or TesseracticExtension (123-bit sec).
In this higher degree, the frobenuis algorithm becomes explicitly faster for the inverse. if we only use this extension for the prime field (i.e., we force the base field to be the prime field and the extension field to be a generalized OEF), then we can embed the frobenuis algorithm inside these particular extension field structs.

The quintic binomial poly can be x^5-2, the generator is i+8
The Tesseract binomial poly can be x^4-11, and the generator is also i+8

Refactor BabyBear and KoalaBear Code

In PR #329, we introduced KoalaBear, a field similar to BabyBear but with some numerical advantages.

Currently the code in the KoalaBear repository is almost entirely copy pasted from the BabyBear repository with the only changes being updating all the constants.

Clearly this level of code duplication is undesirable so we should either try and find a way to refactor the code to remove as much duplication as possible or long term only support one of these two fields.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.