Git Product home page Git Product logo

pqm4's Introduction

pqm4

Collection of post-quantum cryptographic alrogithms for the ARM Cortex-M4

Introduction

The pqm4 library, benchmarking and testing framework started as a result of the PQCRYPTO project funded by the European Commission in the H2020 program. It currently contains implementations post-quantum key-encapsulation mechanisms and post-quantum signature schemes targeting the ARM Cortex-M4 family of microcontrollers. The design goals of the library are to offer

  • automated functional testing on a widely available development board;
  • automated generation of test vectors and comparison against output of a reference implementation running host-side (i.e., on the computer the development board is connected to);
  • automated benchmarking for speed, stack usage, and code-size;
  • automated profiling of cycles spent in symmetric primitives (SHA-2, SHA-3, AES);
  • integration of clean implementations from PQClean; and
  • easy integration of new schemes and implementations into the framework.

Previous NIST PQC

The master branch of pqm4 contains schemes that either selected for standardization by NIST, part of the 4th round of the NIST PQC standardization process, or part or the first round of additional signatures of the NIST PQC standardization process.

Implementations for previous NIST PQC rounds are available here:

Changes in Round 2

For the second round of the NIST PQC process, pqm4 was extended (see #78) with the following features:

  • common code was moved to mupq for reuse in pqriscv,
  • much simpler build process,
  • automated profiling of cycles spent in symmetric primitives (SHA-2, SHA-3, AES),
  • reporting of code-size,
  • integration of clean implementations from PQClean.

Changes in Round 3

For the third round of the NIST PQC process, pqm4 was extended with the following features:

  • overhaul of the build process to support multiple target boards, and
  • use of the QEMU simulator to measure stack usage of larger schemes.

Changes in Round 4 / Round 1 of Additional signatures

For the fourth round of the NIST PQC process pqm4 was extended with the following features:

  • Switch to the Nucleo-L4R5ZI board as the default board for measurements, and
  • an overhaul of the console output.

Schemes included in pqm4

For most of the schemes there are multiple implementations. The naming scheme for these implementations is as follows:

  • clean: clean reference implementation from PQClean,
  • ref: the reference implementation submitted to NIST (will be replaced by clean in the long term),
  • opt: an optimized implementation in plain C (e.g., the optimized implementation submitted to NIST),
  • m4: an implementation with Cortex-M4 specific optimizations (typically in assembly).
  • m4f: an implementation with Cortex-M4F specific optimizations (typically assembly using floating-point registers).

Setup/Installation

The testing and benchmarking framework of pqm4 targets several development boards, all featuring an ARM Cortex-M4 chip:

  • nucleo-l4r5zi (default): The NUCLEO-L4R5ZI board featuring 2MB of Flash and 640KB of RAM. This board does not require a separate USB serial interface converter.
  • stm32f4discovery: The STM32F4 Discovery board featuring 1MB of Flash, and 192KB of RAM. Connecting the development to the host computer requires a mini-USB cable and a USB-TTL converter together with a 2-pin dupont / jumper cable.
  • nucleo-l476rg: The NUCLEO-L476RG board featuring 1MB of Flash and 128KB of RAM. This board does not require a separate USB serial interface converter.
  • cw308t-stm32f3: The ChipWhisperer CW308-STM32F3 target board (in the F3 configuration) featuring 256KB of Flash and 40KB of RAM.
  • mps2-an386: The ARM MPS2(+) FPGA prototyping board when used with the ARM-Cortex M4 bitstream (see ARM AN386) featuring two 4MB RAM blocks, one used in lieu of Flash one as RAM. This board can also be simulated with the QEMU 5.2 simulator (the cycle counts are, however, meaningless in this case).

Installing the ARM toolchain

The pqm4 build system assumes that you have the arm-none-eabi toolchain toolchain installed. All benchmarks are performed using this toolchain. On most Linux systems, the correct toolchain gets installed when you install the arm-none-eabi-gcc (or gcc-arm-none-eabi) package.
On some Linux distributions, you will also have to explicitly install libnewlib-arm-none-eabi .

Installing stlink

To flash binaries onto most development boards, pqm4 is using stlink. Depending on your operating system, stlink may be available in your package manager -- if not, please refer to the stlink Github page for instructions on how to compile it from source (in that case, be careful to use libusb-1.0.0-dev, not libusb-0.1).

Installing OpenOCD

For the nucleo-l4r5zi board OpenOCD (tested with version 0.12) is used for flashing binaries. Depending on your operating system, OpenOCD may be available in your package manager -- if not, please refer to the OpenOCD README for instructions on how to compile it from source.

Python3

The benchmarking scripts used in pqm4 require Python >= 3.8.

Installing pyserial

The host-side Python code for most platforms requires the pyserial module. Your package repository might offer python3-serial (Debian, Ubuntu) or python-pyserial (Arch) or python3-pyserial (Fedora, openSUSE) or pyserial (Slack, CentOS, Gentoo) or py3-pyserial (Alpine) directly. Alternatively, this can be easily installed from PyPA by calling pip3 install -r requirements.txt. If you do not have pip3 installed yet, you can typically find it as python3-pip (Debian, Ubuntu) or python-pip (Arch) using your package manager.

Installing ChipWhisperer

The host-side Python code for the cw308t-stm32f3 board requires the chipwhisperer module. If you don't target this board, you can skip the installation.

Installing QEMU >=5.2

The mps2-an386 platform is simulated with the QEMU ARM system emulator. You'll need at least the version 5.2, which is fairly recent at the time of writing and may not be available on your favourite Linux distro. If you don't target this platform, you can skip the installation.

Connecting the STM32F4 Discovery board to the host

Connect the board to your host machine using the mini-USB port. This provides it with power, and allows you to flash binaries onto the board. It should show up in lsusb as STMicroelectronics ST-LINK/V2.

If you are using a UART-USB connector that has a PL2303 chip on board (which appears to be the most common), the driver should be loaded in your kernel by default. If it is not, it is typically called pl2303. On macOS, you will still need to install it (and reboot). When you plug in the device, it should show up as Prolific Technology, Inc. PL2303 Serial Port when you type lsusb.

Using dupont / jumper cables, connect the TX/TXD pin of the USB connector to the PA3 pin on the board, and connect RX/RXD to PA2. Depending on your setup, you may also want to connect the GND pins.

Downloading pqm4 and libopencm3

Finally, obtain the pqm4 library and the submodules:

git clone --recursive https://github.com/mupq/pqm4.git

Now you may pick your platform and compile the code (adapt the PLATFORM variable to your chosen platform and the number of threads in -j4 to your PC accordingly):

make -j4 PLATFORM=stm32f4discovery

API documentation

The pqm4 library uses the NIST/SUPERCOP/PQClean API. It is mandated for all included schemes.

KEMs need to define CRYPTO_SECRETKEYBYTES, CRYPTO_PUBLICKEYBYTES, CRYPTO_BYTES, and CRYPTO_CIPHERTEXTBYTES and implement

int crypto_kem_keypair(unsigned char *pk, unsigned char *sk);
int crypto_kem_enc(unsigned char *ct, unsigned char *ss, const unsigned char *pk);
int crypto_kem_dec(unsigned char *ss, const unsigned char *ct, const unsigned char *sk);

Signature schemes need to define CRYPTO_SECRETKEYBYTES, CRYPTO_PUBLICKEYBYTES, and CRYPTO_BYTES and implement

int crypto_sign_keypair(unsigned char *pk, unsigned char *sk);
int crypto_sign(unsigned char *sm, size_t *smlen, 
                const unsigned char *msg, size_t len,
                const unsigned char *sk);
int crypto_sign_open(unsigned char *m, size_t *mlen,
                     const unsigned char *sm, size_t smlen,
                     const unsigned char *pk);

Running tests and benchmarks

The build system compiles six binaries for each implemenation which can be used to test and benchmark the schemes. For example, for the reference implementation of Kyber768 the following binaries are assembled:

  • bin/crypto_kem_kyber768_m4_test.bin tests if the scheme works as expected. For KEMs this tests if Alice and Bob derive the same shared key and for signature schemes it tests if a generated signature can be verified correctly. Several failure cases are also checked, see mupq/crypto_kem/test.c and mupq/crypto_sign/test.c for details.
  • bin/crypto_kem_kyber768_m4_speed.bin measures the runtime of crypto_kem_keypair, crypto_kem_enc, and crypto_kem_dec for KEMs and crypto_sign_keypair, crypto_sign, and crypto_sign_open for signatures. See mupq/crypto_kem/speed.c and mupq/crypto_sign/speed.c.
  • bin/crypto_kem_kyber768_m4_hashing.bin measures the cycles spent in SHA-2, SHA-3, and AES of crypto_kem_keypair, crypto_kem_enc, and crypto_kem_dec for KEMs and crypto_sign_keypair, crypto_sign, and crypto_sign_open for signatures. See mupq/crypto_kem/hashing.c and mupq/crypto_sign/speed.c.
  • bin/crypto_kem_kyber768_m4_stack.bin measures the stack consumption of each of the procedures involved. The memory allocated outside of the procedures (e.g., public keys, private keys, ciphertexts, signatures) is not included. See mupq/crypto_kem/stack.c and mupq/crypto_sign/stack.c.
  • bin/crypto_kem_kyber768_m4_testvectors.bin uses a deterministic random number generator to generate testvectors for the implementation. These can be used to cross-check different implemenatations of the same scheme. See mupq/crypto_kem/testvectors.c and mupq/crypto_sign/testvectors.c.
  • bin-host/crypto_kem_kyber768_m4_testvectors uses the same deterministic random number generator to create the testvectors on your host. See mupq/crypto_kem/testvectors-host.c and mupq/crypto_sign/testvectors-host.c.
  • An elf file for each binary is generated in the elf/ folder if desired.

The elf files or binaries can be flashed to your board using an appropriate tool. For example, the stm32f4discovery platform uses st-flash, e.g., st-flash write bin/crypto_kem_kyber768_m4_test.bin 0x8000000. To receive the output, run python3 hostside/host_unidirectional.py.

If you target the mps2-an386 platform, you can also run the elf file using the QEMU ARM emulator:

qemu-system-arm -M mps2-an386 -nographic -semihosting -kernel elf/crypto_kem_kyber512_m4_test.elf

The emulator should exit automatically when the test / benchmark completes. If you run into an error, you can exit QEMU pressing CTRL+A and then X.

The pqm4 framework automates testing and benchmarking for all schemes using Python3 scripts:

  • python3 test.py: flashes all test binaries to the boards and checks that no errors occur.
  • python3 testvectors.py: flashes all testvector binaries to the boards and writes the testvectors to testvectors/. Additionally, it executes the reference implementations on your host machine. Afterwards, it checks the testvectors of different implementations of the same scheme for consistency.
  • python3 benchmarks.py: flashes the stack and speed binaries and writes the results to benchmarks/stack/ and benchmarks/speed/. You may want to execute this several times for certain schemes for which the execution time varies significantly.

The scripts take a number of command line arguments, which you'll need to adapt:

  • --platform <platformname> or -p <platformname>: Sets the target platform (default stm32f4discovery).
  • --opt {speed,size,debug} or -o {speed,size,debug}: Sets optimization flags for compilation (default speed).
  • --lto or -l: Use link-time optimization during compilation.
  • --no-aio: Use link-time optimization during compilation.

If you change any of these values, you'll need to run make clean (the build system will remind you).

In case you don't want to include all schemes, pass a list of schemes you want to include to any of the scripts, e.g., python3 test.py kyber768 sphincs-shake256-128f-simple. In case you want to exclude certain schemes pass --exclude, e.g., python3 test.py --exclude saber.

The benchmark results (in benchmarks/) created by python3 benchmarks.py can be automatically converted to a markdown table using python3 convert_benchmarks.py md or to csv using python3 convert_benchmarks.py csv.

Benchmarks

The current benchmark results can be found in benchmarks.csv or benchmarks.md.

All cycle counts were obtained at 24MHz to avoid wait cycles due to the speed of the memory controller. For most schemes we report minimum, maximum, and average cycle counts of 100 executions. For some particularly slow schemes we reduce the number of executions; the number of executions is reported in parentheses.

The numbers were obtained with arm-none-eabi-gcc (Arm GNU Toolchain 11.3.Rel1) 11.3.1 20220712 from Arm.

The code-size measurements only include the code that is provided by the scheme implementation, i.e., exclude common code like hashing or C standard library functions. The measurements are performed with arm-none-eabi-size. The size contributions to the .text, .data, and .bss sections are also listed separately.

Adding new schemes and implementations

The pqm4 build system is designed to make it very easy to add new schemes and implementations, if these implementations follow the NIST/SUPERCOP/PQClean API.

In case you want to contribute a reference implementation, please open a pull request to PQClean. In case you want to contribute an optimized C implementation, please open a pull request to mupq. In case you want to add an implementation optimized for the Cortex-M4, please open a pull request here.

In the following we consider the example of adding an M4-optimized implementation of NewHope-512-CPA-KEM to pqm4:

  1. Create a subdirectory for the new scheme under crypto_kem/; in the following we assume that this subdirectory is called newhope512cpa.
  2. Create a subdirectory m4 under crypto_kem/newhope512cpa/.
  3. Copy all files of the implementation into this new subdirectory crypto_kem/newhope512cpa/m4/, except for the file implementing the randombytes function (typically PQCgenKAT_kem.c).

The procedure for adding a signature scheme is the same, except that it starts with creating a new subdirectory under crypto_sign/.

Using optimized FIPS202 (Keccak, SHA3, SHAKE)

Many schemes submitted to NIST use SHA-3, SHAKE or cSHAKE for hashing. This is why pqm4 comes with highly optimized Keccak code that is accessible from all KEM and signature implementations. Functions from the FIPS202 standard are defined in mupq/common/fips202.h as follows:

void shake128_absorb(shake128ctx *state, const uint8_t *input, size_t inlen);
void shake128_squeezeblocks(uint8_t *output, size_t nblocks, shake128ctx *state);
void shake128(uint8_t *output, size_t outlen, const uint8_t *input, size_t inlen);

void shake128_inc_init(shake128incctx *state);
void shake128_inc_absorb(shake128incctx *state, const uint8_t *input, size_t inlen);
void shake128_inc_finalize(shake128incctx *state);
void shake128_inc_squeeze(uint8_t *output, size_t outlen, shake128incctx *state);

void shake256_absorb(shake256ctx *state, const uint8_t *input, size_t inlen);
void shake256_squeezeblocks(uint8_t *output, size_t nblocks, shake256ctx *state);
void shake256(uint8_t *output, size_t outlen, const uint8_t *input, size_t inlen);

void shake256_inc_init(shake256incctx *state);
void shake256_inc_absorb(shake256incctx *state, const uint8_t *input, size_t inlen);
void shake256_inc_finalize(shake256incctx *state);
void shake256_inc_squeeze(uint8_t *output, size_t outlen, shake256incctx *state);

void sha3_256_inc_init(sha3_256incctx *state);
void sha3_256_inc_absorb(sha3_256incctx *state, const uint8_t *input, size_t inlen);
void sha3_256_inc_finalize(uint8_t *output, sha3_256incctx *state);

void sha3_256(uint8_t *output, const uint8_t *input, size_t inlen);

void sha3_512_inc_init(sha3_512incctx *state);
void sha3_512_inc_absorb(sha3_512incctx *state, const uint8_t *input, size_t inlen);
void sha3_512_inc_finalize(uint8_t *output, sha3_512incctx *state);

void sha3_512(uint8_t *output, const uint8_t *input, size_t inlen);

Functions from the related publication SP 800-185 (cSHAKE) are defined in mupq/common/sp800-185.h:

void cshake128_inc_init(shake128incctx *state, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen);
void cshake128_inc_absorb(shake128incctx *state, const uint8_t *input, size_t inlen);
void cshake128_inc_finalize(shake128incctx *state);
void cshake128_inc_squeeze(uint8_t *output, size_t outlen, shake128incctx *state);

void cshake128(uint8_t *output, size_t outlen, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen, const uint8_t *input, size_t inlen);

void cshake256_inc_init(shake256incctx *state, const uint8_t *name, size_t namelen, const uint8_t *cstm, size_t cstmlen);
void cshake256_inc_absorb(shake256incctx *state, const uint8_t *input, size_t inlen);
void cshake256_inc_finalize(shake256incctx *state);
void cshake256_inc_squeeze(uint8_t *output, size_t outlen, shake256incctx *state);

void cshake256(uint8_t *output, size_t outlen, const uint8_t *name, size_t namelen, const uint8_t* cstm, size_t cstmlen, const uint8_t *input, size_t inlen);

Implementations that want to make use of these optimized routines simply include fips202.h (or sp800-185.h). The API for sha3_256 and sha3_512 follows the SUPERCOP hash API. The API for shake128 and shake256 is very similar, except that it supports variable-length output. The SHAKE functions are also accessible via the absorb-squeezeblocks functions, which offer incremental output generation (but not incremental input handling). The variants with _inc_ offer both incremental input handling and output generation.

Using optimized SHA-2

Some schemes submitted to NIST use SHA-224, SHA-256, SHA-384, or SHA-512 for hashing. We've experimented with assembly-optimized SHA-512, but found that the speed-up achievable with this compared to the C implementation from SUPERCOP is negligible when compiled using arm-none-eabi-gcc-8.3.0. For older compiler versions (e.g. 5.4.1) hand-optimized assembly implementations were significantly faster. We've therefore decided to only include a C version of the SHA-2 variants. The available functions are:

void sha224_inc_init(sha224ctx *state);
void sha224_inc_blocks(sha224ctx *state, const uint8_t *in, size_t inblocks);
void sha224_inc_finalize(uint8_t *out, sha224ctx *state, const uint8_t *in, size_t inlen);
void sha224(uint8_t *out, const uint8_t *in, size_t inlen);

void sha256_inc_init(sha256ctx *state);
void sha256_inc_blocks(sha256ctx *state, const uint8_t *in, size_t inblocks);
void sha256_inc_finalize(uint8_t *out, sha256ctx *state, const uint8_t *in, size_t inlen);
void sha256(uint8_t *out, const uint8_t *in, size_t inlen);

void sha384_inc_init(sha384ctx *state);
void sha384_inc_blocks(sha384ctx *state, const uint8_t *in, size_t inblocks);
void sha384_inc_finalize(uint8_t *out, sha384ctx *state, const uint8_t *in, size_t inlen);
void sha384(uint8_t *out, const uint8_t *in, size_t inlen);

void sha512_inc_init(sha512ctx *state);
void sha512_inc_blocks(sha512ctx *state, const uint8_t *in, size_t inblocks);
void sha512_inc_finalize(uint8_t *out, sha512ctx *state, const uint8_t *in, size_t inlen);
void sha512(uint8_t *out, const uint8_t *in, size_t inlen);

Implementations can use these by including sha2.h.

Using optimized AES

Some schemes submitted to NIST make use of AES as a subroutine. We included assembly-optimized implementations of AES-128 and AES-256 in ECB mode and in CTR mode.

Up until January 2021, pqm4 relied on the t-table implementation by Schwabe and Stoffelen published at SAC2016. On Cortex-M4 platforms with a data cache, this implementation may be vulnerable to cache attacks. Hence, pqm4 is now using the bitsliced implementation by Adomnicai and Peyrin published in TCHES2021/1.

The functions that can be used are stated in common/aes.h as follows:

void aes128_ecb_keyexp(aes128ctx *r, const unsigned char *key);
void aes128_ctr_keyexp(aes128ctx *r, const unsigned char *key);
void aes128_ecb(unsigned char *out, const unsigned char *in, size_t nblocks, const aes128ctx *ctx);
void aes128_ctr(unsigned char *out, size_t outlen, const unsigned char *iv, const aes128ctx *ctx);

void aes256_ecb_keyexp(aes256ctx *r, const unsigned char *key);
void aes256_ctr_keyexp(aes256ctx *r, const unsigned char *key);
void aes256_ecb(unsigned char *out, const unsigned char *in, size_t nblocks, const aes256ctx *ctx);
void aes256_ctr(unsigned char *out, size_t outlen, const unsigned char *iv, const aes256ctx *ctx);

Implementations can use these by including aes.h.

Some post-quantum schemes use AES with only public inputs (e.g., Kyber and FrodoKEM) and, consequently, do not need a constant-time AES implementation. As those schemes would be unfairly penalized by swiching to a slower constant-time implementation, we additionally provide the t-table implementation. The functions that can be used are stated in common/aes-publicinputs.h as follows:

 void aes128_ecb_keyexp_publicinputs(aes128ctx_publicinputs *r, const unsigned char *key);
 void aes128_ctr_keyexp_publicinputs(aes128ctx_publicinputs *r, const unsigned char *key);
 void aes128_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes128ctx_publicinputs *ctx);
 void aes128_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes128ctx_publicinputs *ctx);

 void aes192_ecb_keyexp_publicinputs(aes192ctx_publicinputs *r, const unsigned char *key);
 void aes192_ctr_keyexp_publicinputs(aes192ctx_publicinputs *r, const unsigned char *key);
 void aes192_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes192ctx_publicinputs *ctx);
 void aes192_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes192ctx_publicinputs *ctx);

 void aes256_ecb_keyexp_publicinputs(aes256ctx_publicinputs *r, const unsigned char *key);
 void aes256_ctr_keyexp_publicinputs(aes256ctx_publicinputs *r, const unsigned char *key);
 void aes256_ecb_publicinputs(unsigned char *out, const unsigned char *in, size_t nblocks, const aes256ctx_publicinputs *ctx);
 void aes256_ctr_publicinputs(unsigned char *out, size_t outlen, const unsigned char *iv, const aes256ctx_publicinputs *ctx);

Bibliography

When referring to this framework in academic literature, please consider using the following bibTeX excerpt:

@misc{PQM4,
  title = {{PQM4}: Post-quantum crypto library for the {ARM} {Cortex-M4}},
  author = {Matthias J. Kannwischer and Richard Petri and Joost Rijneveld and Peter Schwabe and Ko Stoffelen},
  note = {\url{https://github.com/mupq/pqm4}}
}

Please note however, that pqm4 does not author the implementations that are included in pqm4. Most of the implementations that are included in the collection originate from original research projects. Moreover, many implementations have been swapped out over the years. When comparing or improving implementations, please consider not only pqm4, but also cite the publication corresponding to the implementation.

Sometimes it might not be entirely clear which paper to cite. Feel free to you open an issue such that we can help you find it.

License

Different parts of pqm4 have different licenses. Each subdirectory containing implementations contains a LICENSE or COPYING file stating under what license that specific implementation is released. The files in common contain licensing information at the top of the file (and are currently either public domain or MIT).

All other code in this repository is dual-licensed under Apache-2.0 and under the conditions of CC0.

pqm4's People

Contributors

37eex9 avatar alperbilgin avatar cryptojedi avatar dean3154 avatar devillegna avatar dop-amin avatar dsprenkels avatar erdemalkim avatar fragerar avatar joostrijneveld avatar jowlo avatar junhaohuang avatar ko- avatar leonbotros avatar marco-palumbi avatar mjosaarinen avatar mkannwischer avatar mmoeller23 avatar mvanbeirendonck avatar neuromncr avatar prasanna-ravi avatar rpls avatar trista5658321 avatar vincentvbh avatar vmatsec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pqm4's Issues

Measure time spent in SHA2, SHA3, and randombytes

As in our recent paper, it would be nice to have some profiling information on how much time is spent in hashing and randombytes.
This should also give an intuition how the schemes sample randomness.

We implemented this by adding/subtracting the current cycle count to a global variable when entering/leaving the respective function. This should be done only for profiling, but not in the benchmarks since it can have a huge performance impact. It would be nice to port that to PQM4.

In case we want to use systick for this, then we will run into the problem that if the overflow occurs within one of the profiled functions the cycle counts are useless. In our paper experiments, we just detected this case and ran it several times until we had no overflow there - this won't work for SPHINCS.

AES requires word-aligned parameters

Consider the following MWE:

#include <stdio.h>
#include "aes.h"
#include "api.h"
#include "hal.h"

int crypto_kem_keypair(unsigned char *pk, unsigned char *sk)
{
    (void)pk;
    (void)sk;

    uint8_t key[20] = {0};
    aes128ctx ctx;
    char buf[128];

    sprintf(buf, "key: %p, key+1: %p, key+2: %p", key, key+1, key+2);
    hal_send_str(buf);

    aes128_keyexp(&ctx, key);
    hal_send_str("This works fine.");

    aes128_keyexp(&ctx, key+1);
    hal_send_str("But this is not printed anymore.");

    aes128_keyexp(&ctx, key+2);
    hal_send_str("This would also not have been okay.");


    return 0;
}

int crypto_kem_enc(unsigned char *ct, unsigned char *ss, const unsigned char *pk)
{
    (void)ct;
    (void)ss;
    (void)pk;
    return 0;
}

int crypto_kem_dec(unsigned char *ss, const unsigned char *ct, const unsigned char *sk)
{
    (void)ss;
    (void)ct;
    (void)sk;
    return 0;
}

Calling the M4-assembly AES key expansion with a key that is not word-aligned results in a fault. This is because it uses the ldm and stm instructions, which require the pointer to be word-aligned.

According to StackOverflow the only proper solution is to copy the value to a local variable that will always be word-aligned.

We should consider changing the assembly implementation such that it can handle unaligned pointers or hide this copying in mupq/common/aes.c.

This behavior was observed when dealing with NTRU Prime. (More specifically: ntrulpr761 and ntrulpr857).

Some implementations cannot be compiled using c89 standard

I made a (now closed) issue(#6 ) on this earlier but the newly added qTesla128 also has some issues. Now the following two parts have issues:

  • qTesla128
  • Frodo640

Most errors amount to for-loops with internal variable declarations, but one issue in qTesla128 shows the following:

fixed_llrint.c: In functionllrint_fixed’:
fixed_llrint.c:77:24: error: ‘__int32_tundeclared (first use in this function)
           if ((i0 & ~((__int32_t)1 << 31)) == 0)

Which is an unfamiliar issue to me. I replaced it with int32_t and that seems to hold up. I verified both the adjusted qTesla128 and Frodo640 code using test.py.
I have, however, no permission to push my changes to the branch I created locally. What would be the best way to provide my code? (My aim was to push it to a branch and create a pull request)

Proof of successful tests:
image

Stack usage results not useful for some schemes

Schemes like Kindi use malloc/calloc extensively.
While this still works on the M4, the stack usage benchmarks do no longer represent the entire memory consumption of the implementation.
The cleanest solution would be to refactor those implementations not to use malloc etc.
Any other suggestions?

.m4ignore does not work as indicated in readme

If the implementation you're adding is such a host-side-only reference implementation, place a file called .m4ignore in the subdirectory containing the implementation. In that case the Makefile is not required to contain the libpqm4 target.

I am currently adjusting a project using __uint128_t in the reference version, and the functionality mentioned would be beneficial. The compiler gets angry when compiling the ref for the m4. However, the make script seems to still want the libpqm4.a target.

Add new candidate Round5 ( = Round2 + Hila5 )

Round5 should meet your inclusion requirements -- having a version at Category 3 and being in the NIST competition (NIST has approved the merger of Round2 and Hila5 teams).

Code at https://github.com/round5/r5nd_tiny should be easy enough to incorporate to your testing framework. I actually used your implementations in benchmarking (thanks!) so APIs are equivalent. I simply had a different vendor board (with different flashing and serial comms mechanisms) so can't integrate it directly.

Please see the paper "Shorter Messages and Faster Post-Quantum Encryption with Round5 on Cortex M" at https://round5.org/doc/r5m4text.pdf for more information. It should pop up in IACR ePrint shortly.

Cheers,

  • markku

Python version requirement problem

The current ReadMe file states “The benchmarking scripts used in pqm4 require Python >= 3.6.
However, the mupq.py file has the line
subprocess.check_output(args, *, stdin=None, stderr=None, shell=False, cwd=None, encoding=None, errors=None, universal_newlines=None, timeout=None, text=None)
The “text” option was only added in Python 3.7 (“New in version 3.7: text was added as a more readable alias for universal_newlines.”).
A solution is either requiring Python >= 3.7 or using the universal_newlines option instead.

Namespacing

Some people showed interest in bundling more than one scheme from pqm4 together which means that we would need to namespace everything.

Since we have this in https://github.com/PQClean/PQClean/ anyway it should be straightforward to just keep it.

Any strong opinions?

Can't connect to serial interface

When I type the "make" command, some errors are shown as follows:
Makefile:13: recipe for target 'frodo640.o' failed
make[1]: *** [frodo640.o] Error 1
make[1]: Leaving directory '/home/abc/Documents/pqm4/crypto_kem/frodo640-cshake/opt'
Makefile:84: recipe for target 'crypto_kem/frodo640-cshake/opt' failed
make: *** [crypto_kem/frodo640-cshake/opt] Error 2
What should I do to solve them?

When flashing fails, test.py still tries to test.

When running sudo python3 test.py the following messages repeat.

Flashing bin/crypto_kem_kindi256342_ref_test.bin..
Flashed, now running tests...
timed out while waiting for the markers

Running st-flash write bin/crypto_kem_newhope1024cca_ref_test.bin 0x8000000 shows that:

st-flash: error while loading shared libraries: libstlink.so.1: cannot open shared object file: No such file or directory

Which in my case could be fixed using sudo ldconfig after which both the manual flash and test.py run as expected.

The problem here is that test.py should not give a go to the test, since nothing got flashed to the board and as such no output will be generated.

testvectors.py seems to timeout but does not recover like test.py

First to confirm that flashign works, and to get an idea of the speed at which a test runs, I ran sudo python3 test.py kyber768
Which outputs:

Flashing bin/crypto_kem_kyber768_ref_test.bin..
Flashed, now running tests...
  .. found output marker..
Testing if tests were successful..
passed.
Flashing bin/crypto_kem_kyber768_m4_test.bin..
Flashed, now running tests...
timed out while waiting for the markers
Flashing bin/crypto_kem_kyber768_m4_test.bin..
Flashed, now running tests...
  .. found output marker..
Testing if tests were successful..
passed.

Having an idea about the speed, I decided to run sudo python3 testvectors.py kyber768
The first time it ran for over 10 minutes and then I decided it had probably hung and interrupted it:

This script flashes the test vector binaries onto the board, and then
 writes the resulting output to the testvectors directory. It then
 checks if these are internally consistent and match the test vectors
 when running the reference code locally.
Flashing bin/crypto_kem_kyber768_m4_testvectors.bin..
Flashed, now computing test vectors..
^CTraceback (most recent call last):
  File "testvectors.py", line 54, in <module>
    x = dev.read()
  File "/home/nevernown/.local/lib/python3.5/site-packages/serial/serialposix.py", line 483, in read
    ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
KeyboardInterrupt

Sure enough the second run almost immediately got results, which makes my suspicion grow that it indeed hung:

This script flashes the test vector binaries onto the board, and then
 writes the resulting output to the testvectors directory. It then
 checks if these are internally consistent and match the test vectors
 when running the reference code locally.
Flashing bin/crypto_kem_kyber768_m4_testvectors.bin..
Flashed, now computing test vectors..
  .. found output marker..
  .. wrote test vectors!
Flashing bin/crypto_kem_kyber768_ref_testvectors.bin..
Flashed, now computing test vectors..
  .. found output marker..

But this run stops after finding the second output marker (It's been stuck at the above output for about 6 minutes now).

My question is twofold: 1. Did it really hang? 2. How can I have more verbose output on what is going on?

Possible minor bug in verify function of NewHopeKEM

The verify function in verify.c in the NewHope implementation compares two arrays for equality in constant time and should return (r = 0) if the byte arrays are equal and return (r = 1) otherwise. On executing a decapsulation with a faulty ciphertext, it should return r = 1, however I think it is returning r = 0xFF. Due to this, the cmov function subsequently executed during decapsulation in kem.c does not perform correct swapping between the bytes arrays x and r. I think it can be fixed by replacing line 25 in verify.c (i.e)

r = (-(int64_t)r) >> 63

with

r = (-r) >> 63; or r = ((uint64_t)(-(int64_t)r)) >> 63;

I do believe that the most recent implementation of NewHope does not use this verify function though the verify function is defined in verify.c, but this function has been utilized in previous implementations of NewHope in the pqm4 library.

Round5

I've noticed that the CCAKEM variant of Round5 that PQM4 implemented is probably not compatible with CCA KEM contained in the Round5 specification (or the one in r5embed).

Apparently PQM4 enabled Fujisaki-Okamoto for the "epheremeral KEM" parameter sets rather than using the actual CCA parameter sets (i.e. "PKE" parameter sets). I recall suggesting stripping away the DEM from the PKE parameter sets -- as is done in r5embed -- but this suggestion was apparently not followed. On code level the correct CCA parameters are selected by using R5ND_3PKE_5d insted of R5ND_3KEM_5d etc. In the Round5 specification "PKE" is a essentially a synonym for CCA.

The main difference between CPA and CCA parameter sets at given security level is in failure probability; it is now higher than it should be for CCA. The CPA KEM parameters have little bit shorter message and key lengths than the actual CCA KEM variants.

I'm deeply sorry to not have checked before if the variant used in PQM4 was sensible; I was given opportunity to check the pull request. However we did ask the PQM4 to adopt at least some variants of Round5 as they were proposed to NIST.

Work has been (slowly) progressing on the spec and we will be rolling out official CCA KEM out soonish. Preliminary plans are also to drop the _0d ring variants -- and there are some other modifications as well. So I'm fine if you choose to wait until the CCA KEM thing is officially proposed to NIST as none of this was ever compliant with the proposal anyway. Of course the SNEIK stuff is currently just a placeholder in case NIST decides to standardize a lightweight XOF after all (the SNEIK algorithm itself is out of the competition).

Makefile does not account for older toolchains

Edits; Markdown fix.

In the process of trying to get the standard benchmarks to work according to the readme. At this point I am at the "Running tests and benchmarks" part of the readme.
Linuxmint 17.3 (Rosa) with GCC/G++ 5.5 has the following output when calling make:

make -C crypto_kem/frodo640-cshake/opt libpqm4.a
make[1]: Entering directory '/home/nevy/Research/pqm4/crypto_kem/frodo640-cshake/opt'
arm-none-eabi-gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -c -o frodo640.o frodo640.c
In file included from frodo640.c:30:0:
kem.c: In function 'crypto_kem_dec':
kem.c:129:5: error: 'for' loop initial declarations are only allowed in C99 mode
     for (int i = 0; i < PARAMS_N*PARAMS_NBAR; i++) BBp[i] = BBp[i] & ((1 << PARAMS_LOGQ)-1);
     ^
kem.c:129:5: note: use option -std=c99 or -std=gnu99 to compile your code
In file included from frodo640.c:32:0:
frodo_macrify.c: In function 'frodo_add':
frodo_macrify.c:222:5: error: 'for' loop initial declarations are only allowed in C99 mode
     for (int i = 0; i < (PARAMS_NBAR*PARAMS_NBAR); i++) {
     ^
frodo_macrify.c: In function 'frodo_sub':
frodo_macrify.c:233:5: error: 'for' loop initial declarations are only allowed in C99 mode
     for (int i = 0; i < (PARAMS_NBAR*PARAMS_NBAR); i++) {
     ^
make[1]: *** [frodo640.o] Error 1
make[1]: Leaving directory '/home/nevy/Research/pqm4/crypto_kem/frodo640-cshake/opt'
make: *** [crypto_kem/frodo640-cshake/opt] Error 2

Which is supposedly remedied by:
make CFLAGS="-std=gnu99"
Which results in:

...
newhope_asm.S:734: Error: attempt to use an ARM instruction on a Thumb-only processor -- 'pop {r0-r12,pc}'
make[1]: *** [newhope_asm.o] Error 1
rm kem.o cpapke.o
make[1]: Leaving directory '/home/nevy/Research/pqm4/crypto_kem/newhope1024cca/m4'
make: *** [crypto_kem/newhope1024cca/m4] Error 2

On the first run.

Then we run make again, and then make CFLAGS="-std=gnu99" again to get

...
make[1]: Leaving directory '/home/nevy/Research/pqm4/crypto_sign/sphincs-shake256-128s/ref'
make -C crypto_sign/sphincs-shake256-128s/ref libpqhost.a
make[1]: Entering directory '/home/nevy/Research/pqm4/crypto_sign/sphincs-shake256-128s/ref'
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o hash_shake256_host.o hash_shake256.c
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o hash_address_host.o hash_address.c
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o wots_host.o wots.c
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o utils_host.o utils.c
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o fors_host.o fors.c
gcc -I/home/nevy/Research/pqm4/common -Wall -Wextra -O3 -c -o sign_host.o sign.c
gcc-ar rcs libpqhost.a hash_shake256_host.o hash_address_host.o wots_host.o utils_host.o fors_host.o sign_host.o
make[1]: Leaving directory '/home/nevy/Research/pqm4/crypto_sign/sphincs-shake256-128s/ref'
mkdir -p obj 
arm-none-eabi-gcc -std=gnu99 -o obj/stm32f4_wrapper.o -c common/stm32f4_wrapper.c
In file included from common/stm32f4_wrapper.c:1:0:
common/stm32wrapper.h:4:34: fatal error: libopencm3/stm32/rcc.h: No such file or directory
 #include <libopencm3/stm32/rcc.h>
                                  ^
compilation terminated.
make: *** [obj/stm32f4_wrapper.o] Error 1

After which a regular make results in:

...
arm-none-eabi-gcc -O3 -Wall -Wextra -Wimplicit-function-declaration -Wredundant-decls -Wmissing-prototypes -Wstrict-prototypes -Wundef -Wshadow -I./libopencm3/include -fno-common -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -std=gnu99 -MD -DSTM32F4 -std=gnu99 -o obj/stm32f4_wrapper.o -c common/stm32f4_wrapper.c
In file included from ./libopencm3/include/libopencm3/cm3/nvic.h:133:0,
                 from common/stm32wrapper.h:7,
                 from common/stm32f4_wrapper.c:1:
./libopencm3/include/libopencm3/dispatch/nvic.h:14:39: fatal error: libopencm3/stm32/f4/nvic.h: No such file or directory
 # include <libopencm3/stm32/f4/nvic.h>
                                       ^
compilation terminated.
make: *** [obj/stm32f4_wrapper.o] Error 1

At which point alternating both method does not change anything anymore.

Calling make clean will revert the situation to the way it was at the beginning of this message.

Note that I originally updated from GCC 4.8 (Which showed similar behaviour) to GCC 5.5 because I read that gnu11 is the new default for GCC 5.5, which would make messing with the CFLAG unessecary. Sadly, GCC 5.5 did not deliver what I thought it would.

It looks like some libraries need the "-std=gnu99" flag but somehow either something else gets misconfigured because of it, or it was misconfigured from the start. Can someone help me debug this? I am not an expert when it comes to Makefiles, so it might as well be I just forgot something.

[refactor] Hidden fields in benchmarkclock struct

I'm not sure if this is intended, but this struct hides some fields:

/* 24 MHz */
const struct rcc_clock_scale benchmarkclock = {
.pllm = 8, //VCOin = HSE / PLLM = 1 MHz
.plln = 192, //VCOout = VCOin * PLLN = 192 MHz
.pllp = 8, //PLLCLK = VCOout / PLLP = 24 MHz (low to have 0WS)
.pllq = 4, //PLL48CLK = VCOout / PLLQ = 48 MHz (required for USB, RNG)
.hpre = RCC_CFGR_HPRE_DIV_NONE,
.ppre1 = RCC_CFGR_PPRE_DIV_2,
.ppre2 = RCC_CFGR_PPRE_DIV_NONE,
.flash_config = FLASH_ACR_DCEN | FLASH_ACR_ICEN | FLASH_ACR_LATENCY_0WS,
.apb1_frequency = 12000000,
.apb2_frequency = 24000000,
};

I found this cause I copy-pasted into my Rust code, and got the following error:

error[E0063]: missing fields `ahb_frequency`, `pllr`, `power_save` in initializer of `libopencm3_sys::rcc_clock_scale`

Feel free to close this immediately as a "wontfix" if you don't care about these things.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.