dstebila / rlwekex Goto Github PK

DEPRECATED – See NOTICE below about migration to new repository – Post-quantum key exchange from the ring learning with errors problem

License: The Unlicense

Makefile 2.80% C 95.19% C++ 2.01%

rlwekex's People

Contributors

Stargazers

Watchers

Forkers

enavro ja-mf crockeea cryptolab sysjitter yawning isislovecruft alishakiba xaptum-eng jjl jadeli8 bryceweiner jm327 qingchen1984 zpleefly s-kyo cicada-ws

rlwekex's Issues

Unprecise description of sampling corner cases in reference paper

http://files.douglas.stebila.ca/files/research/papers/BCNS14.pdf section 4.2 mention the generation of Vj at random , i.e. in the range [0, 2^192-1]. Then the lookup table T first entry T[0] is 2^189. The last entry T[51] should be 2^192. The unique integer indj should be such as T[inj] <= Vj < T[indj+1]. The paper is unclear about the case where Vj (computed at random) is less than T[0]. Should such randomness be thrown out ? Or should Vj be computed as (random + T[0]) and truncated to 192 bits ?

Since we multiply randomness ..... do we need to swizzle it ?

Since we multiply our randomness in the FFT, do we really need to swizzle it at the start of the NTT and at the end of the NTT inverse ? Why don't we just pretend that our encoding of the input polynomials (which we get directly from randomness) has exactly the coefficient order suitable to a NTT ? Similarly for the output

Compatibility of the polynomial representation would only be required if we were using an algebraic tool like magma, sage matlabs inside the key exchange to verify the polynomial correctness of the operation.

e.g. Z[i] as expected in a textbook can be found from the output of sample_ct() at position i/32 + (i %32)*32, or in the public key at the same offset.

What is the exact entropy of the sample function ?

Dead code elimination shows that more than 90% of the input randomness has no impact to the sample() algorithm. The reason comes from the 192 bit number comparison, where approx 96% of the decisions are taken when the first byte is compared, the remaining 23 bytes have no impact on the algorithm. The non-constant time version could be more than 20 times faster than the current version, if less entropy is fetched from the randomness source.

Equivalently, many bytes, bits and words of the entropy can be modified without having an impact on the output of the sample() function.

Regardless of the algorithmic speed, is the security level has high as expected ? The effective entropy of the sample function is only 2095 bytes. the effective entropy of rlwe_kex_generate_keypair() is about 4090 bytes.

Probability of shared secret not being the same

Prposition 2 section 4.3 mentions that there is a probability failure of 2^-13072 . How does it appear ? The final secret computed by bob and alice is not the same ? Is there a way to detect it earlier ? Is this probability verified ?

Does it depend on the randomness generation ? Apparently, I get many "keys don't match" with a generator biased towards zeroes. Although it is not an ideal condition, any output is valid. Does rlwe require an minimum and maximum hamming weight in randomness that must be enforced ?

randomness backends

Hi,

I see from the commit you made after merging my work that the /dev/urandom backend is quite slow.

I don't understand the overall structure of your code well enough yet to see how much of a saving can be made, but if we know how much randomness is required ahead of time, we can simply buffer it to avoid the syscall overhead which is presumably what is making performance suck (because the current functions don't return very much randomness at a time).

I'm also thinking about what other backends might be desirable. I've completely lost trust in the ability of openssl to get anything right, but there must be other options for secure userspace random generation. Any recommendations?

smaple and sample_ct do not report the same results, and more bugs

There are multiple issues in the existing code (downloaded 26 / jul / 2015 from zip file)

The following code highlights a bug
uint64_t rnd[3];
rnd[0] = 0xffffffffffffffffull;
rnd[1] = 0xffffffffffffffffull;
rnd[2] = 0xffffffffffffffffull;
i = single_sample_ct(rnd);
j = single_sample(rnd);
printf("%u, %u\n", (unsigned)i, (unsigned)j);
assert(i == j);

Same for
rnd[0] = 0xfffffffffffffffeull;
rnd[1] = 0xffffffffffffffffull;
rnd[2] = 0xffffffffffffffffull;

Maybe the constant time search should better define what is the expected output when the parameter is equal to the table last entry. Should it be 51, 52, 53 ?

Maybe the "non-constant time" binary search should search between 0 and sizeof(rlwe_table)/sizeof(rle_table[0]), not between 0 and (harcoded)64.

why isn't the binary search a constant time one ? Do you think scanning the whole table is constant time ? It is "constant number of instructions", which is different from "constant time". (that is more of a philosophical endless issue, need to consider memory, pages and caches alignments and plenty other hardware features, together with register spill over the stack which depends on compilers optimizations). The binary search you wrote is "constant number of instructions".

cmp_lt seams to mean (compare_less_than()). Considering the order of the entries in rlwe_table, It is extremely unclear to me why the comparison of 192-bit numbers starts with apparently what is the least significant limb. Comparison of multiprecision numbers should start from the most significant limb.

rdtsc() code in rle_benchmark does not work for 64 bit Intel platforms. Correct code for benchmarking purposes should look like
unsigned long long lo, hi;
asm volatile( "rdtscp" : "=a" (lo), "=d" (hi) );
return( lo | (hi << 32) );

The output of gettimeofday() is not properly handled in case of wraparound of seconds.

The gcc compiler (I use 4.9) has many difficulties to optimize thru your cmplt_ct() code. I guess you can get a significant boost with this code
static int cmplt_ct(uint64_t *a, uint64_t *b) {
uint64_t ai = a[2];
uint64_t bi = b[2];
int m = (ai < bi);
int r = (ai == bi);
ai = a[1];
bi = b[1];
m = m || (r && (ai < bi));
r &= (ai == bi);
ai = a[0];
bi = b[0];
m = m || (r && (ai < bi));
return m;
}

Use rdrand instruction to reduce the cost of randomness in rlwe

Execution time will be improved with the use of rdrand instruction to fetch entropy. (rdseed is compliant to NIST SP800-90B and NIST SP800-90C ).

This has a serious impact on execution time (measured on a Xeon(R) CPU E5-2680 v3 (turbo enabled)

Operation Iterations nsec (avg) cycles (avg)

sample_ct 3000 236821 774747
sample 3000 2059 6736
FFT_mul 3000 107019 350109
FFT_add 3000 665 2177
crossround2_ct 3000 8917 29171
crossround2 3000 4954 16208
round2_ct 3000 2185 7149
round2 3000 1350 4418
rec_ct 3000 5015 16406
rec 3000 1641 5369
rlwe_kex_generate_keypair 3000 110267 360733 (no ct)
rlwe_kex_compute_key_bob 3000 114900 375890 (no ct)
rlwe_kex_compute_key_alice 3000 107631 352110 (no ct)

Simple code to use rdrand and rdseed looks like

static inline uint64_t random_rd(void)
{
uint64_t rd;
__asm ("1: rdrand %0\njnc 1b\n": "=r" (rd) : :"flags");
return rd;
}

define RANDOM_VARS

define RANDOM8 ((uint8_t)random_rd())

define RANDOM32 ((uint32_t)random_rd())

define RANDOM64 (random_rd())

Another corner case

what is the expected output of dbl(x) when x == 0 or x=0xffffffff ? The documentation mentions 2*x - e with e being (-1,0,0,1) chosen at random.

dbl(0,0) return 0
dbl(0,1) return 1
dbl(0,2) return 0xffffffffffffffff <-- ???
dbl(0,3) return 0

what is the intended behaviour ?

edited (Pierre) , using closed interval notation as in http://mathworld.wolfram.com/ClosedInterval.html

for x in the range [0,q-1], the function y = dbl(x) creates the following output y

the value-1 with probability 1/(4q) (the 0xfffff.... above)
2_q-1 values in the range [0, 2_(q-1)] with each has an equal probability 1/(2_q)
the value 2_q-1 with probability 1/(4q)

the purpose of crossround2() is to select enough elements with a total probability exactly 1/2

the range [(q+1)/2, 2_q], plus the range [3_(q-1)/2 + 1. 2_q-1] and the value 2_q-1 have an accumulated probability 1/2+1/(4*q)

with q being 0xffffffff, that is a little bias that is selected by the test
b = (ct_ge_u64(dd, 2147483648ULL) & ct_le_u64(dd, 4294967295ULL)) |
(ct_ge_u64(dd, 6442450942ULL) & ct_le_u64(dd, 8589934590ULL));

Moreover, strict bound checking shows that no element is equal to 2*q (8589934590)

Would cross2bound be unbiased when selecting the values from the range [(q+1)/2, 2_q], plus the range [3_(q-1)/2+1. 2*q-2] have an exact probability 1/2
b = (ct_ge_u64(dd, 2147483648ULL) & ct_le_u64(dd, 4294967295ULL)) |
(ct_ge_u64(dd, 6442450942ULL) & ct_le_u64(dd, 8589934588ULL));

Let me know if I missed something

FFT(x,y) output can be different from FFT(y,x) output

FFT output is not exactly always the expected one, due to the lack of narmalization after the last additions.

The the output of FFT_mul and FFT_add is used in crossround2_ct and rec_ct which expect an input in the range [0,q-1] instead of the range [0,2^32-1]. Wouldn't this omission be a source of bias ?

One of the possibilities could be to change the last loops of the FFT to look like this
for (i = 0; i < 32; i++) {
uint32_t t;
sub(t, Z1[i][0], Z1[32 + i][32 - 1]);
normalize(z[i], t);
for (j = 1; j < 32; j++) {
add(t, Z1[i][j], Z1[32 + i][j - 1]);
normalize(z[32 * j + i], t);
}
}
The cleanest would be to add a new function FFP_normalize(z) which could look like this
for (i = 0; i < 1024; i++) {
normalize(z[i], z[i]); // mod q, oputput is in the range [0,q-1]
}

There is a suspiciously unnecessary normalization within moddiv2. A division by 2 mod q is a unconditional rotation by 1 bit, which compilers like gcc can optimize. There is another suspiciously unnecessary normalization macro in neg() macro.

define moddiv2(c,a) \

do {
uint32_t _ta = a;
c = (_ta >> 1) ^ (_ta << 31);
} while (0)

define modneg(c,a) \

do {
uint32_t _ta = 0xffffffff - a;
c = _ta;
} while (0)

define div2(c,a) moddiv2(c,a)

define neg(c,a) modneg(c,a)

Add after Reduce, or Reduce after Add ? significant impact on FFT performance

The following code takes advantage that (A_B + C_D)%E has always been faster than ((A_B)%E + (C_D)%E)%E. THe FFT cost is reduced by more than 20%, assuming n < 2^15 (in which case no overflow will occur)

static void naive(DATATYPE *z, const DATATYPE *x, const DATATYPE *y, unsigned int n) {
unsigned int i, j, k;
uint64_t A, B, t;
uint32_t ta, tb;

    for (i = 0; i < n; i += 1) {
            A = 0;
            for (j = 0; j <= i; j++) {
                    t = (uint64_t)x[j] * (uint64_t)y[i - j];
                    A += (uint32_t)t;
                    A += t >> 32;
            }
            B = 0;
            for (k = 1; j < n; j++, k++) {
                    t = (uint64_t)x[j] * (uint64_t)y[n - k];
                    B += (uint32_t)t;
                    B += t >> 32;
            }
            modmul(ta, A, 1);
            modmul(tb, B, 1);
            modsub(z[i], tb, ta);
    }

}

unnecesary multiplication by 2 before comparing to even constants

It is not necessary to multiply a variable by 2 before comparing it to even constants.
The following code shows these unnecessary operations.

            coswi = (((uint64_t)w[i]) << (uint64_t)1);
            B = (ct_eq_u64(getbit(b, i), 0) & ct_ge_u64(coswi, 3221225472ULL) &
                 ct_le_u64(coswi, 7516192766ULL)) |
                (ct_eq_u64(getbit(b, i), 1) & ct_ge_u64(coswi, 1073741824ULL) &
                 ct_le_u64(coswi, 5368709118ULL));