I observed a lot of variability in the results from this DSA timing test. Sigma value

DsaTest.testTiming() could use a warmup about wycheproof HOT 3 OPEN

ascarpino commented on July 23, 2024

DsaTest.testTiming() could use a warmup

from wycheproof.

Comments (3)

bleichenbacher-daniel commented on July 23, 2024

I don't think any changes are necessary here.

What the test is doing is to generate signatures, select a subset from those signatures based on timing information, and then
checks if the k's used to generate those signatures are biased. If it is possible to select DSA signatures with small k's by selecting signatures that were generated faster than others, then the implementation has a weakness.

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

from wycheproof.

ascarpino commented on July 23, 2024

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

I would absolutely disagree with the premise that external factors, like warmup, gc, server load, etc., do not change the distribution. With the randomness of K, noise could be introduced at unfortunate times. That does not show weakness in the implementation, it shows weakness in the test. The test does try to mitigate some of this by a large allowance for sigma. But as the above results show, it could be hard for that allowance to overcome a 10x performance differences if certain lengths of K occur at the wrong time. For example, there maybe 100 small K values or 1000 in a test run. Many of those small K's may be in the first half of the test or the latter.

The lack of a warmup also fails to take intrinsics into consideration. Once the C2 compiler decides the method is hot, the intrinsic will change the performance values and disrupt the results distribution. That is not a weakness in the implementation, that's a failure to test during normal system operation.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

The results were generated with 50000 iteration that the wycheproof test uses. It's true that more iterations will reduce the influence of noise, but a warmup would reduce the biggest noise influencer and not result in a significantly longer test run.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

Accept what I suggested or not, that is your decision.

from wycheproof.

bleichenbacher-daniel commented on July 23, 2024

The point I wanted to make is that there can not be a test failure because of noise. If the implementation is correct then expected result will be close to a normal distribution with variance 1.

Too much noise can of course hide timing leaks. By selecting the signatures with the shortest timing the test eliminates the biggest influences of noise without the need to examine the environment. Slow signatures during startup or during garbage collection are most likely not used. As long as their number is small they don't have a significant influence. Also, if the test becomes slower in the middle because of heavy other load on the machine then the result will be computed just from the 25'000 signatures generated during the quiet time. This can miss a bias, but I can't lead to false positives.

The current setup of the test is for continuous testing. If a randomized test is repeated many times then it is important that the probability of false positives is small. Hence the large threshold. For other usages it might be reasonable to use a smaller threshold.

from wycheproof.

DsaTest.testTiming() could use a warmup about wycheproof HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent