Should we prune the corpus to get rid of large slow inputs that reach code that can be

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job? about qa-assets HOT 13 CLOSED

practicalswift commented on May 24, 2024

Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job?

from qa-assets.

Comments (13)

practicalswift commented on May 24, 2024 1

@darosior -merge=1 will never delete any files, so it cannot replace an existing big seed file with a smaller seed file that achieves the same coverage. Thus the first input in the repo that achieved a certain coverage will be kept.

What needs to be done is to create a new seed corpus from scratch by -merge:ing in the existing set of seeds together with other (potentially smaller) seeds. The idea is that this process will give a more optimal set of seeds. It will also remove duplicates (two seeds achieving the same coverage) in our repo.

from qa-assets.

darosior commented on May 24, 2024

Sounds good!

How would you proceed ? I would have expected merge=1 to choose the smaller input when choosing between 2 seeds with equal coverage, but maybe it's not ?

from qa-assets.

practicalswift commented on May 24, 2024

Seeking input from fellow fuzzing enthusiasts: 👍 , 👎 or neutral? :)

I'm willing to do the work! :)

from qa-assets.

maflcko commented on May 24, 2024

How are you going to ensure that coverage is measured across different architectures/operating systems when merging the inputs?

from qa-assets.

sipa commented on May 24, 2024

One reason not to (in general), is that the fuzzing corpus is built up through many versions the code went through, and thus may contain implicit knowledge about bugs that once existed but were fixed (not things that caused a crash; just inputs that were relevant in trigger code paths related to bugs) - which may help finding them later if accidentally reintroduced.

If we'd do this, I'd suggest only doing it for fuzzers whose corpus size has grown very large.

I'd also suggest doing the minimization (create empty dir, -merge=1 into it) using both binaries with and without sanitizers - they tend to have different inputs that matter to them. EDIT: as Marco mentions, multiple architectures would be useful too.

So what you'd do is starting from corpus dir C:

Create new empty dir N
For each binary/platform/...:
- ./fuzzer -merge=1 -use_value_profile=1 N C
replace C with N

from qa-assets.

maflcko commented on May 24, 2024

Another issue to consider is how to deal with stateful logic and pruning seeds based on (line/branch) coverage. For example the script interpreter has a stack to keep state during script execution. However, coverage based fuzzers are to the best of my knowledge not aware of state and might thus prune away seeds that might be the only ones exercising a specific code path through the interpreter that is already (line/branch) covered by different seed snippets.

Edit: Discussion about inputs with useful features that get removed because they don't increase coverage: https://reviews.llvm.org/D86577

from qa-assets.

Crypt-iQ commented on May 24, 2024

I agree with @MarcoFalke and @sipa - there may be interesting code paths with sanitizers or state during script execution that are pruned (I don't think any fuzzers track path coverage either!). Though maybe specifically for CI fuzzing jobs, this qa-assets/fuzz_seed_corpus directory can be slightly pruned and put in a different directory? And fuzzing enthusiasts can run all the data at home without comprising any coverage.

from qa-assets.

maflcko commented on May 24, 2024

Though practically, everyone is using libFuzzer to generate inputs (or similar fuzz engines that can't fully track path coverage), so I'd be surprised if path coverage decreases after cleaning the seed files.

Now that the seed files change format occasionally, and the ci starts to reach its timeout limit when iterating over all historic seeds, we should start looking into cleaning the folder.

Keeping the historic seeds in a separate folder for now would be an option, but I don't think it matters in practice, because we all use pretty much the same fuzz engine with the same coverage statistics.

from qa-assets.

maflcko commented on May 24, 2024

Another thought: Ideally the pruning should be reproducible, which is currently impossible because the fuzz targets don't achieve reproducible coverage.

from qa-assets.

Crypt-iQ commented on May 24, 2024

I've noticed that with a larger in-memory corpus, libFuzzer will take up more rss, and in some cases grind to a near halt with ~1 exec/s (process_messages harness).

from qa-assets.

maflcko commented on May 24, 2024

We can no longer iterate over all seeds in less than 2 CPU hours, so we might have to do this soon. Otherwise, CI checks will need to be disabled or otherwise reduced.

from qa-assets.

practicalswift commented on May 24, 2024

We have accumulated quite a large number of extremely large inputs that takes ages to process due to sheer size. A lot of these inputs do not add any meaningful coverage AFAICT.

In order to get meaningful fuzzing run times I usually do …

find qa-assets/fuzz_seed_corpus/ -type f -size +1000k -delete

… before fuzzing. Where 1000k can be 100k, 10k or even 4k (libFuzzer default) depending on how much fuzzing runtime I want to spend.

It is basically impossible to process qa-assets/ with say -fsanitize=integer without doing so.

Regardless of the pruning I suggested in this issue I think we should at least kill off the "large-for-no-reason" seeds :)

WDYT? :)

from qa-assets.

maflcko commented on May 24, 2024

Would be nice if someone could test #44

from qa-assets.

Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job? about qa-assets HOT 13 CLOSED

Comments (13)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent