Comments (13)
@darosior -merge=1
will never delete any files, so it cannot replace an existing big seed file with a smaller seed file that achieves the same coverage. Thus the first input in the repo that achieved a certain coverage will be kept.
What needs to be done is to create a new seed corpus from scratch by -merge
:ing in the existing set of seeds together with other (potentially smaller) seeds. The idea is that this process will give a more optimal set of seeds. It will also remove duplicates (two seeds achieving the same coverage) in our repo.
from qa-assets.
Sounds good!
How would you proceed ? I would have expected merge=1
to choose the smaller input when choosing between 2 seeds with equal coverage, but maybe it's not ?
from qa-assets.
Seeking input from fellow fuzzing enthusiasts: π , π or neutral? :)
I'm willing to do the work! :)
from qa-assets.
How are you going to ensure that coverage is measured across different architectures/operating systems when merging the inputs?
from qa-assets.
One reason not to (in general), is that the fuzzing corpus is built up through many versions the code went through, and thus may contain implicit knowledge about bugs that once existed but were fixed (not things that caused a crash; just inputs that were relevant in trigger code paths related to bugs) - which may help finding them later if accidentally reintroduced.
If we'd do this, I'd suggest only doing it for fuzzers whose corpus size has grown very large.
I'd also suggest doing the minimization (create empty dir, -merge=1 into it) using both binaries with and without sanitizers - they tend to have different inputs that matter to them. EDIT: as Marco mentions, multiple architectures would be useful too.
So what you'd do is starting from corpus dir C:
- Create new empty dir N
- For each binary/platform/...:
- ./fuzzer -merge=1 -use_value_profile=1 N C
- replace C with N
from qa-assets.
Another issue to consider is how to deal with stateful logic and pruning seeds based on (line/branch) coverage. For example the script interpreter has a stack to keep state during script execution. However, coverage based fuzzers are to the best of my knowledge not aware of state and might thus prune away seeds that might be the only ones exercising a specific code path through the interpreter that is already (line/branch) covered by different seed snippets.
Edit: Discussion about inputs with useful features that get removed because they don't increase coverage: https://reviews.llvm.org/D86577
from qa-assets.
I agree with @MarcoFalke and @sipa - there may be interesting code paths with sanitizers or state during script execution that are pruned (I don't think any fuzzers track path coverage either!). Though maybe specifically for CI fuzzing jobs, this qa-assets/fuzz_seed_corpus
directory can be slightly pruned and put in a different directory? And fuzzing enthusiasts can run all the data at home without comprising any coverage.
from qa-assets.
Though practically, everyone is using libFuzzer to generate inputs (or similar fuzz engines that can't fully track path coverage), so I'd be surprised if path coverage decreases after cleaning the seed files.
Now that the seed files change format occasionally, and the ci starts to reach its timeout limit when iterating over all historic seeds, we should start looking into cleaning the folder.
Keeping the historic seeds in a separate folder for now would be an option, but I don't think it matters in practice, because we all use pretty much the same fuzz engine with the same coverage statistics.
from qa-assets.
Another thought: Ideally the pruning should be reproducible, which is currently impossible because the fuzz targets don't achieve reproducible coverage.
from qa-assets.
I've noticed that with a larger in-memory corpus, libFuzzer will take up more rss, and in some cases grind to a near halt with ~1 exec/s (process_messages
harness).
from qa-assets.
We can no longer iterate over all seeds in less than 2 CPU hours, so we might have to do this soon. Otherwise, CI checks will need to be disabled or otherwise reduced.
from qa-assets.
We have accumulated quite a large number of extremely large inputs that takes ages to process due to sheer size. A lot of these inputs do not add any meaningful coverage AFAICT.
In order to get meaningful fuzzing run times I usually do β¦
find qa-assets/fuzz_seed_corpus/ -type f -size +1000k -delete
β¦ before fuzzing. Where 1000k
can be 100k
, 10k
or even 4k
(libFuzzer default) depending on how much fuzzing runtime I want to spend.
It is basically impossible to process qa-assets/
with say -fsanitize=integer
without doing so.
Regardless of the pruning I suggested in this issue I think we should at least kill off the "large-for-no-reason" seeds :)
WDYT? :)
from qa-assets.
Would be nice if someone could test #44
from qa-assets.
Related Issues (16)
- Consider removing unnecessarily large inputs which are causing excessive corpus processing runtime HOT 1
- Consider removing unnecessarily large inputs which are causing excessive corpus processing runtime HOT 5
- Merge OSS-Fuzz inputs
- fuzz_seed_corpus: sub_net_deserialize and address_deserialize don't have any fuzz tests HOT 7
- Increase timeout or remove valgrind CI job? HOT 21
- Crypto
- Pruning large/slow inputs? HOT 8
- brainstorm: Reducing the size of this repo HOT 30
- Arenβt we missing out on a lot of reductions? HOT 9
- `utxo_total_supply` extremely slow HOT 10
- CI job for verifying coverage increase HOT 8
- unsymbolized MSAN stack traces
- .
- Sharing File...
- Automatic check on PR coverage? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qa-assets.