Git Product home page Git Product logo

Comments (13)

practicalswift avatar practicalswift commented on May 24, 2024 1

@darosior -merge=1 will never delete any files, so it cannot replace an existing big seed file with a smaller seed file that achieves the same coverage. Thus the first input in the repo that achieved a certain coverage will be kept.

What needs to be done is to create a new seed corpus from scratch by -merge:ing in the existing set of seeds together with other (potentially smaller) seeds. The idea is that this process will give a more optimal set of seeds. It will also remove duplicates (two seeds achieving the same coverage) in our repo.

from qa-assets.

darosior avatar darosior commented on May 24, 2024

Sounds good!

How would you proceed ? I would have expected merge=1 to choose the smaller input when choosing between 2 seeds with equal coverage, but maybe it's not ?

from qa-assets.

practicalswift avatar practicalswift commented on May 24, 2024

Seeking input from fellow fuzzing enthusiasts: πŸ‘ , πŸ‘Ž or neutral? :)

I'm willing to do the work! :)

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

How are you going to ensure that coverage is measured across different architectures/operating systems when merging the inputs?

from qa-assets.

sipa avatar sipa commented on May 24, 2024

One reason not to (in general), is that the fuzzing corpus is built up through many versions the code went through, and thus may contain implicit knowledge about bugs that once existed but were fixed (not things that caused a crash; just inputs that were relevant in trigger code paths related to bugs) - which may help finding them later if accidentally reintroduced.

If we'd do this, I'd suggest only doing it for fuzzers whose corpus size has grown very large.

I'd also suggest doing the minimization (create empty dir, -merge=1 into it) using both binaries with and without sanitizers - they tend to have different inputs that matter to them. EDIT: as Marco mentions, multiple architectures would be useful too.

So what you'd do is starting from corpus dir C:

  • Create new empty dir N
  • For each binary/platform/...:
    • ./fuzzer -merge=1 -use_value_profile=1 N C
  • replace C with N

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

Another issue to consider is how to deal with stateful logic and pruning seeds based on (line/branch) coverage. For example the script interpreter has a stack to keep state during script execution. However, coverage based fuzzers are to the best of my knowledge not aware of state and might thus prune away seeds that might be the only ones exercising a specific code path through the interpreter that is already (line/branch) covered by different seed snippets.

Edit: Discussion about inputs with useful features that get removed because they don't increase coverage: https://reviews.llvm.org/D86577

from qa-assets.

Crypt-iQ avatar Crypt-iQ commented on May 24, 2024

I agree with @MarcoFalke and @sipa - there may be interesting code paths with sanitizers or state during script execution that are pruned (I don't think any fuzzers track path coverage either!). Though maybe specifically for CI fuzzing jobs, this qa-assets/fuzz_seed_corpus directory can be slightly pruned and put in a different directory? And fuzzing enthusiasts can run all the data at home without comprising any coverage.

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

Though practically, everyone is using libFuzzer to generate inputs (or similar fuzz engines that can't fully track path coverage), so I'd be surprised if path coverage decreases after cleaning the seed files.

Now that the seed files change format occasionally, and the ci starts to reach its timeout limit when iterating over all historic seeds, we should start looking into cleaning the folder.

Keeping the historic seeds in a separate folder for now would be an option, but I don't think it matters in practice, because we all use pretty much the same fuzz engine with the same coverage statistics.

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

Another thought: Ideally the pruning should be reproducible, which is currently impossible because the fuzz targets don't achieve reproducible coverage.

from qa-assets.

Crypt-iQ avatar Crypt-iQ commented on May 24, 2024

I've noticed that with a larger in-memory corpus, libFuzzer will take up more rss, and in some cases grind to a near halt with ~1 exec/s (process_messages harness).

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

We can no longer iterate over all seeds in less than 2 CPU hours, so we might have to do this soon. Otherwise, CI checks will need to be disabled or otherwise reduced.

from qa-assets.

practicalswift avatar practicalswift commented on May 24, 2024

We have accumulated quite a large number of extremely large inputs that takes ages to process due to sheer size. A lot of these inputs do not add any meaningful coverage AFAICT.

In order to get meaningful fuzzing run times I usually do …

find qa-assets/fuzz_seed_corpus/ -type f -size +1000k -delete

… before fuzzing. Where 1000k can be 100k, 10k or even 4k (libFuzzer default) depending on how much fuzzing runtime I want to spend.

It is basically impossible to process qa-assets/ with say -fsanitize=integer without doing so.

Regardless of the pruning I suggested in this issue I think we should at least kill off the "large-for-no-reason" seeds :)

WDYT? :)

from qa-assets.

maflcko avatar maflcko commented on May 24, 2024

Would be nice if someone could test #44

from qa-assets.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.