Comments (30)
How much would be saved on top, if the -set_cover_merge=1
merge algorithm was used?
from qa-assets.
I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.
Sgtm
I guess one way to implement this, would be to have a "massive submit repo", which is append-only and each submitter may or may not merge into an empty corpus.
Then, there is a regular task to cherry-pick the "minimal" qa-assets folder used for CI.
from qa-assets.
How much would that be with libFuzzer with and without -use_value_profile?
8.1GB with, 1.9GB without.
So perhaps dropping use_value_profile
from the merge script is the best low-effort solution for now?
from qa-assets.
That'd be 6.0G, which still seems a bit large, given that we may want to re-think the merge process (always merge into a clean folder). I think we also want to keep the append-only aspect of pull requests to simplify review.
As a next step, one could compare the runtime of the result of set_cover_merge
vs merge
.
from qa-assets.
Could you also test this with
-use_value_profile=1 -set_cover_merge=1
? Maybe that could be a nice middle ground...
I wonder if we did the wrong measurement, since we ran on a corpus generated with -merge=1
. It may be better to re-run the measurements on a "dirty" corpus.
from qa-assets.
@MarcoFalke thoughts?
from qa-assets.
I guess cloning with --depth 1
already works quite well
from qa-assets.
Prune the git history, .git is currently at 4GB. (we don't really need the history/we could archive to the history to a separate repo)
There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we'll be able to host the history in a repo. Someone could put up an tar_gz of the .git
folder on their personal website maybe?
Compress corpora (~6GB gzip)
This will make everything worse. It will make it impossible for git to track single fuzz inputs and de-duplicate them in the git history. Adding a single fuzz input requires a full copy of all fuzz inputs of the same fuzz target. Also, it makes is harder to browse and use.
Avoid large inputs / have separate repo for those
Maybe, but this will also make it harder to browse, use and contribute.
The biggest downside to the size currently is that we pull this repo in our CI jobs (oss-fuzz as well) which is a big overhead.
Not sure if this is a problem or whether it can be fixed. As you say --depth=1
is already used and CI machines generally have a fast connection.
from qa-assets.
There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we'll be able to host the history in a repo. Someone could put up an tar_gz of the .git folder on their personal website maybe?
An alternative to squashing the history at that point may be to move the repo to GitLab, which has a premium plan for 50 GB or 250 GB storage.
from qa-assets.
Is git
really the right tool to manage a data collection with so many files? Unfortunately, I don’t have a better idea (yet), but it does seem terribly slow in interactions with this repository.
from qa-assets.
I've recently been using the afl++ tooling more and noticed that afl-cmin
(the afl++ corpus minimizer) produces much smaller corpora. This is explained by libFuzzer using more than coverage as feedback, which ends up bloating the corpora with inputs that achieve the same coverage but have otherwise interesting features (interesting according to libFuzzer). So we could consider using afl-cmin
but I'm not sure how we would evaluate whether or not this is a good idea (besides the corpora size).
from qa-assets.
It might be worth fuzzing with either afl++
or libfuzzer
but then only uploading what afl-cmin
considers to be increasing coverage?
from qa-assets.
One could also remove the -use_value_profile=1
setting from the merge script?
from qa-assets.
I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort. However, the better alternative would result in far more churn, and as long as we're tied to git for storage, we probably don't want that.
I believe that when one calls fuzz -merge=1 DIR1 DIR2 DIR3
, assets in DIR2 and DIR3 which add coverage or features w.r.t. DIR1 get added to it. However, assets in DIR2 or DIR3 which are only smaller but don't increase coverage beyond what DIR1 combined already has, do not. This means that reductions found by local fuzzing ("REDUCE" lines) don't actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation. To get reductions in, you'd need to use a new empty DIR1 rather than the existing qa-assets dir as DIR1, but that will cause merging to likely throw out existing entries often too.
I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus. Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory. However, I suspect this will cause more churn in the git repo than we want, so perhaps we should think if we can't come up with an alternative that isn't git based but still allows people to submit additions.
from qa-assets.
I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort.
I think there are different procedures, which serve different purposes:
- The qa-assets folder, which has the purpose to provide deterministic, reasonable coverage fuzz inputs to CI tasks
- Continuous fuzzing, which may or may not use the qa-assets folder, and may or may not provide fuzz inputs back to the folder. The purpose here is to ever extend coverage, to find rare issues, and to protect against fuzz input format changes.
This means that reductions found by local fuzzing ("REDUCE" lines) don't actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation.
Good point. I guess having non-determinism in the fuzz targets and the -use_value_profile=1
bloat may cause small inputs to make it in regardless right now.
I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.
Sgtm
Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory.
This is already done and will be done this month again.
from qa-assets.
Just noting that the size of fuzz_seed_corpus
is reduced to 1.6GB from 14GB using afl-cmin
, coverage report: https://dergoegge.github.io/bitcoin-coverage/afl-cmin/fuzz.coverage/src/index.html
from qa-assets.
How much would that be with libFuzzer with and without -use_value_profile
?
from qa-assets.
TIL -set_cover_merge=1
. Not documented on https://llvm.org/docs/LibFuzzer.html?
from qa-assets.
Yes, most options are not mentioned in the html help.
from qa-assets.
TIL that there is anything else than the html help. (-help=1
works...)
from qa-assets.
MERGE-INNER: 686831 total files; 0 processed earlier; will process 686831 files now
...
#686831 DONE cov: 4419 exec/s: 1064 rss: 182Mb
MERGE-OUTER: successful in 1 attempt(s)
MERGE-OUTER: the control file has 4141390298 bytes
==3695960== ERROR: libFuzzer: out-of-memory (used: 2079Mb; limit: 2048Mb)
from qa-assets.
How much would be saved on top, if the
-set_cover_merge=1
merge algorithm was used?
I am getting 1.6G vs 1.9G
from qa-assets.
I am getting 1.6G vs 1.9G
Could you also test this with -use_value_profile=1 -set_cover_merge=1
? Maybe that could be a nice middle ground...
from qa-assets.
Running /usr/bin/time -f '%M KB, %S + %U' ./test/fuzz/test_runner.py
on the result folder gives:
set_cover_merge_dir
:458140 KB, 39.68 + 1149.42
merge_dir
:556452 KB, 64.96 + 1699.33
Which seems like a massive speed up for the same coverage?
from qa-assets.
Benchmark with a big miniscript_smart
corpus (228210 files, 1.2G), though accidentally including all of fuzz_seed_corpus
into the merging:
-
-use_value_profile=0 -merge=1
: 4064 files, 21M -
-use_value_profile=1 -merge=1
: 4869 files, 30M (by extending the result from the previous line) -
-use_value_profile=0 -set_cover_merge=1
: 605 files, 3.1M -
-use_value_profile=1 -set_cover_merge=1
: 1037 files, 7.7M (by extending the result from the previous line)
from qa-assets.
Interesting find. So I guess -use_value_profile=1 -set_cover_merge=1
is doable for some targets. Though, the massive targets addrman
, banman
, block
, ... probably set the overall result when it comes to storage and compute used.
from qa-assets.
For creating a new corpus at the branch-off point, would it perhaps make sense to at least combine the crème de la crème? I.e. if each of us merged their active fuzzing directory to a new directory with -set_cover_merge
and pushed that branch to their own repos, someone could combine the old bloated set and our individual best sets to create the new starting point?
from qa-assets.
That would certainly be an interesting measurement as well
from qa-assets.
Ok, I am getting that -set_cover_merge=1
can produce a smaller result if many "active/dirty" folders are used as input. So likely with higher coverage now, and even smaller:
-set_cover_merge=1
-use_value_profile=0
: 1.0G-set_cover_merge=1
-use_value_profile=1
: 5.3G
Though, that seems still a bit large, given that the maximum repo size is apparently 10G on GitHub and 25G on GitLab.
So I guess we can keep using -set_cover_merge=1
-use_value_profile=0
in the merge script here.
from qa-assets.
Oh, that’s an interesting thought. That might also explain why for me the set_cover_merge
from my active fuzzing directory was bigger than the merge from active fuzzing + qa-assets/main
.
from qa-assets.
Related Issues (16)
- Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job? HOT 13
- Consider removing unnecessarily large inputs which are causing excessive corpus processing runtime HOT 1
- Consider removing unnecessarily large inputs which are causing excessive corpus processing runtime HOT 5
- Merge OSS-Fuzz inputs
- fuzz_seed_corpus: sub_net_deserialize and address_deserialize don't have any fuzz tests HOT 7
- Increase timeout or remove valgrind CI job? HOT 21
- Crypto
- Pruning large/slow inputs? HOT 8
- Aren’t we missing out on a lot of reductions? HOT 9
- `utxo_total_supply` extremely slow HOT 10
- CI job for verifying coverage increase HOT 8
- unsymbolized MSAN stack traces
- .
- Sharing File...
- Automatic check on PR coverage? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qa-assets.