This repository is very large (~16GB atm) and I think there are a bunch of things we c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

brainstorm: Reducing the size of this repo about qa-assets HOT 30 CLOSED

dergoegge commented on May 23, 2024

brainstorm: Reducing the size of this repo

from qa-assets.

Comments (30)

maflcko commented on May 23, 2024 2

How much would be saved on top, if the -set_cover_merge=1 merge algorithm was used?

from qa-assets.

maflcko commented on May 23, 2024 1

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.

Sgtm

I guess one way to implement this, would be to have a "massive submit repo", which is append-only and each submitter may or may not merge into an empty corpus.

Then, there is a regular task to cherry-pick the "minimal" qa-assets folder used for CI.

from qa-assets.

dergoegge commented on May 23, 2024 1

How much would that be with libFuzzer with and without -use_value_profile?

8.1GB with, 1.9GB without.

So perhaps dropping use_value_profile from the merge script is the best low-effort solution for now?

from qa-assets.

maflcko commented on May 23, 2024 1

That'd be 6.0G, which still seems a bit large, given that we may want to re-think the merge process (always merge into a clean folder). I think we also want to keep the append-only aspect of pull requests to simplify review.

As a next step, one could compare the runtime of the result of set_cover_merge vs merge.

from qa-assets.

maflcko commented on May 23, 2024 1

Could you also test this with -use_value_profile=1 -set_cover_merge=1? Maybe that could be a nice middle ground...

I wonder if we did the wrong measurement, since we ran on a corpus generated with -merge=1. It may be better to re-run the measurements on a "dirty" corpus.

from qa-assets.

dergoegge commented on May 23, 2024

@MarcoFalke thoughts?

from qa-assets.

dergoegge commented on May 23, 2024

I guess cloning with --depth 1 already works quite well

from qa-assets.

maflcko commented on May 23, 2024

Prune the git history, .git is currently at 4GB. (we don't really need the history/we could archive to the history to a separate repo)

There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we'll be able to host the history in a repo. Someone could put up an tar_gz of the .git folder on their personal website maybe?

Compress corpora (~6GB gzip)

This will make everything worse. It will make it impossible for git to track single fuzz inputs and de-duplicate them in the git history. Adding a single fuzz input requires a full copy of all fuzz inputs of the same fuzz target. Also, it makes is harder to browse and use.

Avoid large inputs / have separate repo for those

Maybe, but this will also make it harder to browse, use and contribute.

The biggest downside to the size currently is that we pull this repo in our CI jobs (oss-fuzz as well) which is a big overhead.

Not sure if this is a problem or whether it can be fixed. As you say --depth=1 is already used and CI machines generally have a fast connection.

from qa-assets.

maflcko commented on May 23, 2024

There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we'll be able to host the history in a repo. Someone could put up an tar_gz of the .git folder on their personal website maybe?

An alternative to squashing the history at that point may be to move the repo to GitLab, which has a premium plan for 50 GB or 250 GB storage.

from qa-assets.

murchandamus commented on May 23, 2024

Is git really the right tool to manage a data collection with so many files? Unfortunately, I don’t have a better idea (yet), but it does seem terribly slow in interactions with this repository.

from qa-assets.

dergoegge commented on May 23, 2024

I've recently been using the afl++ tooling more and noticed that afl-cmin (the afl++ corpus minimizer) produces much smaller corpora. This is explained by libFuzzer using more than coverage as feedback, which ends up bloating the corpora with inputs that achieve the same coverage but have otherwise interesting features (interesting according to libFuzzer). So we could consider using afl-cmin but I'm not sure how we would evaluate whether or not this is a good idea (besides the corpora size).

from qa-assets.

murchandamus commented on May 23, 2024

It might be worth fuzzing with either afl++ or libfuzzer but then only uploading what afl-cmin considers to be increasing coverage?

from qa-assets.

maflcko commented on May 23, 2024

One could also remove the -use_value_profile=1 setting from the merge script?

from qa-assets.

sipa commented on May 23, 2024

I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort. However, the better alternative would result in far more churn, and as long as we're tied to git for storage, we probably don't want that.

I believe that when one calls fuzz -merge=1 DIR1 DIR2 DIR3, assets in DIR2 and DIR3 which add coverage or features w.r.t. DIR1 get added to it. However, assets in DIR2 or DIR3 which are only smaller but don't increase coverage beyond what DIR1 combined already has, do not. This means that reductions found by local fuzzing ("REDUCE" lines) don't actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation. To get reductions in, you'd need to use a new empty DIR1 rather than the existing qa-assets dir as DIR1, but that will cause merging to likely throw out existing entries often too.

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus. Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory. However, I suspect this will cause more churn in the git repo than we want, so perhaps we should think if we can't come up with an alternative that isn't git based but still allows people to submit additions.

from qa-assets.

maflcko commented on May 23, 2024

I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort.

I think there are different procedures, which serve different purposes:

The qa-assets folder, which has the purpose to provide deterministic, reasonable coverage fuzz inputs to CI tasks
Continuous fuzzing, which may or may not use the qa-assets folder, and may or may not provide fuzz inputs back to the folder. The purpose here is to ever extend coverage, to find rare issues, and to protect against fuzz input format changes.

This means that reductions found by local fuzzing ("REDUCE" lines) don't actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation.

Good point. I guess having non-determinism in the fuzz targets and the -use_value_profile=1 bloat may cause small inputs to make it in regardless right now.

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.

Sgtm

Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory.

This is already done and will be done this month again.

from qa-assets.

dergoegge commented on May 23, 2024

Just noting that the size of fuzz_seed_corpus is reduced to 1.6GB from 14GB using afl-cmin, coverage report: https://dergoegge.github.io/bitcoin-coverage/afl-cmin/fuzz.coverage/src/index.html

from qa-assets.

maflcko commented on May 23, 2024

How much would that be with libFuzzer with and without -use_value_profile?

from qa-assets.

sipa commented on May 23, 2024

TIL -set_cover_merge=1. Not documented on https://llvm.org/docs/LibFuzzer.html?

from qa-assets.

maflcko commented on May 23, 2024

Yes, most options are not mentioned in the html help.

from qa-assets.

sipa commented on May 23, 2024

TIL that there is anything else than the html help. (-help=1 works...)

from qa-assets.

sipa commented on May 23, 2024

MERGE-INNER: 686831 total files; 0 processed earlier; will process 686831 files now
...
#686831 DONE   cov: 4419 exec/s: 1064 rss: 182Mb
MERGE-OUTER: successful in 1 attempt(s)
MERGE-OUTER: the control file has 4141390298 bytes
==3695960== ERROR: libFuzzer: out-of-memory (used: 2079Mb; limit: 2048Mb)

from qa-assets.

maflcko commented on May 23, 2024

How much would be saved on top, if the -set_cover_merge=1 merge algorithm was used?

I am getting 1.6G vs 1.9G

from qa-assets.

dergoegge commented on May 23, 2024

I am getting 1.6G vs 1.9G

Could you also test this with -use_value_profile=1 -set_cover_merge=1? Maybe that could be a nice middle ground...

from qa-assets.

maflcko commented on May 23, 2024

Running /usr/bin/time -f '%M KB, %S + %U' ./test/fuzz/test_runner.py on the result folder gives:

set_cover_merge_dir: 458140 KB, 39.68 + 1149.42
merge_dir: 556452 KB, 64.96 + 1699.33

Which seems like a massive speed up for the same coverage?

from qa-assets.

sipa commented on May 23, 2024

Benchmark with a big miniscript_smart corpus (228210 files, 1.2G), though accidentally including all of fuzz_seed_corpus into the merging:

-use_value_profile=0 -merge=1: 4064 files, 21M
-use_value_profile=1 -merge=1: 4869 files, 30M (by extending the result from the previous line)
-use_value_profile=0 -set_cover_merge=1: 605 files, 3.1M
-use_value_profile=1 -set_cover_merge=1: 1037 files, 7.7M (by extending the result from the previous line)

from qa-assets.

maflcko commented on May 23, 2024

Interesting find. So I guess -use_value_profile=1 -set_cover_merge=1 is doable for some targets. Though, the massive targets addrman, banman, block, ... probably set the overall result when it comes to storage and compute used.

from qa-assets.

murchandamus commented on May 23, 2024

For creating a new corpus at the branch-off point, would it perhaps make sense to at least combine the crème de la crème? I.e. if each of us merged their active fuzzing directory to a new directory with -set_cover_merge and pushed that branch to their own repos, someone could combine the old bloated set and our individual best sets to create the new starting point?

from qa-assets.

murchandamus commented on May 23, 2024

That would certainly be an interesting measurement as well

from qa-assets.

maflcko commented on May 23, 2024

Ok, I am getting that -set_cover_merge=1 can produce a smaller result if many "active/dirty" folders are used as input. So likely with higher coverage now, and even smaller:

-set_cover_merge=1 -use_value_profile=0: 1.0G
-set_cover_merge=1 -use_value_profile=1: 5.3G

Though, that seems still a bit large, given that the maximum repo size is apparently 10G on GitHub and 25G on GitLab.

So I guess we can keep using -set_cover_merge=1 -use_value_profile=0 in the merge script here.

from qa-assets.

murchandamus commented on May 23, 2024

Oh, that’s an interesting thought. That might also explain why for me the set_cover_merge from my active fuzzing directory was bigger than the merge from active fuzzing + qa-assets/main.

from qa-assets.

brainstorm: Reducing the size of this repo about qa-assets HOT 30 CLOSED

Comments (30)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent