Git Product home page Git Product logo

Comments (42)

geofflangdale avatar geofflangdale commented on May 4, 2024 1

I have fixed the issue that caused ids 0/1/2/4 in 3d30fd5

Leaving issue open, but we should split off the remaining crashers as a separate issue (once I can replicate) as adding fuzzing is a legit issue in itself regardless of whether we fix these particular sets.

from simdjson.

lemire avatar lemire commented on May 4, 2024 1

@everestmz We might take you up on it.

from simdjson.

lemire avatar lemire commented on May 4, 2024

We do not support any input: the input must be aligned and padded (as happens if you use get_corpus). However, we need to support fuzz testing.

from simdjson.

StephanDollberg avatar StephanDollberg commented on May 4, 2024

" " even crashes when using get_corpus as per the single header example code.

#0  find_structural_bits (buf=buf@entry=0x55d5260d8140 ' ' <repeats 40 times>, len=len@entry=40, pj=...) at simdjson.cpp:801
#1  0x000055d5253c4532 in json_parse (buf=buf@entry=0x55d5260d8140 ' ' <repeats 40 times>, len=len@entry=40, pj=..., 
    reallocifneeded=reallocifneeded@entry=true) at simdjson.cpp:343
#2  0x000055d5253c4c81 in build_parsed_json (buf=0x55d5260d8140 ' ' <repeats 40 times>, len=40, 
    reallocifneeded=<optimized out>) at simdjson.cpp:365
#3  0x000055d5253b890e in build_parsed_json (reallocifneeded=true, len=40, buf=0x55d5260d8140 ' ' <repeats 40 times>)
    at simdjson.h:37309
#4  build_parsed_json (reallocifneeded=true, Python Exception <class 'gdb.error'> value has been optimized out: 
s=<synthetic pointer>) at simdjson.h:37309
#5  main (argc=<optimized out>, argv=<optimized out>) at main.cpp:7

from simdjson.

StephanDollberg avatar StephanDollberg commented on May 4, 2024

Running afl for around 15 minutes finds another 10 unique crashes:
crashes.zip

This is all with the single header example code.

from simdjson.

lemire avatar lemire commented on May 4, 2024

Great. Let us fix these issues.

from simdjson.

lemire avatar lemire commented on May 4, 2024

Added the 'bug' label.

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

What's the path into the code being used here - i.e. which main? I can replicate ids 0,1,2 and 4 as crashers by running run benchmark/parse on these files, but the rest are simply failing out in "stage 34" (which I will fix to say "stage 2").

from simdjson.

lemire avatar lemire commented on May 4, 2024

Fantastic.

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

Oops - sorry, apparently I can't read: "as per the single header example code" should have answered my question regarding "which main".

from simdjson.

lemire avatar lemire commented on May 4, 2024

The single-header version was updated as well.

from simdjson.

StephanDollberg avatar StephanDollberg commented on May 4, 2024

stopped afl after around 20h now. Full set of unique crashes (70):

crashes_20h.zip

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

I tried all of these with amalgamation_demo and all are flagged as "not_valid" (i.e. not crashers) with amalgamation_demo compiled both normally and with optimizations switched off. Can you upload a precis of your procedure for running afl so we can redo it at will, as well as verifying that the current set of crashers is fixed in your setup as well as my own? I don't want to close this just in case there's something wacky about one of our setups that is producing inconsistent results.

I'd also like to figure out a regular procedure to run fuzz testing, even just for the rudimentary "checking for crashers" task (better still: finding departures from 'ground truth').

from simdjson.

StephanDollberg avatar StephanDollberg commented on May 4, 2024

@geofflangdale I assume that's with your fixes applied? That's a good sign :)

Really just very basic afl usage:

afl-g++ -std=c++17 -g -march=native -o main single_header_example.cpp
afl-fuzz -i in -o out ./main @@

where the in directory contains:

$ more in/test.json 
{'hello':'bye'}

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

I did a couple 1 hour runs, one with a similar 'in' file to yours, and one with a copy of our 'short' JSON examples. Neither found any crashers. I'm hoping you can verify that on your platform that the crashers are also gone away.

from simdjson.

StephanDollberg avatar StephanDollberg commented on May 4, 2024

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

Removed the 'bug' label as there is no longer an actual bug associated with this issue to my knowledge. As per Stephan's advice, am running a longer test - it is currently at 27 hours with no crashes or hangs.

from simdjson.

lemire avatar lemire commented on May 4, 2024

@geofflangdale Any chance we can get fuzz testing as part of our official workflow... so that we can run it periodically?

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

I've been fuzz testing for 2 days with no further effect, but the 'cycles done' hasn't gone green (no crashes either). So this is CPU intensive in a fashion that suggests that we aren't going to be able to lean on unpaid public infrastructure (rightly).

I think we could create a 'make fuzz' target that seeds a few reasonable files in an 'in' directory and then fires up afl-fuzz in the fashion that Stephan describes, above.

A more serious approach would ground-truth test us against RapidJSON, but that's a significant project.

from simdjson.

everestmz avatar everestmz commented on May 4, 2024

Hey! We just launched our fuzzing as a service tool the other day (https://fuzzbuzz.io), and we're providing free team accounts with a bunch of free compute power to open-source projects that need it. I'd be happy to help get this project set up there - let me know if that sounds interesting!

from simdjson.

lemire avatar lemire commented on May 4, 2024

Assigned to @ioioioio.

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

@geofflangdale

A more serious approach would ground-truth test us against RapidJSON, but that's a significant project.

Are you thinking of letting the two parsers work on the same input, then comparing the parsed result? That is differential fuzzing and it is done for some crypto library integral operations on oss-fuzz. I think it has paid off well there!

For "normal use", is RapidJSON and simdjson supposed to give the same answer? I am thinking of how floating point is handled. Comparing anything else than bit exact is tricky to implement correctly, especially since the structure of the json document can be different (assuming I understand correctly what you want to do).

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

Thanks for the prompt reply, I agree with your reasoning!

I think there is some low hanging fruit to pick first. @Yzoni would you mind sharing your fuzzer? I recently made pull request #348 , and you may have used an api entry point I missed.

from simdjson.

lemire avatar lemire commented on May 4, 2024

@pauldreik @geofflangdale

These questions point to why I said it was a "significant project". Ensuring that we have the "same" result as RapidJSON would require us to be able to programmatically inspect what RapidJSON did vs what we did and compare things to ensure that they were "equivalent".

This problem has been solved and we already have the code.

https://github.com/ioioioio/fuzzyjson

The work was never integrated into the simdjson project, but it works. We used it successfully to gain confidence in simdjson.

The short story is that it did not catch bugs in simdjson... short of a few minor issues. It did catch a couple of minor issues with RapidJSON and it allowed us to find out that sajson does not do genuine UTF-8 validation.

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

from simdjson.

lemire avatar lemire commented on May 4, 2024

Agreed, this is marked as "help needed". We need help.

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

Thanks for the links to the differential json fuzzer, I will check it out!

Agreed, this is marked as "help needed". We need help.

I volunteer to try get simdjson on to oss-fuzz. The bar for getting it accepted according to
https://google.github.io/oss-fuzz/getting-started/accepting-new-projects/
is

To be accepted to OSS-Fuzz, an open-source project must have a significant user base and/or be critical to the global IT infrastructure.

The great thing with oss-fuzz is that they provide all the boiler plate for building/running/bug handling/reproduce. They also provide abstraction for the fuzzing engine, meaning they can switch out the fuzzing engine without having to modify the source for the fuzzed library.

I have already gotten fmt into oss-fuzz so I know the tools/process and it would not be much work - the fuzzers added in #348 already run and the problems they revealed are already fixed.

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

Thanks for the links to the differential json fuzzer, I will check it out!

I had a look, very interesting project! It generates input randomly, but there is no feedback loop if I understand correctly. For efficient fuzzing, that is necessary. I think parts can be reused for writing a mutational (feedback) differential fuzzer, in particular how to compare the output.

Dropping files to disk each iteration is also not good for performance (or the life of my SSD :-)). Things gets easier with using libFuzzer, it will write files to disk only when there is a problem or an interesting input is found.

from simdjson.

lemire avatar lemire commented on May 4, 2024

@pauldreik

One thing it does offer is a clean way to compare the parsed trees between simdjson and RapidJSON.

Regarding infrastructure, I have Internet-accessible servers that I could dedicate to fuzz testing for simdjson. I have idle servers as it is.

from simdjson.

lemire avatar lemire commented on May 4, 2024

I volunteer to try get simdjson on to oss-fuzz

Good!

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

I volunteer to try get simdjson on to oss-fuzz

Good!

Nice, will give it a go shortly!

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

I have made a first attempt. To try it out, checkout https://github.com/pauldreik/oss-fuzz/tree/simdjson_step1

The full documentation is at https://google.github.io/oss-fuzz/getting-started/new-project-guide/

git checkout https://github.com/pauldreik/oss-fuzz/tree/simdjson_step1
export PROJECT_NAME=simdjson
cd oss-fuzz
python infra/helper.py build_image $PROJECT_NAME # <-- answer yes
python infra/helper.py build_fuzzers --sanitizer  address $PROJECT_NAME
python infra/helper.py check_build $PROJECT_NAME
# now try running one of the fuzzers
python infra/helper.py run_fuzzer $PROJECT_NAME fuzz_parser

I had to disable the minify fuzzer, because it did not pass the check build test. Is it perhaps using so much simd magic that libFuzzer can't understand what is going on? I have no idea. This is what the check_build step says above, if it is still built:

BAD BUILD: /out/fuzz_minify seems to have only partial coverage instrumentation.

@lemire could you please review the https://github.com/pauldreik/oss-fuzz/tree/simdjson_step1 branch and https://github.com/pauldreik/simdjson/tree/paul/ossfuzz_step1 in case there is something you do not want to merge? We do not need to merge at this time, I think it is clever to start small and let oss-fuzz point to my branch so I can fix build issues etc myself, and then once it is working and it can be merged to master, we will repoint ossfuzz to the master branch in this repo instead of my fork. I used this approach for libfmt and it worked perfectly.

My intent is that, as per the fuzzing best practices, the fuzzers will be built by default (but without fuzzing engine and coverage) so that they do not bit rot.
Ideally, the fuzzers should be executed in CI just to make sure they still work and don't segfault etc. Is there a sanitizer build in CI?

I also wonder about the choice of "simd backend" (sorry if I use the wrong word), is it a good idea to build the fuzzers in flavours that use different backends?

from simdjson.

lemire avatar lemire commented on May 4, 2024

@pauldreik

Looks great to me.

If you are fuzzing the SIMD minifier, then I would be that your troubles come from the super large lookup table we are using (this is going away in the future). It may just be that the code is too reliant on a lookup table. I don't expect that the SIMD instructions themselves are the problem.

Can you have a look at #186 ?

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

I made an experiment with a CI job, running as a Github action.

It does the following

  • build the fuzzers in several variants: with/without sanitizers, with/without avx.
  • build the fuzzers in reproduce mode (no fuzzing instrumentation)
  • download the corpus from persistent storage
  • run the fastest fuzzer build (release mode, without sanitizers) to expand the corpus, for a short time. (The fastest fuzzer runs at approx. 200 k iterations per second)
  • run the other fuzzer builds on the expanded corpus, each for a short while. Any new test cases are accumulated on to the next fuzzer. These fuzzers are slower, around 10-20 k iterations per second.
  • run all the corpus on the fuzzers built in reproduce mode, executed with valgrind
  • minimize the corpus
  • if the branch is master, replace the corpus in persistent storage

All this happens in approximately 5 minutes, these fuzz targets are very fast.

The approach above will catch easy to find errors. Since the corpus is accumulated over time, it will improve over time. From my experience with fuzzing, a majority of bugs are found either within minutes, or not at all.
I believe this CI job should run on pull requests, and regularly on master (each 12 hours or so)

An example of one of these runs can be seen here:
https://github.com/pauldreik/simdjson/commit/abfb2ea4711a369fbfad6c64bced487320da026f/checks?check_suite_id=293127642

The action is here: https://github.com/pauldreik/simdjson/blob/paul/workflows-try1/.github/workflows/fuzzers.yml

What remains to do is do ensure that any detected errors actually mark the build as failed. For instance, the undefined behaviour sanitizer may need the environment variable set to make it abort on error.

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

The pull request has been merged on oss-fuzz!

from simdjson.

lemire avatar lemire commented on May 4, 2024

@pauldreik thank you!

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

I moved on to the next step in oss-fuzz integration, moving the build scripts into the simdjson repo instead of in oss-fuzz, so it is easier to change and maintain.
google/oss-fuzz#3013

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

It is now possible to see the coverage: https://storage.googleapis.com/oss-fuzz-coverage/simdjson/reports/20191106/linux/src/simdjson/report.html

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

Now when there is pull request fuzz testing in place, and oss-fuzz runs, should we close this issue?

I see several things to improve

  • write more fuzzers and/or builds, guided by the coverage measurement on oss-fuzz (update: see #368)
  • add/reintroduce differential fuzzing as described earlier in this thread(update: see #369)
  • add the initial corpus to the CI job, making sure it works ok (update: see #370)

but I think these tasks should go in separate github issues/pull requests.

And a big thanks to @lemire for fixing all issues, questions and pull requests so quickly, so the fuzzers could be developed without waiting!

from simdjson.

lemire avatar lemire commented on May 4, 2024

@pauldreik Can you close this one and open the distinct issues?

from simdjson.

pauldreik avatar pauldreik commented on May 4, 2024

Fixed, closing.

from simdjson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.