Comments (20)
If this is useful, here's a real version of that algorithm:
import sre_parse
import re
import pytest
def extract_literals(r):
ops = sre_parse.parse(r.pattern)
results = []
extract_literals_from_ops(ops, results)
return results
def extract_literals_from_ops(ops, results):
i = 0
while i < len(ops):
op, val = ops[i]
if op == sre_parse.LITERAL:
start_i = i
while i < len(ops) and ops[i][0] == sre_parse.LITERAL:
i += 1
results.append("".join(chr(c) for _, c in ops[start_i:i]))
continue
elif op == sre_parse.BRANCH:
_, branches = val
for branch in branches:
extract_literals_from_ops(branch, results)
elif op == sre_parse.SUBPATTERN:
_, _, _, sub_ops = val
extract_literals_from_ops(sub_ops, results)
elif op == sre_parse.MAX_REPEAT:
_, _, sub_ops = val
extract_literals_from_ops(sub_ops, results)
elif op == sre_parse.ASSERT or op == sre_parse.ASSERT_NOT:
_, sub_ops = val
extract_literals_from_ops(sub_ops, results)
i += 1
return results
@pytest.mark.parametrize(
("r", "expected"),
[
(r"^abc$", ["abc"]),
(r"abc|def", ["abc", "def"]),
(r"(abc|\d+)", ["abc"]),
(r"(?:abc){3,}", ["abc"]),
(r"(?:abc){,3}", ["abc"]),
(r"(?=abc)", ["abc"]),
(r"(?!abc)", ["abc"]),
(r"(?<=abc)", ["abc"]),
(r"(?<!abc)", ["abc"]),
]
)
def test_extract_literals(r, expected):
actual = extract_literals(re.compile(r))
assert actual == expected
from atheris.
hypothesis.strategies.from_regex()
can help if you want to write a fuzz harness which generates strings matching some regexp, or you could just use it to generate a seed corpus for the fuzzer.
from atheris.
Hi Alex,
That does seem useful!
How do you propose it be integrated into libFuzzer?
from atheris.
In the Tracer
(https://github.com/google/atheris/blob/master/tracer.cc#L269) have it grab self
off the frame, if isinstance(self, regexp)
then run something akin to the logic I suggested.
from atheris.
Oh, I'm asking about the second part: once we have the data, how do we inject it into libFuzzer? To do it as you suggest, I think I'd have to modify libFuzzer to expose the ability to programmatically inject new dictionary events. That's possible, but means that only users with up-to-date Clang will get the feature. Do you have any other ideas?
from atheris.
Oh, sorry I misread you. Can we inject it via memcmp(value, value)
? That basically causes it to go into teh dictionary, right?
from atheris.
Looks like that might provide some benefit. Such comparisons end up in the "TableOfRecentCompares" (max size 32), which based on randomness might end up in a dictionary.
Will probably have to call that every time the regexp is invoked - but can probably cache the extracted data.
from atheris.
As a stop-gap, it's pretty easy to build just the _sre module with coverage. Use the provided patches for linking libFuzzer into CPython, but also do the following.
First, configure it:
CC=clang CXX=clang++ ./configure
Then, find the right line in the generated Makefile:
Modules/_sre.o: $(srcdir)/Modules/_sre.c; $(CC) $(PY_BUILTIN_MODULE_CFLAGS) -c $(srcdir)/Modules/_sre.c -o Modules/_sre.o
Add -fsanitize=fuzzer-no-link
right before -c
Then, run:
make LIBFUZZER_VERSION=path/to/libclang_rt.fuzzer_no_main-x86_64.a -j 100
Finally, use atheris_no_libfuzzer
instead of atheris
.
from atheris.
Seems like it'd also be sensible to send a feature request to libFuzzer to allow dynamic additions to the dictionary.
from atheris.
Already in progress! I've got a patch prepared, which I'll send to LLVM as soon as I've confirmed it works well for our use case.
From my experience, simulating a memcmp(foo, foo) was not effective. It really requires dynamic dictionary injection.
Atheris will be programmed to detect whether it was using a sufficiently new version of libFuzzer. If not, it will print a warning in the event that regular expressions are encountered.,
from atheris.
from atheris.
This has proved to be somewhat effective. Here's the LLVM change: https://reviews.llvm.org/D93879
While somewhat effective, many regular expressions still fail to be reasonably navigated with this strategy. I think this can be improved by computing a string that will match the whole regular expression, and adding that to the dictionary.
from atheris.
from atheris.
If Python used true regular expressions, where everything is representable as a DFA, then it would be as simple as a Dijkstra's walk over the DFA. But, Python supports features like negative lookbehind, which are O(n^2) and cannot be represented in a DFA. It's even possible to write Python regular expressions that can never match a string. For example, a(?<!a)b
matches "the string ab
where b
is not proceeded by a
".
If we exclude lookahead, lookbehind, and a few rarely used features, it should be easy. That might be good for an initial version.
from atheris.
Alright, generating a string that matches a regex and adding it to the dictionary is highly effective.
from atheris.
Fantastic! What benchmark are you using to establish that?
from atheris.
So far, I've been testing whether Atheris can fuzz past a wide variety of regular expressions. With this latest version that generates matches, it can get past any expression I've come up with that doesn't use a zero-width match (e.g. lookahead, lookbehind, or word boundaries).
I've also realized, I can use Atheris to fuzz itself on this. I can write a fuzzer that ensures Atheris can take a regular expression and generate a matching string for any set of regular expression features I support.
However, it appears LLVM is reluctant to add LLVMFuzzerAddToDictionary()
support to libFuzzer. https://reviews.llvm.org/D93879 I'm reluctant to work on this more until that is resolved.
from atheris.
from atheris.
I just pushed a new version that includes experimental support for this. It's opt-in for now, because there are some performance downsides, but it now works!
from atheris.
Let's close this issue, and make it enabled by default at some point :)
from atheris.
Related Issues (20)
- publish wheels for macOS Python 3.10 HOT 1
- Installation doesn't work with clang 15
- WARNING: Failed to find function "__sanitizer_acquire_crash_state". HOT 2
- Coverage for Tensorflow not show up HOT 1
- GCOV instrumentation `.gcda` fails when run by atheris HOT 2
- Hall of fame addition HOT 1
- Help with installation on ARM HOT 3
- Please add a tag for the latest release HOT 3
- Potential file conflict with other Python packages HOT 3
- Instrumenting time long HOT 3
- Aggregate all string literals during instrumentation HOT 1
- Unable to cast Python instance to C++ HOT 8
- Unable to execute target in fork mode HOT 2
- Generate wheels for all platforms via cibuildwheel HOT 3
- "Permission denied" when using -merge=1
- Preloaded libFuzzer doesn't allow to use custom mutator.
- Wrong crash line in new python version
- Atheris 2.2.2 fails to instrument a while loop in Python 3.11.0 HOT 1
- Support RegisterPostProcessor in lib protofuf mutator
- Make use of Github Actions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from atheris.