Git Product home page Git Product logo

riskybird's Introduction

RiskyBird

Regular expression authors best friend

Overview

Regular Expressions are notoriously hard to get right. When you are writing a new expression, it is hard for reviewers to read & assert with confidence that the expression is correct. Tweaking existing expressions can often lead to unintended consequences.

RiskyBird tries to mitigate this by offering a set of tools for software engineers:

  1. A parser: guarantees the expression is well formed.
  2. A pretty printer: helps interpret the regular expression.
  3. A lint engine: catches common mistakes.
  4. A unittest engine: prevents future mishapes.
  5. A collaboration platform: reviewers can add tests and provide feedback.

This project also provides a reusable regular expression parser.

Why RiskyBird

We love AngryBirds and we wanted a name that starts with R.

See it in action

https://www.quaxio.com/regexp_lint/

Some notes

I haven't found a nice place to put these, so leaving these notes here.

Here are some tips to help you write better regular expressions:

  1. Is the language regular(*)? We have often tried to write regular expressions for languages which are not regular! This always leads to issues down the road. If the language is not regular, you will need to use a Lexer/Grammar.

    (*) regular expression engines actually implement some features which cannot be described by regular languages (in the formal sense), but you get my point.

  2. Can I use a less powerful but faster library (i.e. pattern matching instead of regular expressions)?

  3. Am I trying to match a URI (or part of one)? It is extreemly hard to get URI parsing right, and different browsers interpret URIs differently. The only way to get this right is to split the URI into parts (protocol, user, password, domain, port, path, etc.), run the desired checks on the parts and then rebuilt a new URI with the proper escaping applied to each part. Again, we have libraries to do this!

    If you aren't convinced this is required, go read the browser security handbook or the Tangled Web.

  4. Don't be lazy. If you know your expression should match the beginning of a string put the ^ anchor. If you are expecting a ".", use . instead of the dot metacharacter. Use non capturing groups when you don't need to capture a group. Etc.

  5. Different engines / different programming languages behave in slightly different ways (what were you expecting?). Don't just copy paste regular expressions from one language in to another!

    Proof:

    • in JavaScript: new RegExp(/^[\\abc]+$/).test('abc\\'); → true
    • in PHP: preg_match("/^[\\abc]+$/", "abc\\"); → false

Code Layout

  • riskybird.opa: web code
  • riskybird_parser.opa: regular expression parser
  • riskybird_string_printer.opa: pretty printer
  • riskybird_xhtml_printer.opa: pretty printer
  • riskybird_lint.opa: lint engine
  • riskybird_eval.opa: evaluation engine
  • riskybird_unittest.opa: unittests

License

RiskyBird is distribtued under the AGPL license.

riskybird's People

Contributors

alokmenghrajani avatar pikatchu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

riskybird's Issues

Improve handling of back reference

  • \1(x) should trigger a warning saying that \1 is used before it exists.
  • (\1) should trigger a warning saying that \1 refers to itself.

Display ranges from low to high

Currently, ranges are shown in what IMO is a counter-intuitive way. For example: ? is shown as 1-0, * is shown as ∞-0, and {3,5} is shown as 5-3.

I think such ranges should instead be shown as 0-1, 0-∞, and 3-5, respectively.

Split short-but-hard-to-read warning for character range

Currently the "non optimal character range" warning is triggered both in ranges that can be written more compactly, and in those that are hard to read in their most compact form:

A shorter/cleaner way to write {s2} is {s1}

This can be confusing since those two goals are can be opposite to each other. I'd suggest splitting that warning into two: one for unnecessarily verbose ranges that can be compressed, like

[A-NM-Z] --> [A-Z]

...and another for excessively compact ranges that should be expanded for readability, like

[	 ] --> [\x09\x20]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.