tom-lord / regexp-examples Goto Github PK

View Code? Open in Web Editor NEW

521.0 15.0 31.0 683 KB

Generate strings that match a given regular expression

License: MIT License

Ruby 100.00%

ruby regexp mri data-generation random-string

regexp-examples's Introduction

regexp-examples

Extends the Regexp class with the methods: Regexp#examples and Regexp#random_example

Regexp#examples generates a list of all* strings that will match the given regular expression.

Regexp#random_example returns one, random string (from all possible strings!!) that matches the regex.

* If the regex has an infinite number of possible strings that match it, such as /a*b+c{2,}/, or a huge number of possible matches, such as /.\w/, then only a subset of these will be listed. For more detail on this, see configuration options.

If you'd like to understand how/why this gem works, please check out my blog post about it.

Usage

Regexp#examples

/a*/.examples #=> ['', 'a', 'aa']
/ab+/.examples #=> ['ab', 'abb', 'abbb']
/this|is|awesome/.examples #=> ['this', 'is', 'awesome']
/https?:\/\/(www\.)?github\.com/.examples #=> ['http://github.com',
  # 'http://www.github.com', 'https://github.com', 'https://www.github.com']
/(I(N(C(E(P(T(I(O(N)))))))))*/.examples #=> ["", "INCEPTION", "INCEPTIONINCEPTION"]
/\x74\x68\x69\x73/.examples #=> ["this"]
/what about (backreferences\?) \1/.examples
  #=> ['what about backreferences? backreferences?']
/
  \u{28}\u2022\u{5f}\u2022\u{29}
  |
  \u{28}\u{20}\u2022\u{5f}\u2022\u{29}\u{3e}\u2310\u25a0\u{2d}\u25a0\u{20}
  |
  \u{28}\u2310\u25a0\u{5f}\u25a0\u{29}
/x.examples #=> ["(•_•)", "( •_•)>⌐■-■ ", "(⌐■_■)"]

Regexp#random_example

Obviously, you will get different (random) results if you try these yourself!

/\w{10}@(hotmail|gmail)\.com/.random_example #=> "[email protected]"
/5[1-5][0-9]{14}/.random_example #=> "5224028604559821" (A valid MasterCard number)
/\p{Greek}{80}/.random_example
  #=> "ΖΆΧͷᵦμͷηϒϰΟᵝΔ΄θϔζΌψΨεκᴪΓΕπι϶ονϵΓϹᵦΟπᵡήϴϜΦϚϴϑ͵ϴΉϺ͵ϹϰϡᵠϝΤΏΨϹϊϻαώΞΰϰΑͼΈΘͽϙͽξΆΆΡΡΉΓς"
/written by tom lord/i.random_example #=> "WrITtEN bY tOM LORD"

Supported ruby versions

MRI 2.4.0 (oldest non-EOL version) --> 3.0.0 (latest stable version)

MRI 2.0.0 --> 2.3.x were supported until version 1.5.0 of this library. Support was dropped primarily because of the need to use RbConfig::CONFIG['UNICODE_VERSION'], which was added to ruby version 2.4.0.

MRI versions ≤ 1.9.3 were never supported by this library. This is primarily because MRI 2.0.0 introduced a new regexp engine (Oniguruma was replaced by Onigmo -- For example, named properties like /\p{Alpha}/ are illegal syntax on MRI 1.9.3.). Whilst most of this gem could be made to work with MRI 1.9.x (or even 1.8.x), I considered the changes too significant to implement backwards compatability (especially since long-term support for MRI 1.9.3 has long ended).

Other implementations, such as JRuby, could probably work fine - but I haven't fully tried/tested it. Pull requests are welcome.

Installation

Add this line to your application's Gemfile:

gem 'regexp-examples'

And then execute:

$ bundle

Or install it yourself as:

$ gem install regexp-examples

Supported syntax

Short answer: Everything is supported, apart from "irregular" aspects of the regexp language -- see impossible features.

Long answer:

All forms of repeaters (quantifiers), e.g. /a*/, /a+/, /a?/, /a{1,4}/, /a{3,}/, /a{,2}/
- Reluctant and possissive repeaters work fine, too, e.g. /a*?/, /a*+/
Boolean "Or" groups, e.g. /a|b|c/
Character sets, e.g. /[abc]/ - including:
- Ranges, e.g./[A-Z0-9]/
- Negation, e.g. /[^a-z]/
- Escaped characters, e.g. /[\w\s\b]/
- POSIX bracket expressions, e.g. /[[:alnum:]]/, /[[:^space:]]/
  - ...Taking the current ruby version into account - e.g. the definition of /[[:punct:]]/ changed in version 2.4.0.
- Set intersection, e.g. /[[a-h]&&[f-z]]/
Escaped characters, e.g. /\n/, /\w/, /\D/ (and so on...)
Capture groups, e.g. /(group)/
- Including named groups, e.g. /(?<name>group)/
- And backreferences(!!!), e.g. /(this|that) \1/ /(?<name>foo) \k<name>/
- ...even for the more "obscure" syntax, e.g. /(?<future>the) \k'future'/, /(a)(b) \k<-1>/
- ...and even if nested or optional, e.g. /(even(this(works?))) \1 \2 \3/, /what about (this)? \1/
- Non-capture groups, e.g. /(?:foo)/
- Comment groups, e.g. /foo(?#comment)bar/
- Absent operator groups, e.g. /(?~exp)/ This feature is available in ruby version >= 2.4.1. However, support in this gem is limited.
Control characters, e.g. /\ca/, /\cZ/, /\C-9/
Escape sequences, e.g. /\x42/, /\x5word/, /#{"\x80".force_encoding("ASCII-8BIT")}/
Unicode characters, e.g. /\u0123/, /\uabcd/, /\u{789}/
Octal characters, e.g. /\10/, /\177/
Named properties, e.g. /\p{L}/ ("Letter"), /\p{Arabic}/ ("Arabic character") , /\p{^Ll}/ ("Not a lowercase letter"), /\P{^Canadian_Aboriginal}/ ("Not not a Canadian aboriginal character")
- ...Even between different ruby versions!! (e.g. /\p{Arabic}/.examples(max_group_results: 999) will give you a different answer in ruby v2.1.x and v2.2.x)
Arbitrarily complex combinations of all the above!
Regexp options can also be used:
- Case insensitive examples: /cool/i.examples #=> ["cool", "cooL", "coOl", "coOL", ...]
- Multiline examples: /./m.examples #=> ["\n", "a", "b", "c", "d"]
- Extended form examples: /line1 #comment \n line2/x.examples #=> ["line1line2"]
- Options toggling supported: /before(?imx-imx)after/, /before(?imx-imx:subexpr)after/

Configuration Options

When generating examples, the gem uses 3 configurable values to limit how many examples are listed:

max_repeater_variance (default = 2) restricts how many examples to return for each repeater. For example:
- .* is equivalent to .{0,2}
- .+ is equivalent to .{1,3}
- .{2,} is equivalent to .{2,4}
- .{,3} is equivalent to .{0,2}
- .{3,8} is equivalent to .{3,5}
max_group_results (default = 5) restricts how many characters to return for each "set". For example:
- \d is equivalent to [01234]
- \w is equivalent to [abcde]
- [h-s] is equivalent to [hijkl]
- (1|2|3|4|5|6|7|8) is equivalent to [12345]
max_results_limit (default = 10000) restricts the maximum number of results that can possibly be generated. For example:
- /c+r+a+z+y+ * B+I+G+ * r+e+g+e+x+/i.examples.length <= 10000 -- Attempting this will NOT freeze your system, even though (by the above rules) this "should" attempt to generate 117546246144 examples.

Rexexp#examples makes use of all these options; Rexexp#random_example only uses max_repeater_variance, since the other options are redundant.

Defining custom configuration values

To use an alternative value, you can either pass the configuration option as a parameter:

/a*/.examples(max_repeater_variance: 5)
  #=> [''. 'a', 'aa', 'aaa', 'aaaa' 'aaaaa']
/[F-X]/.examples(max_group_results: 10)
  #=> ['F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']
/[ab]{10}/.examples(max_results_limit: 64).length == 64 # NOT 1024
/[slow]{9}/.examples(max_results_limit: 9999999).length == 4 ** 9 == 262144 # Warning - this will take a while!
/.*/.random_example(max_repeater_variance: 50)
  #=> "A very unlikely result!"

Or, set an alternative value within a block:

RegexpExamples::Config.with_configuration(max_repeater_variance: 5) do
  # ...
end

Or, globally set a different default value:

# e.g In a rails project, you may wish to place this in
# config/initializers/regexp_examples.rb
RegexpExamples::Config.max_repeater_variance = 5
RegexpExamples::Config.max_group_results = 10
RegexpExamples::Config.max_results_limit = 20000

A sensible use case might be, for example, to generate all 1-5 digit strings:

/\d{1,5}/.examples(max_repeater_variance: 4, max_group_results: 10, max_results_limit: 100000)
  #=> ['0', '1', '2', ..., '99998', '99999']

Configuration Notes

Due to code optimisation, Regexp#random_example runs pretty fast even on very complex patterns. (I.e. It's typically a lot faster than using /pattern/.examples.sample(1).) For instance, the following takes no more than ~ 1 second on my machine:

/.*\w+\d{100}/.random_example(max_repeater_variance: 1000)

All forms of configuration mentioned above are thread safe.

Bugs and TODOs

There are no known major bugs with this library. However, there are a few obscure issues that you may encounter.

All known bugs/missing features are documented in GitHub. Please discuss known issues there, or raise a new issue if required. Pull requests are welcome!

Some of the most obscure regexp features are not even mentioned in the ruby docs. However, full documentation on all the intricate obscurities in the ruby (version 2.x) regexp parser can be found here.

Impossible features ("illegal syntax")

The following features in the regex language can never be properly implemented into this gem because, put simply, they are not technically "regular"! If you'd like to understand this in more detail, check out what I had to say in my blog post about this gem.

Using any of the following will raise a RegexpExamples::IllegalSyntax exception:

Lookarounds, e.g. /foo(?=bar)/, /foo(?!bar)/, /(?<=foo)bar/, /(?<!foo)bar/
Anchors (\b, \B, \G, ^, \A, $, \z, \Z), e.g. /\bword\b/, /line1\n^line2/
- Anchors are really just special cases of lookarounds!
- However, a special case has been made to allow ^, \A and \G at the start of a pattern; and to allow $, \z and \Z at the end of pattern. In such cases, the characters are effectively just ignored.
Subexpression calls (\g), e.g. /(?<name> ... \g<name>* )/

(Note: Backreferences are not really "regular" either, but I got these to work with a bit of hackery.)

Contributing

Fork it ( https://github.com/tom-lord/regexp-examples/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request
Don't forget to add tests!!

regexp-examples's People

Contributors

Stargazers

Watchers

regexp-examples's Issues

Empty regexp examples

//.examples or Regexp.new('').examples results in:

NoMethodError: undefined method `empty?' for nil:NilClass

Should this be returning [''] instead?

Just trying to cover all scenarios! Thanks for the great work.

Parsing of nested repeat operators

Nested repeat operators are incorrectly parsed, e.g. /b{2}{3}/ - which "should" be interpreted like /b{6}/.

However, there is probably no reason to ever write regexes like this! This behaviour is undocumented by both the language and regexp engine; there is no test for its expected behaviour at the time of writing - therefore such behaviour should not be relied upon.

Support for use without class extension

It would be nice to offer an option to use this gem without actually adding methods to the Regexp class. Something along the lines of:

require 'regexp-examples/module'

RegexpExamples.examples(/.../)
RegexpExamples.random_example(/.../)

As opposed to the current, "default" behaviour:

require 'regexp-examples'

/.../.examples
/.../.random_example

[bug] Cannot generate examples for /./

Hi,

I found an issue that #examples method cannot generate examples for regexp /__.__/.

irb(main):017:0> /__.__/.examples
=> []

Note:

Examples for following regexp can be generated.

/____/
/_._/
/_.__/
/__._/

irb(main):018:0> /____/.examples
=> ["____"]
irb(main):019:0> /_._/.examples
=> ["_a_", "_b_", "_c_", "_d_", "_e_"]
irb(main):020:0> /_.__/.examples
=> ["_a__", "_b__", "_c__", "_d__", "_e__"]
irb(main):021:0> /__._/.examples
=> ["__a_", "__b_", "__c_", "__d_", "__e_"]

My environment:

ruby 2.5.3p105 (2018-10-18 revision 65156) [x64-mingw32]
regexp-examples (1.4.4)

Support for conditional capture groups

Conditional capture groups, e.g. /(group1)? (?(1)yes|no)/.examples are not yet supported.

This example should return: ["group1 yes", " no"].

This has been a feature of the ruby language since v2.0.0, although it is a little-known/rarely used language feature.

Note that unfortunately, a common use case is to use conditional capture groups in conjunction with look-arounds - which this library cannot support.

Missing named properties

The official ruby documentation does not include a comprehensive list of all named properties supported by the language. Some examples:

/\p{Age=6.0}/
/\p{In Miscellaneous Mathematical Symbols-B}/
/\p{Transport and Map Symbols}/
/\p{Emoji}/ # <-- A valid unicode property name, but NOT(!!) supported by ruby

Thankfully, the onigmo docs do provide a full list (but not all of these are supported by the latest ruby!)

Possible paths to take:

Refresh the db/*.pstore files with a more comprehensive list
Has this problem been solved before? Research other libraries.
Consider directly referencing RFCs or similar, rather than dynamically generating the lists? (Is this practical?)

Also worth noting:

This library does not yet "officially" support jruby, because the test suite fails in relation to named properties. (The list supported by this implementation differs to MRI.) Maybe try wrapping the tests in a rescue SyntaxError... with caution!! (Define a list of known errors; don't just rescue blindly.)
Arbitrary whitespace, underscores and hyphens can be included in unicode property names. This library does not yet account for this.

Randomness regression?

The CDDL tool uses regex-examples as part of its instance-example generation.

A typical regex might turn up in the CDDL expression

nai = tstr .regexp "\\w+@\\w+(\\.\\w+)+"

A while ago, the CDDL tool generated

"[email protected]"

from that. Now I get instances such as

"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"

Obviously, this is less satisfying as a set of examples.

Any reason why the entropy vanishes at the end of the RE?

(The same is true with

nai = tstr .regexp "[A-Za-z0-9]+@[A-Za-z0-9]+(\\.[A-Za-z0-9]+)+"

except that in this case the upper case wins:

"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"
"[email protected]"

)

Predictable behaviour of `max_examples_limit` option

Ideally, Regexp#examples should always return up to max_results_limit.

Currently, it usually "aborts" before this limit is reached. (I.e. the exact number of examples generated can be hard to predict, for complex patterns.)

An enhancement would be to modify the algorithm such that results are always generated up to max_results_limit if possible.

Allow the use of empty look-arounds

Due to how this library's algorithm works, look-arounds cannot be supported. As explained in the README, a RegexpExamples::IllegalSyntax will be raised if they are used in a pattern you attempt to generate examples for.

However, similar to how the library currently "ignores" \G. ^ and \A anchors at the beginning of a pattern; or $, \z and \Z at the end; it would also be safe to ignore empty look arounds!

For example, the following could be considered safe:

/foo|bar(?=)/.random_example #=> "foo"

One motivation for this stems from the fact that Regexp.union returns an empty look-ahead if no patterns are given:

Regexp.union([]) #=> /(?!)/

Therefore one would expect the following to work as follows, rather than raising an exception:

Regexp.union([]).random_example #=> ""

[\p{Nd}] — unicode categories in character classes

\p{Nd}                 "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢", "٣", "٤", "٥", "٦", "٧", "٨", "٩"
[\p{Nd}]              "p", "{", "N", "d", "}"

These should be equivalent.

Hello, I am trying to use regexp-example but I have an irregular expression and it says that the gem is not compatible with looking back and looking forward, can you help me? is there any solution or do I have to modify the regex?

Hello, I am trying to use regexp-example but I have an irregular expression and it says that the gem is not compatible with looking back and looking forward, can you help me? is there any solution or do I have to modify the regex?

Originally posted by @anderstroker in #22 (comment)

use regexp_parser?

hi!

nice gem! and a really good blog post (which is how i found it).

would you be interested in using regexp_parser for the parsing part?

i know you've probably invested a lot of love into the parsing functions.

on the other hand, more people might be able to benefit from the knowledge you've aquired along the way if you're interested in contributing to regexp_parser -- and perhaps some other gems that can be used on their own.

this could also improve regexp-examples a bit. i had a quick look around, and here are just a few things that regexp_parser handles more correctly or would allow to implement more easily:

/\u{10FFFF}/.examples     # => NoMethodError; should be ["\u{10FFFF}"]
/\u{61 62}/.examples      # => NoMethodError; should be ["ab"]
/[[:^ascii]]/.examples    # => []; should be ["\u0080", "\u0081", ...] or so
/\X/.examples             # => ["X"]; should be all kinds of stuff [1]
/(a)\g<1>/.examples       # => easy with regexp_parser's #referenced_expression
/(a)(?(1)b|c)/.examples   # => NoMethodError; doable but complicated [2]
/\0/.examples             # => []; should be ["\u0000"]
/[a-&&]/.examples         # => ["a", "&"]; should be []
/(?u:\w)/.examples        # => NoMethodError; should be unicode word chars
/(?a)[[:word:]]/.examples # => NoMethodError; should be ascii word chars

[1] [2]

then there are some other gems (cough by me cough) that might be helpful and would benefit from contributors:

regexp_property_values reads out the codepoints matched by property or posix expressions directly from Ruby via a C API. might allow getting rid of the versioned codepoint databases in this gem. also works with old Rubies.

character_set calculates matched codepoints, e.g. of bracket expressions, in C. could be a performance boost or at least abstract away that part.

all three of these gems can be seen in use in js_regex.

i'll understand if you want to keep regexp-examples without dependencies, but feel free to take a look around this stuff.

symlink unimplemented when installing

I get this message when installing it on windows 10:
C:/Ruby22/lib/ruby/site_ruby/2.2.0/rubygems/package.rb:388:in `symlink': symlink() function is unimplemented on this machine (NotImplementedError)

Enhancements to absent operator examples

The "best" way to replicate generating examples for absent operator groups is with a negative lookbehind, e.g:

(?~abc) --> (?:.(?<!abc))*

But since look-behinds are irregular, this library cannot support that! A possible workaround would be to replace the group with a repetition of the first letter negated, e.g:

(?~abc) --> (?:[^a]*)

However (!!) this generalisation is not always possible:

(?~\wa|\Wb) --> ???

Therefore, the only 100% reliable option - which is what this gem currently does - is just to match "nothing"!

/(?~abc)/.examples #=> [""]

However, as shown above, this library could, at least, be enhanced to deal with specific scenarios in better ways. But such a strategy needs to be optimised for generalisation and reliability.

Can a reverse abilty be made?

Not an issue just something I thought about, can a reverse application be made, which takes a string and creates a Regex for it ?

P.S
Your gem is really cool, thanks for the hard work !

enhancement: negative examples?

Is there a way to generate counter-examples to a regex pattern? This would be incredibly useful for writing tests for format validators. Think validates_format_of for shoulda-matchers.

Conceptually, it would enumerate all strings that don't match the pattern.

Ideal behavior:

/a+/.counter_examples #=> ["", "b", "c", "d", ...]
/hello/.counter_examples #=> ["", "a", "b", "c", ...]
/.*/.counter_examples #=> [] (or nil)

it seems like this is computationally possible at least. the complement of a regular language is also regular, which means that you can enumerate the counter-examples by "negating" the regular expression's deterministic finite automaton representation, and then enumerating examples from that.

https://math.stackexchange.com/questions/685182/complement-of-a-regular-expression

negative unicode property via \P is not supported

/\P{Word}/.examples causes an error, but it should in theory be possible to handle easily cause \p{^Word}/.examples works ok.