Git Product home page Git Product logo

meta_re's Introduction

MetaRegexp

Note: this project has not been released as a gem, but might be someday.

What is it

A regular expression “pre-processor” that performs the following transformations:

  • Repetition Expansion: expands repeated grouped expressions to enable repeated captures, without manual duplication.

  • Aliases: recognizes and substitutes registered regular expression aliases at any depth (i.e. aliases can contain aliases), including the expansion of any grouped expressions within them.

The following sections cover these transformations in more detail.

Repetition Expansion

Using the built-in Regexp class a regular expression that contains repeated (i.e. quantified) capturing groups only gets its last match captured into the resulting MatchData. For example, given the following pattern and input text:

regex = /([a-z]+\s?){1,3}/
input  = "one two three"

The built-in Regexp class produces:

Regexp.new( regex ).match( input ).captures
#=> ["three"]

With MetaRexgep’s repetition expansion, the same expression captures all the grouped expressions in the MatchData:

MetaRegexp.new( regex ).match( input ).captures
#=> ["one ", "two ", "three"]

The example above is very simple (the same can be done with scan /w+/) but it was purposefully chosen to quickly illustrate the benefit of expansion. A more elaborate example is presented below.

How Expansion Works

Expansion is accomplished by parsing the given expression, string or Regexp object, collecting information about the grouped expressions within it and their minimum and maximu repetitions into a “tree” structure. This is then recompiled into a new string that is used to create a new Regexp object.

So, given the following regular expression:

/([a-z]+){1,3}/

MetaRegexp transforms it into the following:

/([a-z]+)([a-z]+)?([a-z]+)?/

Now the pattern contains actual groups that will be captured during the matching process. Note how the last two instances of the original pattern are marked as optional matches with the zero-or-one meta character ‘?’. The minimum is one (required) and the maximum is 3, i.e the last two matches are optional.

This type of expansion is usually done ‘manually’ by interpolating or typing the desired capture patterns multiple times. In addition to not being DRY, this approach makes the resulting expressions harder to edit and read.

Notes

* The "or more" quantifiers, zero or more (*), and one or more (+) use a
  default value of 12 for the maximum number of repetitions. A different
  value can be specified as the second argument to #new of MetaRegexp.

* To avoid over expansion with * and +, use the interval (a.k.a range)
  quantifier ({min,max}) for repeating groups.

* Only greedy and reluctant (lazy) quantifiers are supported and tested
  so far.

* Possible matches that do not occur appear as nil values in the captures
  array. See the MatchData class in meta_re.rb for more info and ways to
  deal with these.

An Example

Here’s a more elaborate example, this time for matching HTML tags. It is somewhat simplified in terms of the tag and attribute names to make it eaiser to follow.

Given the following input and regular expression:

# input text
tag = '<input type="text" value="hello" id="body" onclick="click_text();" />'

# regular expression
rex = /<[a-z]+\s+(([a-z]+)="([^"]+)"\s*){1,8}\s*\/>/

With Regexp:

Regexp.new( rex ).match( tag ).captures

# captures
["onclick=\"click_text();\" ", "onclick", "click_text();"]

Now with MetaRegexp:

MetaRegexp.new( rex ).match( tag ).captures

# captures, formatted for readability:
[
  "type=\"text\" ", "type", "text",
  "value=\"hello\" ", "value", "hello",
  "id=\"body\" ", "id", "body",
  "onclick=\"click_text();\" ", "onclick", "click_text();",
  nil, nil, nil,
  nil, nil, nil,
  nil, nil, nil,
  nil, nil, nil
]

The nil values that appear at the end are for possible matches that did not occur. See the notes (above) for more on this.

Aliases

Aliases are like macros, a way to define named regular expression patterns that can be used on their own or within larger patterns.

This type of pattern reuse is usually done with constants and simple string concatination, but it is only effective for one level of depth. With MetaRegexp aliases, new aliases can be defined in terms of other aliases. This allows for building more granural “libraries” of reusable patterns.

Aliasing Example

Aliasing is off by default. It can be enabled with:

MetaRegexp.aliasing true

Next define some aliases, smaller patterns are used to build larger patterns:

MetaRegexp.alias :name,      /[a-z][-a-z0-9_]+/
MetaRegexp.alias :scheme,    /https?|ftps?/
MetaRegexp.alias :tld,       'com|net|org|edu'
MetaRegexp.alias :domain,    /((?:(@name)\.){1,4}(@tld))/
MetaRegexp.alias :uri,       '@scheme://@domain'

To use an alias, prepend its name with an at sign (@):

>> MetaRegexp.new /@domain/
=> /((?:([a-z][-a-z0-9_]+)\.)(?:([a-z][-a-z0-9_]+)\.)?(?:([a-z][-a-z0-9_]+)
   \.)?(?:([a-z][-a-z0-9_]+)\.)?(com|net|org|edu))/

>> MetaRegexp.new '@uri' 
=> /(?:https?|ftps?):\/\/(?:((?:([a-z][-a-z0-9_]+)\.)(?:([a-z][-a-z0-9_]+)
   \.)?(?:([a-z][-a-z0-9_]+)\.)?(?:([a-z][-a-z0-9_]+)\.)?(com|net|org|edu)))/

Naming Alias Captures

In addition to the normal alias resolution, where aliases are replaced with their defined values as is, they can also be wrapped with named groups that use the same name as the alias. To enable named alias groups use:

MetaRegexp.name_groups true

>> MetaRegexp.new '@scheme'
=> /(?<scheme>https?|ftps?)/

Now matches for that alias can be accessed by name from the match data.

Known Probelm

Aliases that contain repetitions will result in multiple groups having the same name. Obviously these matches will overwrite one another on the captures list. This will be corrected in a future version.

Copyright © 2010 Ammar Ali. See LICENSE file for details.

meta_re's People

Contributors

ammar avatar

Stargazers

Enrique Esquinas avatar  avatar Hemant Kumar avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.