Git Product home page Git Product logo

llvm-regexp's Introduction

JIT-compiling code to match regular expressions
===============================================

This is a small demo, showing how to generate regular expression matchers using LLVM.
It is inspired by the great Weighted Regular Expressions paper (http://sebfisch.github.com/haskell-regexp/).

There is no documentation so far, only copies of mailing list communication.

Introduction (28.7.2010)
------------------------
Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell.

The attached prototype does not scale for larger regexp, is mostly untested and probably unoptimized, but is quite a bit faster than ruby and python's re, with the code generation core only spanning 25 lines.

Here is a biased (because the LLVM stuff needs to use bytestrings) and completely unrepresentative comparison for the "even number of c's" regular expressions:

> ruby -e 'print("accbccacbc" * 10000000)' > test.in

> # Using weighted-regexp-0.2.0.0 and the RE from the paper
> time ./TestWeightedRegexp < test.in # using the RE from the paper
       58.34 real        56.65 user         1.10 sys

> time ruby -e 'gets =~ /\A(?:[ab]*c[ab]*c)*[ab]*\Z/' < test.in
        7.42 real         6.23 user         1.03 sys

> time ./Grep '([ab]*c[ab]*c)*[ab]*' < test.in
        0.96 real         0.79 user         0.16 sys

For large regular expressions, the generated bitcode serves as a good stress test for LLVM's backend ;) 

Rough Functionality (30.7.2010)
-------------------------------

The current implementation works as follows:

It generates an LLVM function which matches a bytestring against a given  regular expression. The state of the 'automaton' consists of one bit for
each leaf of the regexp AST, corresponding to the marks in the article's figure. It then generates straight-line code updating the state given an input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM).

For example, for matching '.a' , the generated code looks like this in a simplified pseudo code notation:

> ... next ~ first character matched
> ... ch   ~ input character
>   next1 = bitmask[0]              -- was '.' marked ?
>   bitmask[0] = next               -- mark '.' if initial
>   next2 = bitmask[1]              -- was 'a' marked ?
>   bitmask[1] = ch == 'a' && next1 -- mark 'a' if '.' was marked                                          
                                    -- and input is 'a'

At the end of the string, code is generated to check whether the automaton is in a final state (genFinalStateCheck). In the example above, this corresponds to

>   final = bitmask[1]

The rest is either LLVM boilerplate (regexMatcher) or adaptions from weighed-regexp. Additionally, I've adapted the Parser.y from weighted-regexp, but some things (e.g. character classes like \w) are not implemented.

Generating one big basic block for the whole regular expressions does not work well when there are more than a few thousand nodes in the AST. Using functions for large regular expressions and loop for repititions would be one solution. The other problem is that the matcher only operates on Word8s at the moment.

Trying it out (if you have llvm+haskell bindings) is also easy (do not cabal-easy):

> $ git clone [email protected]:visq/llvm-regexp.git && cd llvm-regexp
> $ make
> $ ./Grep '.*spotlight.*' < /etc/passwd 
< Line 45 matches

llvm-regexp's People

Contributors

visq avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.