Git Product home page Git Product logo

frak's Introduction

frak

frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible.

"Installation"

Add frak as a dependency to your project.clj file.

[frak "0.1.2"]

Usage

user> (require 'frak)
nil
user> (frak/pattern ["foo" "bar" "baz" "quux"])
#"(?:ba[rz]|foo|quux)"
user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
#"Cloj(?:ure(?:Script)?|ars)"

How?

A frak pattern is constructed from a trie of characters. As characters are added to it, meta data is stored in it's branches containing information such as which branches are terminal and a record of characters which have "visited" the branch.

During the rendering process frak will prefer branch characters that have "visited" the most. In the example above, you will notice the ba[rz] branch takes precedence over foo even though "foo" was the first to enter the trie. This is because the character \b has frequented the branch more than \f and \q. The example below illustrates this behavior on the second character of each input.

user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
#"b(?:a[tnr]|o[tx]|it)"

Why?

Here's why. Also because.

Next

While the patterns currently generated by frak are correct, there is potential for improvement; the word trie could be converted to a directed acyclic word graph (DAWG).

By using a DAWG it might be possible to produce more efficient patterns which consider common prefixes and suffixes. This could reduce backtracking and overall pattern size.

And now for something completely different

Let's build a regular expression for matching any word in /usr/share/dict/words.

user> (require '[clojure.java.io :as io])
nil
user> (def words
           (-> (io/file "/usr/share/dict/words")
               io/reader
               line-seq))
#'user/words
user> (def word-re (frak/pattern words))
#'user/word-re
user> (every? #(re-matches word-re %) words)
true

The last two operations will take a moment since there are about 235,886 words to consider.

You can view the full expression here (it's approximately 1.5M!).

Benchmarks

(use 'criterium.core)

(def words
  (-> (io/file "/usr/share/dict/words")
      io/reader
      line-seq))

(defn naive-pattern
  "Create a naive regular expression pattern for matching every string
   in strs."
  [strs]
  (->> strs
       (clojure.string/join "|")
       (format "(?:%s)")
       re-pattern))

;; Shuffle 10000 words and build a naive and frak pattern from them.
(def ws (shuffle (take 10000 words)))
(def n-pat (naive-pattern ws))
(def f-pat (frak/pattern ws))

;; Verify the naive pattern matches everything it was constructed from.
(every? #(re-matches n-pat %) ws)
;; => true

;; Shuffle the words again since the naive pattern is built in the
;; same order as it's inputs.
(def ws' (shuffle ws))

;;;; Benchmarks

;; Naive pattern

(bench (doseq [w ws'] (re-matches n-pat w)))
;;             Execution time mean : 1.499489 sec
;;    Execution time std-deviation : 181.365166 ms
;;   Execution time lower quantile : 1.337817 sec ( 2.5%)
;;   Execution time upper quantile : 1.828733 sec (97.5%)

;; frak pattern

(bench (doseq [w ws'] (re-matches f-pat w)))
;;             Execution time mean : 155.515855 ms
;;    Execution time std-deviation : 5.663346 ms
;;   Execution time lower quantile : 148.168855 ms ( 2.5%)
;;   Execution time upper quantile : 164.164294 ms (97.5%)

frak's People

Contributors

noprompt avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.