Git Product home page Git Product logo

intxeger's Introduction

IntXeger

Build Status Documentation Code Coverage PyPI MIT

IntXeger (pronounced "integer") is a Python library for generating strings from regular expressions. Some of its core features include:

  • Support for most common regular expression operations.
  • Array-like indexing for mapping integers to matching strings.
  • Generator interface for sequentially sampling matching strings.
  • Sampling-without-replacement for generating a set of unique strings.

Compared to popular alternatives such as xeger and exrex, IntXeger is an order of magnitude faster at generating strings and offers unique functionality such as array-like indexing and sampling-without-replacement.

Installation

You can install the latest stable release of IntXeger by running:

pip install intxeger

Quick Start

Let's start with a simple example where our regex specifies a two-character string that only contains lowercase letters.

import intxeger
x = intxeger.build("[a-z]{2}")

You can check the number of strings that can be generated from this regex using the length attribute and generate the ith matching string using the get(i) method.

assert x.length == 26**2 # there are 676 unique strings which match this regex
assert x.get(15) == 'ap' # the 15th unique string is 'ap'

Furthermore, you can generate N unique strings which match this regex using the sample(N) method. Note that N must be less than or equal to the length.

print(x.sample(N=10))
# ['xt', 'rd', 'jm', 'pj', 'jy', 'sp', 'cm', 'ag', 'cb', 'yt']

Here's a more complicated regex which specifies a timestamp.

x = intxeger.build(r"(1[0-2]|0[1-9])(:[0-5]\d){2} (A|P)M")
print(x.sample(N=2))
# ['11:57:12 AM', '01:16:01 AM']

You can also print matches on the command line.

$ intxeger --order=desc "[a-c]"
c
b
a
$ python3 -m intxeger -0 'base/[ab]/[12]' | xargs -0 mkdir -p
$ tree base/
base
├── a
│   ├── 1
│   └── 2
└── b
    ├── 1
    └── 2

To learn more about the functionality provided by IntXeger, check out our documentation!

Benchmark

This table, generated by benchmark.py, shows the amount of time in milliseconds required to generate N examples of each regular expression using xeger and intxeger.

regex N xeger exrex intxeger
[a-zA-Z]+ 100 7.36 3.17 1.09
[0-9]{3}-[0-9]{3}-[0-9]{4} 100 11.59 6.25 0.8
[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4} 1000 208.62 91.3 18.28
/json/([0-9]{4})/([a-z]{4}) 1000 133.36 107.01 12.18

Have a regular expression that isn't represented here? Check out our Contributing Guide and submit a pull request!

intxeger's People

Contributors

k15z avatar moreati avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

intxeger's Issues

Support countably infinite regular expressions

The goal is to support array-based indexing for regular expressions with unbounded repeats. Currently, the max_repeats parameter limits the number of times any sequence can be repeated, making it so that there are always a finite number of strings which can be generated from a regex.

After this change, the user will be able to choose between (1) specifying max_repeats and having a finite set of strings or (2) not specifying max_repeats and having an infinite set of strings they can iterate over and/or index into.

Repeat

Modify the Repeat class to apply Cantor's pairing function when max_repeat is not specified.

x -> (a, b) # decompose the index into two values
b -> interpret this as the length of repeated sequence
a -> (a1, a2, a3, ... a_b) # convert it into `b` values
a_i -> the integer index of the `i`th element in the sequence

Note that the length attribute will be set to float(-inf).

Choice

Modify the Choice class to handle both finite nodes and infinite nodes. It should assign the smallest integers to the finite nodes; then, once those are all assigned, it should start handling the infinite nodes by rotating between them.

Strings may not be unique if the regex is ambiguous

For example, if your regex is:

(abc)|(abc)

Then it will say that length=2 and generate ["abc", "abc"] since they're generated by different nodes in the tree. It's not clear what the solution is but this is a not a problem unique to intxeger, other libraries such as exrex also have this issue.

Using a regex `.` causes newline breaks to be sampled

It seems that when used to sample a regex that includes a . (which should match anything except a newline break), \n will show up in the sample

import re
import intxeger

regex = r"."
x = intxeger.build(regex)
samples = x.sample(min(100, x.length))
non_matches = [item for item in samples if re.fullmatch(regex, item) is None]
print(non_matches)
# ['\n']

Expand user API

  • Add an intxeger.sample(regex, N) method which builds the tree, optimize it, and uses it to generate N samples.
  • Add an intxeger.iterator(regex, ordered=False) generator which yields random or ordered samples.

ValueError raised

Hi there, I'm evaluating using this library instead of the alternatives since it looks quite nice. But I am enountering some issues.

For example, given this input:

from intxeger import build

regex = "a$"
result = build(regex)

I am getting this:

op = AT, args = AT_END, max_repeat = 10

    def _to_node(op, args, max_repeat):
        if op == sre_parse.IN:
            nodes = []
            for op, args in args:
                nodes.append(_to_node(op, args, max_repeat))
            if nodes[0] == "NEGATE":
                values = [c[i] for c in nodes[1:] for i in range(c.length)]
                nodes = [Constant(c) for c in string.printable if c not in values]
            return Choice(nodes)
        elif op == sre_parse.RANGE:
            min_value, max_value = args
            return Choice(
                [Constant(chr(value)) for value in range(min_value, max_value + 1)]
            )
        elif op == sre_parse.LITERAL:
            return Constant(chr(args))
        elif op == sre_parse.NEGATE:
            return "NEGATE"
        elif op == sre_parse.CATEGORY:
            return Choice([Constant(c) for c in CATEGORY_MAP[args]])
        elif op == sre_parse.ANY:
            return Choice([Constant(c) for c in string.printable])
        elif op == sre_parse.ASSERT:
            nodes = []
            for op, args in args[1]:
                nodes.append(_to_node(op, args, max_repeat))
            return Concatenate(nodes)
        elif op == sre_parse.BRANCH:
            nodes = []
            for group in args[1]:
                subnodes = []
                for op, args in group:
                    subnodes.append(_to_node(op, args, max_repeat))
                nodes.append(Concatenate(subnodes))
            return Choice(nodes)
        elif op == sre_parse.SUBPATTERN:
            nodes = []
            ref_id = args[0]
            for op, args in args[3]:
                nodes.append(_to_node(op, args, max_repeat))
            return Group(Concatenate(nodes), ref_id)
        elif op == sre_parse.GROUPREF:
            return GroupRef(ref_id=args)
        elif op == sre_parse.MAX_REPEAT or op == sre_parse.MIN_REPEAT:
            min_, max_, args = args
            op, args = args[0]
            if max_ == sre_parse.MAXREPEAT:
                max_ = max_repeat
            return Repeat(_to_node(op, args, max_repeat), min_, max_)
        elif op == sre_parse.NOT_LITERAL:
            return Choice([Constant(c) for c in string.printable if c != chr(args)])
        else:
>           raise ValueError(f"{op} {args}")
E           ValueError: AT AT_END

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.