google / sre_yield Goto Github PK

View Code? Open in Web Editor NEW

186.0 11.0 52.0 126 KB

Python module to generate regular all expression matches

License: Apache License 2.0

Python 96.91% Shell 2.00% Makefile 1.08%

sre_yield's Introduction

sre_yield

ARCHIVED: See https://github.com/sre-yield/sre-yield for continuing development.

sre_yield's People

Contributors

Stargazers

Watchers

sre_yield's Issues

slice len() raises TypeError even for small values

len(sre_yield.AllMatches('\d+')) raises OverflowError: cannot fit 'int' into an index-sized integer

len(sre_yield.AllMatches('\d+', max_count=19)) raises OverflowError: cannot fit 'int' into an index-sized integer, but lower max_count is ok.

But using slices it gets interesting.

len(sre_yield.AllMatches('\d+')[:16]) raises TypeError: 'float' object cannot be interpreted as an integer - same result for using AllStrings ([:15] is ok). In this case, using list is a workaround - len(list(sre_yield.AllStrings('\d+')[:16])) = 16.

len(sre_yield.AllMatches('\d+', max_count=1)[:16]) is ok, but len(sre_yield.AllMatches('\d+', max_count=2)[:16]) and higher is not.

Looking at the code, I see a note about using .__len__() instead, and the solution for most cases becomes obvious.

How to limit the the numbers generated?

k=list(sre_yield.AllStrings('[a-zA-Z]\d{7}')) Is there way to limit the numbers as to generate 9 digit it would take lot of time.

Request for sre_yield.oneString method

For big regex I am seeing below error. Probably it is because too many combinations are generated. I would need just one of the string.

Probably sre_yield.oneString needs to be developed.

list(sre_yield.AllStrings(r'((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}((([0-9a-fA-F]{0,4}:)?(:|[0-9a-fA-F]{0,4}))|(((25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])))(%[\p{N}\p{L}]+)?'))
Traceback (most recent call last):
File "", line 1, in
OverflowError: cannot fit 'int' into an index-sized integer

Implement thinning consistently

When the user passes in a charset currently, it's only used for dot. I'm expanding this to be use intersection between the charset passed in, and categories like \w\s\d as well, but don't intend to for literals.

Should it apply to character classes? I'm not sure.

For some, like [^\w] it's pretty clear it should (once Unicode support lands), but others like [a-z_] are already fairly limited.

Support flags=re.LOCALE

This seems unnecessarily difficult, please leave a comment if it's useful to you (with an example).

repr is slow

#32 test case testBenchInputSlow shows repr can be very slow.

It looks like repeat sequence objects are the cause, but #30 fixes it for the PR 32 test cases (only), so my gut feeling is a wrapper concat/combin object which uses repr(self.list_lengths) which hits list.__repr__ and that is causing a full expansion to occur. I am pretty confident that PR doesnt really solve it - it only solves the most basic cases.

IndexError if max_count lower than seq{x}

sre_yield.AllStrings(r"\d{2}", max_count=1)

causes:

>>> list(sre_yield.AllStrings(r"\d{2}", max_count=1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sre_yield/__init__.py", line 581, in AllStrings
    return RegexMembershipSequence(
  File "sre_yield/__init__.py", line 556, in __init__
    self.raw = self.sub_values(pattern)
  File "sre_yield/__init__.py", line 417, in sub_values
    elements = [self.sub_values(p) for p in parsed]
  File "sre_yield/__init__.py", line 417, in <listcomp>
    elements = [self.sub_values(p) for p in parsed]
  File "sre_yield/__init__.py", line 427, in sub_values
    return self.backends[matcher](*arguments)
  File "sre_yield/__init__.py", line 378, in max_repeat_values
    return RepetitiveSequence(self.sub_values(items), min_count, max_count)
  File "sre_yield/__init__.py", line 275, in __init__
    if self.offsets[-1][0] > OFFSET_BREAK_THRESHOLD:
  File "sre_yield/cachingseq.py", line 33, in __getitem__
    raise IndexError()
IndexError

Reverse slices can produce empty sets

>>> "12345678"[99:-99:-1]
'87654321'
>>> AllStrings("[abcdef]")[99:-99:-1]
[]

Pass through unexpanded

I am using sre_yield to expand regexes where the result is sane, and want the ridiculous parts of the regex passed through unexpanded in the results.

I am currently achieving it by pre-processing known ridiculous bits to be '~~', and I intend to improve that by subclassing and catching&replace them dynamically. I think this could be a common need, as a way to allow usage when some cases are too complex for sre_yield to generate all possibilities, but most times it is ok.

Capture group lost for single pattern sequence

sre_yield.AllMatches("z([ab]{2})") has a capture group, but sre_yield.AllMatches("([ab]{2})") does not.

Slices should not raise IndexError

It is a nice oddity of slices that they never raise an IndexError. I've found two uses of slices which cause them.

>>> [0, 1, 2, 3][slice(99,-99)]
[]
>>> AllStrings("[abcdef]")[slice(99,-99)]
Traceback (most recent call last):
  File "sre_yield/__init__.py", line 137, in __getitem__
    result = SlicedSequence(self, slicer=i)
  File "sre_yield/__init__.py", line 167, in __init__
    self.start, self.stop, self.step = slice_indices(slicer, raw.__len__())
  File "sre_yield/__init__.py", line 97, in slice_indices
    stop = _adjust_index(stop, size)
  File "sre_yield/__init__.py", line 107, in _adjust_index
    raise IndexError("Out of range")
IndexError: Out of range

>>> "abcdef"[99::-1]
'fedcba'
>>> AllStrings("[abcdef]")[99::-1]
Traceback (most recent call last):
  File "sre_yield/__init__.py", line 140, in __getitem__
    result = [item for item in result]
  File "sre_yield/__init__.py", line 140, in <listcomp>
    result = [item for item in result]
  File "sre_yield/__init__.py", line 148, in __iter__
    yield self.get_item(i)
  File "sre_yield/__init__.py", line 176, in get_item
    return self.raw[j]
  File "sre_yield/__init__.py", line 144, in __getitem__
    return self.get_item(i)
  File "re_yield/__init__.py", line 405, in get_item
    return super().get_item(i)
  File "sre_yield/__init__.py", line 126, in get_item
    return self.raw.get_item(i, d)
  File "sre_yield/__init__.py", line 217, in get_item
    raise IndexError("Index %d out of bounds" % (i,))
IndexError: Index 6 out of bounds

Random values

There are lots of regex expanders which provide only one feature, and it is a feature missing from this library: Random values.

The result is that other similar codebases, typically not as well built (often broken or incomplete sre handling that is "good enough" for MVP), are getting more brain power invested in them.

No doubt this library can be adapted to this easily, since it provides rather efficient slicing, so it would be simple to do a random slice into the sequence to get a random value.

IMO that is worth building into this library, heralding it, and over time improving the performance by providing additional slicers that obtain a less-random value that is known to be easier to obtain.

If the random slicer is able to be used repetitively, it can be used as a mechanism for thinning a large result space #2

fwiw, I am not suggesting that the larger use case of "fake data" is included in this library. I think that there should be many libraries which approach that type of problem. I see the objective as adding to this library the tools they would need to generate fake data values with high performance using an almost complete regex syntax.

README: obsolete comment about null bytes

README says:

The re module docs say "Regular expression pattern strings may not contain null bytes" yet this appears to work fine.

But this sentence was removed in Python 3.6 (python/cpython@69ed5b6), and sre_yield doesn't support older versions.

Problem with lookaheads

Positive and negative lookaheads behave the same. This is as of [d997adf]

>>> x = sre_yield.AllStrings("(?!a)x?")
>>> len(x)
0
>>> x.raw.list_lengths
[([], 0), ({repeat base=1 low=0 high=1}, 2)]

upload a new version up to pypi?

version 1.0 (published 2014/02/14) up on pypi doesn't work with python3.

It appears that the code here in 'master' has been updated to do so, although I haven't tried running/testing it in both py2 and 3.

If it does work for both, could someone upload a new version up to pypi?

Support flags=re.UNICODE

This will make the \w\s\d categories support unicode, as well as .

Properties support (from the regex module rather than re) probably won't be supported, and \b still won't be (isn't currently either).

This is somewhat dependent on the thinning question in #2

Support flags = re.IGNORECASE

Right now regexes with this flag raise an exception with a link to this issue.

Supporting this is dependent on some decisions for unicode/thinning support in #2 and #3

[feature request] Provide an iterateable object

To prevent memory overflow errors please provide an iterateable object so each string can be process separately.

Example:
Consider the following code:

import sre_yield
matches = list(sre_yield.AllStrings("[a-z]{1,20}"))

The above generates OverflowError but if an iterateable object is provided then we can prevent this error. The expected outcome would be as follows:

import sre_yield
for combination in sre_yeild.getIteratableCombinations("[a-z]{1,20}"):
   print(combination)

In this way we won't be needing memory for storing huge lists.

Slice step 0 causes ZeroDivisionError

>>> slice(0,1,0)
slice(0, 1, 0)
>>> [0, 1, 2, 3][slice(0,1,0)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: slice step cannot be zero
>>> AllStrings("[abcdef]")[0:1:0]
Traceback (most recent call last):
  File "sre_yield/__init__.py", line 137, in __getitem__
    result = SlicedSequence(self, slicer=i)
  File "sre_yield/__init__.py", line 170, in __init__
    self.length = (
ZeroDivisionError: integer division or modulo by zero

max_count is confusing

IMO max_count in AllStrings/AllMatches should be renamed max_repeat, gradually of course, by introducing a new kwarg max_repeat and deprecating use of max_count.

Supporting "obvious" anchors

Some regexes like ^(foo|bar)$ or ^^^ contain anchors that aren't strictly necessary (since it's fullmatch). It would be nice to accept these and not raise ParseError.

IndexError not raised causing infinite loop

sre_yield.AllStrings("x?").get_item(2) doesnt raise IndexError and ends up lost.

It goes into divmod_iter(1, 1) -> divmod_iter_basic(1, 1) and never gets out.

Four tests ignored on pytest 5

Output in pytest 5.2.4

sre_yield/tests/test_bigrange.py:34
  sre_yield/tests/test_bigrange.py:34: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_all will be ignored
    def test_all():

sre_yield/tests/test_fastdivmod.py:139
  sre_yield/tests/test_fastdivmod.py:139: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_correctness_big_numbers will be ignored
    def test_correctness_big_numbers():

sre_yield/tests/test_fastdivmod.py:168
  sre_yield/tests/test_fastdivmod.py:168: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_powersum will be ignored
    def test_powersum():

sre_yield/tests/test_slicing.py:52
  sre_yield/tests/test_slicing.py:52: PytestCollectionWarning: yield tests were removed in pytest 4.0 - test_parity will be ignored
    def test_parity():

Allow pre-parsed patterns

I'm processing ~20,000 patterns, and I would rather not have them parsed/compiled a few times.

So I sre_parse them, and then use sre_compile.compile(p) to create the compiled pattern when needed. re.compile does those two steps anyway - the only difference is whether the compiled regex has the pattern attribute as a string containing the original regex.

The parsed (not compiled) version seems to be more suitable for keeping in memory for longer periods, as its size is more closely related to the string pattern length, while the compiled regex can be 8x the input string size.

google / sre_yield Goto Github PK

sre_yield's Introduction

sre_yield

sre_yield's People

Contributors

Stargazers

Watchers

Forkers

sre_yield's Issues

Recommend Projects

Recommend Topics

Recommend Org