google / re2j Goto Github PK

linear time regular expression matching in Java

License: Other

Java 63.58% HTML 36.07% Perl 0.35%

re2j's Introduction

RE2/J: linear time regular expression matching in Java

RE2 is a regular expression engine that runs in time linear in the size of the input. RE2/J is a port of C++ library RE2 to pure Java.

Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.

If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.

In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

There are certain features of PCRE or Perl regular expressions that cannot be implemented in linear time, for example, backreferences, but the vast majority of regular expressions patterns in practice avoid such features.

Why should I switch?

If you use regular expression patterns with a high degree of alternation, your code may run faster with RE2/J. In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J.

Caveats

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

RE2/J is not a drop-in replacement for java.util.regex. Aside from the different package name, it doesn't support the following parts of the interface:

the MatchResult class
Matcher.hasAnchoringBounds()
Matcher.hasTransparentBounds()
Matcher.hitEnd()
Matcher.region(int, int)
Matcher.regionEnd()
Matcher.regionStart()
Matcher.requireEnd()
Matcher.toMatchResult()
Matcher.useAnchoringBounds(boolean)
Matcher.usePattern(Pattern)
Matcher.useTransparentBounds(boolean)
CANON_EQ
COMMENTS
LITERAL
UNICODE_CASE
UNICODE_CHARACTER_CLASS
UNIX_LINES
PatternSyntaxException.getMessage()

It also doesn't have parity with the full set of Java's character classes and special regular expression constructs.

Getting RE2/J

If you're using Maven, you can use the following snippet in your pom.xml to get RE2/J:

<dependency>
  <groupId>com.google.re2j</groupId>
  <artifactId>re2j</artifactId>
  <version>1.6</version>
</dependency>

You can use the same artifact details in any build system compatible with the Maven Central repositories (e.g. Gradle, Ivy).

You can also download RE2/J the old-fashioned way: go to the RE2/J release tag, download the RE2/J JAR and add it to your CLASSPATH.

Discussion and contribution

We have set up a Google Group for discussion, please join the RE2/J discussion list if you'd like to get in touch.

If you would like to contribute patches, please see the instructions for contributors.

Who wrote this?

RE2 was designed and implemented in C++ by Russ Cox. The C++ implementation includes both NFA and DFA engines and numerous optimisations. Russ also ported a simplified version of the NFA to Go. Alan Donovan ported the NFA-based Go implementation to Java. Afroz Mohiuddin wrapped the engine in a familiar Java Matcher / Pattern API. James Ring prepared the open-source release and has been its primary maintainer since then.

re2j's People

Contributors

Stargazers

Watchers

Forkers

kinow emaxerrno j-y-park caopeng428 grey32 hunter99 tempbottle jbaginski sjamesr tanbamboo foo123 narayana1208 narayana-glassbeam semicoder gnanam336 alpavlov rlxrlxrlx jbot-the-dev jkadams sopel39 teradata godin irisindy mazhiyuan jpaulynice rzs840707 tadeegan kleopatra999 taiziye cursive-ide gorset menedev changeorientedprogrammingenvironment som-snytt geekercui quyetdo289 cxfly aoking wdroste tejasvgupta ramkaly93 ctiao xiezhenye thinkronize fixermark lluke gumuxiansheng ringw arnaudroger ibm-abdullah mikelalcon cushon speak2me luchy0120 zergscut2017 lkpnotice rkeytacked gudelliramu jeremycarroll avee10 nenorbot gorkemgenc codeaudit zhangtongle dhaase jatish-khanna kaby76 hasuniea xcorail daniellansun jschaf mbrukman dengziming tool-recommender-bot adedayoominiyi fbthrift jimmymaci wangjun0 stellar-development-foundation yesterdayblue kmizu tjgq hedong911 f00267456 ericedens asdbaihu jeff303 chouxiaoe orancyj yokotaso isaacpacht mykeul tobilang carbonfish chenyan19209219 itbeecoder nanqiai vonjackustc xiaming9880 crixalis2013

re2j's Issues

New release?

Is it possible to do a new release to release the bugfixes?

java.lang.StringIndexOutOfBoundsException in parseUnicodeClass

We were doing some fuzzing internally, and the fuzzer happened upon a String that gets a StringIndexOutOfBoundsException when our code calls Pattern.compile. I managed to extract and escape the String and it's:

"i\u9b47$ \u55ed54@\u864e\ub117\ua575\uad27\uc80cv^\u9fe8\u15ab\u6e2de\u67a5X+\u2d09\u649fn%\ua91e[Y\u6d7da\u9586\ua2bb\u7170nZ\u1476\u2a9eA\u4720\u1edb.\ua0603\u1754\u4915n\u2b90"\u0fe5pC4\u117c\uc94bt\u580d\uad90\u3930\u1b92u\ub00e\u7361/\u234b\u8e5f\uaac6\u0647y\u9ca5\u0092\u3396\u1775@[q:\u3abb\p"

Adding it to the test suite (PR incoming) gives:

java.lang.StringIndexOutOfBoundsException: String index out of range: 80
at java.lang.String.codePointAt(String.java:689)
at com.google.re2j.Parser$StringIterator.pop(Parser.java:747)
at com.google.re2j.Parser.parseUnicodeClass(Parser.java:1542)
at com.google.re2j.Parser.parseClass(Parser.java:1637)
at com.google.re2j.Parser.parseInternal(Parser.java:852)
at com.google.re2j.Parser.parse(Parser.java:784)
at com.google.re2j.RE2.compileImpl(RE2.java:179)
at com.google.re2j.RE2.compile(RE2.java:152)

Re2j runs 5x slower than java.util.regex

Here is my benchmark:

import org.apache.commons.lang3.RandomStringUtils;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;

import java.util.Iterator;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

@Warmup(iterations = 2, time = 3)
@Measurement(iterations = 2, time = 3)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Threads(3)
@Fork(1)
/*
 * <pre>
 *     Benchmark                Mode  Cnt     Score   Error  Units
 * Re2jBenchmark.javaMatch  avgt    2   175.926          ns/op
 * Re2jBenchmark.re2jMatch  avgt    2  1067.591          ns/op
 *     <pre/>
 */
public class Re2jBenchmark {

  private static final String PATTERN = "[0-5][0-6]{1,3}[1-7]{1,2}[2580]{1,3}[127]{0,8}";
  private static final Pattern p1 = Pattern.compile(PATTERN);
  private static final com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile(PATTERN);

  private List<String> inputs;

  private Iterator<String> iterator;

  @Setup
  public void setup() {
    inputs =
        IntStream.rangeClosed(0, 999)
            .mapToObj(n -> RandomStringUtils.randomNumeric(32))
            .collect(Collectors.toList());
    iterator = inputs.listIterator();
  }

  private String nextInput() {
    if (!iterator.hasNext()) {
      iterator = inputs.listIterator();
    }
    return iterator.next();
  }

  @Benchmark
  public boolean javaMatch() {
    return p1.matcher(nextInput()).find();
  }

  @Benchmark
  public boolean re2jMatch() {
    return p2.matcher(nextInput()).find();
  }
}

here is the result:

Benchmark                Mode  Cnt     Score   Error  Units
Re2jBenchmark.javaMatch  avgt    2   175.926          ns/op
Re2jBenchmark.re2jMatch  avgt    2  1067.591          ns/op

Pattern Leading to NullPointerException

When I compile the pattern: "..|.#|..":

Pattern.compile("..|.#|..")

I get java.lang.NullPointerException exception.

== Java Exception: java.lang.NullPointerException: Cannot read the array length because "subMax.runes" is null
	at com.google.re2j.Parser.factor(Parser.java:552)
	at com.google.re2j.Parser.collapse(Parser.java:344)
	at com.google.re2j.Parser.factor(Parser.java:512)
	at com.google.re2j.Parser.collapse(Parser.java:344)
	at com.google.re2j.Parser.alternate(Parser.java:294)
	at com.google.re2j.Parser.parseInternal(Parser.java:975)
	at com.google.re2j.Parser.parse(Parser.java:788)
	at com.google.re2j.RE2.compileImpl(RE2.java:187)
	at com.google.re2j.Pattern.compile(Pattern.java:149)
	at com.google.re2j.Pattern.compile(Pattern.java:107)

I applied the Jazzer Fuzzer to your lib. Maybe an integration is worth it.

com.google.re2j.PatternSyntaxException: error parsing regexp: invalid or unsupported Perl syntax: `(?!`

Hi, because of sonar issue i am migrating my java 1.8 regex to re2j regex 1.6
FYI: i am using it inside my springboot project

the code throwing the error.
Pattern.compile("^(?!."xml":)(.)", Pattern.CASE_INSENSITIVE)

here is the stack trace
com.google.re2j.PatternSyntaxException: error parsing regexp: invalid or unsupported Perl syntax: (?!
at com.google.re2j.Parser.parsePerlFlags(Parser.java:1139) ~[re2j-1.6.jar:?]
at com.google.re2j.Parser.parseInternal(Parser.java:811) ~[re2j-1.6.jar:?]
at com.google.re2j.Parser.parse(Parser.java:788) ~[re2j-1.6.jar:?]
at com.google.re2j.RE2.compileImpl(RE2.java:187) ~[re2j-1.6.jar:?]
at com.google.re2j.Pattern.compile(Pattern.java:149) ~[re2j-1.6.jar:?]
at com.google.re2j.Pattern.compile(Pattern.java:137) ~[re2j-1.6.jar:?]
at com.ficostudio.framework.config.XSSRequestInterceptor.(XSSRequestInterceptor.java:59) ~[classes/:?]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_291]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_291]
at org.springframework.cglib.core.ReflectUtils.defineClass(ReflectUtils.java:591) ~[spring-core-5.3.9.jar:5.3.9]
at org.springframework.cglib.core.AbstractClassGenerator.generate(AbstractClassGenerator.java:363) ~[spring-core-5.3.9.jar:5.3.9]
at org.springframework.cglib.proxy.Enhancer.generate(Enhancer.java:585) ~[spring-core-5.3.9.jar:5.3.9]
at org.springframework.cglib.core.AbstractClassGenerator$ClassLoaderData$3.apply(AbstractClassGenerator.java:110) ~[spring-core-5.3.9.jar:5.3.9]
at org.springframework.cglib.core.AbstractClassGenerator$ClassLoaderData$3.apply(AbstractClassGenerator.java:108) ~[spring-core-5.3.9.jar:5.3.9]
at org.springframework.cglib.core.internal.LoadingCache$2.call(LoadingCache.java:54) ~[spring-core-5.3.9.jar:5.3.9]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_291]
at org.springframework.cglib.core.internal.LoadingCache.createEntry(LoadingCache.java:61) ~[spring-core-5.3.9.jar:5.3.9]
... 20 more

StackOverflowError for Machine.add

Hey, I was combining many (>10.000) similar file names to a unified regexp (all escaped and |-ed).
In Java it compiles (though is extremely slow), but in Re2/j it fails with a stack overflow:

Caused by: java.lang.StackOverflowError
	at com.google.re2j.Machine.add(Machine.java:358)
	at com.google.re2j.Machine.add(Machine.java:358)
etc ...

Support for binary-mode matching?

This might be related to issue #17. The C++ implementation of re2 supports matching binary data (using a Latin1 encoding I believe) rather than unicode strings, including supporting \C as "match any byte".

It would be awesome to have support for this in this java port as well, are there any plans to ever add support for it? I've hacked in support for these in my own branch to support my requirements but I'm not sure how well it'd work in a general case.

OSS-Fuzz integration

OSS-Fuzz now offers support for fuzzing Java projects with Jazzer. If you are interested, I could set up re2j in OSS-Fuzz.

By default, Jazzer would detect undeclared exceptions (i.e. those that are not PatternSyntaxExceptions) as well as more serious, potentially DoSable issues such as OutOfMemoryErrors. In order to come up with a good fuzz target, it would be helpful for me to get a better understanding of the security guarantees re2j intends to offer. The parent project's fuzzer could serve as a starting point for that discussion. Depending on your particular security goals, it could also make sense to perform differential fuzzing, i.e., to use a fuzzer to confirm that re2 and re2j behave identically on the common subset of their features.

API to list named-capture-groups

At the moment it is a flaw of java regex, that it is not possible to enumerate named-capture groups. It would be a reason to switch to re2j for me alone of that. For example something like

Map<String,String> getNamedGroups()

There is a vulnerability in ICU for C/C++/Java 4.8.2,upgrade recommended

re2j/unicode/build.gradle

Line 15 in 66840ce

compile 'com.ibm.icu:icu4j:4.8.2'

CVE-2017-17484 CVE-2014-9654 CVE-2017-14952 CVE-2016-6293 CVE-2016-7415

Recommended upgrade version：
67.1

RE2J failed to match easy pattern

Using simple regexp with re2j library:

`^[0-9a-fA-F:]+$`

It compiled, but failed to recognize text:
`123ABC:DEF456
0:1:2:3:4:5
12345
90210-1234

regexp: considers "\Q\E*" as valid regexp

From golang/go#11187:

regexp/syntax: fix handling of \Q...\E

It's not a group: must handle the inside as a sequence of literal chars,
not a single literal string.

That is, \Qab\E+ is the same as ab+, not (ab)+.

Repro:

Add {"\\Qab\\E+", "cat{lit{a}plus{lit{b}}}"}, to https://github.com/google/re2j/blob/master/javatests/com/google/re2j/ParserTest.java#L88
Test output is parse/dump of \Qab\E+ expected cat{lit{a}plus{lit{b}}}, got plus{str{ab}} expected:<[cat{lit{a}plus{lit{b}]}}> but was:<[plus{str{ab]}}>

Future work

Dear Community,

As some of you know we at Teradata are planning to incorporate RE2J into Presto database (https://github.com/facebook/presto) since it is based on lighting fast RE2 and has potential for huge performance improvements. As part of our research on RE2J we found that it currently lacks many algorithms and optimizations that make RE2j so fast. Those optimizations require byte matching and not rune matching like in RE2J. Therefore as part of our work we would like to make RE2J internally work on UTF8 bytes. Additionally, we would like an input data to be represented by Slice (https://github.com/airlift/slice). Slice is basically a wrapper around byte[] array that can be sliced. It would allow us to directly match raw bytes. We believe that our changes will make RE2J one of the fastest matching solutions for Java in big data high performance scenarios. However, our approach might break current functionality of matching Java UTF16 Strings. We might add support for it later, but it will be less efficient than raw bytes matching because conversion from Strings to Slices is required. Our primary focus would be on matching UTF8 sequences. Our POC branch with RE2J on Slices is here: https://github.com/Teradata/re2j/tree/re2j_on_bytes. What do you think about our plan?

Here are some useful FAQs that additionally explain our motivations:

Why replacing Strings with Slices?

Slices are effectively a wrapper around a memory fragment. Since we would like RE2J to match UTF8 bytes instead of runes it is logical to use a structure that provides fast access to bytes and allows slicing. It happens that Slice is such a structure and is also used in Presto.

Why not use raw byte[] array?

Because raw byte[] array cannot be sliced without memory copying. We aim for maximum performance.

Why not use ByteBuffer?

Because ByteBuffer is an interface and will introduce virtual method calls when RE2J reads input bytes. This will hurt performance.

Why not leave abstraction of MachineInput as it is now and support both Strings and Slices?

Having MachineInput abstracted with multiple implementations will introduce virtual method calls and will hurt RE2J performance.

Is there a way to support UTF16 efficiently on Slices?

We could compile matching program specifically for UTF16 byte sequences and convert Strings to Slices. Matching will be efficient but would require one memory copying (String to Slice).

Why we want RE2J to work primarily on UTF8?

Presto uses UTF8. Additionally I would risk saying that most of the text in big data is either stored as ANSI or UTF8. That makes UTF8 our primary target. See http://utf8everywhere.org/.

Why we want RE2J to match bytes instead of runes?

Original RE2 matches bytes. Byte matching allows us to port algorithms and optimizations from RE2. Without those algorithms we won’t be able to match RE2 performance.

RE2J with Slices is not a drop in replacement for Java Regular Expressions.

If you don’t care about performance then Java Regular Expressions is probably your number one choice. If you want high performance matching then you won't choose Java Regular Expressions. RE2J is not a drop in replacement anyway since it doesn't support backreferences. Therefore I don’t think we should target Java Regular Expressions users.

Would RE2J with Slices be useful for community?

Absolutely. Slices are small airlift subproject so there is not a lot of dependency. Additionally, I think big data and high performance community will care more about fast UTF8 bytes matching solution.

.equals() and .hashcode() missing for Pattern class

The Pattern class should implement some (simple) .equals(Object) and .hashCode() methods to be able to be identified equal when being put into a Set.

Equality in this context can be as simple as the equality of the pattern string.

error parsing regexp: missing closing )

When I use Pattern.compile() to compile a regex it occured error as I described below:

regex = "^#connected,all connect count: 1{\"event\":\"device_status\",\"data\":{\"wifi_name\":\"([^\"]+)\",\"wifi_signal\":\d+,\"battery\":\d+,\"batterycharging\":\w+,\"gsm_signal\":\d+,\"sms_unread\":\d+,\"sdcard\":\d+,\"updateinfo\":null}}"

Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL)

erros

error parsing regexp: missing closing )

I test the same regex in C++ RE2 it works well...
Was the regex wrong for compiling?
and if wrong, could you offer me some advices on converting PCRE regexes to RE2 by re2j? such like fol

PCRE regexes samples:

regex1:

^\x99\xf3\0\0\0\0\0\0\xff\xff\xff\xff$

^\0\0\0T\0\0\0\x03\0\0\0\0\0\0\0\x01\x1b\xde\x83B\xca\xc0\xf3\?\0\0\0\x06aomSrv\0\0\0\0\0\x01\*\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\r[\d.]+\0\0\0\0\0\0\x04root\0\0\x06\(\0\0\0J$

Regees aboves contains some binary strings, Was re2j compatible with them?

question about a possibility

Infinite loop in `Pattern#compile` on certain case-insensitive patterns

Pattern.compile("(?i)ᲀ") hangs in an infinite loop with JDK 11 and 17, but not 8. Other characters from the Cyrillic Extended-C block trigger the same behavior.

Stack trace captured with jcmd:

at java.lang.Character.toLowerCase([email protected]/Character.java:10710)
at com.google.re2j.Characters.toLowerCase(Characters.java:14)
at com.google.re2j.Unicode.simpleFold(Unicode.java:118)
at com.google.re2j.Parser.minFoldRune(Parser.java:200)
at com.google.re2j.Parser.newLiteral(Parser.java:187)
at com.google.re2j.Parser.literal(Parser.java:211)
at com.google.re2j.Parser.parseInternal(Parser.java:807)
at com.google.re2j.Parser.parse(Parser.java:790)
at com.google.re2j.RE2.compileImpl(RE2.java:185)
at com.google.re2j.Pattern.compile(Pattern.java:151)
at com.google.re2j.Pattern.compile(Pattern.java:109)

I first submitted this as a security issue since I understood the README to indicate that re2j should be considered safe to use on untrusted regexes of bounded length, but was told that this is not the case and that there are many known ways to "crash" re2j with crafted regular expressions.

I would thus welcome a clear statement on the security guarantees provided by re2j in the README.

Edit: Looks like https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc could resolve this issue.

\" is again getting escaped and making the pattern invalid

Hi Team,

I'm creating the regex as - "^[\w\d!@#$%^(){}:;<>".|/?=+-]+$"

Here i'm having the double quote already escaped. But when i try to build using re2j, its again escaping the string as shown below and making the pattern invalid(That, not pattern is ending at " itself)

com.google.re2j.Pattern GLOBAL_TRADE_ITEM_NUMBER__PATTERN = com.google.re2j.Pattern.compile("^[\w\d!@#$%^(){}:;<>.|/?=\"+-]+$");

Now it's again esacping as highlighted above its thinking that esacaping is for backslash and not for " and thinking " is the end of pattern and making the

next characterts invalid. I think here github is unescaping. I'll attache screenshot aslo.

Regards,
Sharath

Serialization breaks compiled pattern

Hi,

I'm trying to serialize/deserialize an instance of com.google.re2j.Pattern

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.re2j.Pattern;
import org.apache.commons.lang3.SerializationUtils;
import org.junit.Test;

import java.util.Map;

public class Re2SerializationTest {
    @Test
    public void test() throws Exception {
        JsonParser parser = new JsonParser();
        JsonObject obj = parser.parse("{\"1\":\"\\\\Atest\\\\z\"}").getAsJsonObject();

        for (Map.Entry<String,JsonElement> t : obj.entrySet()) {
            String rgxp = t.getValue().getAsString();
            System.out.println("rgxp" + " -> " + rgxp);

            // java.util.regex
            java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(rgxp);
            System.out.println(pattern + " -> " + pattern.matcher("test").matches());
            byte[] serializedData = SerializationUtils.serialize(pattern);
            java.util.regex.Pattern pattern2 = SerializationUtils.deserialize(serializedData);
            System.out.println(pattern2 + " -> " + pattern2.matcher("test").matches());

            // com.google.re2j
            Pattern pattern3 = Pattern.compile(rgxp);
            System.out.println(pattern3 + " -> " + pattern3.matches("test"));
            byte[] serializedData2 = SerializationUtils.serialize(pattern3);
            Pattern pattern4 = SerializationUtils.deserialize(serializedData2);
            System.out.println(pattern4 + " -> " + pattern4.matches("test"));
        }
    }
}

...and getting the following error:

rgxp -> \Atest\z
\Atest\z -> true
\Atest\z -> true
\Atest\z -> true

com.google.re2j.PatternSyntaxException: error parsing regexp: invalid escape sequence: `\A`

	at com.google.re2j.Parser.parseEscape(Parser.java:1439)
	at com.google.re2j.Parser.parseInternal(Parser.java:966)
	at com.google.re2j.Parser.parse(Parser.java:802)
	at com.google.re2j.RE2.compileImpl(RE2.java:183)
	at com.google.re2j.Pattern.readObject(Pattern.java:274)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:224)
	at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:268)
	at Re2SerializationTest.test(Re2SerializationTest.java:31)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:117)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:262)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:84)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)


Process finished with exit code 255

Incorrect match found when capturing groups are not used

This test passes when using java.util.regex, but fails with re2j:

  @Test
  public void test() {
    Pattern p1 = Pattern.compile("(a.*?c)|a.*?b");
    Pattern p2 = Pattern.compile("a.*?c|a.*?b");

    Matcher m1 = p1.matcher("abc");
    m1.find();
    Matcher m2 = p2.matcher("abc");
    m2.find();

    assertEquals(m1.group(), m2.group());
  }

Both expressions should match abc. The second one only matches ab.

Fixed in go: golang/go#13812

proposal: port RE2's CoalesceWalker to RE2/J

google/re2@b77e1a4 added an optimisation to RE2. I'm offering to port this to RE2/J for the sake of feature parity.

Support (?<name>expr) named capturing groups format

I'm a kafka developer and I maintain some kafka clusters in my company. I often expose an interface to my kafka users, so that they can post a regular expression and analyse kafka records by using kafka transform. The analysed records can be used in kafka connect and send to other sink services.

Nowadays I think (?<name>expr) capturing group syntax becomes more popular, and my kafka users prefer to use this syntax instead of (?P<name>expr). But I found that re2j only support (?P<name>expr) named group format refer to code comment below:

In both the open source world (via Code Search) and the
Google source tree, (?P<name>expr) is the dominant form,
so that's the one we implement. One is enough.

So should we support (?<name>expr) syntax? I think if we support this feature, what we only need to do is add this prefix, and change skip substring length in com/google/re2j/Parser.java/parsePerlFlags function. If we can support this syntax, I can make a pr.

10-60X slower compared to java regexes in some cases

Original report: https://groups.google.com/forum/#!topic/re2j-discuss/8c3L06m6wbY

pattern is : ".d|e|cart|jinjian|kk."
String is: "aajinjianaksdjflaajinjianaksdjflaajinjianaksdjfl"

Java takes: 330ms
re2j takes: 4257ms

Thanks for your time to take a look at it.

This came up also on prometheus/jmx_exporter#23 @bbaja42

Pattern doesn't deserialize correctly

Pattern claims Serializable, but it has a transient field "re2", and nothing re-initializes that field on deserialization, so using a deserialized pattern eventually raises an NPE.

Question about supported features

Are lookaround operators supported by re2j?
Possessive quantifiers? (e.g.: ++)
Regards

Few minor bugs

Hi,

While investigating some slow issue with re2j compared to java regular expression, I found a few issues that I am planning to create a PR for:

See these 2 commits I have in my branch (don't worry about the System.out.print lines):

Question:
private void free(Thread t) method is not used in Machine.java what was the intention?

Regards,

Jay

What about a new release?

Hi 👋🏻

Sorry if this is not the proper channel.

It has been a while since the last release and I'd be very much interested in the support of the (?<name>expr) syntax added on the master branch over 6 months ago.

9b3f052

Thoughts on synchronized access on RE2 instances

We've recently been doing some profiling on one of our applications that uses re2j quite heavily, and we're noticing a decent amount of thread contention coming from RE2 when doing concurrent patten matching (i.e. multiple threads using the same Pattern instance).

Both RE2#get and RE2#put synchronize access on the monitor to obtain either a cached Machine instance or instantiate a new one if there are none that already exist.

Here's a small example that when run demonstrates the blocking thread behavior:

public class ThreadContentionExample {

  private static final int THREADS = 20;
  private static final Pattern PATTERN = Pattern.compile("\\w+://.*");

  public static void main(String ...args) throws InterruptedException {
    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < THREADS; i++) {
      Thread thread = new Thread() {
        @Override public void run() {
          while (true) {
            PATTERN.matcher("https://www.google.com").matches();
          }
        }
      };
      thread.start();
      threads[i] = thread;
    }

    for (Thread thread : threads) {
      thread.join();
    }
  }
}

The thread profile (via YourKit) looks as follows:

I did some JMH profiling with various other concurrent data structures, but it seems like all have slightly worse off performance than the current implementation at lower thread counts (within error bounds anyway). A ConcurrentLinkedQueue or Dequeue seems to be slightly more performant at higher thread counts. However, even if they were better, many of these more "exotic" classes are not available in GWT anyway or they're at a higher JDK language level (re2j uses 1.6 features now), so I think that precludes them from use.

That said, I was wondering if there were any thoughts how this synchronization bottleneck could be removed, or at least improved a little, in the case that there are many threads accessing the same Pattern instance?

Our current approach is to just use ThreadLocal Patterns in our application code (as we have bounded thread pools). I see that the ThreadLocal approach was adopted in re2j too, but it lead to some memory leaks, so it was reverted.

This definitely isn't a dealbreaker for us, and it's not really a bug either, just food for thought.

Re2j has been great! Thanks!

Quoted codepoint is not matched while unquoted is matched

I am using re2j (thx for library!) and use randomly generated strings to test that patterns and logic that I wrote works correctly. I recently found a weird case and I am not sure if it is a bug but feels so because golang regexp (also re2, right?) behavior is different.

Example

import com.google.re2j.Pattern;

public class Main {
    public static void main(String[] args) {
        String source = Character.toString(110781);
        System.out.println("unquoted: " + Pattern.matches(source, source));  // true
        System.out.println("quoted: " + Pattern.matches("\\Q" + source + "\\E", source));  // false
    }
}

package main

import (
	"fmt"
	"regexp"
)

func main() {
	source := string([]rune{110781})
	matched, _ := regexp.MatchString(source, source)
	fmt.Printf("unquoted: %v\n", matched)  // true
	matched, _ = regexp.MatchString(`\Q` + source + `\E`, source)
	fmt.Printf("quoted: %v\n", matched)  // true
}

(I hope I did right with that rune to string conversion)

(link: https://play.golang.org/p/EPbFTmzsZm4)

Pattern.compile("\\d{16}").namedGroups() throws NullPointerException

Pattern.compile("\\d{16}").namedGroups() throws a NullPointerException instead of returning an empty map.

re2j vs java.util.regex PatternSyntaxException behavior discrepency.

Not a huge issue but when I tested out re2j, some of my unit tests that expect a PatternSyntaxException failed with re2j as re2j did not consider the regex to be invalid whereas they pass with java.util.regex.

An example regex that causes the PatternSyntaxException in java.util.regex but not in re2j is:

.*field1=([^|]*)\\|field2=([^|]*).*(

\b does not behave like it does with java.util.regex.Pattern

Word boundaries should use \p{L} not just A-Za-z to behave like the default regex in java. Added some tests showing the issue and fixed it in this PR : #100 . (but I had to disable a large unit-test I don't know how to adapt to support this change)

Exception trying to run benchmarks

I just checked out the source (master, 7e6e57a). When I run ./gradlew benchmarks, I get the following. Looks like somehow it's bringing in a version of Guava that is too new to still have the FutureFallback class?

> Task :benchmarks FAILED
java.lang.NoClassDefFoundError: com/google/common/util/concurrent/FutureFallback
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
        at java.lang.Class.getDeclaredConstructors(Class.java:2020)
        at com.google.inject.spi.InjectionPoint.forConstructorOf(InjectionPoint.java:243)
        at com.google.inject.internal.ConstructorBindingImpl.create(ConstructorBindingImpl.java:96)
        at com.google.inject.internal.InjectorImpl.createUninitializedBinding(InjectorImpl.java:629)
        at com.google.inject.internal.InjectorImpl.createJustInTimeBinding(InjectorImpl.java:845)

        at com.google.inject.internal.InjectorImpl.createJustInTimeBindingRecursive(InjectorImpl.java:772)
An unexpected exception has been thrown by the caliper runner.
        at com.google.inject.internal.InjectorImpl.createJustInTimeBindingRecursive(InjectorImpl.java:761)
Please see https://sites.google.com/site/caliperusers/issues
        at com.google.inject.internal.InjectorImpl.getJustInTimeBinding(InjectorImpl.java:256)
        at com.google.inject.internal.InjectorImpl.getBindingOrThrow(InjectorImpl.java:205)
        at com.google.inject.internal.InjectorImpl.getInternalFactory(InjectorImpl.java:853)
        at com.google.inject.internal.FactoryProxy.notify(FactoryProxy.java:46)
        at com.google.inject.internal.ProcessedBindingData.runCreationListeners(ProcessedBindingData.java:50)
        at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:133)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
        at com.google.inject.internal.InjectorImpl.createChildInjector(InjectorImpl.java:217)
        at com.google.inject.internal.InjectorImpl.createChildInjector(InjectorImpl.java:224)
        at com.google.caliper.runner.CaliperMain.exitlessMain(CaliperMain.java:120)
        at com.google.caliper.runner.CaliperMain.main(CaliperMain.java:81)
        at com.google.caliper.runner.CaliperMain.main(CaliperMain.java:69)
        at com.google.re2j.Benchmarks.main(Benchmarks.java:206)
Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.FutureFallback
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 22 more

regexp: (|a)* matches more text than (|a)+ does

From golang/go#46123 (the Go implementation of the algorithm):
@BurntSushi noticed that in Rust regex (rust-lang/regex#779):

(|a)* matching aaa matches the entire text.
(|a)+ matching aaa matches only the empty string at the start of the text.
This is a bit of an unlikely corner case, but it's certainly a significant violation of the rules of regexps for * to match more than +. The correct answer is for (|a)* to match an empty string too, because each iteration prefers the empty string, so even an infinite number of those matches will never fall back to matching an a. (This behavior is what Perl and PCRE do.)

Go's regexp package and RE2 have the same bug, which ultimately derives from a mismatch between the e* picture in my first regexp article and the backtracking prioritization matching in the second article. The implementation of e* needs to be a little different to get the match prioritization correct in the case where e has a preferred empty match.

I found this mismatch between RE2 and Perl in the e* case when running Glenn Fowler's regexp test cases against RE2 long ago, but at the time I thought the mismatch was unavoidable in the proper NFA approach. @BurntSushi's observation that e+ handles the empty match correctly proves otherwise. The correct fix is to leave the compilation of e+ alone and then compile e* as (e+)?. (I will add a note to the article as well.)

This is a tiny bug fix but may of course result in different matches for certain programs, causing problems for programs that rely on the current buggy behavior. That is always the case for any bug fix, of course. Fixing the bug only in Go would also make Go diverge from RE2 and Rust.

@junyer, @BurntSushi, and I discussed both the potential for breakage and the goal of keeping Go, RE2, and Rust aligned. We decided the potential breakage is minimal and outweighed by the benefits of fixing the bug, to better match Perl and PCRE, and to reestablish the properties that e* never matches more than e+ and that e+ and ee* are always the same. We agreed to make the change in all of our implementations.

SHA-1's of Maven and GitHub jars don't match

$ for jar in \
  "https://search.maven.org/remotecontent?filepath=com/google/re2j/re2j/1.2/re2j-1.2.jar" \
  "https://github.com/google/re2j/releases/download/re2j-1.2/re2j-1.2.jar"; do \
  curl -Ls "$jar" | shasum; done                                                                         
4361eed4abe6f84d982cbb26749825f285996dd2  -
499d5e041f962fefd0f245a9325e8125608ebb54  -

After extracting the jars and doing a recursive diff, I get:

$ diff -r maven/META-INF/MANIFEST.MF github/META-INF/MANIFEST.MF
1a2,5
> Archiver-Version: Plexus Archiver
> Built-By: sjr
> Created-By: Apache Maven 3.5.0
> Build-Jdk: 1.8.0_151
Only in github/META-INF: maven
Binary files maven/com/google/re2j/Matcher.class and github/com/google/re2j/Matcher.class differ
Binary files maven/com/google/re2j/Parser$1.class and github/com/google/re2j/Parser$1.class differ
Binary files maven/com/google/re2j/Parser$Pair.class and github/com/google/re2j/Parser$Pair.class differ
Binary files maven/com/google/re2j/Parser$Stack.class and github/com/google/re2j/Parser$Stack.class differ
Binary files maven/com/google/re2j/Parser$StringIterator.class and github/com/google/re2j/Parser$StringIterator.class differ
Binary files maven/com/google/re2j/Parser.class and github/com/google/re2j/Parser.class differ
Binary files maven/com/google/re2j/RE2$1.class and github/com/google/re2j/RE2$1.class differ
Binary files maven/com/google/re2j/RE2$10.class and github/com/google/re2j/RE2$10.class differ
Binary files maven/com/google/re2j/RE2$2.class and github/com/google/re2j/RE2$2.class differ
Binary files maven/com/google/re2j/RE2$3.class and github/com/google/re2j/RE2$3.class differ
Binary files maven/com/google/re2j/RE2$4.class and github/com/google/re2j/RE2$4.class differ
Binary files maven/com/google/re2j/RE2$5.class and github/com/google/re2j/RE2$5.class differ
Binary files maven/com/google/re2j/RE2$6.class and github/com/google/re2j/RE2$6.class differ
Binary files maven/com/google/re2j/RE2$7.class and github/com/google/re2j/RE2$7.class differ
Binary files maven/com/google/re2j/RE2$8.class and github/com/google/re2j/RE2$8.class differ
Binary files maven/com/google/re2j/RE2$9.class and github/com/google/re2j/RE2$9.class differ
Binary files maven/com/google/re2j/RE2.class and github/com/google/re2j/RE2.class differ
Binary files maven/com/google/re2j/Regexp$1.class and github/com/google/re2j/Regexp$1.class differ
Binary files maven/com/google/re2j/Regexp$Op.class and github/com/google/re2j/Regexp$Op.class differ
Binary files maven/com/google/re2j/Regexp.class and github/com/google/re2j/Regexp.class differ

Doing a sample decompilation of corresponding class files:

$ diff <(javap maven/com/google/re2j/Matcher.class) <(javap github/com/google/re2j/Matcher.class)
9a10
>   public int start(java.lang.String);
10a12
>   public int end(java.lang.String);
12a15
>   public java.lang.String group(java.lang.String);

So the GitHub jar has some APIs that the Maven jar doesn't. I also confirmed this via the Maven source jar for 1.2.

Hopefully this is just a hiccup in the release process (I notice the prior release was from 2015). Can someone double-check my work, verify that nothing serious is going on here, and possibly re-publish one or both artifacts? Thanks!

Unclear licensing

I would like to package RE2/J for Debian Linux, but it is not clear whether or not it is distributable.
The license file acknowledges the original work by the Go authors, but it does not provide any information on the actual license/copyright of the Java port.
In addition, some files, such as java/com/google/re2j/Matcher.java, do not even refer to a license, only to a copyright.
I am afraid the packaging cannot not go further until this is clarified.

Weird incorrect match for named group

I am trying to grab two numbers followed by an optional space followed by another optional number or one optional letter. No other letter afterwards. This is the test code:

@Test
  public void test() {
    com.google.re2j.Pattern p1 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2} ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2}(?U) ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    String input = "22 bored";
    com.google.re2j.Matcher m1 = p1.matcher(input);
    com.google.re2j.Matcher m2 = p2.matcher(input);
    while (m1.find()) {
      System.out.println(m1.group("elem")); // 22 b
    }
    while (m2.find()) {
      System.out.println(m2.group("elem")); // 22
    }
  }

Both should print 22, instead first one (without the non-greedy flag) prints 22 b
Works in go: https://play.golang.org/p/Q0gCyLD013V
Regex: https://regex101.com/r/lF9aG7/6

Matcher.find throws StringIndexOutOfBoundsException when input truncated

When Matcher is initialized with a mutable CharSequence, truncating the sequence can result in an exception.

    Pattern p = Pattern.compile("b(an)*(.)");
    StringBuilder b = new StringBuilder("by, band, banana");
    Matcher m = p.matcher(b);
    assertTrue(m.find(0));
    int start = b.indexOf("ban");
    b.replace(b.indexOf("ban"), start + 3, "b");  // truncate string to "by, bd, banana"
    assertTrue(m.find(b.indexOf("ban")));

whitespace character class ('\s') matching different characters than java Pattern

The java Pattern API defines \s as matching [ \t\n\x0B\f\r]

Unfortunately Re2j the current implementation (based on Re2 syntax) does not match \x0B (Vertical tab) with the \s character class

Fixing it directly in the code inside CharGroup.java would seems straightforward, unfortunately this file is generated by the make_perl_groups.pl script, which resolves classes directly from perl.

Maybe instead of using the make_perl_groups.pl script that resolves classes from perl it would be preferable to have a script in Java that would resolves classes similarly (but from Java).

RFC on unicode tables management

Hello there :D

thanks for this great lib; I encountered some issues with recent unicode character not supported in re2j unicode tables.

Is there some plan to be able to update this table ?

I have done some tests with icu4j 71.1 (unicode 14.0) and I manage to fix some issues I got with not up to date definition of unicode range tables.

// Generated at 2022-09-29T11:43:59.963000520Z by Java 17.0.2 using Unicode version 14.0.0.0.

I run the benchmark for you but I have no real comparison with previous re2j version if it's really bad or not.

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                 (binary)  (impl)  (regex)  (repeats)  Mode  Cnt      Score      Error  Units
BenchmarkBacktrack.matched                                     N/A     JDK      N/A          5  avgt    5      0.409 ±    0.010  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         10  avgt    5     11.181 ±    1.372  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         15  avgt    5    421.301 ±    5.390  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         20  avgt    5  14601.885 ±  585.185  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A          5  avgt    5      0.725 ±    0.162  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         10  avgt    5      2.198 ±    0.174  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         15  avgt    5      4.858 ±    0.359  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         20  avgt    5      8.456 ±    0.091  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true     JDK      N/A        N/A  avgt    5    206.578 ±    4.268  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true    RE2J      N/A        N/A  avgt    5    594.817 ±    3.601  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false     JDK      N/A        N/A  avgt    5    208.050 ±    5.570  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false    RE2J      N/A        N/A  avgt    5    622.479 ±   33.410  us/op
BenchmarkCompile.compile                                       N/A     JDK     DATE        N/A  avgt    5    606.043 ±   43.850  ns/op
BenchmarkCompile.compile                                       N/A     JDK    EMAIL        N/A  avgt    5    155.496 ±   13.520  ns/op
BenchmarkCompile.compile                                       N/A     JDK    PHONE        N/A  avgt    5    295.075 ±   15.263  ns/op
BenchmarkCompile.compile                                       N/A     JDK   RANDOM        N/A  avgt    5   1087.170 ±  115.227  ns/op
BenchmarkCompile.compile                                       N/A     JDK   SOCIAL        N/A  avgt    5    251.806 ±   28.439  ns/op
BenchmarkCompile.compile                                       N/A     JDK   STATES        N/A  avgt    5   1120.005 ±   80.479  ns/op
BenchmarkCompile.compile                                       N/A    RE2J     DATE        N/A  avgt    5   3081.459 ±  438.926  ns/op
BenchmarkCompile.compile                                       N/A    RE2J    EMAIL        N/A  avgt    5   1056.468 ±  146.337  ns/op
BenchmarkCompile.compile                                       N/A    RE2J    PHONE        N/A  avgt    5   1683.852 ±  173.123  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   RANDOM        N/A  avgt    5   5299.210 ±  735.038  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   SOCIAL        N/A  avgt    5   1167.235 ±   28.007  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   STATES        N/A  avgt    5   7610.245 ± 1019.406  ns/op
BenchmarkFullMatch.matched                                    true     JDK      N/A        N/A  avgt    5     77.149 ±    3.385  ns/op
BenchmarkFullMatch.matched                                    true    RE2J      N/A        N/A  avgt    5    478.138 ±   48.828  ns/op
BenchmarkFullMatch.matched                                   false     JDK      N/A        N/A  avgt    5     68.403 ±    3.378  ns/op
BenchmarkFullMatch.matched                                   false    RE2J      N/A        N/A  avgt    5    462.312 ±   86.473  ns/op
BenchmarkFullMatch.notMatched                                 true     JDK      N/A        N/A  avgt    5     57.041 ±    4.191  ns/op
BenchmarkFullMatch.notMatched                                 true    RE2J      N/A        N/A  avgt    5    411.035 ±   39.113  ns/op
BenchmarkFullMatch.notMatched                                false     JDK      N/A        N/A  avgt    5     48.511 ±    3.818  ns/op
BenchmarkFullMatch.notMatched                                false    RE2J      N/A        N/A  avgt    5    405.113 ±   54.661  ns/op
BenchmarkSplit.benchmarkSplit                                  N/A     JDK      N/A        N/A  avgt    5     10.665 ±    0.301  us/op
BenchmarkSplit.benchmarkSplit                                  N/A    RE2J      N/A        N/A  avgt    5     40.286 ±    6.797  us/op
BenchmarkSubMatch.findPhoneNumbers                            true     JDK      N/A        N/A  avgt    5      3.868 ±    0.538  ms/op
BenchmarkSubMatch.findPhoneNumbers                            true    RE2J      N/A        N/A  avgt    5     11.425 ±    0.257  ms/op
BenchmarkSubMatch.findPhoneNumbers                           false     JDK      N/A        N/A  avgt    5      3.164 ±    0.334  ms/op
BenchmarkSubMatch.findPhoneNumbers                           false    RE2J      N/A        N/A  avgt    5     11.907 ±    0.295  ms/op

I read this https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc and I wonder if there is some plan in regards of moving from unicode 6.0.

I could even try to help if you need or have some plans for it 👍

nb: icu4j 71.1 is the latest official release but the next version 72 for unicode 15.0 is planned on 8th october

let me know your ideas,

Kidi

StringIndexOutOfBoundsException in parseEscape

jshell> com.google.re2j.Pattern.compile("\\x")
|  Exception com.google.re2j.PatternSyntaxException: error parsing regexp: invalid escape sequence: `\x`
|        at Parser.parseEscape (Parser.java:1440)
|        at Parser.parseInternal (Parser.java:962)
|        at Parser.parse (Parser.java:788)
|        at RE2.compileImpl (RE2.java:184)
|        at Pattern.compile (Pattern.java:136)
|        at Pattern.compile (Pattern.java:96)
|        at (#1:1)

jshell> com.google.re2j.Pattern.compile("\\xv")
|  Exception java.lang.StringIndexOutOfBoundsException: index 3,length 3
|        at String.checkIndex (String.java:3278)
|        at String.codePointAt (String.java:723)
|        at Parser$StringIterator.pop (Parser.java:751)
|        at Parser.parseEscape (Parser.java:1414)
|        at Parser.parseInternal (Parser.java:962)
|        at Parser.parse (Parser.java:788)
|        at RE2.compileImpl (RE2.java:184)
|        at Pattern.compile (Pattern.java:136)
|        at Pattern.compile (Pattern.java:96)
|        at (#2:1)

Unused Field

re2j/java/com/google/re2j/Parser.java

Line 31 in 66840ce

private static final String ERR_INVALID_CHAR_CLASS = "invalid character class";

This field is never used.

\C not implemented

After #130, binary mode matching is now supported. However, \C is not a supported metacharacter, which limits its usefulness.

What about a new release?

Incorrect behavior for Pattern::split

How to reproduce

jshell> com.google.re2j.Pattern.compile("x*").split("foo")
$1 ==> String[4] { "", "f", "o", "o" }

jshell> java.util.regex.Pattern.compile("x*").split("foo")
$2 ==> String[3] { "f", "o", "o" }

Versions

java 11.0.10
re2j 1.5

Should I try for a PR?

how to get matched names

like (?P<volume>\d+)sf(d|e)
how can I get by name "volume"

Publish API documentation

Hi. Is any documentation published for the Java library API?

C++ library has partial documentation on its wiki https://github.com/google/re2/wiki/CplusplusAPI
Go regexp package has comprehensive documentation at https://pkg.go.dev/regexp

Is re2j able to combine multiple regex into one?

Basically, if there is a large number of regular expression, it can be quite slow to match target strings. So I am looking for solutions to "compile" many regular expressions into "one".

https://github.com/fulmicoton/multiregexp provides a solution to build a large DFA based on each DFA built from each regex. However, it has many limitations - only a subset of regex is supported.

Is there a way to achieve what I want using re2j? I tried hard but could not find related info.

Thanks in advance.

GPG verify

Hi,

could you please add the pgp public key, which can be used to verify this artifact on maven central, to the GPG key server or this repo so we can use it to check?

Kind regards

Fabian

LICENSE selection is mildly unclear

Hello awesome people!

I got roped in to looking at this library due to it being a transitive dependency of another dependency I'm using, and I wanted to describe a bit of the experience around figuring out exactly what license applies to this project:

https://github.com/google/re2j/blob/master/build.gradle#L116-L119

The Go License best I can tell is not a known SPDX identifier, which makes things a bit unclear when first glancing at stuff. Once you dig a bit, it becomes clear it's a BSD3 style license (only one word really changed from the BSD3 license, HOLDER -> OWNER (screencap attached).

At any rate, the LICENSE file itself is on this repo, and then you link to an external LICENSE, the Go License (which is more or less BSD3). It is a little difficult right from the get go figuring out which applies, etc... yada yada yada, I'm not a lawyer, no interest in becoming one :)

It would seem like at least using an SPDX identifier would be a good step, or seeing if there are truly any differences to the BSD3 license (English is tough and HOLDER -> OWNER is probably a big change if you ask the right person).