gliwka / hyperscan-java Goto Github PK

View Code? Open in Web Editor NEW

168.0 8.0 45.0 29.6 MB

Match tens of thousands of regular expressions within milliseconds - Java bindings for Intel's hyperscan 5

License: BSD 3-Clause "New" or "Revised" License

Java 100.00%

hyperscan regexp regex regular-expression-engine regular-expression jna nfa

hyperscan-java's Introduction

hyperscan-java

hyperscan is a high-performance multiple regex matching library.

It uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.

This project is a third-party developed wrapper for the hyperscan project to enable developers to integrate hyperscan in their java (JVM) based projects.

Because the latest hyperscan release is now under a proprietary license and ARM-support has never been integrated, this project utilizes the vectorscan fork.

Add it to your project

This project is available on maven central.

The version number consists of two parts (i.e. 5.4.11-3.0.0). The first part specifies the vectorscan version (5.4.11), the second part the version of this library utilizing semantic versioning (3.0.0).

Maven

<dependency>
    <groupId>com.gliwka.hyperscan</groupId>
    <artifactId>hyperscan</artifactId>
    <version>5.4.11-3.0.0</version>
</dependency>

Gradle

compile group: 'com.gliwka.hyperscan', name: 'hyperscan', version: '5.4.11-3.0.0'

sbt

libraryDependencies += "com.gliwka.hyperscan" %% "hyperscan" % "5.4.11-3.0.0"

Usage

If you want to utilize the whole power of the Java Regex API / full PCRE syntax and are fine with sacrificing some performance, use thePatternFilter. It takes a large lists of java.util.regex.Pattern and uses hyperscan to filter it down to a few Patterns with a high probability that they will match. You can then use the regular Java API to confirm those matches. This is similar to chimera, only using the standard Java API instead of libpcre.

If you need the highest performance, you should use the hyperscan API directly. Be aware, that only a smaller subset of the PCRE syntax is supported. Missing features are for example backreferences, capture groups and backtracking verbs. The matching behaviour is also a litte bit different, see the semantics chapter of the hyperscan docs.

Examples

Use of the PatternFilter

List<Pattern> patterns = asList(
        Pattern.compile("The number is ([0-9]+)", Pattern.CASE_INSENSITIVE),
        Pattern.compile("The color is (blue|red|orange)")
        // and thousands more
);

//not thread-safe, create per thread
PatternFilter filter = new PatternFilter(patterns);

//this list now only contains the probably matching patterns, in this case the first one
List<Matcher> matchers = filter.filter("The number is 7 the NUMber is 27");

//now we use the regular java regex api to check for matches - this is not hyperscan specific
for(Matcher matcher : matchers) {
    while (matcher.find()) {
        // will print 7 and 27
        System.out.println(matcher.group(1));
    }
}

Direct use of hyperscan

import com.gliwka.hyperscan.wrapper;

...

//we define a list containing all of our expressions
LinkedList<Expression> expressions = new LinkedList<Expression>();

//the first argument in the constructor is the regular pattern, the latter one is a expression flag
//make sure you read the original hyperscan documentation to learn more about flags
//or browse the ExpressionFlag.java in this repo.
expressions.add(new Expression("[0-9]{5}", EnumSet.of(ExpressionFlag.SOM_LEFTMOST)));
expressions.add(new Expression("Test", ExpressionFlag.CASELESS));


//we precompile the expression into a database.
//you can compile single expression instances or lists of expressions

//since we're interacting with native handles always use try-with-resources or call the close method after use
try(Database db = Database.compile(expressions)) {
    //initialize scanner - one scanner per thread!
    //same here, always use try-with-resources or call the close method after use
    try(Scanner scanner = new Scanner())
    {
        //allocate scratch space matching the passed database
        scanner.allocScratch(db);


        //provide the database and the input string
        //returns a list with matches
        //synchronized method, only one execution at a time (use more scanner instances for multithreading)
        List<Match> matches = scanner.scan(db, "12345 test string");

        //matches always contain the expression causing the match and the end position of the match
        //the start position and the matches string it self is only part of a matach if the
        //SOM_LEFTMOST is set (for more details refer to the original hyperscan documentation)
    }

    // Save the database to the file system for later use
    try(OutputStream out = new FileOutputStream("db")) {
        db.save(out);
    }

    // Later, load the database back in. This is useful for large databases that take a long time to compile.
    // You can compile them offline, save them to a file, and then quickly load them in at runtime.
    // The load has to happen on the same type of platform as the save.
    try (InputStream in = new FileInputStream("db");
         Database loadedDb = Database.load(in)) {
        // Use the loadedDb as before.
    }
}
catch (CompileErrorException ce) {
    //gets thrown during  compile in case something with the expression is wrong
    //you can retrieve the expression causing the exception like this:
    Expression failedExpression = ce.getFailedExpression();
}
catch(IOException ie) {
  //IO during serializing / deserializing failed
}

Native libraries

This library ships with pre-compiled vectorscan binaries for linux (glibc >=2.17) and macOS for x86_64 and arm64 CPUs.

Windows is no longer supported (last supported version is 5.4.0-2.0.0) due to vectorscan dropping windows support.

You can find the repository with the native libraries here

Documentation

The developer reference explains vectorscan. The javadoc is located here.

Changelog

See here.

Contributing

Feel free to raise issues or submit a pull request.

Credits

Shoutout to @eliaslevy, @krzysztofzienkiewicz, @swapnilnawale, @mmimica and @Jiar for all the great contributions.

Thanks to Intel for opensourcing hyperscan and @VectorCamp for actively maintaining the fork!

License

BSD 3-Clause License

hyperscan-java's People

Contributors

Stargazers

Watchers

hyperscan-java's Issues

Compile new binaries for hyperscan v4.6.0

Windows builds of native library based on Visual Studio

Currently windows developers working on projects which include hyperscan-java as dependency cannot run the whole project on their local machine due to the missing windows support. It would be great to make development with hyperscan-java and projects basing on it a cross-platform experience.

To achieve this, the following things need to be implemented:

Compile hyperscan as an dynamic library on windows
Use Travis as CI environment to build new hyperscan libraries for windows automatically and test hyperscan on windows.

Since there is no "fat library" support from the hyperscan team for windows, the windows library would be compiled with SSE3 support only (no AVX-512 or other advanced instruction sets).

Bundle precompiled hyperscan library for linux

Compile 64-bit "fat" runtime and add it to the appropriate directory so jna will load and extract it (see here)

Implement extended expression syntax using hs_compile_ext_multi()

Trying to compile a database with an empty list should throw an exception

A JMH Benchmark of Replacing Char

Hi,

Following the conversation in #1 , I have implemented a benchmark using sbt-jmh.

It was originated by my use case on string replacing in my Spark app, only to find out hyperscan-java is no faster than Java Regex, which is as simple as the following:

"abc123".replaceAll("""\d""", "d")
// should return "abcddd"

To find out what is slowing down hyperscan-java, here's the result of sbt-jmh, on a benchmark simulating my use case.

[info] Benchmark                        Mode  Cnt    Score   Error   Units
[info] ReplaceCharBenchmark.HyperScan  thrpt   20    3.368 ± 0.225  ops/ms
[info] ReplaceCharBenchmark.Native     thrpt   20  560.507 ± 4.527  ops/ms

It clearly shows the native Java Regex outperformed HyperScan by a wide margin.

My benchmark code can be found on my fork of hyperscan-java. And please reference to my README.md on how to run the benchmark.

Separate java library and precompiled shared libraries

Separate shared libraries into their own maven artifacts.
One artifact for each os/architecture/instruction set combination

Use https://github.com/trustin/os-maven-plugin and classifiers.

This allows us to support new instructionssets (like AVX512, see #36) which need a newer tool chain and thus a newer GLIBC while still maintaining compatibility with older Linux systems.

Add benchmark to detect performance regressions

utf8 match

Hi,I'm a novice to hyperscan

I have a question whether hyperscan supports matching Chinese characters?

I use your code hyperscan-java and set the expressionFlag to utf-8, but it can't match result.

Examples are as follows：
expression：测试
scan：这是一个测试

thanks.

Add 32-bit Linux support

Crosscompile ~~for Windows and~~ 32-bit Linux support.

database looking is lower ans lower when thread num increase

Create maven release containing hyperscan v4.7.0

Change version in pom.xml after travis finished building the binaries

Hyperscan 5.3

Could you please update hs.dll to Hyperscan 5.3 version?

Evaluate enabling AVX512-support for fat library

See https://github.com/intel/hyperscan/blob/master/doc/dev-reference/getting_started.rst

Compile new binaries for HyperScan v4.7.0

Improve error handling

Remove throwable from method signatures. Use unchecked exceptions for everything but compile exceptions and io exceptions.

Specify processor capabilities at db compile time

Enable the ability to explicitly choose the processor capabilities at db compilation time rather than what it currently does which is to infer processor capabilities from the db compilation host.
We can avoid the "compiled for a different platform" warning that causes a runtime performance penalty
We can work around the SIGSEGV issue (see: #71 )

Check if direct mapping can be applied

See https://github.com/java-native-access/jna/blob/master/www/DirectMapping.md

Update to Hyperscan 5.2

get the index id of Expressions

0x00 getMatchedExpressionId

such as, getMatchedExpressionId fuction.

like hyperscan python:

def done(eid, start, end, flags, context):
    """
    On a found result:
        -Expression ID
        -Start Offset
        -End Offset
        -Flags (Currently Unused)
        -Context object passed to scan
    """
   print(eid)

db.scan(log_strings, match_event_handler=done, context={'offset': offset, "rawlog": rawlog})

Error due to reading of structs after call to free

Bug report by @eliaslevy in #28:

A bit more on the error I was seeing. I've reproduced it on my MBP and narrowed it down to a call I make to Expression.validate on start up to check the expressions. It only seems to trigger if there are sufficient concurrent threads, each of which run the same code. In my case I need at least four threads for it to trigger, and it won't trigger if I run the code in the debugger. The error goes away if I synchronize the call to Expression.validate.

The error report generated by the JVM shows two threads calling into hs_free_compile_error from validate with the top of the stack being:
Stack: [0x00007000055c4000,0x00007000056c4000],  sp=0x00007000056c2440,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libsystem_c.dylib+0x1b52]  strlen+0x12
C  [jna4845073464406839331.tmp+0x4073]  Java_com_sun_jna_Native_getStringBytes+0x23
j  com.sun.jna.Native.getStringBytes(Lcom/sun/jna/Pointer;JJ)[B+0
j  com.sun.jna.Native.getString(Lcom/sun/jna/Pointer;JLjava/lang/String;)Ljava/lang/String;+6
j  com.sun.jna.Pointer.getString(JLjava/lang/String;)Ljava/lang/String;+3
j  com.sun.jna.Structure.readField(Lcom/sun/jna/Structure$StructField;)Ljava/lang/Object;+134
j  com.sun.jna.Structure.read()V+100
j  com.sun.jna.Structure.autoRead()V+8
j  com.sun.jna.Function.invoke(Ljava/lang/reflect/Method;[Ljava/lang/Class;Ljava/lang/Class;[Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+395
j  com.sun.jna.Library$Handler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+344
j  com.sun.proxy.$Proxy3.hs_free_compile_error(Lcom/gliwka/hyperscan/jna/CompileErrorStruct;)I+24
j  com.gliwka.hyperscan.wrapper.Expression.validate()Lcom/gliwka/hyperscan/wrapper/Expression$ValidationResult;+77
Looks like it is dying in the call to strlen from JNA's Java_com_sun_jna_Native_getStringBytes.

I am not familiar with JNA, but I believe the issue is that CompileErrorStruct is created by passing it a native struct that has been allocated by hyperscan. The call to hs_free_compile_error deallocates the the struct. But JNA's Structure has a feature enabled by default that will read the contents of the native struct into the Java Structure after every native call. So after calling hs_free_compile_error, JNA is trying to read the fields of the struct that hyperscan just freed.

In the single threaded case this is usually not a problem, as nothing will have modified that memory. But in a multi-threaded environment, the memory could be allocated and modified by another thread, leading to an error when the original thread tried to read the fields. Something changed between hyperscan 4.5.2 and 4.6.0 to make this more likely, but the problem was present before we upgraded to 4.6.0.

If I modify hyperscan-java to call errorStruct.setAutoRead(false); before every call to hs_free_compile_error, thus disabling read of the struct after the call to hs_free_compile_error, the error goes away.

Implement stream mode

Check for null values as expression strings

Passing null values as expression currently results in a JVM crash.

Throw NullPointerException if expression string is null.

Add Travis CI

Implement new APIs (hamming distance, levensthein distance)

Change links pointing to the intel documentation to the new url

Old URL: https://01org.github.io/hyperscan/
New URL: https://intel.github.io/hyperscan/

Related hyperscan commit:
intel/hyperscan@aff7242

Replace jna bindings with jni-javacpp

Migrate the wrapper to use the javacpp bindings instead of our own JNA bindings: https://github.com/bytedeco/javacpp-presets/tree/master/hyperscan

We still keep our own docker based approach for linux to target centos 6 (#69).

This will improve the performance by minimizing the overhead calling to C from java.

Array Out of Bound Exception (String to Byte Position Mapping)

The current implementation of the string to byte position mapping in combination with issue #6 causes a Array Out of Bound Exception.

Support simple hasMatch operation for faster processing

Wondering if in the Scanner class we could simply have a matches method that returns boolean depending on whether or not there are matches. It would be similar to scan, but not bother building the List of match expressions to return.

hyper scan out of memory bug

I use "new scanner" every time , but the memory increase step by step , at last oom . so how to close scanner. 3q

the code is flow :

public List match(Database db, String query) throws Throwable {
Scanner scanner = new Scanner();
scanner.allocScratch(db);
List results = scanner.scan(db, query);
return results;
}

public static EnumSet expressionFlags = EnumSet
.of(ExpressionFlag.CASELESS, ExpressionFlag.UTF8, ExpressionFlag.SINGLEMATCH,
ExpressionFlag.MULTILINE);

public static void main(String[] args) throws Throwable {
LinkedList expressions = new LinkedList<>();
expressions.add(new Expression("song", expressionFlags));
Database db = Database.compile(expressions);
whiile(true) {
match(db,"xxx");
}
}

Upgrade to Java 11 (current LTS)

Implement git flow and snapshot releases

Serialization and deserialization of databases

Implement serialization and deserialization to allow for advanced use cases (see https://intel.github.io/hyperscan/dev-reference/serialization.html)

SIGSEGV during scan execution on Scanner class

[This issue might be related to #8 ]

We observe a segmentation fault when running an engine (a collection of hyperscan compiled regexes) on a specific host.
SIGSEGV can manifest on the same host that performed the db compilation so it doesn't appear to be a matter of mismatched architectures.
Full log: hs_err_pid26287.log

Issues with returned match (utf-8)

The returned match specifies the byte position, which can differ to the string position in case of utf8. It's necessary to replace substring with code extracting the string at the exact byte positions.

Add syntactic sugar to Expression constructors

Add constructors to create expressions with only a single ExpressionFlag.

End of match position index is off by one

Add simple api which encapsulates the scanner logic

Embedded Linux shared library won't work with more conservative Linux distros

Conservative Linux distros, such as Amazon Linux, have older versions of gcc/libstdc++. That leads to an error when JNA tries to load the embedded shared library (java.lang.UnsatisfiedLinkError: /usr/lib64/libstdc++.so.6: version "GLIBCXX_3.4.20" not found), which has been compiled with a newer ABI. As libstdc++ is forward compatible, using versioned symbols, to maximize compatibility, it would be best for the embedded libhs.so to be compiled using an older version of gcc/libstdc++. Then it could be used in older and newer systems.

Mark junit as test dependency

See comment in #9

BTW, any chance of adding test to the JUnit dependency and pinning it to a version that is not RELEASE? Maven 3 no longer recognizes that tag.

Split binaries into separate artifacts

We currently ship a fat jar with binaries for linux and OS X combined.

It would be great to separate those into their own artifacts (see netty-tcnative).

Occasional SIGSEGV during Finalizer execution on Scanner class

In some cases (usually during high memory pressure) a java application using hyperscan-java can segfault. The stacktrace indicates, that the problem seems to be related to the finalize() method on the Scanner class.

Built-in Linux hyperscan shared library fails in Amazon Linux AMI

I am not sure what is is about the hyperscan Linux shared library bundled with the jar, but the update in 0.4.10 broke my app in EC2. Since then hs_alloc_scratch returns HS_DB_VERSION_ERROR even when passed a database created right before the call. I built my own libhs.so and repackaged the jar and that fixed the issue.

The Linux version is:

Linux version 4.4.11-23.53.amzn1.x86_64 (mockbuild@gobi-build-60009) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Wed Jun 1 22:22:50 UTC 2016

Publish to maven central

Figure out AVX/non-AVX situation #37
Update pom.xml to adhere to the standards
Add script to build, sign and push artefacts to repository
Raise ticket with Sonatype to create repository on OSSRH (ticket)
Notify folks at Sonatype as soon as first artefact has been promoted to 'Release' repository
Update README

Create travis build matrix for Linux and OS X to automate hyperscan builds

ToDo:

Split the travis config into multiple build stages (test of java code, building of hyperscan artifacts, deploy)
Create a build matrix containing latest OS X 64-bit, Linux Docker Image with oldest possible GLIBC (32-bit and 64-bit) and a MINGW cross compile for Windows (32-bit and 64-bit)

This would also fix #18.

Getting a hard crash on osx

com.gliwka.hyperscan.wrapper.Database

Database.compile(scanExpression);

hs_err_pid15410.log

seems to be this line:
int hsError = HyperscanLibrary.INSTANCE.hs_compile(expression.getExpression(),
Util.bitEnumSetToInt(expression.getFlags()), HS_MODE_BLOCK, Pointer.NULL, database, error);

Ok was due to "hyperscan" was not installed, perhaps some check for the library first?

Drop 32-bit support

Centos 7 is being used now to be able to build hyper scan with AVX512 support.

Devtoolset-7 is being used for this to have a modern gcc based toolchain. Since RedHat dropped 32-bit support, we're not able to build 32-bit versions any more and thus dropping the support.

scanner ArrayIndexOutOfBoundsException bug

in scanner class , there is a arrayIndexOutOfBoundsException bug when regex = "a|",because the "from = 0" and "to = 0" exit .

  private HyperscanLibrary.match_event_handler matchHandler = new HyperscanLibrary.match_event_handler() {
        public int invoke(int id, long from, long to, int flags, Pointer context) {
            long[] tuple = { id, from, to };
#  //  may be from = 0 and to = 0.for example( regex "a|") 
            matchedIds.add(tuple);
            return 0;
        }
    };

   public List<Match> scan(final Database db, final String input) throws Throwable {
        Pointer dbPointer = db.getPointer();

        scratch = scratchReference.getValue();

        final LinkedList<Match> matches = new LinkedList<Match>();

        final byte[] utf8bytes =input.getBytes(StandardCharsets.UTF_8);
        int bytesLength = utf8bytes.length;
        final int[] byteToIndex = Util.utf8ByteIndexesMapping(input, bytesLength);

        matchedIds.clear();
        int hsError = HyperscanLibraryDirect.hs_scan(dbPointer, input, bytesLength,
                0, scratch, matchHandler, Pointer.NULL);

        if(hsError != 0)
            throw Util.hsErrorIntToException(hsError);

        for (long[] tuple : matchedIds) {
            int id = (int)tuple[0];
            long from = tuple[1];
            long to = tuple[2];
            String match = "";
            Expression matchingExpression = db.getExpression(id);

            if(matchingExpression.getFlags().contains(ExpressionFlag.SOM_LEFTMOST)) {
                int startIndex = byteToIndex[(int)from];
                int endIndex = byteToIndex[(int)to - 1];
                match = input.substring(startIndex, endIndex + 1);
            }

            matches.add(new Match(byteToIndex[(int)from], byteToIndex[(int)to - 1], match, matchingExpression));
#  // there will be ArrayIndexOutOfBoundsException error when "from = 0" and "to = 0"  form example( regex = "a|")
        }
        return matches;
    }

thank you
mzn

Double free issue using Scanner::close and Database::close

First, thanks a lot for these bindings! I was able to get a pretty significant performance improvement in my use-case quite easily by switching from the naive "loop over all the regexes" algorithm to using these hyperscan bindings :)

In my use-case I was creating a lot of Scanners in try-with-resources blocks one after the other. (I know it's not ideal to be re-allocating them unnecessarily, but it was the quickest way to a working prototype in my case). And sometimes I was getting an InvalidParameterException("An invalid parameter has been passed. Is scratch allocated?") from scanning.

I'm pretty sure I've tracked this down to the implementation of close() and finalize()

The problem is, when close() is called (via try-with-resources or otherwise) the scratch space is freed, and then at some later point finalize() might run and trigger a double free. My working theory is that in my case, it sometimes happened that between closing the first Scanner and executing the finalizer a new Scanner was allocated at the same address, and the second free in the finalizer deallocated the new one I was just about to use!

Excessive String Creation

When using the SOM_LEFTMOST expression flag, all matching strings are returned within the returned list of match objects. For large strings (1+MB) and certain regexps this could cause orders of magnitude more strings to be created. If the scan method could return these strings as a lazy collection such as a Stream instead of a List, this could avoid consuming too much heap memory via the string pool. Alternatively, just the start and end offsets could be returned so the caller can control the number of strings created.