mariomka / regex-benchmark Goto Github PK

It's just a simple regex benchmark of different programming languages.

License: MIT License

Crystal 2.04% JavaScript 2.83% PHP 19.50% Python 2.76% Rust 3.65% Ruby 2.13% C# 4.46% Java 4.31% Kotlin 3.87% Go 3.97% C 7.36% Perl 2.87% D 3.43% Dockerfile 20.82% C++ 5.60% Dart 3.09% Nim 5.07% Julia 2.26%

regex bench benchmark regexp languages

regex-benchmark's Introduction

Languages Regex Benchmark

It's just a simple regex benchmark for different programming languages.

Measures how long it takes to find and count non-overlapping occurrences with default settings.

All benchmarks are wrong, but some are useful - Szilard, benchm-ml

I hope this benchmark can be helpful, but it's not only about performance, but each language also has its engine and offers different features (like UTF support, backreferences, capturing groups ...)

Input text

The input text is a concatenation of Learn X in Y minutes repository.

Maybe isn't the best representative text. I'm searching other texts to add to the benchmark.

Regex patterns

Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])

The above regex patterns aren't the best or the optimal. The focus is the benchmark, not the matching.

The patterns are applied to the whole file.

Measure

Measuring is done inside the programs to avoid include startup, reading and writing times on results.

Elapsed time include pattern compilation, find and count occurrences.

Performance

Docker image was run on: MacBook Pro (16-inch, 2019), 2.4 GHz Intel Core i9, 32 GB 2667 Mhz DDR4 with macOS Big Sur 11.2.3.

Language	Email(ms)	URI(ms)	IP(ms)	Total(ms)
Nim Regex	1.32	26.92	7.84	36.09
Nim	22.70	21.49	6.75	50.94
Rust	26.66	25.70	5.28	57.63
PHP	42.87	46.30	5.17	94.33
C++ Boost	44.97	44.13	15.13	104.23
Javascript	59.00	47.23	1.50	107.73
Perl	94.92	63.96	20.37	179.25
Julia	104.58	86.55	5.01	196.14
C PCRE2	126.10	112.17	13.10	251.37
Crystal	128.19	112.70	13.18	254.07
C# .Net Core	115.05	106.05	42.71	263.81
Dart	104.10	107.64	76.51	288.25
D ldc	165.46	165.20	4.85	335.51
D dmd	187.94	189.92	5.32	383.18
Ruby	233.88	208.85	43.14	485.86
Python PyPy2	158.34	139.70	253.77	551.81
Dart Native	278.54	307.54	5.77	591.85
Python 2	197.92	131.74	294.42	624.08
Kotlin	186.20	223.05	287.49	696.74
Java	198.33	221.87	287.81	708.01
Python PyPy3	258.78	221.89	257.35	738.03
Python 3	273.86	190.79	319.13	783.78
Go	248.14	241.28	360.90	850.32
C++ STL	433.09	344.74	245.66	1023.49
C# Mono	2859.05	2431.87	145.82	5436.75

Optimized

The following results are for the optimized version.

Language	Email(ms)	URI(ms)	IP(ms)	Total(ms)
Rust	11.43	11.40	5.11	27.94
Nim Regex	1.37	25.51	7.27	34.15
Nim	22.79	21.64	6.77	51.21
C PCRE2	46.22	36.92	4.73	87.87
PHP	43.18	46.71	5.23	95.12
C++ Boost	44.68	44.50	15.10	104.28
Javascript	59.20	47.67	1.61	108.48
C# .Net Core	61.76	47.86	11.63	121.25
Perl	96.00	63.39	20.59	179.99
Julia	104.31	87.98	5.16	197.45
Crystal	129.52	116.33	13.12	258.97
Dart	105.82	107.78	78.18	291.78
D ldc	167.60	165.71	5.07	338.37
D dmd	187.66	192.16	5.55	385.37
Ruby	236.93	206.51	43.70	487.14
Python PyPy2	161.33	143.56	258.06	562.96
Dart Native	273.17	306.14	5.89	585.20
Python 2	200.54	132.89	290.26	623.69
Kotlin	184.13	220.31	273.76	678.21
Java	190.74	223.77	275.24	689.75
Python PyPy3	268.41	226.74	261.17	756.32
Python 3	273.70	194.09	322.09	789.88
Go	244.14	238.40	365.27	847.81
C++ STL	433.18	341.07	246.85	1021.10
C# Mono	1400.04	1189.50	145.73	2735.28

Language: Indicates the language.
Email(ms), URI(ms), IP(ms): Indicates the time elapsed in milliseconds for finding and counting non-overlapping occurrences for the pattern.
Total(ms): Indicates the sum of the above times.

Versions and notes

C: gcc 7.5.0 & PCRE2 10.36-2
Crystal: crystal 0.35.1 - LLVM: 8.0.0
C++: g++ 7.5.0 | Boost 1.65.1.0
C#: dotnet 5.0.201 | Mono 6.12.0.122
D: DMD v2.089.0 | LDC 1.8.0
Dart: Dart 2.12.2
Go: go 1.16.2
Java: OpenJDK 11.0.10
Javascript: node v15.13.0
Julia: Julia 1.6.0
Kotlin: kotlinc-jvm 1.4.32
Nim: Nim 1.4.4
Perl: perl v5.26.1
PHP: PHP 8.0.3
Python: Python 2.7.17 | Python 3.6.9 | PyPy 7.3.3
Ruby: ruby 2.5.1p57
Rust: rustc 1.51.0 & regex 1.4.5

How to run

The easiest way to run the benchmark is by using Docker.

git clone https://github.com/mariomka/regex-benchmark.git
cd regex-benchmark
docker run --rm -v $(pwd):/var/regex mariomka/regex-benchmark:1.6

Contributing

All contributions are welcome, from tiny optimizations to new implementations.

There are only a few requirements:

Follow the style of the current implementations
Use the default settings for the regex engine
Update Dockerfile if it's necessary

Kudos

Heng Li's for his work on Benchmark of Regex Libraries.
A "challenge" on Madrid Devs group inspired me.
Programming subreddit, helped me to improve the benchmark.

License

MIT © Mario Juárez.

regex-benchmark's People

Contributors

Stargazers

Watchers

regex-benchmark's Issues

Use compiled regexes in C#

While theoretically possible, in practice I've never found anyone using C# regexes via Regex.Matches. A standard practice here is to use var reg = new Regex(@"[^a-zA-Z&-]+", RegexOptions.Compiled);. Another case is the fact, that there seems to be no warmup phase for JITer compiler.

Use your code and repository for research article

Hi guys 👋. Can I use your code for my research article?)

C# Benchmark with RegexOptions.ECMAScript

It might be worth using RegexOptions.ECMAScript in C# benchmark, or just add another bench for c# with this option. It drops some 'advanced' features like Unicode support, which increases performance in ~1.5-2 times.

I understand that it's a questionable decision, and I could mention some pros and cons here:

We are competing with different regex engines, and some of them, like JS one, as far as I'm aware, does not support UTF. So it's a bit unfair to compare motorbike with a truck.
It looks like we compare default regex usage patterns, and I have never seen anyone use RegexOptions.ECMAScript in a real life.

My results (Windows 10, dotnet core 2.1.4):

Original

1824.7035 - 92
1560.8753 - 5301
109.0058 - 5

With ECMAScript

1072.2842 - 92
908.0764 - 5301
110.9607 - 5

It will be nice to see here TCL regex

TCL regex engine is different from other regex engines. It will be nice to see here in benchmark. More about TCL regex engine: https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Use BenchmarkDotNet for C# Benchmark

I'd like to suggest that BenchmarkDotNet be used to perform the C# benchmark. There's a lot of overhead to running the C# executable once, as the JIT is doing a lot of upfront work coupled with the RegexOptions.Compiled flag.

Once the pump is primed, you'll notice that performance is better. I suppose it depends on what is trying to be measured, though. If it's to show the performance of running an executable that is trying to match the 3 regexes once and terminate, then the current approach works. However, this doesn't represent more typical usage in an application.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i5-8500 CPU 3.00GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.NET Core SDK=3.1.101
  [Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Email	1,357.77 ms	2.984 ms	2.791 ms	-	-	-	335.63 KB
Uri	1,175.25 ms	3.274 ms	3.063 ms	-	-	-	1402.2 KB
Ip	83.62 ms	0.301 ms	0.281 ms	-	-	-	19.65 KB
EmailCompiled	964.28 ms	2.583 ms	2.416 ms	-	-	-	230.55 KB
UriCompiled	827.55 ms	2.150 ms	2.011 ms	-	-	-	1393.54 KB
IpCompiled	63.70 ms	0.164 ms	0.153 ms	-	-	-	15.11 KB

void Main()
{
    BenchmarkRunner.Run<RegexBenchmark>();
}

[MemoryDiagnoser]
public class RegexBenchmark
{
    private string _data;
    
    [GlobalSetup]
    public void GlobalSetup()
    {
        _data = File.ReadAllText(@"D:\Benchmarks\input-text.txt");
    }
    
    [Benchmark]
    public int Email()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+");
        return matches.Count;
    }
    
    [Benchmark]
    public int Uri()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?");
        return matches.Count;
    }
    
    [Benchmark]
    public int Ip()
    {
        MatchCollection matches = Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])");
        return matches.Count;
    }

    [Benchmark]
    public int EmailCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled);
        return matches.Count;
    }

    [Benchmark]
    public int UriCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled);
        return matches.Count;
    }

    [Benchmark]
    public int IpCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])", RegexOptions.Compiled);
        return matches.Count;
    }
}

Compile time regex for C++

I've just tried CTRE on these benchmarks. On my machine, it beats rust.

Here's the code:

#include "ctre.hpp"
#include <array>
#include <chrono>
#include <fstream>
#include <iostream>
#include <numeric>

template<size_t N>
void measure(const std::string& data) {
  using clock = std::chrono::high_resolution_clock;
  const auto start = clock::now();
  unsigned count = 0;
  if constexpr ( N == 0 )
    // Email
    for(auto match : ctre::range<"[\\w.+\\-]+@[\\w.\\-]+\\.[\\w.\\-]+">(data)) count++;
  else if constexpr ( N == 1 )
    // URI
    for(auto match : ctre::range<"[\\w]+:\\/\\/[^\\/\\s?#]+[^\\s?#]+(?:\\?[^\\s#]*)?(?:#[^\\s]*)?">(data)) count++;
  else
    // IP
    for(auto match : ctre::range<"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])">(data)) count++;

  const auto end = clock::now();
  const double elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() * 1e-6;
  std::cout << elapsed << " - " << count << "\n";
}

int main(int argc, char** argv) {
  if (argc != 2) {
    std::cerr << "Usage: benchmark <filename>\n";
    return 1;
  }

  std::ifstream file{argv[1]};
  if (!file) {
    std::cerr << "unable to open " << argv[1] << "\n";
    return 1;
  }

  const std::string data{std::istreambuf_iterator<char>{file}, std::istreambuf_iterator<char>{}};

  measure<0>(data);
  measure<1>(data);
  measure<2>(data);
}

CTRE is different from other C++ regex libraries in that it computes the state machine at compile time. The hard part about submitting a CTRE pull request for this repo is that it isn't packaged for Ubuntu. On my machine I have just downloaded the header straight from raw.githubuserconent.com with wget. The API is also completely different, so I don't know if I should make a new file for the PR or...

Also, it needs GCC 9.

Pattern compilation should not be included in elapsed time

A compiled regular expression is almost always used rather than recompiling each time.
And the measurement should be an average of multiple times.
A benchmark is only useful if it is close enough to real applications.

Missing PHP version

There isn't any note in the README pointing to the used php version

Update readme with .NET 5?

.NET 5 has been released for several months, but the readme is still showing data from .NET Core 3.1. Could it be updated with numbers for .NET 5 instead? Thanks!

Suggest adding Spencer regex libraries

Henry Spencer wrote multiple regex libraries which were historically important and are still relevant today: https://garyhouston.github.io/regex/

The one that survives in production code today are the Tcl library which is used in both Tcl and Postgres.

It would be very interesting to know how they fare as they have a number of different approaches from the other historical lineages of regex library, many of which have not been picked up by modern implementations.

The Postgres regex engine supports a lot of modern features like being able to turn on/off greediness per capture group and supporting unicode (well whatever the OS puts in wchar_t).

Email Regex using RFC

https://www.regular-expressions.info/email.html

dart2native is obsolete

https://dart.dev/tools/dart-compile

Instead of dart2native the following command should be used:

   mkdir -p dart/bin && dart compile exe dart/benchmark.dart -o dart/bin/benchmark

Ruby benchmark compiles regexp before measuring

A Regexp literal in Ruby, e.g. /foo/ results in a call to Regexp::compile, which means the Regexp is already created by the time measure is invoked.

Instead, pass a String pattern that is compiled with Regexp::compile, which makes the test more analogous to the Rust one.

"C# benchmark" is incorrect: "RegexOptions.Compiled" "Flags" should not be set.

The csharp code in the benchmark is incorrect: RegexOptions.Compiled Flags should not be set.
LINK: csharp/Benchmark.cs#L33

Because:
In many other languages, Compile is used as the Parser process for Regex. But on .NET, a Compiled option is provided to dynamically create IL objects, and the vast majority of cases should not used. It is easy to be used incorrectly.

In addition, the javascript code used is unfair.
The /...regex.../ syntax in javascript creates a Regex instance object that has been Parser, but the Parser consumption in other languages Benchmark is counted.

Code and setting optimizations

Hello everyone,

Language/engine features seem to be very controversial and give us a lot of different opinions. Because of that, I have decided to use default settings for every language/engine but I also decided to create a new branch where code and settings optimizations will be allowed. It includes disabling some features (Unicode support for example) or enabling specific mechanics(JIT for example), but the measure, style, and regex patters will be the same.

At this point, we have two benchmarks, the first for comparing default settings and the second to push the benchmark until the limit for this specific scenario.

Results for both benchmarks will be in master readme.

I will work on it the next days but I accept PR for the new branch optimized from now.

Why Python is faster than Go and C#?

Python is a dynamic language. Go is a static language. There is no way that Go is slower than Python.

And C# is similar to Java in the runtime. So C# should at least faster than Python too.

C++ benchmark improvements

This repository came up in a recent C++ discussion in /r/cpp and once thing was immediately pointed out as wrong.

https://github.com/mariomka/regex-benchmark/blob/master/cpp/benchmark.cpp#L8-L16

You're measuring the time it takes to create the regex too, not just the loop.

Second, std::regex is known to be terribly slow and boost::regex is API compatible with much better performance. On top of that CTRE is even faster.

Benchmarks on my computer, with C/PCRE for reference:

PCRE
bstaletic@Gallifrey c  (git)-[master]-% ./bin/benchmark ../input-text.txt
31.998263 - 92
26.271349 - 5301
7.516541 - 5

std::regex
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
631.659 - 92
512.522 - 5301
391.858 - 5

boost::regex
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
84.5532 - 92
86.0452 - 5301
34.0546 - 5

CTRE
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
62.5399 - 92
52.8931 - 5301
12.9586 - 5

some benchmarks in perl

--> perl perl/benchmark.pl input-text.txt
213.690996 - 92
165.706158 - 5301
29.313803 - 5

--> perl perl/benchmark_bytes.pl input-text.txt
95.394135 - 92
62.829971 - 5301
24.369955 - 5

--> perl perl/benchmark_bytes_ascii.pl input-text.txt
57.080984 - 92
62.375784 - 5301
24.430990 - 5

--> perl perl/benchmark_bytes_ascii_nocapture.pl input-text.txt
57.039022 - 92
62.216997 - 5301
24.269104 - 5

--> uname -v
FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 11:55:48 UTC 2017     [email protected]:/usr/obj/usr/src/sys/GENERIC

--> perl -E 'say $^V'
v5.26.1

--> diff -u perl/benchmark.pl perl/benchmark_bytes.pl
--- perl/benchmark.pl   2017-10-22 14:08:18.000000000 +0900
+++ perl/benchmark_bytes.pl     2017-10-22 14:30:27.000000000 +0900
@@ -18,7 +18,7 @@

 my ($filename) = @ARGV;

-open my $fh, '<:encoding(UTF-8)', $filename or die 'Could not open file.';
+open my $fh, '<', $filename or die 'Could not open file.';
 my $text;
 read $fh, $data, -s $filename;
 close $fh;

--> diff -u perl/benchmark_bytes.pl perl/benchmark_bytes_ascii.pl
--- perl/benchmark_bytes.pl     2017-10-22 14:30:27.000000000 +0900
+++ perl/benchmark_bytes_ascii.pl       2017-10-22 14:30:33.000000000 +0900
@@ -5,7 +5,7 @@

     my $start = Time::HiRes::gettimeofday();

-    my $count = () = $data =~ /$pattern/g;
+    my $count = () = $data =~ /$pattern/ga;

     my $elapsed = (Time::HiRes::gettimeofday() - $start) * 1e3;


--> diff -u perl/benchmark_bytes_ascii.pl perl/benchmark_bytes_ascii_nocapture.pl
--- perl/benchmark_bytes_ascii.pl       2017-10-22 14:30:33.000000000 +0900
+++ perl/benchmark_bytes_ascii_nocapture.pl     2017-10-22 14:52:44.183470000 +0900
@@ -5,7 +5,7 @@

     my $start = Time::HiRes::gettimeofday();

-    my $count = () = $data =~ /$pattern/ga;
+    my $count = () = $data =~ /$pattern/gan;

     my $elapsed = (Time::HiRes::gettimeofday() - $start) * 1e3;

Nim optimization

For Nim, you should also use the -d:danger flag to get the fastest speed.

There is a way to make rust regex compatible with normal regex syntax ?

@BurntSushi

fn measure(pattern: &str, data: &str) {
    let start = Instant::now();

    let regex = Regex::new(pattern);
    match regex {
        Ok(re) => {
            let count = re.find_iter(data).count();

            let elapsed = Instant::now().duration_since(start);
            println!("{} - {}", elapsed.as_secs_f64() * 1e3, count);
        }
        Err(err) => {
            println!("{:?}", err);
        }
    }
}

prints

regex parse error:
    .* x-x[[0-9]+] time="[0-9\.\:\-T]+" level=error msg="x x x x ([a-z0-9]{64}): x to x device ([a-z0-9]{64})-init:devicemapper: x x x x failed""
                                           ^^
error: unrecognized escape sequence

while other languages and online regex compile it just fine
without changing the regex