Git Product home page Git Product logo

mariomka / regex-benchmark Goto Github PK

View Code? Open in Web Editor NEW
313.0 14.0 58.0 2.31 MB

It's just a simple regex benchmark of different programming languages.

License: MIT License

Crystal 2.04% JavaScript 2.83% PHP 19.50% Python 2.76% Rust 3.65% Ruby 2.13% C# 4.46% Java 4.31% Kotlin 3.87% Go 3.97% C 7.36% Perl 2.87% D 3.43% Dockerfile 20.82% C++ 5.60% Dart 3.09% Nim 5.07% Julia 2.26%
regex bench benchmark regexp languages

regex-benchmark's Introduction

Languages Regex Benchmark

It's just a simple regex benchmark for different programming languages.

Measures how long it takes to find and count non-overlapping occurrences with default settings.

All benchmarks are wrong, but some are useful - Szilard, benchm-ml

I hope this benchmark can be helpful, but it's not only about performance, but each language also has its engine and offers different features (like UTF support, backreferences, capturing groups ...)

Input text

The input text is a concatenation of Learn X in Y minutes repository.

Maybe isn't the best representative text. I'm searching other texts to add to the benchmark.

Regex patterns

  • Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
  • URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
  • IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])

The above regex patterns aren't the best or the optimal. The focus is the benchmark, not the matching.

The patterns are applied to the whole file.

Measure

Measuring is done inside the programs to avoid include startup, reading and writing times on results.

Elapsed time include pattern compilation, find and count occurrences.

Performance

Docker image was run on: MacBook Pro (16-inch, 2019), 2.4 GHz Intel Core i9, 32 GB 2667 Mhz DDR4 with macOS Big Sur 11.2.3.

Language Email(ms) URI(ms) IP(ms) Total(ms)
Nim Regex 1.32 26.92 7.84 36.09
Nim 22.70 21.49 6.75 50.94
Rust 26.66 25.70 5.28 57.63
PHP 42.87 46.30 5.17 94.33
C++ Boost 44.97 44.13 15.13 104.23
Javascript 59.00 47.23 1.50 107.73
Perl 94.92 63.96 20.37 179.25
Julia 104.58 86.55 5.01 196.14
C PCRE2 126.10 112.17 13.10 251.37
Crystal 128.19 112.70 13.18 254.07
C# .Net Core 115.05 106.05 42.71 263.81
Dart 104.10 107.64 76.51 288.25
D ldc 165.46 165.20 4.85 335.51
D dmd 187.94 189.92 5.32 383.18
Ruby 233.88 208.85 43.14 485.86
Python PyPy2 158.34 139.70 253.77 551.81
Dart Native 278.54 307.54 5.77 591.85
Python 2 197.92 131.74 294.42 624.08
Kotlin 186.20 223.05 287.49 696.74
Java 198.33 221.87 287.81 708.01
Python PyPy3 258.78 221.89 257.35 738.03
Python 3 273.86 190.79 319.13 783.78
Go 248.14 241.28 360.90 850.32
C++ STL 433.09 344.74 245.66 1023.49
C# Mono 2859.05 2431.87 145.82 5436.75

Optimized

The following results are for the optimized version.

Language Email(ms) URI(ms) IP(ms) Total(ms)
Rust 11.43 11.40 5.11 27.94
Nim Regex 1.37 25.51 7.27 34.15
Nim 22.79 21.64 6.77 51.21
C PCRE2 46.22 36.92 4.73 87.87
PHP 43.18 46.71 5.23 95.12
C++ Boost 44.68 44.50 15.10 104.28
Javascript 59.20 47.67 1.61 108.48
C# .Net Core 61.76 47.86 11.63 121.25
Perl 96.00 63.39 20.59 179.99
Julia 104.31 87.98 5.16 197.45
Crystal 129.52 116.33 13.12 258.97
Dart 105.82 107.78 78.18 291.78
D ldc 167.60 165.71 5.07 338.37
D dmd 187.66 192.16 5.55 385.37
Ruby 236.93 206.51 43.70 487.14
Python PyPy2 161.33 143.56 258.06 562.96
Dart Native 273.17 306.14 5.89 585.20
Python 2 200.54 132.89 290.26 623.69
Kotlin 184.13 220.31 273.76 678.21
Java 190.74 223.77 275.24 689.75
Python PyPy3 268.41 226.74 261.17 756.32
Python 3 273.70 194.09 322.09 789.88
Go 244.14 238.40 365.27 847.81
C++ STL 433.18 341.07 246.85 1021.10
C# Mono 1400.04 1189.50 145.73 2735.28
  • Language: Indicates the language.
  • Email(ms), URI(ms), IP(ms): Indicates the time elapsed in milliseconds for finding and counting non-overlapping occurrences for the pattern.
  • Total(ms): Indicates the sum of the above times.

Versions and notes

  • C: gcc 7.5.0 & PCRE2 10.36-2
  • Crystal: crystal 0.35.1 - LLVM: 8.0.0
  • C++: g++ 7.5.0 | Boost 1.65.1.0
  • C#: dotnet 5.0.201 | Mono 6.12.0.122
  • D: DMD v2.089.0 | LDC 1.8.0
  • Dart: Dart 2.12.2
  • Go: go 1.16.2
  • Java: OpenJDK 11.0.10
  • Javascript: node v15.13.0
  • Julia: Julia 1.6.0
  • Kotlin: kotlinc-jvm 1.4.32
  • Nim: Nim 1.4.4
  • Perl: perl v5.26.1
  • PHP: PHP 8.0.3
  • Python: Python 2.7.17 | Python 3.6.9 | PyPy 7.3.3
  • Ruby: ruby 2.5.1p57
  • Rust: rustc 1.51.0 & regex 1.4.5

How to run

The easiest way to run the benchmark is by using Docker.

git clone https://github.com/mariomka/regex-benchmark.git
cd regex-benchmark
docker run --rm -v $(pwd):/var/regex mariomka/regex-benchmark:1.6

Contributing

All contributions are welcome, from tiny optimizations to new implementations.

There are only a few requirements:

  • Follow the style of the current implementations
  • Use the default settings for the regex engine
  • Update Dockerfile if it's necessary

Kudos

License

MIT © Mario Juárez.

regex-benchmark's People

Contributors

altunyurt avatar bstaletic avatar burntsushi avatar danmoseley avatar data-man avatar dougpuob avatar e-kwsm avatar mariomka avatar nitely avatar szarnyasg avatar tristan971 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regex-benchmark's Issues

Use compiled regexes in C#

While theoretically possible, in practice I've never found anyone using C# regexes via Regex.Matches. A standard practice here is to use var reg = new Regex(@"[^a-zA-Z&-]+", RegexOptions.Compiled);. Another case is the fact, that there seems to be no warmup phase for JITer compiler.

C# Benchmark with RegexOptions.ECMAScript

It might be worth using RegexOptions.ECMAScript in C# benchmark, or just add another bench for c# with this option. It drops some 'advanced' features like Unicode support, which increases performance in ~1.5-2 times.

I understand that it's a questionable decision, and I could mention some pros and cons here:

  • We are competing with different regex engines, and some of them, like JS one, as far as I'm aware, does not support UTF. So it's a bit unfair to compare motorbike with a truck.
  • It looks like we compare default regex usage patterns, and I have never seen anyone use RegexOptions.ECMAScript in a real life.

My results (Windows 10, dotnet core 2.1.4):

Original

1824.7035 - 92
1560.8753 - 5301
109.0058 - 5

With ECMAScript

1072.2842 - 92
908.0764 - 5301
110.9607 - 5

Use BenchmarkDotNet for C# Benchmark

I'd like to suggest that BenchmarkDotNet be used to perform the C# benchmark. There's a lot of overhead to running the C# executable once, as the JIT is doing a lot of upfront work coupled with the RegexOptions.Compiled flag.

Once the pump is primed, you'll notice that performance is better. I suppose it depends on what is trying to be measured, though. If it's to show the performance of running an executable that is trying to match the 3 regexes once and terminate, then the current approach works. However, this doesn't represent more typical usage in an application.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i5-8500 CPU 3.00GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.NET Core SDK=3.1.101
  [Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Email 1,357.77 ms 2.984 ms 2.791 ms - - - 335.63 KB
Uri 1,175.25 ms 3.274 ms 3.063 ms - - - 1402.2 KB
Ip 83.62 ms 0.301 ms 0.281 ms - - - 19.65 KB
EmailCompiled 964.28 ms 2.583 ms 2.416 ms - - - 230.55 KB
UriCompiled 827.55 ms 2.150 ms 2.011 ms - - - 1393.54 KB
IpCompiled 63.70 ms 0.164 ms 0.153 ms - - - 15.11 KB
void Main()
{
    BenchmarkRunner.Run<RegexBenchmark>();
}

[MemoryDiagnoser]
public class RegexBenchmark
{
    private string _data;
    
    [GlobalSetup]
    public void GlobalSetup()
    {
        _data = File.ReadAllText(@"D:\Benchmarks\input-text.txt");
    }
    
    [Benchmark]
    public int Email()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+");
        return matches.Count;
    }
    
    [Benchmark]
    public int Uri()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?");
        return matches.Count;
    }
    
    [Benchmark]
    public int Ip()
    {
        MatchCollection matches = Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])");
        return matches.Count;
    }

    [Benchmark]
    public int EmailCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled);
        return matches.Count;
    }

    [Benchmark]
    public int UriCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled);
        return matches.Count;
    }

    [Benchmark]
    public int IpCompiled()
    {
        MatchCollection matches = Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])", RegexOptions.Compiled);
        return matches.Count;
    }
}

Compile time regex for C++

I've just tried CTRE on these benchmarks. On my machine, it beats rust.

Here's the code:

#include "ctre.hpp"
#include <array>
#include <chrono>
#include <fstream>
#include <iostream>
#include <numeric>

template<size_t N>
void measure(const std::string& data) {
  using clock = std::chrono::high_resolution_clock;
  const auto start = clock::now();
  unsigned count = 0;
  if constexpr ( N == 0 )
    // Email
    for(auto match : ctre::range<"[\\w.+\\-]+@[\\w.\\-]+\\.[\\w.\\-]+">(data)) count++;
  else if constexpr ( N == 1 )
    // URI
    for(auto match : ctre::range<"[\\w]+:\\/\\/[^\\/\\s?#]+[^\\s?#]+(?:\\?[^\\s#]*)?(?:#[^\\s]*)?">(data)) count++;
  else
    // IP
    for(auto match : ctre::range<"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])">(data)) count++;

  const auto end = clock::now();
  const double elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() * 1e-6;
  std::cout << elapsed << " - " << count << "\n";
}

int main(int argc, char** argv) {
  if (argc != 2) {
    std::cerr << "Usage: benchmark <filename>\n";
    return 1;
  }

  std::ifstream file{argv[1]};
  if (!file) {
    std::cerr << "unable to open " << argv[1] << "\n";
    return 1;
  }

  const std::string data{std::istreambuf_iterator<char>{file}, std::istreambuf_iterator<char>{}};

  measure<0>(data);
  measure<1>(data);
  measure<2>(data);
}

CTRE is different from other C++ regex libraries in that it computes the state machine at compile time. The hard part about submitting a CTRE pull request for this repo is that it isn't packaged for Ubuntu. On my machine I have just downloaded the header straight from raw.githubuserconent.com with wget. The API is also completely different, so I don't know if I should make a new file for the PR or...

Also, it needs GCC 9.

Update readme with .NET 5?

.NET 5 has been released for several months, but the readme is still showing data from .NET Core 3.1. Could it be updated with numbers for .NET 5 instead? Thanks!

Suggest adding Spencer regex libraries

Henry Spencer wrote multiple regex libraries which were historically important and are still relevant today: https://garyhouston.github.io/regex/

The one that survives in production code today are the Tcl library which is used in both Tcl and Postgres.

It would be very interesting to know how they fare as they have a number of different approaches from the other historical lineages of regex library, many of which have not been picked up by modern implementations.

The Postgres regex engine supports a lot of modern features like being able to turn on/off greediness per capture group and supporting unicode (well whatever the OS puts in wchar_t).

Ruby benchmark compiles regexp before measuring

A Regexp literal in Ruby, e.g. /foo/ results in a call to Regexp::compile, which means the Regexp is already created by the time measure is invoked.

Instead, pass a String pattern that is compiled with Regexp::compile, which makes the test more analogous to the Rust one.

"C# benchmark" is incorrect: "RegexOptions.Compiled" "Flags" should not be set.

The csharp code in the benchmark is incorrect: RegexOptions.Compiled Flags should not be set.
LINK: csharp/Benchmark.cs#L33

Because:
In many other languages, Compile is used as the Parser process for Regex. But on .NET, a Compiled option is provided to dynamically create IL objects, and the vast majority of cases should not used. It is easy to be used incorrectly.

In addition, the javascript code used is unfair.
The /...regex.../ syntax in javascript creates a Regex instance object that has been Parser, but the Parser consumption in other languages Benchmark is counted.

Code and setting optimizations

Hello everyone,

Language/engine features seem to be very controversial and give us a lot of different opinions. Because of that, I have decided to use default settings for every language/engine but I also decided to create a new branch where code and settings optimizations will be allowed. It includes disabling some features (Unicode support for example) or enabling specific mechanics(JIT for example), but the measure, style, and regex patters will be the same.

At this point, we have two benchmarks, the first for comparing default settings and the second to push the benchmark until the limit for this specific scenario.

Results for both benchmarks will be in master readme.

I will work on it the next days but I accept PR for the new branch optimized from now.

Why Python is faster than Go and C#?

Python is a dynamic language. Go is a static language. There is no way that Go is slower than Python.

And C# is similar to Java in the runtime. So C# should at least faster than Python too.

C++ benchmark improvements

This repository came up in a recent C++ discussion in /r/cpp and once thing was immediately pointed out as wrong.

https://github.com/mariomka/regex-benchmark/blob/master/cpp/benchmark.cpp#L8-L16

You're measuring the time it takes to create the regex too, not just the loop.

Second, std::regex is known to be terribly slow and boost::regex is API compatible with much better performance. On top of that CTRE is even faster.

Benchmarks on my computer, with C/PCRE for reference:

PCRE
bstaletic@Gallifrey c  (git)-[master]-% ./bin/benchmark ../input-text.txt
31.998263 - 92
26.271349 - 5301
7.516541 - 5

std::regex
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
631.659 - 92
512.522 - 5301
391.858 - 5

boost::regex
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
84.5532 - 92
86.0452 - 5301
34.0546 - 5

CTRE
bstaletic@Gallifrey cpp  (git)-[master]-% ./bin/benchmark ../input-text.txt
62.5399 - 92
52.8931 - 5301
12.9586 - 5

some benchmarks in perl

--> perl perl/benchmark.pl input-text.txt
213.690996 - 92
165.706158 - 5301
29.313803 - 5

--> perl perl/benchmark_bytes.pl input-text.txt
95.394135 - 92
62.829971 - 5301
24.369955 - 5

--> perl perl/benchmark_bytes_ascii.pl input-text.txt
57.080984 - 92
62.375784 - 5301
24.430990 - 5

--> perl perl/benchmark_bytes_ascii_nocapture.pl input-text.txt
57.039022 - 92
62.216997 - 5301
24.269104 - 5

--> uname -v
FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 11:55:48 UTC 2017     [email protected]:/usr/obj/usr/src/sys/GENERIC

--> perl -E 'say $^V'
v5.26.1

--> diff -u perl/benchmark.pl perl/benchmark_bytes.pl
--- perl/benchmark.pl   2017-10-22 14:08:18.000000000 +0900
+++ perl/benchmark_bytes.pl     2017-10-22 14:30:27.000000000 +0900
@@ -18,7 +18,7 @@

 my ($filename) = @ARGV;

-open my $fh, '<:encoding(UTF-8)', $filename or die 'Could not open file.';
+open my $fh, '<', $filename or die 'Could not open file.';
 my $text;
 read $fh, $data, -s $filename;
 close $fh;

--> diff -u perl/benchmark_bytes.pl perl/benchmark_bytes_ascii.pl
--- perl/benchmark_bytes.pl     2017-10-22 14:30:27.000000000 +0900
+++ perl/benchmark_bytes_ascii.pl       2017-10-22 14:30:33.000000000 +0900
@@ -5,7 +5,7 @@

     my $start = Time::HiRes::gettimeofday();

-    my $count = () = $data =~ /$pattern/g;
+    my $count = () = $data =~ /$pattern/ga;

     my $elapsed = (Time::HiRes::gettimeofday() - $start) * 1e3;


--> diff -u perl/benchmark_bytes_ascii.pl perl/benchmark_bytes_ascii_nocapture.pl
--- perl/benchmark_bytes_ascii.pl       2017-10-22 14:30:33.000000000 +0900
+++ perl/benchmark_bytes_ascii_nocapture.pl     2017-10-22 14:52:44.183470000 +0900
@@ -5,7 +5,7 @@

     my $start = Time::HiRes::gettimeofday();

-    my $count = () = $data =~ /$pattern/ga;
+    my $count = () = $data =~ /$pattern/gan;

     my $elapsed = (Time::HiRes::gettimeofday() - $start) * 1e3;

Nim optimization

For Nim, you should also use the -d:danger flag to get the fastest speed.

There is a way to make rust regex compatible with normal regex syntax ?

@BurntSushi

fn measure(pattern: &str, data: &str) {
    let start = Instant::now();

    let regex = Regex::new(pattern);
    match regex {
        Ok(re) => {
            let count = re.find_iter(data).count();

            let elapsed = Instant::now().duration_since(start);
            println!("{} - {}", elapsed.as_secs_f64() * 1e3, count);
        }
        Err(err) => {
            println!("{:?}", err);
        }
    }
}

prints

regex parse error:
    .* x-x[[0-9]+] time="[0-9\.\:\-T]+" level=error msg="x x x x ([a-z0-9]{64}): x to x device ([a-z0-9]{64})-init:devicemapper: x x x x failed""
                                           ^^
error: unrecognized escape sequence

while other languages and online regex compile it just fine
without changing the regex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.