slevithan / xregexp Goto Github PK

View Code? Open in Web Editor NEW

3.2K 71.0 276.0 4.44 MB

Extended JavaScript regular expressions

Home Page: http://xregexp.com/

License: MIT License

JavaScript 96.58% HTML 1.10% TypeScript 2.32%

regex regular-expression regexp

xregexp's Introduction

XRegExp 5.1.1

_{Included in}
^{Awesome Regex}

XRegExp provides augmented (and extensible) JavaScript regular expressions. You get modern syntax and flags beyond what browsers support natively. XRegExp is also a regex utility belt with tools to make your grepping and parsing easier, while freeing you from regex cross-browser inconsistencies and other annoyances.

XRegExp supports all native ES6 regular expression syntax. It supports ES5+ browsers, and you can use it with Node.js or as a RequireJS module. Over the years, many of XRegExp's features have been adopted by new JavaScript standards (named capturing, Unicode properties/scripts/categories, flag s, sticky matching, etc.), so using XRegExp can be a way to extend these features into older browsers.

Performance

XRegExp compiles to native RegExp objects. Therefore regexes built with XRegExp perform just as fast as native regular expressions. There is a tiny extra cost when compiling a pattern for the first time.

Named capture breaking change in XRegExp 5

XRegExp 5 introduced a breaking change where named backreference properties now appear on the result's groups object (following ES2018), rather than directly on the result. To restore the old handling so you don't need to update old code, run the following line after importing XRegExp:

XRegExp.uninstall('namespacing');

XRegExp 4.1.0 and later allow introducing the new behavior without upgrading to XRegExp 5 by running XRegExp.install('namespacing').

Following is the most commonly needed change to update code for the new behavior:

// Change this
const name = XRegExp.exec(str, regexWithNamedCapture).name;

// To this
const name = XRegExp.exec(str, regexWithNamedCapture).groups.name;

See below for more examples of using named capture with XRegExp.exec and XRegExp.replace.

Usage examples

// Using named capture and flag x for free-spacing and line comments
const date = XRegExp(
    `(?<year>  [0-9]{4} ) -?  # year
     (?<month> [0-9]{2} ) -?  # month
     (?<day>   [0-9]{2} )     # day`, 'x');

// XRegExp.exec provides named backreferences on the result's groups property
let match = XRegExp.exec('2021-02-22', date);
match.groups.year; // -> '2021'

// It also includes optional pos and sticky arguments
let pos = 3;
const result = [];
while (match = XRegExp.exec('<1><2><3>4<5>', /<(\d+)>/, pos, 'sticky')) {
    result.push(match[1]);
    pos = match.index + match[0].length;
}
// result -> ['2', '3']

// XRegExp.replace allows named backreferences in replacements
XRegExp.replace('2021-02-22', date, '$<month>/$<day>/$<year>');
// -> '02/22/2021'
XRegExp.replace('2021-02-22', date, (...args) => {
    // Named backreferences are on the last argument
    const {day, month, year} = args.at(-1);
    return `${month}/${day}/${year}`;
});
// -> '02/22/2021'

// XRegExps compile to RegExps and work with native methods
date.test('2021-02-22');
// -> true
// However, named captures must be referenced using numbered backreferences
// if used with native methods
'2021-02-22'.replace(date, '$2/$3/$1');
// -> '02/22/2021'

// Use XRegExp.forEach to extract every other digit from a string
const evens = [];
XRegExp.forEach('1a2345', /\d/, (match, i) => {
    if (i % 2) evens.push(+match[0]);
});
// evens -> [2, 4]

// Use XRegExp.matchChain to get numbers within <b> tags
XRegExp.matchChain('1 <b>2</b> 3 <B>4 \n 56</B>', [
    XRegExp('<b>.*?</b>', 'is'),
    /\d+/
]);
// -> ['2', '4', '56']

// You can also pass forward and return specific backreferences
const html =
    `<a href="https://xregexp.com/">XRegExp</a>
     <a href="https://www.google.com/">Google</a>`;
XRegExp.matchChain(html, [
    {regex: /<a href="([^"]+)">/i, backref: 1},
    {regex: XRegExp('(?i)^https?://(?<domain>[^/?#]+)'), backref: 'domain'}
]);
// -> ['xregexp.com', 'www.google.com']

// Merge strings and regexes, with updated backreferences
XRegExp.union(['m+a*n', /(bear)\1/, /(pig)\1/], 'i', {conjunction: 'or'});
// -> /m\+a\*n|(bear)\1|(pig)\2/i

These examples give the flavor of what's possible, but XRegExp has more syntax, flags, methods, options, and browser fixes that aren't shown here. You can also augment XRegExp's regular expression syntax with addons (see below) or write your own. See xregexp.com for details.

Addons

You can either load addons individually, or bundle all addons with XRegExp by loading xregexp-all.js from https://unpkg.com/xregexp/xregexp-all.js.

Unicode

If not using xregexp-all.js, first include the Unicode Base script and then one or more of the addons for Unicode categories, properties, or scripts.

Then you can do this:

// Test some Unicode scripts
// Can also use the Script= prefix to match ES2018: \p{Script=Hiragana}
XRegExp('^\\p{Hiragana}+$').test('ひらがな'); // -> true
XRegExp('^[\\p{Latin}\\p{Common}]+$').test('Über Café.'); // -> true

// Test the Unicode categories Letter and Mark
// Can also use the short names \p{L} and \p{M}
const unicodeWord = XRegExp.tag()`^\p{Letter}[\p{Letter}\p{Mark}]*$`;
unicodeWord.test('Русский'); // -> true
unicodeWord.test('日本語'); // -> true
unicodeWord.test('العربية'); // -> true

By default, \p{…} and \P{…} support the Basic Multilingual Plane (i.e. code points up to U+FFFF). You can opt-in to full 21-bit Unicode support (with code points up to U+10FFFF) on a per-regex basis by using flag A. This is called astral mode. You can automatically add flag A for all new regexes by running XRegExp.install('astral'). When in astral mode, \p{…} and \P{…} always match a full code point rather than a code unit, using surrogate pairs for code points above U+FFFF.

// Using flag A to match astral code points
XRegExp('^\\p{S}$').test('💩'); // -> false
XRegExp('^\\p{S}$', 'A').test('💩'); // -> true
// Using surrogate pair U+D83D U+DCA9 to represent U+1F4A9 (pile of poo)
XRegExp('^\\p{S}$', 'A').test('\uD83D\uDCA9'); // -> true

// Implicit flag A
XRegExp.install('astral');
XRegExp('^\\p{S}$').test('💩'); // -> true

Opting in to astral mode disables the use of \p{…} and \P{…} within character classes. In astral mode, use e.g. (\pL|[0-9_])+ instead of [\pL0-9_]+.

XRegExp uses Unicode 14.0.0.

XRegExp.build

Build regular expressions using named subpatterns, for readability and pattern reuse:

const time = XRegExp.build('(?x)^ {{hours}} ({{minutes}}) $', {
    hours: XRegExp.build('{{h12}} : | {{h24}}', {
        h12: /1[0-2]|0?[1-9]/,
        h24: /2[0-3]|[01][0-9]/
    }),
    minutes: /^[0-5][0-9]$/
});

time.test('10:59'); // -> true
XRegExp.exec('10:59', time).groups.minutes; // -> '59'

Named subpatterns can be provided as strings or regex objects. A leading ^ and trailing unescaped $ are stripped from subpatterns if both are present, which allows embedding independently-useful anchored patterns. {{…}} tokens can be quantified as a single unit. Any backreferences in the outer pattern or provided subpatterns are automatically renumbered to work correctly within the larger combined pattern. The syntax ({{name}}) works as shorthand for named capture via (?<name>{{name}}). Named subpatterns cannot be embedded within character classes.

XRegExp.tag (included with XRegExp.build)

Provides tagged template literals that create regexes with XRegExp syntax and flags:

XRegExp.tag()`\b\w+\b`.test('word'); // -> true

const hours = /1[0-2]|0?[1-9]/;
const minutes = /(?<minutes>[0-5][0-9])/;
const time = XRegExp.tag('x')`\b ${hours} : ${minutes} \b`;
time.test('10:59'); // -> true
XRegExp.exec('10:59', time).groups.minutes; // -> '59'

const backref1 = /(a)\1/;
const backref2 = /(b)\1/;
XRegExp.tag()`${backref1}${backref2}`.test('aabb'); // -> true

XRegExp.tag does more than just interpolation. You get all the XRegExp syntax and flags, and since it reads patterns as raw strings, you no longer need to escape all your backslashes. XRegExp.tag also uses XRegExp.build under the hood, so you get all of its extras for free. Leading ^ and trailing unescaped $ are stripped from interpolated patterns if both are present (to allow embedding independently useful anchored regexes), interpolating into a character class is an error (to avoid unintended meaning in edge cases), interpolated patterns are treated as atomic units when quantified, interpolated strings have their special characters escaped, and any backreferences within an interpolated regex are rewritten to work within the overall pattern.

XRegExp.matchRecursive

A robust and flexible API for matching recursive constructs using XRegExp pattern strings as left and right delimiters:

const str1 = '(t((e))s)t()(ing)';
XRegExp.matchRecursive(str1, '\\(', '\\)', 'g');
// -> ['t((e))s', '', 'ing']

// Extended information mode with valueNames
const str2 = 'Here is <div> <div>an</div></div> example';
XRegExp.matchRecursive(str2, '<div\\s*>', '</div>', 'gi', {
    valueNames: ['between', 'left', 'match', 'right']
});
/* -> [
{name: 'between', value: 'Here is ',       start: 0,  end: 8},
{name: 'left',    value: '<div>',          start: 8,  end: 13},
{name: 'match',   value: ' <div>an</div>', start: 13, end: 27},
{name: 'right',   value: '</div>',         start: 27, end: 33},
{name: 'between', value: ' example',       start: 33, end: 41}
] */

// Omitting unneeded parts with null valueNames, and using escapeChar
const str3 = '...{1}.\\{{function(x,y){return {y:x}}}';
XRegExp.matchRecursive(str3, '{', '}', 'g', {
    valueNames: ['literal', null, 'value', null],
    escapeChar: '\\'
});
/* -> [
{name: 'literal', value: '...',  start: 0, end: 3},
{name: 'value',   value: '1',    start: 4, end: 5},
{name: 'literal', value: '.\\{', start: 6, end: 9},
{name: 'value',   value: 'function(x,y){return {y:x}}', start: 10, end: 37}
] */

// Sticky mode via flag y
const str4 = '<1><<<2>>><3>4<5>';
XRegExp.matchRecursive(str4, '<', '>', 'gy');
// -> ['1', '<<2>>', '3']

// Skipping unbalanced delimiters instead of erroring
const str5 = 'Here is <div> <div>an</div> unbalanced example';
XRegExp.matchRecursive(str5, '<div\\s*>', '</div>', 'gi', {
    unbalanced: 'skip'
});
// -> ['an']

By default, XRegExp.matchRecursive throws an error if it scans past an unbalanced delimiter in the target string. Multiple alternative options are available for handling unbalanced delimiters.

Installation and usage

In browsers (bundle XRegExp with all of its addons):

<script src="https://unpkg.com/xregexp/xregexp-all.js"></script>

Using npm:

npm install xregexp

In Node.js:

const XRegExp = require('xregexp');

Contribution guide

Fork the repository and clone the forked version locally.
Ensure you have the typescript module installed globally.
Run npm install.
Ensure all tests pass with npm test.
Add tests for new functionality or that fail from the bug not fixed.
Implement functionality or bug fix to pass the test.

Credits

XRegExp project collaborators are:

Thanks to all contributors and others who have submitted code, provided feedback, reported bugs, and inspired new features.

XRegExp is released under the MIT License. Learn more at xregexp.com.

xregexp's People

Contributors

Stargazers

Watchers

Forkers

qawemlilo walling airbuzz mchorfa karimmaassen floatyears suissa codylindley hightemp kmikzjh mettjus nevermosby nvdnkpr bensochar elis zhuzhuaicoding dgalvez kylepdavis fantasyni bashor jasonbogovich zlumber kamilaborowska asdi bertomartin lgsunnyvale cyke2006 brianmhunt callmephilip kazuofunatsu marking bennewton999 imecoding ggpeters craniumslows ereztourjeman rajarju twotix mtsr gerhobbelt pombredanne giusepperizzo bhannat2012 murongzi brettz9 jbinkleyj joberrr joscha ysmood squirrly netconstructor easy-forex jessecc web5design eyes2design nagaozen jmzavala qljlld gvilarino tommy8694 awesome jordlidd kainhong acidburn0zzz williamseye loveencounterflow jeff-tian stephenliberty mermalrain neiljryder uniasha inno-v leecheedoo nangal mattbierner umeshmunasinghe wclr imcarlosdev bda001 violetlife fdiquinzio yashilanka javascript-forks aecca abhishektangudu m1sta ishawge zscgrhg drahma peerlibrary silver83 scottrodoty unscathed18 cybernetics developsmith jelu mylittlepython v09-software jacobking akylas

xregexp's Issues

Cache the regex copies made by XRegExp.exec/replace

For various reasons, the XRegExp.exec and XRegExp.replace functions make copies of their provided regexes, sometimes with the addition or removal of flags /g and/or /y. For improved performance, these copies should be cached on a regex's xregexp object. The cached copies can be shared by all XRegExp functions that benefit from their use.

XRegExp.test, XRegExp.forEach, XRegExp.split, and the new XRegExp.replaceEach all rely on XRegExp.exec or XRegExp.replace, so they will share the performance improvements.

This should also make XRegExp.exec fast enough to allow the private and performance-sensitive runTokens function to take advantage of XRegExp.exec's sticky-mode matching, rather than reinventing the sticky wheel.

Add a function for performing multiple replacements

Proposed name: XRegExp.replaceSet. _Edit:_ New name: XRegExp.replaceEach.

Create a new function called XRegExp.replaceSet that provides sugar for performing multiple sequential replacements. It will accept two arguments, str {String} and replacements {Array}, and return a new string with all replacements applied.

Details:

Uses the existing XRegExp replacement text syntax, with support for ${name}, $0, etc.
Later replacements operate on the output of earlier replacements, rather than the original string.
Allows specifying scope as 'one' or 'all' via the third item in a replacement array. This follows the XRegExp.replace function, where the optional scope argument overrides the state of /g.

Usage example:

XRegExp.replaceSet(str, [
  [XRegExp('(?<z>z)'), 'a${z}'],
  [/y/gi, 'b'],
  [/x/g, 'c', 'one'], // scope 'one' overrides /g
  [/w/, 'd', 'all'],  // scope 'all' overrides lack of /g
  ['v', 'e', 'all'],  // scope 'all' allows replace-all for strings
  [/u/g, function ($0) {
    return 'f' + $0.toUpperCase();
  }]
]);

Rationale:

To get the same functionality with XRegExp v2.0.0 (without any custom sugar), you'd have to write a pyramid of doom:

XRegExp.replace(
  XRegExp.replace(
    XRegExp.replace(
      XRegExp.replace(
        XRegExp.replace(
          XRegExp.replace(
            str, XRegExp('(?<z>z)'), 'a${z}'
          ), /y/gi, 'b'
        ), /x/g, 'c', 'one'
      ), /w/, 'd', 'all'
    ), 'v', 'e', 'all'
  ), /u/g, function ($0) {
    return 'f' + $0.toUpperCase();
  }
)

You could avoid this by extending String.prototype with a method that calls XRegExp.replace, but using XRegExp.replaceSet would still be cleaner and shorter.

Implementation:

Something simple like this:

XRegExp.replaceSet = function (str, replacements) {
  var i, r;
  for (i = 0; i < replacements.length; ++i) {
    r = replacements[i];
    str = XRegExp.replace(str, r[0], r[1], r[2]);
  }
  return str;
};

Support the \pL shorthand for \p{L}

PHP allows the short form \pL instead of \p{L} but it isn't working in xregexp 2.

\p and \P and mixed astral/BMP within character classes

Hey Steve!

Just a thought that although astral characters cannot be directly supported within character classes, I think they can be simulated by the likes of:

(<high1><low1>|<high2><low2>)

Even ranges could be calculated by joining appropriate ranges of surrogates, e.g.:

(<high1a>[<low1a>-<low1a>]|[<high1b>-<high1b>][<low1b>-<low1b>]|<high1c>[<low1c>-<low1c>])

whereby the first and third (a,c) alternates might not be necessary if the entire range of surrogates on either end is requested.

Negation would I guess need to compute all non-astral, non-excluded characters/ranges and join that to the inverse of the surrogate pattern above.

[Bug] Astral mode tests true for surrogate pairs, but not symbols themselves

The following example uses the symbol GClef (U+1D11E):

Unfortunately, I don't think this web form allows me to enter those symbols into the text... but under linux, holding CTRL+Shift while typing "1D11E" results in the symbol appearing in the text. I think you actually need to use the character map in windows, and something similar in mac...

XRegExp.install('astral');
XRegExp('^\\pS$').test('\uD834\uDD1E'); //--> true
XRegExp('^\\pS$').test('<G_Clef_Here>'); //--> false

String.replace polyfill throws SyntaxError for valid replacement values

The polyfill for String.prototype.replace incorrectly throws a SyntaxError: Invalid token for valid replacement values when natives are installed in XRegExp version 3.0.0-pre.

Source: xregexp.js, line 1433

Sample Code

The following code throws a SyntaxError: Invalid token $% error in XRegExp 3.0.0-pre:

XRegExp.install('natives');
'abc'.replace('b', '$%'); // throws "SyntaxError: Invalid token $%"

If you run the above code without installing natives, modern browsers (Chrome and Firefox) return the correct result without throwing an error:

'abc'.replace('b', '$%'); // returns "a$%c"

Expected Behavior

When XRegExp encounters a $ character in the replacement string that is not followed by a $, &, ```, ', `n`, or `nn`, XRegExp should simply return the matched substring as-is instead of throwing a `SyntaxError`.

According to ECMA-262 (PDF) § 15.5.4.11:

If replaceValue is [not] a function ... let newstring denote the result of converting replaceValue to a string. The result is a string value derived from the original input string by replacing each matched substring with a string derived from newstring by replacing characters in newstring by replacement text as specified in the following table. These $ replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements. For example, "$1,$2".replace(/(\$(\d))/g, "$$1-$1$2") returns "$1-$11,$1-$22".

A $ in newstring that does not match any of the forms below is left as is.

Characters	Replacement text
`$$`	`$`
`$&`	The matched substring.
`$``	The portion of string that precedes the matched substring.
`$'`	The portion of string that follows the matched substring.
`$n`	The nth capture, where n is a single digit 1-9 and `$n` is not followed by a decimal digit. If n≤m and the nth capture is undefined, use the empty string instead. If n>m, the result is implementation-defined.
`$nn`	The nnth capture, where nn is a two-digit decimal number 01-99. If nn≤m and the nnth capture is undefined, use the empty string instead. If nn>m, the result is implementation-defined.

Can I match a string with a fromIndex?

I need something like regex.match("1a45", "\d", 1), the 3rd parameter is fromIndex which means match the string "1a45" from the second character 'a'. But unlike the lastIndex of JS RegExp, the character at fromIndex must be matched, otherwise it's not successful.

Is this feature supported by XRegExp? Thanks!

[Feature request] Add Support for \Q .. \E block escapes.

From page 29 of O’Reilly's 2nd edition of Regular Expressions Cookbook by Levithan and Goyvaerts:

\Q suppresses the meaning of all metacharacters, including the backslash, until \E. If you omit \E, all characters after the \Q until the end of the regex are treated as literals.

Example:

/\QI *love* donuts (and pizza).\E/

instead of

/I \*love\* donuts \(and pizza\)\./

This feature is available in Java, PCRE, and Perl, and would make a useful addition to XRegExp, as some client-side Javascript may get regexes that include block quotes from server-side code using one of the aforementioned regex flavors, or just contain literal text that would normally need a lot of manual escaping as in the example sentence above.

Thanks.

Support for (?-i) mode modifiers in the middle of the regex?

Any plans on supporting negative and positive mode modifiers (?letters), such as (?i) and (?-i), in the middle of the regex?

For example: Input : (?i)te(?-i)st
Matches: test, TEst, but not teST or TEST.

http://www.regular-expressions.info/refmodifiers.html

Thanks!

[Edit:] Use \P{Cn} data for \p{Assigned}

Edit: Original title: Remove \p{Assigned}

The combined BMP and astral data for \p{Assigned} is nearly 7 KB uncompressed, making it easily one of the heftiest Unicode properties supported by XRegExp. However, it adds no value since the Unicode Categories addon already supports \p{Cn} (and its full name, \p{Unassigned}), which is the exact inverse of \p{Assigned}. In other words, you can match the same characters as \p{Assigned} by using \P{Cn} or \p{^Cn}.

UTS #18 includes \p{Assigned} as one of the properties required for Level 1 Unicode support. However, unlike all other Level 1 properties, the UnicodeSet application on unicode.org does not support \p{Assigned}.

This change breaks backward compatibility, but is not expected to affect many, if any, XRegExp users. The Unicode Properties addon which includes \p{Assigned} was only added very recently in the XRegExp 2.0.0 release, and most people are more familiar with \P{Cn}, which will continue to work. Java, .NET, and PCRE all support \P{Cn} but not \p{Assigned}. (Perl and Oniguruma support both \P{Cn} and \p{Assigned}.)

(Note that XRegExp cannot support \p{Assigned} via a scripted inversion of the data used by \p{Cn} because of the complexity of the surrogate-pair-based ranges in the astral data.)

\s => s

Slight difficulty I found.

This string:

(?:“|"|")([\s\S]+?)(?:”|″|"|")\s_?:\s_?(?:“|"|")([\s\S]+?)(?:”|″|"|")

passed as a parameter to the XRegExp constructor generates the following RegExp:

/(?:“|"|")([\s\S]+?)(?:”|″|"|")s_?:s_?(?:“|"|")([\s\S]+?(?:”|″|"|")/gi

Extracted from the first string:

\s_?:\s_?

and from the second:

s_?:s_?

My current workaround as a replacement for the previous string (brackets for clarity):

[ ]?:[ ]?

Test text which should be matched:

"Lucky charm" : "22.7"

A compatibility bug

This code will cause a bug, the m.index should be a number, but it will be overwritten by the matched group's name.

r = XRegExp('(?<index>\\w)(?<input>\\d)', 'g')
m = XRegExp.exec('a1b2', r)
console.log(m.index)

Hope the XRegExp.exec method can return an array other than return a object that is polluted by your custom properties. Such as return an array: [match, index, input]. Even add an underscore before the variable will be better, such as _index, _input.

Add component support

XRegExp should support component.

All that is needed is a component.json file.

Mixing named and unnamed captures

Hi slevithan,

first thanks for your great library. Really, really appreciating your work!

Don't know if I'm doing something wrong, but consider the following:

var url = 'page/edit/en/4f55fbbab51bda0df1000001/unnamed'
var re = XRegExp('^page/edit/(?<language>[a-z]{2})/(?<entityId>[a-z0-9]{24})/(?:.*)$');
var match = XRegExp.exec(url, re);
console.log(match);
/**
output:

0: "page/edit/en/4f55fbbab51bda0df1000001/unnamed",
1: "en",
2:"4f55fbbab51bda0df1000001",
entityId:"4f55fbbab51bda0df1000001",    
index:0,
input:"page/edit/en/4f55fbbab51bda0df1000001/unnamed",
language:"en"
*/

My problem is that i have to capture the "unnamed" part of the url (?:.*). The match object doesn't hold me the value of "unnamed"...

Is XRegExp not able to mix named and unnamed captures, or am I am missing something?

Thanks!

Change XRegExp.matchAll to XRegExp.match

XRegExp 2.1.0-dev (pre-release) added XRegExp.matchAll (see #16). However, before the release of v2.1.0 final, I plan to change both the name and semantics of the function. XRegExp.matchAll will be removed. In its place, a new XRegExp.match function will offer both match-all and match-first modes. The mode will be set via an optional third scope argument, which works like the scope argument of XRegExp.replace. It will accept the values 'one' (default) or 'all'. Also like XRegExp.replace, the presence or absence of flag /g can be used to set the scope, but an explicitly specified scope will always override /g.

When scope is 'one', XRegExp.match will return the first match as a string, or null if no match is found. (If you want backreference properties, etc., that's what XRegExp.exec is for.) When scope is 'all', XRegExp.match will return an array of strings, or an empty array if no match is found.

This is essentially a more convenient re-implementation of String.prototype.match that gives you the result types you actually want (string instead of exec-style array in match-first mode, and an empty array instead of null when no matches are found in match-all mode), and lets you override/ignore flag /g and lastIndex.

better toString()

in loveencounterflow@a81f8b2, i try to make it so that, when printed to the console (in NodeJS, using ( require 'utils' ).inspect), an XRegExp object is represented by its input pattern, not its compiled regular-JS representation.

simple reason: i have to print out a lot of values that may contain XRegExp objects. when you combine a few advanced features, the output quickly gets incredibly long. for example, var x = new XRegExp '^\\p{L}+$'; console.log( x ) will cause an output several hundred characters long that contains characters from scripts all around the world:

{ /^[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ ... ... ... ... ... ... ... ... ... ...

which not only hides the intention of the pattern, it also makes the console (and the textarea i'm writing this in) grind to a halt (almost; i shortened the above quote for fear it could render this very page unusable).

my uninformed patch seems to work when you do console.log( x + '' ), but not without adding that string literal. i think it would be much more helpful to have the input pattern displayed; as it stands,

most people will be unable to check the compiled pattern for correctness anyway, and
the present output differs structurally from the output of a plain RegExp structurally already, so making it more readable and sensible would be a great idea IMHO.

XRegExp fails under Rhino

I am trying to use XRegExp 3.0.0-pre with Rhino 1.6r2 (which is the version of Rhino shipping with Java 6).

Compiling the regex below (taken from http://xregexp.com/ ):

date = XRegExp('(?<year>  [0-9]{4} ) -?  # year  \n' +
               '(?<month> [0-9]{2} ) -?  # month \n' +
               '(?<day>   [0-9]{2} )     # day     ', 'x');

triggers the following error message:

"Invalid quantifier ?" at script line 517 (which is the line: "return augment(new RegExp(key.pattern, key.flags), key.captures, /*addProto*/ true);")

Inspecting key.pattern reveals that the ?<...> are not being stripped out:

(?<year>(?:)[0-9]{4}(?:))(?:)-?(?:)year(?:)(?<month>(?:)[0-9]{2}(?:))(?:)-?(?:)month(?:)(?<day>(?:)[0-9]{2}(?:))(?:)day

Does anyone have a workaround?

Unable to read Japanese filename (Need HELP)

Hi all,

I need your help!

I am currently working on a file uploader with the features of taking it alphanumeric, Harigana, Katakana as well as . - _ for filename. The funny part is test function return true when i paste a Japanese string but when i try to upload a file with the same string as filename, it return false.

Here's my regex: XRegExp("^[\p{Hiragana}\p{Katakana}\p{L}\p{N}._-]+$")

Anyone knows what is the issue? =/

Thanks for your time!

3.0 on npm

It seems that npm now has only 2.0., are you going to publish 3.0?

Accessing captureNames array, is it ok?

This is more a question than an issue.

I have some regex that's customizable and I want to dynamically grab capture names from the configured regex. I see that all captured names are stored in captureNames array in xregexp. I assume it is should be ok to access that field right? i.e. It's not expected to change anytime soon.

Add RequireJS support for xregexp-all.js

This will require wrapping the concatenated source files using an intro.js and outro.js file, to avoid creating the XRegExp global variable when loaded as a RequireJS module.

Use proto to set the prototype chain, when possible

Background info:

Because the XRegExp function returns a nonprimitive value, ES rules dictate that it can't be used as a constructor. I.e., the returned regexes inherit from RegExp.prototype rather than XRegExp.prototype, regardless of whether new is used. XRegExp v2.0.0 attaches XRegExp.prototype methods directly to regex objects when they are created or copied by XRegExp.

Going forward:

In browsers that support __proto__ (all but IE), XRegExp will set the prototype object of regexes created or copied by XRegExp to XRegExp.prototype. The main benefit of this is performance, especially when many properties are added to XRegExp.prototype (the Prototype Methods addon currently adds six, which isn't so bad, but users can add as many as they want). There are minor secondary benefits for instanceof, getPrototypeOf, isPrototypeOf, etc.

Because XRegExp.prototype will itself be a regex created by new RegExp(), regexes with swapped prototype objects will continue to inherit from RegExp, in addition to XRegExp. In browsers that don't support __proto__, regexes will continue to inherit from RegExp and have XRegExp.prototype properties assigned as own properties.

In all cases, XRegExp.isRegExp will continue to work. instanceof, constructor, and Object.prototype.toString tests against RegExp will also continue to work just fine for all regexes, regardless of whether they are native, XRegExp-augmented, or XRegExp-created. In other words, there should be no backward-compatibility issues in any browser.

XRegExp.build strips trailing unescaped $ in subpatterns when leading ^ not present

The XRegExp.build addon is only supposed to strip a leading ^ and trailing unescaped $ from subpatterns when both are present.

This is an edge case that is not known to affect any code in use, but nevertheless, I will fix this immediately and add tests.

npm module has not been updated

to reflect the changes made in the readme.

It's written such to use require('xregexp'), however, to access functions one has to still call require('xregexp').XRegExp

Upgrade to Unicode 6.2.0

Unicode 6.2.0 is currently in beta and won't be released until late September or early October. However, the changes that will affect XRegExp are already well defined (see: What's new in Unicode 6.2?). Specifically, the changes are as follows:

Turkish Lira Sign (U+20BA):

Add U+20BA to categories \p{S} and \p{Sc}.
Remove U+20BA from categories \p{C} and \p{Cn}.
Add U+20BA to property \p{Assigned} (no longer relevant since XRegExp defines Assigned as the inverse of Cn, without separate data).

Arabic Wavy Hamza Below (U+065F):

Move U+065F from script \p{Inherited} to \p{Arabic}.

IMO, it makes sense to go ahead and add these early because XRegExp 3.0.0 is almost ready for release. RegexBuddy 4 will include XRegExp as a supported regex flavor, and in future versions (v4.1?) RegexBuddy will add astral support and treat changes in the supported Unicode version as a separate regex flavor. Including Unicode upgrades in major releases of XRegExp (as with any other nonbugfix syntax changes) would therefore be ideal.

If there are any changes between the Unicode 6.2.0 beta and final release data (this seems unlikely), they can be added in an XRegExp bugfix release and will not require a new major version.

Transfer ownership of NPM package to slevithan

Right now I'm the owner of the NPM package. This is not very practical in the long run. I suggest the following procedure:

I get the NPM username of @slevithan (need to npm adduser).
I add him as owner of the xregexp package (npm owner add).
He publishes the next version of xregexp on NPM (npm publish . in root dir).
When he feels comfortable publishing NPM packages, I'll remove myself as owner, thereby transferring all ownership to @slevithan. In the meanwhile I'll happily try to sort out potential issues (hopefully none).
I make a notice on https://github.com/walling/xregexp saying that the project is deprecated and people should go here to look for the source of the NPM package.

Comments are welcome.

Addons: Support reparsing the output of syntax/flag tokens

Add an option to the XRegExp.addToken options object (perhaps a boolean called reparseOutput) that lets a token's output be reprocessed. This would allow chaining new syntax/flag tokens, and provide greater flexibility and simplicity. E.g., [:alnum:] within character classes could return \p{L}\p{M}\p{Nd}, and the actual code point range generation could be deferred to the Unicode Categories addon.

Example usage:

// Allow \pL (etc.) as shorthand for \p{L}
XRegExp.addToken(
  /\\([pP])([CLMNPSZ])/,
  function (m) {
    return '\\' + m[1] + '{' + m[2] + '}';
  },
  {
    scope: 'all',
    reparseOutput: true
  }
);

Remove XRegExp.prototype.apply/call?

Back story:

XRegExp 0.5.0 added the methods RegExp.prototype.apply/call. XRegExp 2.0.0-beta moved them to XRegExp.prototype, added XRegExp.install('methods') to copy them back to RegExp.prototype, and made XRegExp(regexp)/XRegExp.globalize augment copied regexes with apply and call methods.

Going forward:

I'm considering removing the regex apply and call methods altogether in XRegExp 2.0.0 final. Since the built-in array collection methods (such as Array.prototype.filter) don't use duck-typed apply or call, adding these methods to regexes just doesn't seem useful often enough to justify them.

Make the use of groups with the same name a SyntaxError

Currently, backreferences to a group, when multiple groups use the same name, refer to the last (rightmost) group with that name. See my named capture comparison page to see how this compares to other libraries. Notably, using multiple groups with the same name is an error in PCRE, Python, and Java.

.NET, Perl, and Oniguruma give useful semantics to multiple groups with the same name, but the behavior is different in each case, and XRegExp's current behavior is different than all of them. XRegExp's current behavior is not very useful, so I will change this to a SyntaxError in XRegExp v2.1.0.

Generally, nonbugfix syntax changes are delayed until v3.0.0. This is being treated as a syntax bugfix, even though it is not technically a bug. The current behavior was intentional, but it was chosen without detailed information (recently provided by Jan Goyvaerts) about all the different and noncompatible ways that this is handled in other regex flavors.

Add opt-in astral support to Unicode addons, without separate files

@mathiasbynens, @walling, this issue picks up from #25, since that's now a closed/merged pull request.

Prior to merging the opt-in astral support from the Unicode Categories Astral addon into the (default) Unicode Categories addon (which is automatically included in xregexp-all.js, and therefore in the npm package), following are the changes I think would be beneficial:

Prerequisite:

Data for base categories like \p{L} needs to be added, not just \p{Ll}, \p{Lu}, etc.

Changes:

The XRegExp.addUnicodePackage function in Unicode Base should change from accepting an object with BMP data and an object with optional aliases to instead accept the following array:

[
    {
        name: 'Ll',
        alias: 'Lowercase_Letter', // optional; used to support full category names
        bmp: '0000-FFFF', // compressed BMP data or null
        astral: '010000-10FFFF' // compressed astral data or null
    },
    …
]

The above data will be stored in the private unicode object, without any preprocessing. Two new private lookup objects will be added: bmp and astral. These won't be populated automatically, but instead augmented on first use of each Unicode name in a regex. In other words, these are used to cache generated data.

Astral ranges with surrogate pairs will be built and cached in JavaScript code, on first use.

For scripts and blocks that exist only within astral planes, the bmp property of the objects accepted by addUnicodePackage should be set to null. For addons that include astral support, the astral property should always be included, with null as the value for properties that have no astral code points. The astral property shouldn't be included by addons that don't yet include astral support.

The \p{…} (etc.) syntax token handler in Unicode Base should be updated to check XRegExp.isInstalled("astral") in its handler (main) function. If true, combine data from the bmp and astral objects, and throw a SyntaxError if the match scope is "class". The trigger function currently used in unicode-categories-astral.js will no longer be necessary.

Since the BMP and astral data will be split up, these changes shouldn't inflate unicode-categories.js too much. At least, BMP data will not be included twice.

With these changes in place, separate BMP and all-plane addons won't be needed, and users can opt in or out of astral support at any time.

Question: will "new XRegExp()" be permitted syntax in the future?

From what I can tell, "new XRegExp" will behave correctly for current versions of XRegExp. I was just curious if anyone is using this behavior, and if it will be protected in the future? Are there any plans to override "new XRegExp" or introduce incompatible code?

Consider adding package.json + npm support

This would take the burden off @walling, who maintains https://github.com/walling/xregexp.

We could write a simple script that concatenates all files together in the right order. It could be used as a post-commit hook.

The only thing that would need to change in the XRegExp source code is the way XRegExp is being exposed. @walling simply uses module.exports for this, which works fine in Node.js — but with just a few more lines we could support exporting to Narwhal, RingoJS, Rhino, and AMD loaders like RequireJS as well. I do this in Punycode.js as follows: https://github.com/bestiejs/punycode.js/blob/a6e30c4e2ce7a9a569bc2c84a3435bd5612be59f/punycode.js#L493-510

Is there any way to accomplish this PCRE expression to XRegexp?

I'm trying to convert this expression '$((?:(?>[^()]+)|(?R))*)$' in PCRE (PHP 5.4) to XRegexp, as I'm aware it doesn't suport lookaheads and the recursive ?R. It doesn't matter if I need some extra code to get it working, but I'm failing hard to find a substitution for this.

Add installable feature 'astral'

This will be used with XRegExp.install and XRegExp.uninstall to enable full 21-bit Unicode support in XRegExp's Unicode addons (which must be loaded separately). See #25 for related information.

[Edit:] Add XRegExp.match

Edit: Original title: Add XRegExp.matchAll

Create a new function called XRegExp.matchAll. This will work the same as String.prototype.match with /g except for the following details:

Returns an empty array (rather than null) if no match is found.
Doesn't require /g. In other words, it works the same for regexes with or without the global flag, and never acts as an alias of exec.
Does not implicitly convert provided non-RegExp search values to regex objects. Instead, a TypeError is thrown.
Fixes any cross-browser bugs for the setting of lastIndex, compared to the native String.prototype.match. In other words, global regexes always have their lastIndex set to 0 upon completion, and non-global regexes never have their lastIndex modified from its original value. When using the native String.prototype.match with /g, IE (<= 8 ?) does not reset lastIndex to 0 upon completion.

This new function should also be mapped/aliased as XRegExp.prototype.matchAll in the XRegExp Prototype Methods addon.

Background details:

XRegExp v2.0.0 already includes a version of String.prototype.match with cross-browser lastIndex fixes, but it cannot be used without first running XRegExp.install('natives'). It does not include the other differences mentioned above. All other fixed/extended natives already have a corresponding XRegExp function that does not require overriding natives (XRegExp.exec/test/replace/split).

XRegExp doesn't need an equivalent of String.prototype.match without /g (i.e., match instead of matchAll), because that is already provided by XRegExp.exec. More details on the rationale for adding matchAll but not match can be found here.

String.prototype.match with /g is the last place where XRegExp users need to use flag /g or fiddle with lastIndex. With XRegExp.matchAll in place, XRegExp really will live up to its claim that it "frees you from worrying about pesky inconsistencies in cross-browser regex handling and the dubious lastIndex property."

named backreference failing

Running this code with firefox 29 and xregexp 3.0pre shows true for both alerts.
The \k doesn't work as expected and as \2 in dateOK does. If we remove the
"| \s* (? August ) " -part then the bug doesn't show up anymore.

var dateOK = XRegExp.build(' \
( \
 (?<day> [0-3]?\\d) \
\\s* ((?<sep> [./\\s]) ) \\s*  \
(?<month> (1[012]|0?\\d))  \
| \\s* (?<fullmonth> August ) )  \
( \\s* \\2 \\s*  \
(?<year> (20)?[012]\\d)  \
)? ',{},'xni');

var dateBUG = XRegExp.build(' \
( \
 (?<day> [0-3]?\\d) \
\\s* ((?<sep> [./\\s]) ) \\s*  \
(?<month> (1[012]|0?\\d))  \
| \\s* (?<fullmonth> August ) )  \
( \\s* \\k<sep> \\s*  \
(?<year> (20)?[012]\\d)  \
)? ',{},'xni');

alert(XRegExp.exec("05/07 09",dateOK).year == "");
alert(XRegExp.exec("05/07 09",dateBUG).year == "09");

\\bβ not found

Hi, we are trying to make javascript search function to handle regex. we included xregx-all in our library and tried the following.

regex = XRegExp('\\bβ', 'gi'),
  str = 'The the test data has βa:ŋi in it.',
  parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bñ', 'gi'),
  str = 'The the test data has ña:ŋi in it.',
  parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bç', 'gi'),
  str = 'The the test data has ça:ŋi in it.',
  parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bg', 'gi'),
  str = 'The the test data has ga:ŋi in it.',
  parts;
undefined
regex.test(str);
true
regex = XRegExp('\\bあ', 'gi'),
  str = 'The the test data has あa:ŋi in it.',
  parts;
undefined
regex.test(str);
false

it works with plain English alphabet but other characters don't seem to be recognised.

Incorrect encoding

All non-minified JS in this project is incorrectly encoded and is breaking tools such as sstephenson/sprockets. Please ensure that it's all valid UTF-8.

Remove the 'all' shortcut used by XRegExp.install/uninstall

Providing the string 'all' to XRegExp.install or XRegExp.uninstall currently serves as a shortcut to add or remove all optional features. This shortcut is future hostile since new versions of XRegExp may include new optional features that current users do not mean to install or uninstall. The shortcut should therefore be removed. Users will still be able to add or remove optional features by explicitly naming them.

Feature: Partial matching

Would it be possible to implement partial matching in XRegExp?

This would make real-time validation on web-forms far more user-friendly, as described in comments I posted on this page.

Java's Matcher class apparently supports it, as does this Java library - a number of other libraries for PERL and C++ have this feature, but I was unable to find an implementation in JS.

One possible implementation strategy, would be to break down the expression to it's individual component expressions, then progressively compare a larger part of the total expression plus a ^ at the end of the expression - if you find a match, as far as I can figure, that should be a partial match. I don't know how difficult it would be to parse and break up the expression into component expressions...

Does this feature seem like a good fit for this library?

Script to auto-generate Unicode ranges

How do you generate and update the Unicode ranges when a new version of Unicode comes along?

Would you be interested in a script that parses UnicodeData.txt and generates the ranges for you?

XRegExp.matchRecursive swaps value/name properties

When you use the valueNames option to enable the detailed match information mode, XRegExp.matchRecursive 0.2.0 is outputting the name value in the value property, and vice versa. Will fix and add tests immediately.

Where to download XRegExp v1.5.1

It appears the download link is gone on xregexp.com and I cannot find a tag, so where can we download it? We need it to provide backward compatibility.

Add XRegExp.execLast

Idea: Add a new function that works similarly to String.prototype.lastIndexOf, except that it accepts a regex to search for and returns a match array (with index and backreference properties) like that returned by exec.

You can sometimes get the last match by pop-ing the array returned by String.prototype.match when provided a regex with /g, but that's sub-optimal for a variety of reasons:

The approach with match often doesn't work at all since matches can overlap. The last match may be entirely different if it's forced to start after all prior matches.
Finding all matches and then popping the last match is inefficient, especially for slow-matching regexes and/or long target strings.
The results of match with /g are simple strings, without any extended info (no match index or backreferences).

The proposed execLast function would essentially loop backward from the end of the string, adding one prior character on each iteration, and testing the regex against the starting position of the sliced string. Efficiency would be improved by wrapping the regex in ^(?:...), and by using flag /y in browsers that support it. Alternatively, I could leave off the anchor and /y, and perform something akin to a binary search. Either approach would, in effect, make all quantifiers nongreedy.

Feedback is very welcome, even if just to say that this would or would not be useful to you. Alternative name suggestions are also welcome.

Very poor performance with unicode addon

Using XRegExp v.2.0.0 installed through npm, nodejs v0.10.22 on a recent Macbook pro. Doing the following:

var XRegExp     = require("xregexp").XRegExp;

var str = "Bonjour, comment allez-vous ? Moi, ça ne va pas très bien à cause d'un gros bug dans l'exécution de mon programme";
var rule = XRegExp("\\P{L}+");

var a = Date.now(); 
var words = XRegExp.split(str, rule);
console.log(Date.now()-a);

gives me results above 350ms ! I saw in the docs that XRegExp is supposed to compile into native regular expressions with no/little performance hit, so I'm surprised to see such a poor performance.

Did I do anything wrong ? Is the performance better with 3.0 ?

Thanks

Let XRegExp overwrite itself when loaded twice

Back story:

Loading XRegExp v1.5.x twice in the same frame causes a descriptive error to be thrown.

XRegExp v2.0.0 does not throw the error (since the error could be frustrating in the case of browser plugins or libraries that bundle XRegExp), but it still avoids loading twice. It does this by checking whether XRegExp is defined, and if so, it does not overwrite the variable. The script is silently skipped.

Going forward:

I think it would be better to stop guarding against running twice. v2.0.0 already made it easier to avoid the related issues since native methods are no longer overridden by default, and it's easy to rename XRegExp for user-scripts or to include an older version without conflicts, etc.

To go with this change, I'll need to ensure that everything works correctly if you load the script a second time after running XRegExp.install('natives'). In such cases, XRegExp.uninstall('natives') will simply revert to the versions of native methods that were present when XRegExp last loaded.

Note that the private list of added syntax/flag tokens will be tracked per XRegExp load. In other words, if you load XRegExp twice and both instances use the global name XRegExp, you might lose previously added tokens. The tokens can be re-added by reloading the relevant addons. Separate instances of XRegExp that use modules or different global names to avoid bashing each other will be able to use independent token lists (this already works in v2.0.0).

Add new regex flag A, for per-regex astral mode opt-in

See #28 for related info.

Can be provided via the flags argument or provided inline. Can be combined with any other XRegExp flags. Examples:

// Via flags argument
XRegExp('^\\p{L}+$', 'Am');

// Inline
XRegExp('(?Am)^\\p{L}+$');

XRegExp.escape() whitespace behavior

Using:

> XRegExp.escape("\n");
\

The output is a literal \ followed by the original whitespace character. I'm not sure what the proper behavior for escaping whitespace is and was expecting a literal \n string.

wrong test result with Firefox Aurora 30.0a2 (2014-04-08)

I stripped my problem down to this short test.html:

<!DOCTYPE HTML>
<html>
  <body>
    <script type="text/javascript" src="xregexp-all.js"></script> 
    <script>
      XRegExp.addToken( /é/, function () {return "[eé]"} ); 
      alert(XRegExp.build("élé",null,"").test("élé"));// true
      alert(XRegExp.build(" é",null,"").test(" é"));// true
      alert(XRegExp.build(" élé",null,"").test(" élé"));// false
    </script>
  </body>
</html>

The third regexp tests to false on Firefox Aurora 30.0a2 while it deliveres "true" as expected on Firefox 26.

Using named group object is difficult

I need to loop over the named parameters after executing, seems to be pretty problematic. Why are you augmenting the array with properties? Shouldn't you provide a way to give access to clean named object?

Taken regex "(?<type>[a-z]+)/(?<id>\\d+)/(?<tab>[a-z]*)?" I'd like to have object (after exec):

{ 'type' : 'bar', 'id' : 666, 'tab' : 'ifany' }

That would be trivial to loop over, now I just can't loop over the properties normally cause there is crap preceding the properties (and even worse, if there is update to XRegExp I might get new crap to skip over in loop).

slevithan / xregexp Goto Github PK

xregexp's Introduction

XRegExp 5.1.1

Performance

Named capture breaking change in XRegExp 5

Usage examples

Addons

Unicode

XRegExp.build

XRegExp.tag (included with XRegExp.build)

XRegExp.matchRecursive

Installation and usage

Contribution guide

Credits

xregexp's People

Contributors

Stargazers

Watchers

Forkers

xregexp's Issues

Sample Code

Expected Behavior

Recommend Projects

Recommend Topics

Recommend Org