tc39 / proposal-regex-escaping Goto Github PK

Proposal for investigating RegExp escaping for the ECMAScript standard

Home Page: http://tc39.es/proposal-regex-escaping/

License: Creative Commons Zero v1.0 Universal

JavaScript 100.00%

proposal-regex-escaping's Introduction

RegExp Escaping Proposal

This ECMAScript proposal seeks to investigate the problem area of escaping a string for use inside a Regular Expression.

Formal specification

Champions:

Status

This proposal is a stage 2 proposal and is awaiting implementation and more input. Please see the issues to get involved.

Motivation

It is often the case when we want to build a regular expression out of a string without treating special characters from the string as special regular expression tokens. For example, if we want to replace all occurrences of the the string let text = "Hello." which we got from the user, we might be tempted to do ourLongText.replace(new RegExp(text, "g")). However, this would match . against any character rather than matching it against a dot.

This is commonly-desired functionality, as can be seen from this years-old es-discuss thread. Standardizing it would be very useful to developers, and avoid subpar implementations they might create that could miss edge cases.

Chosen solutions:

`RegExp.escape` function

This would be a RegExp.escape static function, such that strings can be escaped in order to be used inside regular expressions:

const str = prompt("Please enter a string");
const escaped = RegExp.escape(str);
const re = new RegExp(escaped, 'g'); // handles reg exp special tokens with the replacement.
console.log(ourLongText.replace(re));

Note the double backslashes in the example string contents, which render as a single backslash.

RegExp.escape("The Quick Brown Fox"); // "The\\ Quick\\ Brown\\ Fox"
RegExp.escape("Buy it. use it. break it. fix it.") // "Buy\\ it\\.\\ use it\\.\\ break\\ it\\.\\ fix\\ it\\."
RegExp.escape("(*.*)"); // "\\(\\*\\.\\*\\)"
RegExp.escape("｡^･ｪ･^｡") // "｡\\^･ｪ･\\^｡"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊\\ \\*_\\*\\ \\+_\\+\\ \\.\\.\\.\\ 👍"
RegExp.escape("\\d \\D (?:)"); // "\\\\d \\\\D \\(\\?\\:\\)"

Cross-cutting concerns

Per https://gist.github.com/bakkot/5a22c8c13ce269f6da46c7f7e56d3c3f, we now escape anything that could possible cause a “context escape”.

This would be a commitment to only entering/exiting new contexts using whitespace or ASCII punctuators. That seems like it will not be a significant impediment to language evolution.

In other languages

Note that the languages differ in what they do (e.g. Perl does something different from C#), but they all have the same goal.

We've had a meeting about this subject, whose notes include a more detailed writeup of what other languages do, and the pros and cons thereof.

FAQ

Why is each escaped character escaped?

See [https://gist.github.com/bakkot/5a22c8c13ce269f6da46c7f7e56d3c3f].
How is Unicode handled?

This proposal deals with code points and not code units, so further extensions and dealing with Unicode is done.
What about RegExp.unescape?

While some other languages provide an unescape method we choose to defer discussion about it to a later point, mainly because no evidence of people asking for it has been found (while RegExp.escape is commonly asked for).
How does this relate to the EscapeRegExpPattern AO?

EscapeRegExpPattern (as the name implies) takes a pattern and escapes it so that it can be represented as a string. What RegExp.escape does is take a string and escapes it so it can be literally represented as a pattern. The two do not need to share an escaped set and we can't use one for the other. We're discussing renaming EscapeRegExpPattern in the spec in the future to avoid confusion for readers.
Why not RegExp.tag or another tagged template based proposal?

During the first time this proposal was presented - an edge case was brought up where tagged templates were suggested as an alternative. We believe a simple function is a much better and simpler alternative to tagged templates here:
- Users have consistently been asking for RegExp.escape over the past 5 years - both in this repo and elsewhere. Packages providing this functionality are very popular (see escape-string-regexp and escape-regexp). For comparison there are no downloads and virtually zero issues or interest when I initiated work on a tag proposal.
- When interviewing users regarding RegExp.tag when trying to get motivating use cases for the API - users spoken with were very confused because of the tagged templates. The feedback was negative enough and they found the API confusing and awkward enough for me to stop pursuing it.
- Virtually every other programming language offers .escape (see "in other languages") and made the trade-off to ship .escape even though most of these could have shipped a tagged template API (equivalent, per language).
- This proposal does not block effort on a tag proposal, the two proposals are not mutually exclusive and both APIs can eventually land. See this issue for discussion.
Why don't you do X?

If you believe there is a concern that was not addressed yet, please open an issue.

proposal-regex-escaping's People

Contributors

Stargazers

Watchers

proposal-regex-escaping's Issues

Control Character Escapes

Checking interest in escaping the whole A-Za-z range at the start of escaped strings in order to support ControlCharacter escapes:

> new RegExp('\\cJ').test('\n') // true
> new RegExp("\\c" + RegExp.escape('J')); // matches "\n" but not the string "\cJ"

Are we interested in these escaped? Personally I never even knew these were a thing before, let alone in scenarios where .escape would be used. I definitely see the appeal for safety though.

Summoning @mathiasbynens @anba who are knowledgable on the topic, @bergus and @nikic who led hex escapes and @allenwb @cscott and @domenic for the spec's PoV on the subject.

Complement with an instance method (RegExp.prototype.escape)

(Sorry if this is written like an essay; I developed tunnel-vision halfway through writing it…)

Overview

Authors may need to escape a string for piecewise construction. RegExp.escape(…).source is insufficient, because input may not necessarily be a complete, syntactically valid regular expression. Ergo, I suggest providing an instance method that returns an escaped string following the same logic as RegExp.escape:

/regex/.escape(".") === "\\.";

Rationale

The reason I suggest adding an instance method (as opposed to another class method) is so authors can fine-tune how/where characters are escaped (possibly influenced by a well-known @@escape symbol, à la @@replace).

The definition of RegExp.prototype.escape is more-or-less along the lines of:

RegExp.escape = function(input){
	return new RegExp(this.prototype.escape(...arguments));
};

RegExp.prototype.escape = function(input){
	if(this && "function" === typeof this[Symbol.escape])
		return this[Symbol.escape](...arguments);
	return String(input).replace(/[/\\^$*+?{}[\]().|]/g, "\\$&");
};

Motivation

Subclasses of RegExp may have different expectations about what characters need escaping (and where). A realistic example is a third-party regular expression library imported as a set of functions, which are wrapped inside a subclass for more idiomatic (object-oriented) use.

Some actual code might make this clearer…

Example 1: Oniguruma

Oniguruma uses &&[…] to denote an intersection range within a character class, meaning that [a-z&&[aeiou]] has two different interpretations depending on the engine that's parsing it.

class OnigurumaExpr extends RegExp {
	escape(input){
		input = RegExp.prototype.escape(input);
		return input.replaceAll("&&", "\\&&");
	}
}

/** Return true if input contains an alphabetic character. */
function hasAlphaChars(input, additionalLetters = ""){
	return new OnigurumaExpr(`[A-Z${
		OnigurumaExpr.escape(additionalLetters)
	}a-z]+`).test(input);
}

hasAlphaChars("Café", "éñøüğȟ");   // Harmless
hasAlphaChars("Café", "&&[^a-z]"); // Problematic

Example 2: Basic POSIX regular expressions (BREs)

In legacy POSIX syntax, $…$ and \{…\} have opposite meanings to (…) and {…}, respectively.

class BRE extends RegExp {
	[Symbol.escape](input){
		return input.replace(/\\[({})\\1-9]/g, "\\$&");
	}
}
BRE.prototype.escape("\\(A\\)-(Z)+?") === String.raw `\\(A\\)-(Z)+?`;

Readability vs Context-Sensitive Validity

Currently I see two opposites:

Minimal and readable output - We provide an implementation that only escapes what it must via RegExp.escape to support passing in a context-free way to the RegExp constructor. This means dropping ] and } from the escaped set and telling people that they should be aware of context. This would provide a very readable output.
Maximal and safe - We provide an implementation that deals with context and is context sensitive, we additionally cater for eval cases and returns of regular expressions from the server. This would provide a much less readable but safer RegExp. Doing this would require escaping numerics to literals to avoid capturing groups as @nikic points out, escaping / to allow eval** as @allenwb has suggested and escaping capturing group identifiers : and !.

It sounds like more people tend to prefer the latter, I'm not convinced because of lacking usage data indicating that the problems it solves happen in real code (data indicates otherwise) but a single counter-example would go a long way to convince me that this is a problem we need to address. I'm very tempted by the safety guarantees it provides for context sensitivity.

Escaping everything was ruled out as every single system that did it moved away from it.

** but not whitespace as there is no ״ignore whitespace mode" in JS.

"Let" is sometimes not capitalized

It should be

Ensure the result works with the u flag

I recently enabled the ESLint rule which encourages always using the u flag. When doing so, I found out that my custom regexp escaper, which was

function escapeRegExp(str) {
  return str.replace(/[-[\]/{}()*+?.\\^$|]/ug, "\\$&");
}

was overzealous, and would cause new RegExp(escapeRegExp(input), "ug") to fail when the input string contained a -.

This probably has some intersection with discussions in other threads, e.g. if some delegates require that the result escape - so that it can work in situations like new RegExp("[" + RegExp.escape(input) + "]", "ug"), such a requirement prohibits the result from working in situations like new RegExp(RegExp.escape(input), "ug").

Regarding EscapeRegExpPattern.md

In ES6 EscapeRegExpPattern is defined in 21.2.3.2.4. In ES5.1 it wasn't a named abstract operation but its semantics for escaping and setting the source property was specified in 15.10.4.1.

In both ES5.1 15.10.6.4 and Es6 21.2.5.14 RegExp.prototype.toString is specified to use the value of the source property. Both the ES5.1 and ES6 specs for include a note stating that the value returned should be in the form of a RegularExpressionLiteral that would evaluate to a RegExp object that would have the same matching behavior as the original object.

Escape `-`?

Currently the proposal escapes ] and } which isn't really required since we escape [ and {. The only reason we might need to escape ] is because we need to support the case that the pattern is inserted inside the [.

A preliminary GH code search shows that this pattern is actually used. In particular, underscore's string model does "[" + escapedRegExp +"].

So, in my opinion we should support escaping inside matched sets to be on the safe side, this is somewhat wasteful (because there are other characters that do not need to be escaped in sets but the pattern is very common.

Opinions?

Alternative solutions

If anyone has any alternative to this solutions they think should be considered instead - this is the place to talk about it.

`/u` flag vs. `/v` flag

There is no way of escaping that works in all scenarios:

	`[<]`	`[\<]`
Neither `/u` nor `/v`	✅	✅
`/u`	✅	❌
`/v`	❌	✅

What’s the best approach then? Only support /v?

More information: tc39/proposal-regexp-v-flag#71

RegExp.escape escaping SyntaxCharacter alone is insufficient

If the idea for RegExp.escape is to allow injection in any context, - needs to be escaped in character class context. - is not part of SyntaxCharacter. This is just the first character I thought of needing escaping, and it wasn't escaped, so there's probably others.

Consider leading flags instead of `/${string}/flags` for RegExp.tag

The RegExp modifiers proposal originally included an unbounded (?ims-ims) operator, but that was recently dropped from the proposal. Some RegExp engines support a version of (?ims-ims) that can only appear at the start of a regular expression and applies to the entire pattern.

I've been considering re-introducing this prefix-only form in a follow-on proposal, and it could be helpful here as well:

RegExp.tag`(?ux)
  # allows x-mode comments
  \u{000a} # and unicode-mode code points
`;

const pattern = "(?i)test";
const re = new RegExp(pattern);
re.ignoreCase; // true
re.test("TeST"); // true

The prefix-flags form could theoretically allow all RegExp flags, not just the restricted subset in the Modifiers proposal. It would also remove the necessity for RegExp.tag to require leading and trailing / characters, and could even improve composability:

RegExp.tag`(?x)
  # escaped, case sensitive
  ${string} 
  
  # nested RegExp is *not* escaped. Supported flags are preserved. Unsupported flags are ignored.
  ${caseSensitive ? /Z/ : /Z/i} 
`;

In this case, flags from nested RegExp objects such as i, m, and s could be preserved in the resulting RegExp using a modified group (i.e., (?-i:Z) or (?i:Z) based on the condition above).

Incorrect example usage

I'm talking about this one:

RegExp.escape("\d \D (?:)"); // "\\d \\D \(\?\:\)"

The string "\d \D (?:)" has 8 characters and no backslashes. Isn't this supposed to be either "\\d \\D (?:)" or String.raw`\d \D (?:)` ?

Perhaps other examples can also use some help in this manner:

RegExp.escape("Buy it. use it. break it. fix it.") // "Buy it\\. use it\\. break it\\. fix it\\."
RegExp.escape("(*.*)"); // "\\(\\*\\.\\*\\)"
RegExp.escape("｡^･ｪ･^｡") // "｡\\^･ｪ･\\^｡"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊 \\*_\\* \\+_\\+ \\.\\.\\. 👍"
RegExp.escape("\\d \\D (?:)"); // "\\\\d \\\\D \\(\\?\\:\\)"

I agree, multiple backslashes are indeed horrible, so something like this would also suffice:

console.log(RegExp.escape("Buy it. use it. break it. fix it."))  // Buy it\. use it\. break it\. fix it\.
console.log(RegExp.escape("(*.*)"))  // \(\*\.\*\)
console.log(RegExp.escape("｡^･ｪ･^｡"))  // ｡\^･ｪ･\^｡
console.log(RegExp.escape("😊 *_* +_+ ... 👍"))  // 😊 \*_\* \+_\+ \.\.\. 👍
console.log(RegExp.escape(String.raw`\d \D (?:)`))  // \\d \\D \(\?\:\)

But again, maybe I'm just being too pedantic.

Make the algorithm non-destructive

Instead of appending characters to cuList, can't you create a new list? Basically, make the algorithm non-destructive at each iteration.

I guess it's more so that the algorithm can be recursive/more elegant (^{citation needed}). I'm not sure whether it's an implementation detail, nor how the spec should be written.

Do NUL bytes have to be escaped?

Some regex flavors (like PCRE) truncate on NUL bytes, so languages using those also escape NUL bytes to \000 or similar. Are JS regexes guaranteed to be binary safe, thus making this unnecessary?

ChatGPT says this exists in JavaScript (aka hurry up)

ChatGPT just told me to use RegExp.escape() but it was throwing error in my console. ChatGPT said it was added in ECMAScript 2019 which I found no evidence for, so I asked ChatGPT and it apologized and said it was actually added in ECMAScript 2021.

Upon Googling, I found this repo, so is it stagnant?

This functionality appears to stem from Ruby or probably different languages too, and it would be most efficient solution for me to escape while doing new RegExp(searchTerms, 'i').test(value) when searchTerms is | or \.

ChatGPT gave me a custom solution that looks reasonable:

const escapedTerms = searchTerms.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const regex = new RegExp(escapedTerms, 'i');
const isMatch = regex.test(value);

In my current application, I found this third party dependency (Oruga UI) that is using a custom solution that is the same but oddly different:

/**
 * Escape regex characters
 * http://stackoverflow.com/a/6969486
 */
export function escapeRegExpChars(value) {
    if (!value) return value;
    return value.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, '\\$&'); // eslint-disable-line no-useless-escape
}

After probing ChatGPT further, an issue there seems to stem from its knowledge cutoff of 2021, so my final takeaway is that this proposal looks simple but possibly stagnant and should be kicked.

Given that it involves a final code as simple as ''.replace(/[\-\[\]\/\{\}\*\+\?\.\\\^\$\|]/g, '\\$&'), I would say hurry up and deploy it, and if we're too scared, maybe start with an options object that features include and exclude to opt-in characters or opt-out characters beyond the reasonable default.

I want to have the following code:

new RegExp(RegExp.escape(searchTerms), 'i').test(value)

Does / need to be escaped?

/ is not a character matched by the RegExp grammar SyntaxCharacter production. However, it seems like it might be a character that should be escaped. Particularly, if a string is being converted to a RegExp literal:

let str = "3/4/1972";
let rx = eval("/"+RegExp.escape(str)+"/");

Prepare me for presenting this to TC39

I'd like a list of what should be presented to TC39. Maybe even a quick slid.es deck or something. Here's my initial set of questions that I have:

What issues need TC39's input, and could feasibly be decided in a final manner at the meeting? Versus, which are still open for general discussion?
Is the API final, or is it going to grow some second parameter?
What stage do you think is appropriate to ask for? 0, 1, or maybe 2?
Please do #31

Need better example

Current example in README is:

text.replace(new RegExp(RegExp.escape(str), "g"), newSubstr)

But because we now have replaceAll, it could be simply ourLongText.replaceAll(str, newSubstr) now.

LICENSE file

What would be the correct way to license this repository? I consider this a free-no-strings-attached contribution to the language and would not want like license to ever be a problem in that regard.

Unescape?

A lot of languages also provide a second unescape method like C#.

Should this be considered?
If so, as a part of this proposal?

Present the proposal again?

This proposal got rejected five years and a half ago based on concerns that, from the outside, seem fairly hard to understand. Perhaps it would lead to the same result, but I feel like it's at least worth discussing it again.

I know there's at least one question I'd like to ask of people who previously rejected it: are your concerns worth five years of incorrect manual escaping in users' code? Given that this feature borders on security, similar to SQL injections, it's not a rhetorical question: it seems important for this to be answered.

Identification of potential “cross-cutting” concerns

As far as I'm aware escape does not interfere with any other APIs and uses the list of identifiers from the spec rather than define it to keep it in track. If anyone is able to identify interference this would be a good place to report it.

Escape (whitespace) control characters

Improves readability
Avoids issues with eval (and many other functions based on common assumptions)

Specifically, I'm looking at linebreaks:

> /\n/g.toString()
"/\n/"
> new RegExp("\\n").toString()
"/\n/"
> new RegExp("\n").toString()
"/
/"

I'd love RegExp.escape("\n") to yield "\\n", not the linebreak "\n". Same for all other control characters code units like \r and \t. The algorithm might be based on JSON string escaping, though not using the short form for backspace (\b).

Motivation for the current set of escaped characters

The sets of escaped characters in the spec draft and in the motivation section differ significantly. Maybe makes sense to explain the current choice and update docs?

Alternate proposal: RegExp.fromString

Instead of new RegExp(RegExp.escape(s)), you do RegExp.fromString(s).

Pros:

Shorter
Escaping is simpler since we don't worry about cases like new RegExp(s1 + RegExp.escape(s2) + s3)

Cons:

Maybe doesn't meet some important use cases?
Do other languages do this? They all do escape...

Returning escaped input as a string

It should be possible to escape metacharacters without instantiating a RegExp. Writing RegExp.escape(…).source isn't sufficient, because it assumes the result is a well-formed ECMAScript regular expression. Users wanting to construct regexes piecemeal, or those forced to use string values to accommodate foreign regex syntaxes (e.g., Oniguruma/TextMate-compatible grammars), are left with two partial workarounds:

Do something hacky like [...string].map(char => RegExp.escape(char).source).join("")
Write a helper function that reimplements logic defined by the ECMAScript standard, and keep it in-sync with future revisions

Nobody wants to resort to either of those things, but there doesn't appear to be a better way (that won't throw a SyntaxError, that is).

(I originally brought this up in #50, but probably put too much emphasis on the need to escape individual characters, as opposed to having more control over the escaped result. I've rewritten this to be clearer)

/cc @ljharb

Interaction with backreferences / variable-width escape sequences

Using something like new RegExp("(foo)\\1" + RegExp.escape(input)), if input were to start with a number, this would extend the backreference \1 to something like \11. Does this need to be accounted for?

editorial: _escapedList_ mixes code units and code points

It'd probably be best to create a string through repeated string-concatenations and call UTF16EncodeCodePoint directly.

Advance to stage 2

From the tc39 process:

Step 1 Criteria
Initial spec text

Available at http://benjamingr.github.io/RexExp.escape/ and at the readme file.

tc39 approval of the advancement of the spec to the next level.

Why not just escape every character?

Is there any reason to only escape a specific subset? It's harmless to add slashes, right?

change escaping to hex escape sequences

There's no need to add complexity of single-character identity escapes for every ASCII punctuator. I would prefer escaping using hex escape sequences instead, as discussed in #58. The only argument given against this is that you'd have to copy-paste any RegExp constructed using this function into a RegExp explainer to understand it, but let's be honest, you were going to have to do that anyway. @sophiebits also points out that by not modifying the grammar, we allow this feature to be polyfilled in older browsers.

Discussion of key algorithms, abstractions and semantics

If anyone has anything to say about the semantics of the algorithm used here, the algorithm itself or anything else - please do.

Results of TC39 presentation

I'm sorry to say that the committee declined to accept this proposal as-is. In the end, the concern (largely driven by @erights, although others were sympathetic) was that escaping cannot be done in a way that is totally safe in all cases, even with the extended safe set. For example,

new RegExp("\\" + RegExp.escape("w"))

is a hazard. (It does not matter that "\\" by itself is not a valid regex fragment. The above does not error to indicate that; it just silently creates a bug.)

Note that even if you attempted to correct this by escaping all initial characters, you then have

new RegExp("\\\\" + RegExp.escape("w"))

as a bug. @erights called this the "even-odd problem."

The general feeling was that to be completely safe you need a context-dependent join operation. The feeling was then that if author code wants to do unsafe escaping, the function is easy to write, but if something is going to be standardized, it must be completely safe. The idea that other languages are not held to this standard did not convince them, I'm sorry to say.

The committee recognized that you might not be willing to do work on a different, more complicated proposal. But, if you were interested, they think that a template string tag solution would be the best to investigate. Such a solution would be able to use the context of each insertion to do a safe join operation between any two fragments, instead of depending on string concatenation. Template strings can also be twisted to work in dynamic situations (e.g. those that this proposal would cover via new RegExp(pieces.map(RegExp.escape).join("/"))) by directly calling the tag function, probably with an adapter to work through the awkwardness of the parameter form for template tags. So this would be strictly more powerful. This was also preferred (for reasons I don't really remember) to a building-block approach of e.g. RegExp.concat plus RegExp.escape (used as RegExp.concat("\\", RegExp.escape(x))).

I'm pretty disappointed by this, and am sorry you and others sunk so much work into it with such an outcome. But, what can we do.

At what point is it safe to say "The committee is not interested in .escape"?

Hey,

In the past, WHATWG has expressed interest in standardizing RegExp.escape as a utility as part of the web platform. We (Node) want this and I'm happy to push this through WHATWG and other standards bodies I'm involved with (like WinterCG) to other server platforms.

As it was suggested to me by committee members and as a show of good faith - I pushed back on this idea to standardize through TC39 as that was the nice thing to do to the disappointment of users.

That was over 2 years ago.

When would be a good time to say "OK, it looks like the committee is blocked on the solution the community is asking for" and pursue it in another way the platform I help maintain and other platforms can use it?

Consider RegExp template tag instead

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature. This version would solve the user-facing problem more directly, avoiding the need to concatenate RegExps in the result, and can help engines avoid re-parsing (by just parsing the shared part once). If any part of RegExp syntax ends up requiring context-dependent escaping, a template constructor could resolve that in a way that context-independent RegExp.escape cannot.

Such a feature could look like this, to find the value of the variable x followed by the value of y with arbitrary whitespace in between:

RegExp.build`${x}\s+${y}/m`

Here, flags are provided after a / within the RegExp (which is of course optional).

Explore other solutions

So, the results are in, and it seems they declined RegExp.escape and this is a final decision.

What alternatives do we have? Of course we could just fall back on a user-land implementation (like ljharb/regexp.escape, are there others?) as mentioned, but what if we want to press on?

RegExp.fromString, #33, with regex combinators as RegExp.prototype methods, to provide a fluent interface for regex construction/combination
RegExp.tag, for template strings, avoids double-escaping by accessing the .raw template parts, and solves the problem with trailing backslashes (example)

Both do have a problem with flags though, which are difficult to express. Is there anything else?

cuList vs. cpList confusion

Append c to cpList.

should be cuList.

code points vs code units

You need to be careful about whether you are processing code points or code units. In particular, when you create the result string the elements must be code units so you need to reference one of the algorithms used by the spec to utf16 encode code points

Path to Stage 4!

Stage 4

Stage 3

committee approval
merge test262 tests
write test262 tests

Stage 2.7

Stage 2

committee approval
spec reviewers selected
spec text written

Stage 1

committee approval

Which leading characters should be escaped?

As noted in #58 (comment) , an unescaped non-digit leading character could still be interpreted as part of an escape sequence spanning concatenated RegExp.escape output.

consider e.g. new RegExp("\\c" + RegExp.escape("J")) in a web browser implementing |ExtendedAtom|, where the result should match "\\cJ" and [unlike /\cJ/] fail to match "\n"

Escaping of first decimal digits in the current form - is this something what we expect?

If I understand correctly, with the current spec draft logic, we have:

const string = '40';
const escaped = RegExp.escape(string); // => '\\x40'
const re = new RegExp(escaped); // => /\x40/

re.test('40'); // => false
re.test('@'); // => true

Is this expected behavior?

Spec Review

Should we allow implementations to add to the escaped set?

While researching Perl today we've found an interesting issue:

As an interesting point, RegExp grammar can add flags (like /x in the future) but RegExp.escape will not be able to react to these changes.

In both cases - this means that RegExp.escape can't guarantee the length of the returned string between versions of ECMAScript that make extensions to the grammar.

To me, this is a huge deal. I'm wondering - if implementations are allowed to extend the regular expressions grammar - should they be allowed to escape additional characters and we should make it clear that there is no length guarantee on the returned string?

(No other language has a length guarantee but some have a fixed character set which implies a length guarantee)

Create a table of every escaped character and "why"

I've asked for this a couple times, but I think it's crucial to forward progress, especially as people debate additions to the current spec.

Here is a start:

char	reason
^	new RegExp("^") => syntax error; new RegExp("^") => matches the string "^"

Crawl the internet

This is a self assigned issue for me to talk to either Professor Dani Dolev, Professor Sarah Coden or Professor Dror Feitelson regarding acquiring the university resources needed to crawl the top 1M websites to find eval / usages.

Add ECMarkup

Should use http://bterlson.github.io/ecmarkup/ which uses https://github.com/domenic/ecmarkdown to generate HTML.

Would be nice to make a GH page of it too.

Advance to stage 1

From the tc39 process

Identified “champion” who will advance the addition

This would be me, with the much appreciated help of @domenic . Others (Uri Shaked and Elad Kats) have offered help with the process.

Prose outlining the problem or need and the general shape of a solution

Done in the readme.

Illustrative examples of usage

Done in the readme.

High-level API

Done in the readme, including semantics and a polyfill and a spec to JS file. The semantics spec will likely be moved to ECMarkup format.

Discussion of key algorithms, abstractions and semantics

This is a relatively simple proposal - covered in #3 and #4. A template tag is still considered but as it looks from #4 it seems inferior.

Identification of potential “cross-cutting” concerns and implementation challenges/complexity.

Done, and updated the readme file to address these.

Get TC39 to agree that we have advanced to stage 1, after meeting all the above requirements.

EncodeForRegExpEscape should not return results that require particular flags

EncodeForRegExpEscape step 4.e (which would be reached if input c were a Space_Separator supplementary code point in [U+10000, U+10FFFF]) results in a return value like \u{…}. The interpretation of such pattern text is dependent upon regular expression flags—specifically, it is interpreted as a |RegExpUnicodeEscapeSequence| that will match a code point with the contained hexadecimal value in the presence of a "u" or "v" flag, but otherwise is interpreted as either a syntax error or (only in a host supporting Annex B and only when the hexadecimal representation of the code point consists only of decimal digits) as a quantified |ExtendedAtom| "u" with the specified decimal count of repetitions (e.g., /^\u{10000}$/.test("u".repeat(10000)) is true).

Rather than returning results subject to conditional interpretation, EncodeForRegExpEscape should return a \u…\u… surrogate pair |RegExpUnicodeEscapeSequence| for such inputs (which work in both Unicode and non-Unicode regular expressions, e.g. /^\uD834\uDF06$/u.test("𝌆") and /^\uD834\uDF06$/v.test("𝌆") and /^\uD834\uDF06$/.test("𝌆") are all true).

Or alternatively (and preferably IMO), EncodeForRegExpEscape should not escape all white space. I'm not certain why it does so right now, but looking back I suspect it is due to a misinterpretation of #30 (which requests escaping of control characters, and even more specifically line terminators—and even that isn't necessary).

tc39 / proposal-regex-escaping Goto Github PK

proposal-regex-escaping's Introduction

RegExp Escaping Proposal

Status

Motivation

Chosen solutions:

RegExp.escape function

Cross-cutting concerns

Other solutions considered:

Template tag function

In other languages

FAQ

proposal-regex-escaping's People

Contributors

Stargazers

Watchers

Forkers

proposal-regex-escaping's Issues

Overview

Rationale

Motivation

Stage 4

Stage 3

Stage 2.7

Stage 1

Recommend Projects

Recommend Topics

Recommend Org

`RegExp.escape` function