Git Product home page Git Product logo

proposal-regex-escaping's Introduction

RegExp Escaping Proposal

This ECMAScript proposal seeks to investigate the problem area of escaping a string for use inside a Regular Expression.

Formal specification

Champions:

Status

This proposal is a stage 2 proposal and is awaiting implementation and more input. Please see the issues to get involved.

Motivation

It is often the case when we want to build a regular expression out of a string without treating special characters from the string as special regular expression tokens. For example, if we want to replace all occurrences of the the string let text = "Hello." which we got from the user, we might be tempted to do ourLongText.replace(new RegExp(text, "g")). However, this would match . against any character rather than matching it against a dot.

This is commonly-desired functionality, as can be seen from this years-old es-discuss thread. Standardizing it would be very useful to developers, and avoid subpar implementations they might create that could miss edge cases.

Chosen solutions:

RegExp.escape function

This would be a RegExp.escape static function, such that strings can be escaped in order to be used inside regular expressions:

const str = prompt("Please enter a string");
const escaped = RegExp.escape(str);
const re = new RegExp(escaped, 'g'); // handles reg exp special tokens with the replacement.
console.log(ourLongText.replace(re));

Note the double backslashes in the example string contents, which render as a single backslash.

RegExp.escape("The Quick Brown Fox"); // "The\\ Quick\\ Brown\\ Fox"
RegExp.escape("Buy it. use it. break it. fix it.") // "Buy\\ it\\.\\ use it\\.\\ break\\ it\\.\\ fix\\ it\\."
RegExp.escape("(*.*)"); // "\\(\\*\\.\\*\\)"
RegExp.escape("。^・ェ・^。") // "。\\^・ェ・\\^。"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊\\ \\*_\\*\\ \\+_\\+\\ \\.\\.\\.\\ 👍"
RegExp.escape("\\d \\D (?:)"); // "\\\\d \\\\D \\(\\?\\:\\)"

Cross-cutting concerns

Per https://gist.github.com/bakkot/5a22c8c13ce269f6da46c7f7e56d3c3f, we now escape anything that could possible cause a “context escape”.

This would be a commitment to only entering/exiting new contexts using whitespace or ASCII punctuators. That seems like it will not be a significant impediment to language evolution.

Other solutions considered:

Template tag function

This would be, for example, a template tag function RegExp.tag, used to produce a complete regular expression instead of potentially a piece of one:

const str = prompt("Please enter a string");
const re = RegExp.tag`/${str}/g`;
console.log(ourLongText.replace(re));

In other languages

Note that the languages differ in what they do (e.g. Perl does something different from C#), but they all have the same goal.

We've had a meeting about this subject, whose notes include a more detailed writeup of what other languages do, and the pros and cons thereof.

FAQ

  • Why is each escaped character escaped?

    See [https://gist.github.com/bakkot/5a22c8c13ce269f6da46c7f7e56d3c3f].

  • How is Unicode handled?

    This proposal deals with code points and not code units, so further extensions and dealing with Unicode is done.

  • What about RegExp.unescape?

    While some other languages provide an unescape method we choose to defer discussion about it to a later point, mainly because no evidence of people asking for it has been found (while RegExp.escape is commonly asked for).

  • How does this relate to the EscapeRegExpPattern AO?

    EscapeRegExpPattern (as the name implies) takes a pattern and escapes it so that it can be represented as a string. What RegExp.escape does is take a string and escapes it so it can be literally represented as a pattern. The two do not need to share an escaped set and we can't use one for the other. We're discussing renaming EscapeRegExpPattern in the spec in the future to avoid confusion for readers.

  • Why not RegExp.tag or another tagged template based proposal?

    During the first time this proposal was presented - an edge case was brought up where tagged templates were suggested as an alternative. We believe a simple function is a much better and simpler alternative to tagged templates here:

    • Users have consistently been asking for RegExp.escape over the past 5 years - both in this repo and elsewhere. Packages providing this functionality are very popular (see escape-string-regexp and escape-regexp). For comparison there are no downloads and virtually zero issues or interest when I initiated work on a tag proposal.
    • When interviewing users regarding RegExp.tag when trying to get motivating use cases for the API - users spoken with were very confused because of the tagged templates. The feedback was negative enough and they found the API confusing and awkward enough for me to stop pursuing it.
    • Virtually every other programming language offers .escape (see "in other languages") and made the trade-off to ship .escape even though most of these could have shipped a tagged template API (equivalent, per language).
    • This proposal does not block effort on a tag proposal, the two proposals are not mutually exclusive and both APIs can eventually land. See this issue for discussion.
  • Why don't you do X?

    If you believe there is a concern that was not addressed yet, please open an issue.

proposal-regex-escaping's People

Contributors

abalam666 avatar bakkot avatar benjamingr avatar bergus avatar brettz9 avatar domenic avatar eladrkatz avatar gibson042 avatar ljharb avatar madarauchiha avatar mathiasbynens avatar nikfrank avatar oliverfoster avatar poke avatar qm3ster avatar thefifthsetpin avatar urish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proposal-regex-escaping's Issues

Control Character Escapes

Checking interest in escaping the whole A-Za-z range at the start of escaped strings in order to support ControlCharacter escapes:

> new RegExp('\\cJ').test('\n') // true
> new RegExp("\\c" + RegExp.escape('J')); // matches "\n" but not the string "\cJ"

Are we interested in these escaped? Personally I never even knew these were a thing before, let alone in scenarios where .escape would be used. I definitely see the appeal for safety though.

Summoning @mathiasbynens @anba who are knowledgable on the topic, @bergus and @nikic who led hex escapes and @allenwb @cscott and @domenic for the spec's PoV on the subject.

Complement with an instance method (RegExp.prototype.escape)

(Sorry if this is written like an essay; I developed tunnel-vision halfway through writing it…)

Overview

Authors may need to escape a string for piecewise construction. RegExp.escape(…).source is insufficient, because input may not necessarily be a complete, syntactically valid regular expression. Ergo, I suggest providing an instance method that returns an escaped string following the same logic as RegExp.escape:

/regex/.escape(".") === "\\.";

Rationale

The reason I suggest adding an instance method (as opposed to another class method) is so authors can fine-tune how/where characters are escaped (possibly influenced by a well-known @@escape symbol, à la @@replace).

The definition of RegExp.prototype.escape is more-or-less along the lines of:

RegExp.escape = function(input){
	return new RegExp(this.prototype.escape(...arguments));
};

RegExp.prototype.escape = function(input){
	if(this && "function" === typeof this[Symbol.escape])
		return this[Symbol.escape](...arguments);
	return String(input).replace(/[/\\^$*+?{}[\]().|]/g, "\\$&");
};

Motivation

Subclasses of RegExp may have different expectations about what characters need escaping (and where). A realistic example is a third-party regular expression library imported as a set of functions, which are wrapped inside a subclass for more idiomatic (object-oriented) use.

Some actual code might make this clearer…

Example 1: Oniguruma

Oniguruma uses &&[…] to denote an intersection range within a character class, meaning that [a-z&&[aeiou]] has two different interpretations depending on the engine that's parsing it.

class OnigurumaExpr extends RegExp {
	escape(input){
		input = RegExp.prototype.escape(input);
		return input.replaceAll("&&", "\\&&");
	}
}

/** Return true if input contains an alphabetic character. */
function hasAlphaChars(input, additionalLetters = ""){
	return new OnigurumaExpr(`[A-Z${
		OnigurumaExpr.escape(additionalLetters)
	}a-z]+`).test(input);
}

hasAlphaChars("Café", "éñøüğȟ");   // Harmless
hasAlphaChars("Café", "&&[^a-z]"); // Problematic
Example 2: Basic POSIX regular expressions (BREs)

In legacy POSIX syntax, \(…\) and \{…\} have opposite meanings to (…) and {…}, respectively.

class BRE extends RegExp {
	[Symbol.escape](input){
		return input.replace(/\\[({})\\1-9]/g, "\\$&");
	}
}
BRE.prototype.escape("\\(A\\)-(Z)+?") === String.raw `\\(A\\)-(Z)+?`;

Readability vs Context-Sensitive Validity

Currently I see two opposites:

  • Minimal and readable output - We provide an implementation that only escapes what it must via RegExp.escape to support passing in a context-free way to the RegExp constructor. This means dropping ] and } from the escaped set and telling people that they should be aware of context. This would provide a very readable output.
  • Maximal and safe - We provide an implementation that deals with context and is context sensitive, we additionally cater for eval cases and returns of regular expressions from the server. This would provide a much less readable but safer RegExp. Doing this would require escaping numerics to literals to avoid capturing groups as @nikic points out, escaping / to allow eval** as @allenwb has suggested and escaping capturing group identifiers : and !.

It sounds like more people tend to prefer the latter, I'm not convinced because of lacking usage data indicating that the problems it solves happen in real code (data indicates otherwise) but a single counter-example would go a long way to convince me that this is a problem we need to address. I'm very tempted by the safety guarantees it provides for context sensitivity.

  • Escaping everything was ruled out as every single system that did it moved away from it.

** but not whitespace as there is no ״ignore whitespace mode" in JS.

Ensure the result works with the u flag

I recently enabled the ESLint rule which encourages always using the u flag. When doing so, I found out that my custom regexp escaper, which was

function escapeRegExp(str) {
  return str.replace(/[-[\]/{}()*+?.\\^$|]/ug, "\\$&");
}

was overzealous, and would cause new RegExp(escapeRegExp(input), "ug") to fail when the input string contained a -.

This probably has some intersection with discussions in other threads, e.g. if some delegates require that the result escape - so that it can work in situations like new RegExp("[" + RegExp.escape(input) + "]", "ug"), such a requirement prohibits the result from working in situations like new RegExp(RegExp.escape(input), "ug").

Regarding EscapeRegExpPattern.md

In ES6 EscapeRegExpPattern is defined in 21.2.3.2.4. In ES5.1 it wasn't a named abstract operation but its semantics for escaping and setting the source property was specified in 15.10.4.1.

In both ES5.1 15.10.6.4 and Es6 21.2.5.14 RegExp.prototype.toString is specified to use the value of the source property. Both the ES5.1 and ES6 specs for include a note stating that the value returned should be in the form of a RegularExpressionLiteral that would evaluate to a RegExp object that would have the same matching behavior as the original object.

Escape `-`?

Currently the proposal escapes ] and } which isn't really required since we escape [ and {. The only reason we might need to escape ] is because we need to support the case that the pattern is inserted inside the [.

A preliminary GH code search shows that this pattern is actually used. In particular, underscore's string model does "[" + escapedRegExp +"].

So, in my opinion we should support escaping inside matched sets to be on the safe side, this is somewhat wasteful (because there are other characters that do not need to be escaped in sets but the pattern is very common.

Opinions?

Alternative solutions

If anyone has any alternative to this solutions they think should be considered instead - this is the place to talk about it.

RegExp.escape escaping SyntaxCharacter alone is insufficient

If the idea for RegExp.escape is to allow injection in any context, - needs to be escaped in character class context. - is not part of SyntaxCharacter. This is just the first character I thought of needing escaping, and it wasn't escaped, so there's probably others.

Consider leading flags instead of `/${string}/flags` for RegExp.tag

The RegExp modifiers proposal originally included an unbounded (?ims-ims) operator, but that was recently dropped from the proposal. Some RegExp engines support a version of (?ims-ims) that can only appear at the start of a regular expression and applies to the entire pattern.

I've been considering re-introducing this prefix-only form in a follow-on proposal, and it could be helpful here as well:

RegExp.tag`(?ux)
  # allows x-mode comments
  \u{000a} # and unicode-mode code points
`;

const pattern = "(?i)test";
const re = new RegExp(pattern);
re.ignoreCase; // true
re.test("TeST"); // true

The prefix-flags form could theoretically allow all RegExp flags, not just the restricted subset in the Modifiers proposal. It would also remove the necessity for RegExp.tag to require leading and trailing / characters, and could even improve composability:

RegExp.tag`(?x)
  # escaped, case sensitive
  ${string} 
  
  # nested RegExp is *not* escaped. Supported flags are preserved. Unsupported flags are ignored.
  ${caseSensitive ? /Z/ : /Z/i} 
`;

In this case, flags from nested RegExp objects such as i, m, and s could be preserved in the resulting RegExp using a modified group (i.e., (?-i:Z) or (?i:Z) based on the condition above).

Incorrect example usage

I'm talking about this one:

RegExp.escape("\d \D (?:)"); // "\\d \\D \(\?\:\)"

The string "\d \D (?:)" has 8 characters and no backslashes. Isn't this supposed to be either "\\d \\D (?:)" or String.raw`\d \D (?:)` ?

Perhaps other examples can also use some help in this manner:

RegExp.escape("Buy it. use it. break it. fix it.") // "Buy it\\. use it\\. break it\\. fix it\\."
RegExp.escape("(*.*)"); // "\\(\\*\\.\\*\\)"
RegExp.escape("。^・ェ・^。") // "。\\^・ェ・\\^。"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊 \\*_\\* \\+_\\+ \\.\\.\\. 👍"
RegExp.escape("\\d \\D (?:)"); // "\\\\d \\\\D \\(\\?\\:\\)"

I agree, multiple backslashes are indeed horrible, so something like this would also suffice:

console.log(RegExp.escape("Buy it. use it. break it. fix it."))  // Buy it\. use it\. break it\. fix it\.
console.log(RegExp.escape("(*.*)"))  // \(\*\.\*\)
console.log(RegExp.escape("。^・ェ・^。"))  // 。\^・ェ・\^。
console.log(RegExp.escape("😊 *_* +_+ ... 👍"))  // 😊 \*_\* \+_\+ \.\.\. 👍
console.log(RegExp.escape(String.raw`\d \D (?:)`))  // \\d \\D \(\?\:\)

But again, maybe I'm just being too pedantic.

Make the algorithm non-destructive

Instead of appending characters to cuList, can't you create a new list? Basically, make the algorithm non-destructive at each iteration.

I guess it's more so that the algorithm can be recursive/more elegant (citation needed). I'm not sure whether it's an implementation detail, nor how the spec should be written.

Do NUL bytes have to be escaped?

Some regex flavors (like PCRE) truncate on NUL bytes, so languages using those also escape NUL bytes to \000 or similar. Are JS regexes guaranteed to be binary safe, thus making this unnecessary?

ChatGPT says this exists in JavaScript (aka hurry up)

ChatGPT just told me to use RegExp.escape() but it was throwing error in my console. ChatGPT said it was added in ECMAScript 2019 which I found no evidence for, so I asked ChatGPT and it apologized and said it was actually added in ECMAScript 2021.

Upon Googling, I found this repo, so is it stagnant?

This functionality appears to stem from Ruby or probably different languages too, and it would be most efficient solution for me to escape while doing new RegExp(searchTerms, 'i').test(value) when searchTerms is | or \.

ChatGPT gave me a custom solution that looks reasonable:

const escapedTerms = searchTerms.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const regex = new RegExp(escapedTerms, 'i');
const isMatch = regex.test(value);

In my current application, I found this third party dependency (Oruga UI) that is using a custom solution that is the same but oddly different:

/**
 * Escape regex characters
 * http://stackoverflow.com/a/6969486
 */
export function escapeRegExpChars(value) {
    if (!value) return value;
    return value.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, '\\$&'); // eslint-disable-line no-useless-escape
}

After probing ChatGPT further, an issue there seems to stem from its knowledge cutoff of 2021, so my final takeaway is that this proposal looks simple but possibly stagnant and should be kicked.

Given that it involves a final code as simple as ''.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, '\\$&'), I would say hurry up and deploy it, and if we're too scared, maybe start with an options object that features include and exclude to opt-in characters or opt-out characters beyond the reasonable default.

I want to have the following code:

new RegExp(RegExp.escape(searchTerms), 'i').test(value)

Does / need to be escaped?

/ is not a character matched by the RegExp grammar SyntaxCharacter production. However, it seems like it might be a character that should be escaped. Particularly, if a string is being converted to a RegExp literal:

let str = "3/4/1972";
let rx = eval("/"+RegExp.escape(str)+"/");

Prepare me for presenting this to TC39

I'd like a list of what should be presented to TC39. Maybe even a quick slid.es deck or something. Here's my initial set of questions that I have:

  • What issues need TC39's input, and could feasibly be decided in a final manner at the meeting? Versus, which are still open for general discussion?
  • Is the API final, or is it going to grow some second parameter?
  • What stage do you think is appropriate to ask for? 0, 1, or maybe 2?
  • Please do #31

Need better example

Current example in README is:

text.replace(new RegExp(RegExp.escape(str), "g"), newSubstr)

But because we now have replaceAll, it could be simply ourLongText.replaceAll(str, newSubstr) now.

LICENSE file

What would be the correct way to license this repository? I consider this a free-no-strings-attached contribution to the language and would not want like license to ever be a problem in that regard.

Unescape?

A lot of languages also provide a second unescape method like C#.

Should this be considered?
If so, as a part of this proposal?

Present the proposal again?

This proposal got rejected five years and a half ago based on concerns that, from the outside, seem fairly hard to understand. Perhaps it would lead to the same result, but I feel like it's at least worth discussing it again.

I know there's at least one question I'd like to ask of people who previously rejected it: are your concerns worth five years of incorrect manual escaping in users' code? Given that this feature borders on security, similar to SQL injections, it's not a rhetorical question: it seems important for this to be answered.

Identification of potential “cross-cutting” concerns

As far as I'm aware escape does not interfere with any other APIs and uses the list of identifiers from the spec rather than define it to keep it in track. If anyone is able to identify interference this would be a good place to report it.

Escape (whitespace) control characters

  • Improves readability
  • Avoids issues with eval (and many other functions based on common assumptions)

Specifically, I'm looking at linebreaks:

> /\n/g.toString()
"/\n/"
> new RegExp("\\n").toString()
"/\n/"
> new RegExp("\n").toString()
"/
/"

I'd love RegExp.escape("\n") to yield "\\n", not the linebreak "\n". Same for all other control characters code units like \r and \t. The algorithm might be based on JSON string escaping, though not using the short form for backspace (\b).

Alternate proposal: RegExp.fromString

Instead of new RegExp(RegExp.escape(s)), you do RegExp.fromString(s).

Pros:

  • Shorter
  • Escaping is simpler since we don't worry about cases like new RegExp(s1 + RegExp.escape(s2) + s3)

Cons:

  • Maybe doesn't meet some important use cases?
  • Do other languages do this? They all do escape...

Returning escaped input as a string

It should be possible to escape metacharacters without instantiating a RegExp. Writing RegExp.escape(…).source isn't sufficient, because it assumes the result is a well-formed ECMAScript regular expression. Users wanting to construct regexes piecemeal, or those forced to use string values to accommodate foreign regex syntaxes (e.g., Oniguruma/TextMate-compatible grammars), are left with two partial workarounds:

  1. Do something hacky like [...string].map(char => RegExp.escape(char).source).join("")
  2. Write a helper function that reimplements logic defined by the ECMAScript standard, and keep it in-sync with future revisions

Nobody wants to resort to either of those things, but there doesn't appear to be a better way (that won't throw a SyntaxError, that is).

(I originally brought this up in #50, but probably put too much emphasis on the need to escape individual characters, as opposed to having more control over the escaped result. I've rewritten this to be clearer)

/cc @ljharb

change escaping to hex escape sequences

There's no need to add complexity of single-character identity escapes for every ASCII punctuator. I would prefer escaping using hex escape sequences instead, as discussed in #58. The only argument given against this is that you'd have to copy-paste any RegExp constructed using this function into a RegExp explainer to understand it, but let's be honest, you were going to have to do that anyway. @sophiebits also points out that by not modifying the grammar, we allow this feature to be polyfilled in older browsers.

Results of TC39 presentation

I'm sorry to say that the committee declined to accept this proposal as-is. In the end, the concern (largely driven by @erights, although others were sympathetic) was that escaping cannot be done in a way that is totally safe in all cases, even with the extended safe set. For example,

new RegExp("\\" + RegExp.escape("w"))

is a hazard. (It does not matter that "\\" by itself is not a valid regex fragment. The above does not error to indicate that; it just silently creates a bug.)

Note that even if you attempted to correct this by escaping all initial characters, you then have

new RegExp("\\\\" + RegExp.escape("w"))

as a bug. @erights called this the "even-odd problem."

The general feeling was that to be completely safe you need a context-dependent join operation. The feeling was then that if author code wants to do unsafe escaping, the function is easy to write, but if something is going to be standardized, it must be completely safe. The idea that other languages are not held to this standard did not convince them, I'm sorry to say.

The committee recognized that you might not be willing to do work on a different, more complicated proposal. But, if you were interested, they think that a template string tag solution would be the best to investigate. Such a solution would be able to use the context of each insertion to do a safe join operation between any two fragments, instead of depending on string concatenation. Template strings can also be twisted to work in dynamic situations (e.g. those that this proposal would cover via new RegExp(pieces.map(RegExp.escape).join("/"))) by directly calling the tag function, probably with an adapter to work through the awkwardness of the parameter form for template tags. So this would be strictly more powerful. This was also preferred (for reasons I don't really remember) to a building-block approach of e.g. RegExp.concat plus RegExp.escape (used as RegExp.concat("\\", RegExp.escape(x))).

I'm pretty disappointed by this, and am sorry you and others sunk so much work into it with such an outcome. But, what can we do.

At what point is it safe to say "The committee is not interested in .escape"?

Hey,

In the past, WHATWG has expressed interest in standardizing RegExp.escape as a utility as part of the web platform. We (Node) want this and I'm happy to push this through WHATWG and other standards bodies I'm involved with (like WinterCG) to other server platforms.

As it was suggested to me by committee members and as a show of good faith - I pushed back on this idea to standardize through TC39 as that was the nice thing to do to the disappointment of users.

That was over 2 years ago.

When would be a good time to say "OK, it looks like the committee is blocked on the solution the community is asking for" and pursue it in another way the platform I help maintain and other platforms can use it?

Consider RegExp template tag instead

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature. This version would solve the user-facing problem more directly, avoiding the need to concatenate RegExps in the result, and can help engines avoid re-parsing (by just parsing the shared part once). If any part of RegExp syntax ends up requiring context-dependent escaping, a template constructor could resolve that in a way that context-independent RegExp.escape cannot.

Such a feature could look like this, to find the value of the variable x followed by the value of y with arbitrary whitespace in between:

RegExp.build`${x}\s+${y}/m`

Here, flags are provided after a / within the RegExp (which is of course optional).

Explore other solutions

So, the results are in, and it seems they declined RegExp.escape and this is a final decision.

What alternatives do we have? Of course we could just fall back on a user-land implementation (like ljharb/regexp.escape, are there others?) as mentioned, but what if we want to press on?

  • RegExp.fromString, #33, with regex combinators as RegExp.prototype methods, to provide a fluent interface for regex construction/combination
  • RegExp.tag, for template strings, avoids double-escaping by accessing the .raw template parts, and solves the problem with trailing backslashes (example)

Both do have a problem with flags though, which are difficult to express. Is there anything else?

code points vs code units

You need to be careful about whether you are processing code points or code units. In particular, when you create the result string the elements must be code units so you need to reference one of the algorithms used by the spec to utf16 encode code points

Path to Stage 4!

Stage 4

  • committee approval
  • two implementations
  • significant in-the-field experience
  • ecma262 PR approved
  • prepare ecma262 PR

Stage 3

  • committee approval
  • merge test262 tests
  • write test262 tests

Stage 2.7

Stage 2

  • committee approval
  • spec reviewers selected
  • spec text written

Stage 1

  • committee approval

Which leading characters should be escaped?

As noted in #58 (comment) , an unescaped non-digit leading character could still be interpreted as part of an escape sequence spanning concatenated RegExp.escape output.

consider e.g. new RegExp("\\c" + RegExp.escape("J")) in a web browser implementing |ExtendedAtom|, where the result should match "\\cJ" and [unlike /\cJ/] fail to match "\n"

Spec Review

  • Why do we need a "ASCII punctuators that need escaping" phrase?
  • Nit: rename cuList to something else. Maybe escapedList?
  • > If c is the first code point in cpList and c is a DecimalDigit, then
    • I don't see the phrase "is the first … in" anywhere else. I think we can update that to state "if cuList is empty" and use a defined "is empty" phrase and keep the same behavior.
  • There are several references to cuList that need to become _cuList_
  • > Append the elements of the UTF16Encoding (10.1.1) of c to cuList
    • What is UTF16Encoding? That 10.1.1 is 10.1.1 [[GetPrototypeOf]] ( )
  • There are several 0x00XY references "unicode" references, shouldn't they all be U+00XY?
  • DecimalDigit and WhiteSpace should be |DecimalDigit| and |WhiteSpace|
  • > and c is a DecimalDigit
    • c is a code point, can you compare it to a DecimalDigit production?
  • u> and c is a WhiteSpace
    • c is a code point, can you compare it to a WhiteSpace production?
  • > c is a CharSetElement of toEscape
    • c is a code point needs to be cast to a CharSetElement before we can check if it's an element of the set.

Should we allow implementations to add to the escaped set?

While researching Perl today we've found an interesting issue:

As an interesting point, RegExp grammar can add flags (like /x in the future) but RegExp.escape will not be able to react to these changes.

In both cases - this means that RegExp.escape can't guarantee the length of the returned string between versions of ECMAScript that make extensions to the grammar.

To me, this is a huge deal. I'm wondering - if implementations are allowed to extend the regular expressions grammar - should they be allowed to escape additional characters and we should make it clear that there is no length guarantee on the returned string?

(No other language has a length guarantee but some have a fixed character set which implies a length guarantee)

Create a table of every escaped character and "why"

I've asked for this a couple times, but I think it's crucial to forward progress, especially as people debate additions to the current spec.

Here is a start:

char reason
^ new RegExp("^") => syntax error; new RegExp("^") => matches the string "^"

Crawl the internet

This is a self assigned issue for me to talk to either Professor Dani Dolev, Professor Sarah Coden or Professor Dror Feitelson regarding acquiring the university resources needed to crawl the top 1M websites to find eval / usages.

Advance to stage 1

From the tc39 process

  • Identified “champion” who will advance the addition

This would be me, with the much appreciated help of @domenic . Others (Uri Shaked and Elad Kats) have offered help with the process.

  • Prose outlining the problem or need and the general shape of a solution

Done in the readme.

  • Illustrative examples of usage

Done in the readme.

  • High-level API

Done in the readme, including semantics and a polyfill and a spec to JS file. The semantics spec will likely be moved to ECMarkup format.

  • Discussion of key algorithms, abstractions and semantics

This is a relatively simple proposal - covered in #3 and #4. A template tag is still considered but as it looks from #4 it seems inferior.

  • Identification of potential “cross-cutting” concerns and implementation challenges/complexity.

Done, and updated the readme file to address these.

  • Get TC39 to agree that we have advanced to stage 1, after meeting all the above requirements.

EncodeForRegExpEscape should not return results that require particular flags

EncodeForRegExpEscape step 4.e (which would be reached if input c were a Space_Separator supplementary code point in [U+10000, U+10FFFF]) results in a return value like \u{…}. The interpretation of such pattern text is dependent upon regular expression flags—specifically, it is interpreted as a |RegExpUnicodeEscapeSequence| that will match a code point with the contained hexadecimal value in the presence of a "u" or "v" flag, but otherwise is interpreted as either a syntax error or (only in a host supporting Annex B and only when the hexadecimal representation of the code point consists only of decimal digits) as a quantified |ExtendedAtom| "u" with the specified decimal count of repetitions (e.g., /^\u{10000}$/.test("u".repeat(10000)) is true).

Rather than returning results subject to conditional interpretation, EncodeForRegExpEscape should return a \u…\u… surrogate pair |RegExpUnicodeEscapeSequence| for such inputs (which work in both Unicode and non-Unicode regular expressions, e.g. /^\uD834\uDF06$/u.test("𝌆") and /^\uD834\uDF06$/v.test("𝌆") and /^\uD834\uDF06$/.test("𝌆") are all true).

Or alternatively (and preferably IMO), EncodeForRegExpEscape should not escape all white space. I'm not certain why it does so right now, but looking back I suspect it is due to a misinterpretation of #30 (which requests escaping of control characters, and even more specifically line terminators—and even that isn't necessary).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.