Git Product home page Git Product logo

tenko's Introduction

Tenko

A "pixel perfect" 100% spec compliant JavaScript parser written in JavaScript, parsing ES6/ES2015 - ES2021.

REPL: https://pvdz.github.io/tenko/repl

  • Supports:
    • Anything stage 4 up to ES2021
    • Regex syntax (deep)
    • Parsing modes:
      • Sloppy / non-strict
      • Web compat / AnnexB
      • Strict
      • Module
  • AST
    • Is optional, enabled by default
    • Estree (default)
    • (Optional chaining AST works but AST spec seems still in flux)
    • Acorn
    • Babel (anything stage 4, except comments)
    • Supports location data (matching Acorn/Babel for reference)
  • Tests
    • 33k input syntax tests
    • Passes test262 suite (at least as per March 2020), without exception

Name

The name is short for "The Parser Formerly Known As ZeParser3".

It's also an anagram for "Token", perfectly fitting this project.

In Japanese it's a divine beast ("heavenly fox" or "celestial fox"), playing into my nicknames.

REPL

You can find the REPL in repl/index.html, github link: https://pvdz.github.io/tenko/repl

The REPL runs on dev master branch and needs a very new browser due to es module syntax.

Usage

import {Tenko, GOAL_MODULE, COLLECT_TOKENS_ALL} from 'src/index.mjs';
const {
  ast,                 // estree compatible AST
  tokens,              // array of numbers (see Lexer)
  tokenCountSolid,     // number of non-whitespace tokens
  tokenCountAny,       // number of tokens of any kind
} = Tenko(
  inputCode,           // string
  {
    // Parse with script or module goal (module allows import/export)
    goalMode = GOAL_MODULE, // GOAL_MODULE | GOAL_SCRIPT | "module" | "script"
    // Do you want to collect generated tokens at all?
    collectTokens = COLLECT_TOKENS_ALL, // COLLECT_TOKENS_ALL | COLLECT_TOKENS_SOLID | COLLECT_TOKENS_NONE | COLLECT_TOKENS_TYPES | "all" | "solid" | "none" | "types"
    // Apply Annex B rules? (Only works in sloppy mode)
    webCompat = true,
    // Start parsing as if in strict mode? (Works with script goal)
    strictMode = false,
    // Output a Babel compatible AST? Note: comment nodes are not properly mirrored
    babelCompat = false,
    // Add a loc (with `{start: {line, column}, stop: {line, column}}`) to each token?
    babelTokenCompat = false,
    // Pass on a reference that will be used as the AST root
    astRoot = null,
    // Should it normalize \r and \r\n to \n in the .raw of template nodes?
    // Estree spec but makes it hard to serialize template nodes losslessly
    templateNewlineNormalization = true,
    // Pass on a reference to store the tokens
    tokenStorage = [],
    // Callback to receive the lexer instance once its created
    getLexer = null, // getLexer(lexer)
    // You use this to parse `eval` code
    allowGlobalReturn = false,
    // Target a very specific ecmascript version (like, reject async). Number; 6 - 12, or 2015 - 2021, or Infinity.
    targetEsVersion = lastVersion, // (Last supported version is currently ES2021)
    // Leave built up scope information in the ASTs (good luck)
    exposeScopes = false,
    // Assign each node a unique incremental id
    astUids = false,
    // Do you want to print a code frame with error messages? (Part of the input around the point of error)
    errorCodeFrame = true,
    // For the code frame, do you want to always show the entire input, regardless of size? Or just a small context
    truncCodeFrame = true,
    // You can override the logging functions to catch or squash all output
    $log = console.log,
    $warn = console.warn,
    $error = console.error,
    // Value ot use for the `source` field of each `loc` object
    sourceField = '',
    // Generate a `range: {start: number, end: number}` property on all loc objects (does not require `locationTracking`)
    ranges = false,
    // Generate a `range: [start: number, end: number]` property on all nodes. `input.slice(range[0], range[1])` should get you the text for a node.
    nodeRange = false,
    // Do not populate loc properties on AST nodes (property will be undefined). Since v<unpublished>
    locationTracking = true,
  }
);

Development

There is a single entry point in the root project called t which calls tests/t.sh which calls out to various development related scripts.

ES modules

Note that the files use import and export declarations and import(), which requires node 10+ or a cutting edge browser.

At the time of writing node requires the experimental --experimental-modules flag.

It's a burden in some ways and nice in others. A prod build would not have any modules.

Test cases

All test cases are in "special" plain-text .md files. See tests/testcases/README.md for details on formatting those.

Entry point

Some interesting usages of ./t:

# Show help
./t --help

# Run and update (inline) all tests. 
# Use git diff to see changes. Will bail fast on unexpected or assertion errors.
# This tests four modes (sloppy, strict, module, and sloppy-web-compat)
# This also tests the printer on the first successful parse
./t u
# Run all tests step-by-step (same as above) and ask what to do for any changes
./t m

# Same as `./t u` but compare it against Babel or Acorn. Recorded changes should be discarded afterwards.
# Use this to test against AST differences. If there are any they will be printed explicitly.
# Acorn:
./t a
# Babel:
./t b

# Test a particular input from cli
./t i "some.input()"
# Test a particular test file
./t f "tests/testcases/regexes/foo.md"
# Use entire contents of given file as input
./t F "test262/test/annexB/built-ins/foo.js"

# Generate prod builds
# Generate a build. Strips ASSERT*, inline many constants
./t z
# Same as above but explicitly set `acornCompat` and `babelCompat` to `false`.
./t z --no-compat
# Generate pretty builds for debugging without asserts:
./t --pretty
# Minified build with Terser (will lower performance due to inlining)
./t --min

# Run test262 tests (requires cloning https://github.com/tc39/test262 into tenko/ignore/test262)
./t t

# Fuzz the parser
./t fuzz

# Regenerate all autogen test files. Regenerates files still need to be updated (`./t u`).
# All files, regardless:
./t g
# Only create new files:
./t G

# Find out which tests execute a particular code branch in the parser
# Add `HIT()` to any part of the code in src
# Reports (only) all inputs that trigger a `HIT()` call in Tenko
./t s

Some tooling that requires additional setup;

# Benchmarks (requires benchmark files in projroot/ignore/perf);
# Simply spawn new node process and run test:
./t p
# Run benchmarks repeatedly and report results
./t p6
# Configure machine to be as stable as possible (DANGEROUS, read the script before using it, requires root). All 
# changes should be reset after reboot. Then run the benchmarks in the shielded cpus at RT prio (also requires root).
./t stable
./t p6 --stabled
# Same as above but without running `./t stable` previously, and tries to undo certain (but not all) things afterwards
./t p6 --stable

# Investigate v8 perf regressions with deoptigate:
./t deoptigate

# Profile the parser in Chrome devtools (open the tab through `about://inspect`)
./t devtools

# Run a visual heatmap profiler for counts based investigation (private)
./t hf

There are many flags. Some are specific to an action, others are generic. Some examples:

--sloppy             Run in non-strict mode (but non-web compat!)
--strict             Run with script goal but consider the code strict
--module             Run with module goal (enabling strict mode by default)
--web                Run with script goal, non-strict, and enable web compat (AnnexB rules)
--annexb             Force enable AnnexB rules, regardless of mode

6                    Run as close to the rules as of ES6  / ES2015 as possible
7                    Run as close to the rules as of ES7  / ES2016 as possible
8                    Run as close to the rules as of ES8  / ES2017 as possible
9                    Run as close to the rules as of ES9  / ES2018 as possible
10                   Run as close to the rules as of ES10 / ES2019 as possible
11                   Run as close to the rules as of ES11 / ES2020 as possible
12                   Run as close to the rules as of ES11 / ES2021 as possible
2015
2016
2017
2018
2019
2020
2021

--min                Given a broken input, brute force minify the input while maintaining the same error message
--acorn              Output a Acorn compatible AST
--babel              Output a Babel compatible AST
--test-acorn         Compare the `--acorn` output to the actual output of Acorn on same input (./t a)
--test-babel         Compare the `--babel` output to the actual output of Babel on same input (./t b)
--test-node          Compile input in a `Function()` and report whether that throws when Tenko throws, for fuzzer
--build              Use a prod build (from standard output location), instead of dev sources, for all actions that support it
--nb                 Do not build (many actions will kick of a build before doing their thing, this prevents that)

And many more. For details, ./t --help should give you an up to date list of all actions and options.

Building

While the parser runs perfectly fine in dev mode it will be a bit slow. A build:

  • will remove non-assert dev artifacts
  • can remove inline asserts (lines that start with ASSERT)
  • can remove all the AST generation from the build (lines that start with AST)
  • inlines many constants used by the parser as enums or bitwise fields

To generate a build run this in the project root, flags can be combined:

./t z                      # Regular build with everything
./t z --no-ast             # Strip most AST related code from the build (~50% faster, but obviously no AST)
./t z --no-compat          # Strip acorn/babel compatibility
./t z --min                # Run Terser on result (will decrease performance due to inlining)
./t z --pretty             # Run Prettier on build result

Note that this (initially) uses my own printer to print the AST.

The build script writes and ESM and CJS file to ./build

Validation without AST

The no-AST build can validate JS almost as perfect as the regular build except for certain validation cases where it requires the AST:

  • Binary op after arrow with block body (()=>{}*x is illegal)
  • Regular expression on new line after arrow with block body (()=>{} \n /foo/g, prohibited by ASI rules and can't be a division)
  • Update operator anything that's writable but not a valid var or member expression (++[])
  • Delete with an ident that is wrapped in parenthesis (delete (foo) is illegal), trivial cases (delete foo;) should be fine

Testing

Each test is individually encapsulated in an .md file in tests/testcases/**. This file will contain the input code and the output as expected for sloppy mode, strict mode (script goal), module goal, and web compat mode (only works in sloppy mode, script goal).

If a run passes then the AST and types of tokens are printed in the output. This AST is also printed with src/tools/printer and its output is checked to produce the same AST.

If a run does not pass the error message and a pointer to where the error occurred are stored in the file.

The files can be auto-updated with ./t u or ./t m. This makes it easy to update something in the parser and use git to confirm whether anything changed, and if so what.

There are also autogen.md files, which generate a bunch of combinatorial tests (./t g or ./t G), similar to the other tests.

To create a new test simply add a new file, start it with @, a description, a line with only ### and the rest will be considered the test case verbatim. When you run the test runner this file will automatically be converted to a proper test case.

tenko's People

Contributors

0xflotus avatar coderaiser avatar pvdz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tenko's Issues

realpath: command not found

➜  tenko git:(master) ✗ ./t
./t: line 8: realpath: command not found
usage: dirname path
./tests/t.sh: line 6: realpath: command not found
usage: dirname path
./tests/t.sh: line 258: syntax error near unexpected token `;'

os: macos 11.2 Beta (20D5029f)

copyPiggies needs to be less confusing

The copyPiggies calls should not pass an assignable/destructible value. Should create a getpiggies call instead.

The call sites may be simplified by doing so since they're implying that the returned value is affected by the current value being assignable/destructible, which is not the case.

Tokens format not compatible with Esprima and other parsers

esprima and babel, estree, acorn has tokens format that has loc field:

{
    "loc": {
        "start": {
            "line": 8,
            "column": 4
         },
         "end": {
             "line": 8,
             "column": 8
         }
    }
} 

Would be great if tenko also has it because recast uses esprima if it can't find tokens field, but it works great for parsers mentioned earlier.

So compatible tokens field is crucial to avoid double parsing.

Block scoped func decl id scoping

I think Tenko currently doesn't properly scope the id of a function declaration that's inside a block (or switch).

If I understand properly now then in strict mode it becomes a let that is hoisted within the block and if not strict mode then it becomes a var that is hoisted within the block.

So:

{
  f();
  function f(){}
}

In context of where f binds, is really

{
  var f = function(){}
  f();
}

(Note the discrepancies with .name and some other things. This is only supposed to illustrate how f binds.)

While

"use strict";
{
  function f(){}
  f();
}

Is really

"use strict";
{
  let f = function(){}
  f();
}

Works the same in global, in functions, and in switches.

Some correctness issues

I ran a fuzzer for a while and found the following cases which are incorrectly rejected. Other than the first two they don't make sense as programs, although that's mostly because I've tried to reduce them as much as possible.

Let me know if you would prefer that I not file bugs found with a fuzzer.

async(0,...a)
export default async=null
(class extends async function(){}{})
let[{}=class{}]=null
a[{...()=>{}}.m()]
--{_:()=>null}._
(class{static get[()=>null](){}}())

add an option to generate range

to play well with eslint, can you add a option to generate range property

refs: https://eslint.org/docs/developer-guide/working-with-custom-parsers#all-nodes

The AST specification
The AST that custom parsers should create is based on ESTree. The AST requires some additional properties about detail information of the source code.

All nodes:
All nodes must have range property.

range (number[]) is an array of two numbers. Both numbers are a 0-based index which is the position in the array of source code characters. The first is the start position of the node, the second is the end position of the node. code.slice(node.range[0], node.range[1]) must be the text of the node. This range does not include spaces/parentheses which are around the node.
loc (SourceLocation) must not be null. The loc property is defined as nullable by ESTree, but ESLint requires this property. On the other hand, SourceLocation#source property can be undefined. ESLint does not use the SourceLocation#source property.
The parent property of all nodes must be rewritable. ESLint sets each node's parent property to its parent node while traversing, before any rules have access to the AST.

REPL wrong color code on valid / invalid cases

This case

'"use strict"; (async yield => {})'

in the REPL doesn't mark 'web compat' an 'sloppy' as green. Instead they are red and indicate that this case also should fail in sloppy mode.

Nothing wrong with Tenko and if you try to parse this one in the REPL the colors are correct

'(async yield => {})'

Lhs check for `let` left of `in` and `instanceof` is only checking `in`

Silly oversight. Also need to write a test for it.

if ($tp_next_type === $ID_in || $tp_next_type === $ID_in) {
      return THROW_RANGE('Cannot use `let` as a regular var name as the lhs of `in` or `instanceof` in a toplevel expression statement', tok_getStart(), tok_getStop()); // And why would you.
    }

yarn install fails

yarn install
Using globally installed version of Yarn
yarn install v1.12.1
warning package.json: No license field
info No lockfile found.
warning [email protected]: No license field
[1/4] 🔍  Resolving packages...
error Can't add "acorn": invalid package version undefined.

If I try npm, it hangs at:

⸨    ░░░░░░░░░░░░░░⸩ ⠋ fetchMetadata: sill resolveWithNewModule [email protected] checking installable status

I wonder whether this could be something in my environment (FB laptop), but other things seem to work.

can't do destructuring assign with a compound operator

I was playing around with Tenko and discovered that the parser accepted cases like [...{x} /= y].

Use of an compound operator in this case should trigger an error.

Note that this is only for array and object literals. Cases like '[...a /= y]' should not trigger any errors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.