tabatkins / parse-css Goto Github PK

:horse_racing: Standards-based CSS Parser

License: Other

JavaScript 97.81% HTML 2.19%

parse-css's Introduction

Standards-Based CSS Parser

This project implements a standards-based CSS Parser. I'm the editor of the CSS Syntax spec http://drafts.csswg.org/css-syntax/, and need an implementation of it for testing purposes.

This parser is not designed to be fast, but users tell me it's actually rather speedy. (I suppose it's faster than running a ton of regexes over a bunch of text!) Its structure and coding style are instead meant to be very close to the spec, so that it's easy to verify that the code matches the spec (and vice versa) and to make it easy, when the spec changes, to make the same change in the parser.

It is intended to fully and completely match browser behavior (at least, as much as the final spec does).

There's a dingus for testing it out, or just quickly checking what some CSS parses into.

Using the Library

Include parse-css.js in your page. Then just call the desired parsing function, named after the algorithms in the spec: parseAStylesheet(), etc. You can pass a string or a list of tokens (such as what's produced by the tokenize() function). It'll return an appropriate object, as specified by the parsing function.

If you want to get access to the tokens directly, call tokenize() with a string; it'll return a list of tokens.

Note that the Syntax spec, and thus this parser, is extremely generic. It doesn't have any specific knowledge of CSS rules, just the core syntax, so it won't throw out invalid or unknown things.

Parsing Functions

Here's the full list of parsing functions. They do exactly what they say in their name, because they're named exactly the same as the corresponding section of the Syntax spec:

parseAStylesheet()
parseAStylesheetsContents()
parseABlocksContents()
parseARule()
parseADeclaration()
parseAComponentValue()
parseAListOfComponentValues()
parseACommaSeparatedListOfComponentValues()

Node Integration

parse-css.js uses the UMD module pattern, exporting the parser functions, the tokenize() function, and all of the classes used by the parser and tokenizer.

parse-css's People

Contributors

Stargazers

Watchers

parse-css's Issues

DELIM parsing confusion?

I noticed that the asterisk in the following style declaration is parsed as a DELIM, whereas I had expected an IDENT:

dl { grid: * 10em "a b" "c d" 4em; }

I established this assumption from extrapolation of the following rule, in which the letter "a" is parsed as an IDENT:

sym1 { flow: a; }

Apologies if I'm off base here. I did look at the CSS3 Syntax spec, but I wasn't clear on the intended purpose of DELIM.

Wrong end character in consumeTheRemnantsOfABadURL

		while(consume()) {
			if(code == 0x2d || eof()) {
				return;

According to the spec this should be U+0029 RIGHT PARENTHESIS ()) rather than 0x2d (-).

Thanks for the code.

CommonJS support

Nice work on the CSS parser!

I'd love some sort of CommonJs support so I can make a NodeJS-based CSS compressor. :)

Some token types where removed but not entirely cleaned up

aa45c3c

ReferenceError: SubstringMatchToken is not defined

These are still in use in some places : https://github.com/tabatkins/parse-css/blob/main/parse-css.js#L174

parseACommaSeparatedListOfComponentValues is unused

Do we need to export this function, or could we delete it?
Ditto for parseAListOfComponentValues.

The function names are very long, too.

Refactoring: consumeA<Function|...>(...) should be <Func|...>.consumeFrom(...)

I'm willing to make this change in my version of the codebase and make a PR on your project backporting only this change, but only if you accept the idea.

The reason for this is that I don't want to continue to maintain a fork too different from your codebase, and this looks a rather important change; I therefore won't take the rist to make the update on my side if you don't commit on the idea first.

Would you be ok to merge such a pull request if I made one?

Strings with different starting and ending quote should be treated as invalid

Calling tokenize("a') should return BAD_STRING rather than STRING token, at least this is how I understand the grammar.

package.json

It needs a package.json and needs to be published on NPM for node support.

Example package.json https://github.com/Raynos/write-stream/blob/master/package.json

Doesn't work anymore: parse-css / test.html

Hasn't been updated to the one-single-file version of the code.

parseerror is defined two times, with conflicting signatures

function parseerror(s, msg) {
    console.log("Parse error at token " + s.i + ": " + s.token + ".\n" + msg);
    return true;
}

and

    var parseerror = function() { console.log("Parse error at index " + i + ", processing codepoint 0x" + code.toString(16) + ".");return true; };

`HashToken.type` is incorrect when it is an `id`

parse-css/parse-css.js

Line 155 in df2089c

if(wouldStartAnIdentifier(next(1), next(2), next(3))) token.type = "id";

This clashes with the type that all tokens have and that distinguishes them.

Other HashToken instances have instance.type === 'HASH' but when it is an id, it is set to id.

Before the recent refactor HASH was stored in tokenType and it didn't clash.

Bug: Duplicate 'type' where you probably meant 'name' from coding pattern:

here

Typo causing exception at consumeAQualifiedRule

see "consumeAQualifiedRule", line 1005. should be "s.token" and not "token".

UMD broke example.html

The example.html page fails after updating to the latest. It looks like the breaking change was the introduction of UMD.

EOF in rule mode behavior don't match IE, Chrome behavior

The following code: "a { } b" outputs one CSS rule in IE, Chrome but two using your algorithm.

Proposed fix: defer the create/push call from line 112 (default: create(new SelectorRule) && switchto('selector') && reprocess();) to line 118 (case "{": switchto('declaration'); break;).

Update or remove package.json

The package on npm is 10 months old, as is the package.json in this repo. There have been numerous changes since then, so please either update the package or remove it from npm (and the package.json).

Thanks a lot!

`escapeIdentCode` doesn't exist.

parse-css/parse-css.js

Line 801 in 998aced

return escapeIdentCode(code);

ReferenceError: escapeIdentCode is not defined

Generating CSS source code from the rule tree.

Hey Tab,

I believe you're aware of vminpoly which adds vw, vh, vmin & vmax units functionality as a prolly.

After talking a bit with Brian Kardell, I'm working on extending the basic architecture to support pluggable prollyfills on top of your parser/tokenizer. To make it suit my needs I modified tokenizer.js a bit so that it can generate CSS source code adding the toSourceString() function in the various token's prototypes. For example (some Git diff output):

 DimensionToken.prototype.toString = function() { return "DIM("+this.num+","+this.unit+")"; }
 DimensionToken.prototype.toSourceString = function() { return this.num+this.unit; }
// ...
NumberToken.prototype.toString = function() {
    if(this.type == "integer")
        return "INT("+this.value+")";
    return "NUMBER("+this.value+")";
}
NumberToken.prototype.toSourceString = function() {
    if(this.type == "integer")
        return this.value;
    return this.value;
}

Perhaps we can add and maintain .toSourceString() directly in your repository. This addition could be generally useful so perhaps it does belong inside tokenizer.js

What do you think?

EDIT:
Well, it has been a while since I've looked at vminpoly and, after refamiliarizing myself with it again, I've realized part of the CSS source code generation is in vminpoly itself. So .toSourceString() won't generate complete source code.

More so, I'm hooking into parts of the code generator to filter out unneeded parts of the rule tree, so a code generator embedded in your library won't be useful to me unless it can call back at different points during the code generation.

Still, perhaps it could be worthwhile to put a simplified version of vminpoly's CSS code generator into your parser for other people. Let me know if you're interested and I'll show you the relevant parts in my code.

And then again, perhaps a hookable code generator could be neat as well, though it would be more complex to implement in a generic reusable way.

Numeric values become "DIM(undefined,px)"

var tokenize = require('./tokenizer').tokenize;
var parse = require('./parser').parse;

str = 'div { left: 22px; }'
tokens = tokenize(str);
sheet = parse(tokens);

console.log(JSON.stringify(sheet,null,2));

{
  "type": "stylesheet",
  "value": [
    {
      "type": "selector",
      "selector": [
        "IDENT(div)",
        "WS"
      ],
      "value": [
        {
          "type": "declaration",
          "name": "left",
          "value": [
            "WS",
            "DIM(undefined,px)" /* <-- right here */
          ]
        }
      ]
    }
  ]
}

HASH - IDENT multiple matches

Giving the following rule:

id {top:1px}

The tokenizer returns a HashToken for the "#id" selector. This does not seems to be the correct behavior. It should return an IDENT token since according to the http://www.w3.org/TR/css3-syntax/#tokenization :
"In case of multiple matches, the longest match determines the token."

Also a HashToken may be invalid in a selector because it can contain a num as the first character - Illegal in an ID selector.

I think that this should be returned:

id {top:1px} - IDENT

1id {top:1px} - HASH

Thank you.

Throw parse errors instead of just logging them to console

This way the parser could be used to validate properties and attributes, e.g.:

let isValidStyleAttributeValue = (value) => {
  try {
    CSSParser.parseAListOfDeclarations(value.trim());  
  }
  catch (error) {
    return false;
  }

  return true;
};

parser or example.html issues with nesting at-rules

The parser token and JSON results appear different to me when parsing nesting at-rules. I’m unsure if the parser itself has issues parsing nesting at-rules, or if this is a limitation within the example.html demo.

Here is the CSS that I parsed:

@media screen {
	body {
		background-color: blue;
	}
}

@media all {
	@media screen {
		body {
			background-color: blue;
		}
	}
}

The tokens output looks correct, labeling 2 root at-rules and 1 nesting at-rule:

AT(media) WS IDENT(screen) WS {
	WS IDENT(body) WS {
		WS IDENT(background-color) : WS IDENT(blue) ;
	WS }
WS }

WS AT(media) WS IDENT(all) WS {
	WS AT(media) WS IDENT(screen) WS {
		WS IDENT(body) WS {
			WS IDENT(background-color) : WS IDENT(blue) ;
		WS }
	WS }
WS }

The JSON output looks incorrect, identifying the final nesting at-rule as a keyword in the value of the 2nd root at-rule:

{
	"type": "STYLESHEET",
	"value": [
		{
			"type": "AT-RULE",
			"name": "media",
			"prelude": [
				{
					"token": "WHITESPACE"
				},
				{
					"token": "IDENT",
					"value": "screen"
				},
				{
					"token": "WHITESPACE"
				}
			],
			"value": {
				"type": "BLOCK",
				"name": "{",
				"value": [
					{
						"token": "WHITESPACE"
					},
					{
						"token": "IDENT",
						"value": "body"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"type": "BLOCK",
						"name": "{",
						"value": [
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "background-color"
							},
							{
								"token": ":"
							},
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "blue"
							},
							{
								"token": ";"
							},
							{
								"token": "WHITESPACE"
							}
						]
					},
					{
						"token": "WHITESPACE"
					}
				]
			}
		},
		{
			"type": "AT-RULE",
			"name": "media",
			"prelude": [
				{
					"token": "WHITESPACE"
				},
				{
					"token": "IDENT",
					"value": "all"
				},
				{
					"token": "WHITESPACE"
				}
			],
			"value": {
				"type": "BLOCK",
				"name": "{",
				"value": [
					{
						"token": "WHITESPACE"
					},
					{
						"token": "AT-KEYWORD",
						"value": "media"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"token": "IDENT",
						"value": "screen"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"type": "BLOCK",
						"name": "{",
						"value": [
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "body"
							},
							{
								"token": "WHITESPACE"
							},
							{
								"type": "BLOCK",
								"name": "{",
								"value": [
									{
										"token": "WHITESPACE"
									},
									{
										"token": "IDENT",
										"value": "background-color"
									},
									{
										"token": ":"
									},
									{
										"token": "WHITESPACE"
									},
									{
										"token": "IDENT",
										"value": "blue"
									},
									{
										"token": ";"
									},
									{
										"token": "WHITESPACE"
									}
								]
							},
							{
								"token": "WHITESPACE"
							}
						]
					},
					{
						"token": "WHITESPACE"
					}
				]
			}
		}
	]
}

`escapeIdent` seems to be a bit too eager

parse-css/parse-css.js

Line 757 in 6f8f9d5

function escapeIdent(string) {

escapeIdent('--foo')
// '\\--foo'

Duplicate test case

The following two tests are identical:

parse-css/tests.js

Lines 331 to 335 in c7859c4

 { 

 parser: "", 

 css: "null\\0", 

 expected: [{type: "IDENT", value: "null\uFFFD"}, {type: "EOF"}], 

 },

parse-css/tests.js

Lines 336 to 340 in c7859c4

 { 

 parser: "", 

 css: "null\\0", 

 expected: [{type: "IDENT", value: "null\uFFFD"}, {type: "EOF"}], 

 },

I think that is an omission of sorts, I see no value in having both, am I missing something?

Inquiry: Should unicode escapes be idents

So I'm wanting to build a CSS injection validator and this seemed the best starting place.

So looking at these:
https://code.google.com/p/browsersec/wiki/Part1#Cascading_stylesheets

This:

color: expression(alert(1))

and:

color: expression\028 alert\028 1\029 \029

Should be somewhat equal, however the output isn't.

    {
            "type": "FUNCTION",
            "value": [
              {
                "type": "FUNCTION",
                "value": [
                  {
                    "token": "NUMBER",
                    "value": 1,
                    "type": "integer",
                    "repr": "1"
                  }
                ],
                "name": "alert"
              }
            ],
            "name": "expression"
          },
          {
            "token": ";"
          },

and

          {
            "token": "IDENT",
            "value": "color"
          },
          {
            "token": ":"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "expression(alert"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "(1"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "))"
          },
          {
            "token": ";"
          }

Ultimately I'm looking for uniformity to simplify the injection mitigation code, is that something you would prefer me to fix before passing it through the parser or would it be simpler to resolve unicode within the parser?

Parser Question

Firstly, thanks for releasing this. It, along with https://github.com/NV/CSSOM are the most readable JavaScript CSS parsers I've seen. Pardon my ignorance but I have two questions:

Why the use of charCodeAt in the tokenizer vs. just using charAt, wouldn't that make the code easier to read?
Why the separate tokenizer and parser? I know that the aim is not for speed at this stage, however, surely by keeping the two together, there would not be a need for two loops and would also make it more obvious as to what is going on.

Again, thanks for a great library.

Don't overwrite Function constructor

At https://github.com/tabatkins/css-parser/blob/master/parser.js#L337 you overwrote the built-in Function object - not a good idea.

	{
	parser: "",
	css: "null\\0",
	expected: [{type: "IDENT", value: "null\uFFFD"}, {type: "EOF"}],
	},

tabatkins / parse-css Goto Github PK

parse-css's Introduction

Standards-Based CSS Parser

Using the Library

Parsing Functions

Node Integration

parse-css's People

Contributors

Stargazers

Watchers

Forkers

parse-css's Issues

sym1 { flow: a; }

id {top:1px}

id {top:1px} - IDENT

1id {top:1px} - HASH

Recommend Projects

Recommend Topics

Recommend Org