Git Product home page Git Product logo

parse-css's Introduction

Standards-Based CSS Parser

This project implements a standards-based CSS Parser. I'm the editor of the CSS Syntax spec http://drafts.csswg.org/css-syntax/, and need an implementation of it for testing purposes.

This parser is not designed to be fast, but users tell me it's actually rather speedy. (I suppose it's faster than running a ton of regexes over a bunch of text!) Its structure and coding style are instead meant to be very close to the spec, so that it's easy to verify that the code matches the spec (and vice versa) and to make it easy, when the spec changes, to make the same change in the parser.

It is intended to fully and completely match browser behavior (at least, as much as the final spec does).

There's a dingus for testing it out, or just quickly checking what some CSS parses into.

Using the Library

Include parse-css.js in your page. Then just call the desired parsing function, named after the algorithms in the spec: parseAStylesheet(), etc. You can pass a string or a list of tokens (such as what's produced by the tokenize() function). It'll return an appropriate object, as specified by the parsing function.

If you want to get access to the tokens directly, call tokenize() with a string; it'll return a list of tokens.

Note that the Syntax spec, and thus this parser, is extremely generic. It doesn't have any specific knowledge of CSS rules, just the core syntax, so it won't throw out invalid or unknown things.

Parsing Functions

Here's the full list of parsing functions. They do exactly what they say in their name, because they're named exactly the same as the corresponding section of the Syntax spec:

  • parseAStylesheet()
  • parseAStylesheetsContents()
  • parseABlocksContents()
  • parseARule()
  • parseADeclaration()
  • parseAComponentValue()
  • parseAListOfComponentValues()
  • parseACommaSeparatedListOfComponentValues()

Node Integration

parse-css.js uses the UMD module pattern, exporting the parser functions, the tokenize() function, and all of the classes used by the parser and tokenizer.

parse-css's People

Contributors

arnog avatar bkardell avatar danny0838 avatar espadrine avatar fremycompany avatar gregable avatar kolodny avatar mathiasbynens avatar mkrhere avatar raynos avatar romainmenke avatar simonsapin avatar tabatkins avatar tromey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parse-css's Issues

DELIM parsing confusion?

I noticed that the asterisk in the following style declaration is parsed as a DELIM, whereas I had expected an IDENT:

dl { grid: * 10em "a b" "c d" 4em; }

I established this assumption from extrapolation of the following rule, in which the letter "a" is parsed as an IDENT:

sym1 { flow: a; }

Apologies if I'm off base here. I did look at the CSS3 Syntax spec, but I wasn't clear on the intended purpose of DELIM.

CommonJS support

Nice work on the CSS parser!

I'd love some sort of CommonJs support so I can make a NodeJS-based CSS compressor. :)

Refactoring: consumeA<Function|...>(...) should be <Func|...>.consumeFrom(...)

I'm willing to make this change in my version of the codebase and make a PR on your project backporting only this change, but only if you accept the idea.

The reason for this is that I don't want to continue to maintain a fork too different from your codebase, and this looks a rather important change; I therefore won't take the rist to make the update on my side if you don't commit on the idea first.

Would you be ok to merge such a pull request if I made one?

parseerror is defined two times, with conflicting signatures

function parseerror(s, msg) {
    console.log("Parse error at token " + s.i + ": " + s.token + ".\n" + msg);
    return true;
}

and

    var parseerror = function() { console.log("Parse error at index " + i + ", processing codepoint 0x" + code.toString(16) + ".");return true; };

`HashToken.type` is incorrect when it is an `id`

if(wouldStartAnIdentifier(next(1), next(2), next(3))) token.type = "id";

This clashes with the type that all tokens have and that distinguishes them.

Other HashToken instances have instance.type === 'HASH' but when it is an id, it is set to id.


Before the recent refactor HASH was stored in tokenType and it didn't clash.

UMD broke example.html

The example.html page fails after updating to the latest. It looks like the breaking change was the introduction of UMD.

EOF in rule mode behavior don't match IE, Chrome behavior

The following code: "a { } b" outputs one CSS rule in IE, Chrome but two using your algorithm.

Proposed fix: defer the create/push call from line 112 (default: create(new SelectorRule) && switchto('selector') && reprocess();) to line 118 (case "{": switchto('declaration'); break;).

Update or remove package.json

The package on npm is 10 months old, as is the package.json in this repo. There have been numerous changes since then, so please either update the package or remove it from npm (and the package.json).

Thanks a lot!

Generating CSS source code from the rule tree.

Hey Tab,

I believe you're aware of vminpoly which adds vw, vh, vmin & vmax units functionality as a prolly.

After talking a bit with Brian Kardell, I'm working on extending the basic architecture to support pluggable prollyfills on top of your parser/tokenizer. To make it suit my needs I modified tokenizer.js a bit so that it can generate CSS source code adding the toSourceString() function in the various token's prototypes. For example (some Git diff output):

 DimensionToken.prototype.toString = function() { return "DIM("+this.num+","+this.unit+")"; }
 DimensionToken.prototype.toSourceString = function() { return this.num+this.unit; }
// ...
NumberToken.prototype.toString = function() {
    if(this.type == "integer")
        return "INT("+this.value+")";
    return "NUMBER("+this.value+")";
}
NumberToken.prototype.toSourceString = function() {
    if(this.type == "integer")
        return this.value;
    return this.value;
}

Perhaps we can add and maintain .toSourceString() directly in your repository. This addition could be generally useful so perhaps it does belong inside tokenizer.js

What do you think?

EDIT:
Well, it has been a while since I've looked at vminpoly and, after refamiliarizing myself with it again, I've realized part of the CSS source code generation is in vminpoly itself. So .toSourceString() won't generate complete source code.

More so, I'm hooking into parts of the code generator to filter out unneeded parts of the rule tree, so a code generator embedded in your library won't be useful to me unless it can call back at different points during the code generation.

Still, perhaps it could be worthwhile to put a simplified version of vminpoly's CSS code generator into your parser for other people. Let me know if you're interested and I'll show you the relevant parts in my code.

And then again, perhaps a hookable code generator could be neat as well, though it would be more complex to implement in a generic reusable way.

Numeric values become "DIM(undefined,px)"

var tokenize = require('./tokenizer').tokenize;
var parse = require('./parser').parse;

str = 'div { left: 22px; }'
tokens = tokenize(str);
sheet = parse(tokens);

console.log(JSON.stringify(sheet,null,2));
{
  "type": "stylesheet",
  "value": [
    {
      "type": "selector",
      "selector": [
        "IDENT(div)",
        "WS"
      ],
      "value": [
        {
          "type": "declaration",
          "name": "left",
          "value": [
            "WS",
            "DIM(undefined,px)" /* <-- right here */
          ]
        }
      ]
    }
  ]
}

HASH - IDENT multiple matches

Giving the following rule:

id {top:1px}

The tokenizer returns a HashToken for the "#id" selector. This does not seems to be the correct behavior. It should return an IDENT token since according to the http://www.w3.org/TR/css3-syntax/#tokenization :
"In case of multiple matches, the longest match determines the token."

Also a HashToken may be invalid in a selector because it can contain a num as the first character - Illegal in an ID selector.

I think that this should be returned:

id {top:1px} - IDENT

1id {top:1px} - HASH

Thank you.

Throw parse errors instead of just logging them to console

This way the parser could be used to validate properties and attributes, e.g.:

let isValidStyleAttributeValue = (value) => {
  try {
    CSSParser.parseAListOfDeclarations(value.trim());  
  }
  catch (error) {
    return false;
  }

  return true;
};

parser or example.html issues with nesting at-rules

The parser token and JSON results appear different to me when parsing nesting at-rules. Iโ€™m unsure if the parser itself has issues parsing nesting at-rules, or if this is a limitation within the example.html demo.

Here is the CSS that I parsed:

@media screen {
	body {
		background-color: blue;
	}
}

@media all {
	@media screen {
		body {
			background-color: blue;
		}
	}
}

The tokens output looks correct, labeling 2 root at-rules and 1 nesting at-rule:

AT(media) WS IDENT(screen) WS {
	WS IDENT(body) WS {
		WS IDENT(background-color) : WS IDENT(blue) ;
	WS }
WS }

WS AT(media) WS IDENT(all) WS {
	WS AT(media) WS IDENT(screen) WS {
		WS IDENT(body) WS {
			WS IDENT(background-color) : WS IDENT(blue) ;
		WS }
	WS }
WS }

The JSON output looks incorrect, identifying the final nesting at-rule as a keyword in the value of the 2nd root at-rule:

{
	"type": "STYLESHEET",
	"value": [
		{
			"type": "AT-RULE",
			"name": "media",
			"prelude": [
				{
					"token": "WHITESPACE"
				},
				{
					"token": "IDENT",
					"value": "screen"
				},
				{
					"token": "WHITESPACE"
				}
			],
			"value": {
				"type": "BLOCK",
				"name": "{",
				"value": [
					{
						"token": "WHITESPACE"
					},
					{
						"token": "IDENT",
						"value": "body"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"type": "BLOCK",
						"name": "{",
						"value": [
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "background-color"
							},
							{
								"token": ":"
							},
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "blue"
							},
							{
								"token": ";"
							},
							{
								"token": "WHITESPACE"
							}
						]
					},
					{
						"token": "WHITESPACE"
					}
				]
			}
		},
		{
			"type": "AT-RULE",
			"name": "media",
			"prelude": [
				{
					"token": "WHITESPACE"
				},
				{
					"token": "IDENT",
					"value": "all"
				},
				{
					"token": "WHITESPACE"
				}
			],
			"value": {
				"type": "BLOCK",
				"name": "{",
				"value": [
					{
						"token": "WHITESPACE"
					},
					{
						"token": "AT-KEYWORD",
						"value": "media"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"token": "IDENT",
						"value": "screen"
					},
					{
						"token": "WHITESPACE"
					},
					{
						"type": "BLOCK",
						"name": "{",
						"value": [
							{
								"token": "WHITESPACE"
							},
							{
								"token": "IDENT",
								"value": "body"
							},
							{
								"token": "WHITESPACE"
							},
							{
								"type": "BLOCK",
								"name": "{",
								"value": [
									{
										"token": "WHITESPACE"
									},
									{
										"token": "IDENT",
										"value": "background-color"
									},
									{
										"token": ":"
									},
									{
										"token": "WHITESPACE"
									},
									{
										"token": "IDENT",
										"value": "blue"
									},
									{
										"token": ";"
									},
									{
										"token": "WHITESPACE"
									}
								]
							},
							{
								"token": "WHITESPACE"
							}
						]
					},
					{
						"token": "WHITESPACE"
					}
				]
			}
		}
	]
}

Duplicate test case

The following two tests are identical:

parse-css/tests.js

Lines 331 to 335 in c7859c4

{
parser: "",
css: "null\\0",
expected: [{type: "IDENT", value: "null\uFFFD"}, {type: "EOF"}],
},

parse-css/tests.js

Lines 336 to 340 in c7859c4

{
parser: "",
css: "null\\0",
expected: [{type: "IDENT", value: "null\uFFFD"}, {type: "EOF"}],
},

I think that is an omission of sorts, I see no value in having both, am I missing something?

Inquiry: Should unicode escapes be idents

So I'm wanting to build a CSS injection validator and this seemed the best starting place.

So looking at these:
https://code.google.com/p/browsersec/wiki/Part1#Cascading_stylesheets

This:

color: expression(alert(1))

and:

color: expression\028 alert\028 1\029 \029

Should be somewhat equal, however the output isn't.

    {
            "type": "FUNCTION",
            "value": [
              {
                "type": "FUNCTION",
                "value": [
                  {
                    "token": "NUMBER",
                    "value": 1,
                    "type": "integer",
                    "repr": "1"
                  }
                ],
                "name": "alert"
              }
            ],
            "name": "expression"
          },
          {
            "token": ";"
          },

and

          {
            "token": "IDENT",
            "value": "color"
          },
          {
            "token": ":"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "expression(alert"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "(1"
          },
          {
            "token": "WHITESPACE"
          },
          {
            "token": "IDENT",
            "value": "))"
          },
          {
            "token": ";"
          }

Ultimately I'm looking for uniformity to simplify the injection mitigation code, is that something you would prefer me to fix before passing it through the parser or would it be simpler to resolve unicode within the parser?

Parser Question

Firstly, thanks for releasing this. It, along with https://github.com/NV/CSSOM are the most readable JavaScript CSS parsers I've seen. Pardon my ignorance but I have two questions:

  • Why the use of charCodeAt in the tokenizer vs. just using charAt, wouldn't that make the code easier to read?
  • Why the separate tokenizer and parser? I know that the aim is not for speed at this stage, however, surely by keeping the two together, there would not be a need for two loops and would also make it more obvious as to what is going on.

Again, thanks for a great library.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.