Git Product home page Git Product logo

gosub-engine's Introduction

Gosub: Gateway to Optimized Searching and Unlimited Browsing

This repository holds the Gosub browser engine. It will become a standalone library that can be used by other projects but will ultimately be used by the Gosub browser user-agent. See the About section for more information.

Join us at our development Zulip chat!

For more general information you can also join our Discord server.

If you are interested in contributing to Gosub, please checkout the contribution guide!

                       _
                      | |
  __ _  ___  ___ _   _| |__
 / _` |/ _ \/ __| | | | '_ \
| (_| | (_) \__ \ |_| | |_) |
 \__, |\___/|___/\__,_|_.__/
  __/ |  The Gateway to
 |___/   Optimized Searching and
         Unlimited Browsing

About

This repository is part of the Gosub browser project. This is the main engine that holds the following components:

  • HTML5 tokenizer / parser
  • CSS3 tokenizer / parser
  • Document tree
  • Several APIs for connecting to javascript
  • Configuration store
  • Networking stack
  • Rendering engine
  • JS bridge
  • C Bindings

The idea is that this engine will receive some kind of stream of bytes (most likely from a socket or file) and parse this into a valid HTML5 document tree. From that point, it can be fed to a renderer engine that will render the document tree into a window, or it can be fed to a more simplistic engine that will render it in a terminal. JS can be executed on the document tree and the document tree can be modified by JS.

Status

This project is in its infancy. There is no usable browser yet. However, you can look at simple html pages and parse them into a document tree.

We can parse html5 and css3 files into a document tree or the respective css tree. This tree can be shown in the terminal or be rendered in a very unfinished renderer. Our renderer cannot render everything yet, but it can render simple html pages.

We already implemented other parts of the engine, for a JS engine, networking stack, a configuration store and other things however these aren't integrated yet. You can try these out by running the respective binary.

We can render a part for our own site:

Gosub.io

Note: the borders are broken because of an issue with taffy (the layout engine we use). This will be fixed in the future.

How to run

Installing dependencies

This project uses cargo and rustup. First you must install rustup at the link provided. After installing rustup, run:

$ rustup toolchain install 1.73
$ rustc --version
rustc 1.73.0 (cc66ad468 2023-10-03)

Once Rust is installed, run this command to pre-build the dependencies:

$ cargo build --release

You can run the following binaries:

Command Type Description
cargo run -r --bin gosub-parser bin The actual html5 parser/tokenizer that allows you to convert html5 into a document tree.
cargo run -r --bin parser-test test A test suite for the parser that tests specific tests. This will be removed as soon as the parser is completely finished as this tool is for developement only.
cargo run -r --bin html5-parser-test test A test suite that tests all html5lib tests for the treebuilding
cargo run -r --bin test-user-agent bin A simple placeholder user agent for testing purposes
cargo run -r --bin config-store bin A simple test application of the config store for testing purposes
cargo run -r --bin css3-parser bin Show the parsed css tree
cargo run -r --bin renderer bin Render a html page (WIP)
cargo run -r --bin run-js bin Run a JS file (Note: console and event loop are not yet implemented)
cargo run -r --bin style-parser bin Display the html page's text with basic styles in the terminal

You can then run the binaries like so:

cargo run -r --bin renderer file://src/bin/resources/gosub.html

To run the tests and benchmark suite, do:

make test
cargo bench
ls target/criterion/report
index.html

Wasm

Our engine can also be compiled to WebAssembly. You need to use WasmPack for this. To build the Wasm version, run:

wasm-pack build

Browser in browser

Contributing to the project

We welcome contributions to this project but the current status makes that we are spending a lot of time researching, building small proof-of-concepts and figuring out what needs to be done next. Much time of a contributor at this stage of the project will be non-coding.

We do like to hear from you if you are interested in contributing to the project and you can join us currently at our Zulip chat!

gosub-engine's People

Contributors

3kh0 avatar blaumeise20 avatar celisium avatar charleschen0823 avatar cnsky1103 avatar dependabot[bot] avatar dibashthapa avatar eltociear avatar emwalker avatar ernest-rudnicki avatar glomdom avatar jaytaph avatar kaigidwani avatar kiyoshika avatar koopa1338 avatar lucalewin avatar manish7017 avatar mihna123 avatar neuodev avatar psuedomoth avatar sarroutbi avatar sharktheone avatar ynewmark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gosub-engine's Issues

Implement a tree iterator

After prototyping a user-agent we definitely need a tree iterator that will traverse the nodes in tree-order.

We could rip out what I have in DocumentHandle.query() method to create a separate TreeIterator and replace the logic in DocumentHandle.query() with said iterator.

Then in the user agent, we would only need to do something like

while let Some(current_node) = tree_iterator.next() {
    // ...
}

instead of manually writing the tree traversal code.

Trying to make the code more rust idomatic

Since my lack of knowledge on Rust, a lot of code is probably not idomatic. For instance, I seem to use a lot of unwrap(), which I found out is not the best way to make things more robust.

Since the codebase is small, we should try and see where we can fix these things.

parser: replace macros with functions

The parser uses a lot of macros which all take self as an argument. What do you think about replacing the macros with functions in the Html5Parser struct?

Consider parsing expression grammars

I not sure if you are aware (you might) about parsing expression grammars to parse the HTML or other type of code, since they offer better performance and less space for errors than using Regex.

One library you can check is pest.

You could also consider html5ever and rust-cssparser. They are part of the servo project and are very mature already.

Specifying the MSRV

I was just wondering what is the policy on the Minimum Supported Rust Version for this project? Is the plan just to use the latest stable release or would you rather be conservative for the sake of those using older toolchains? Would you welcome fixing clippy::pedantic lints in PRs (which ones apply would depend on the MSRV)?

The Cargo.toml file doesn't seem to have a rust-version field in it.

there is still an issue where the tokenizer needs to call the parser

there is still an issue where the tokenizer needs to call the parser (i needs to know the namespace of the adjusted current node).. connecting the parser back to the tokenizer will create a cyclic dependency and makes this very hard to get right with the borrow checker. This I think is one of the reasons many implementations of a html5parser will actually have 3 actors: the tokenizer, the parser and a tree-builder (or treesink). Here, the parser will tell the tree-builder to add / retrieve nodes, and the tokenizer can ask the tree-builder for some additional information.

Now, in order to not completely rewrite the parser (yet).. i opt for just adding the information (the adjusted-current-node's namespace) to the tokenizer call in the parser. This is a bit of an ugly hack, but it works for now, since we know that no other data from the parser is needed, and it saves a lot of time rewriting the parser.

Originally posted by @jaytaph in #182 (reply in thread)

research spidermonkey / mozjs-rust

  • can mozjs be easily compiled into the current gosub project
  • do a simple eval of inline code
  • load a js file and execute it
  • create a simple "document" object that has a "write" function, that will output any content to the output
  • catch errors (incl syntax errors) from the javascript engine so we can use them in gosub

Improve Node initialization

As discussed in this PR the initialization of Nodes is kind of tedious because there is no default implementation and it's not obvious how the default implementation should look like. For example there is no obvious candidate for NodeData.

My suggestion would be to add a new constructor with parameters that can't be decided in a default implementation and then we could use short struct initialization as follows:

impl Node {
    pub fn new(data: NodeData, document_handle: &Documenthandle) -> Self {
        Self {
            id: Default::default(),
            parent: None,
            children: vec![],
            id: NodeId::default(),
            parent: None,
            children: vec![],
            data,
            name: String::new(),
            namespace: None,
            name: String::new(),
            namespace: None,
            document: Document::clone(document_handle),
            is_registered: false,
        }
    }

    /// Create a new element node with the given name and attributes and namespace
    #[must_use]
    pub fn new_element(
        document: &DocumentHandle,
        name: &str,
        attributes: HashMap<String, String>,
        namespace: &str,
    ) -> Self {
        let data = NodeData::Element(Box::new(ElementData::with_name_and_attributes(
                NodeId::default(),
                Document::clone(document),
                name,
                attributes,
            )));

        Node {
            name: name.to_owned(),
            namespace: Some(namespace.into()),
            ..Self::new(data, document)
        }
    }
}

As there are not many developers around (yet?), input of everybody is appreciated.

As there are not many developers around (yet?), input of everybody is appreciated.

I have set up a mattermost server for testing at https:;//chat.developer.gosub.io, to check if this is a viable option. If not, we can always check other systems. I have heard good stories about mattermost, but i have no experience with matrix. It might be an alternative as well, but we have to chose something.

Originally posted by @jaytaph in #75 (reply in thread)

change file structure for node related files

I've noticed that there 4 different files starting with node the html5_parser module. This seems like it could be grouped into a folder called node or nodes.

I would suggest something like this:

html5_parser/
├─ .../
├─ node/
│  ├─ data/
│  │  ├─ comment.rs
│  │  ├─ document.rs
│  │  ├─ element.rs
│  │  ├─ mod.rs
│  │  ├─ text.rs
│  ├─ arena.rs
│  ├─ mod.rs

If you want I could create a pr for this?

Handling class names in a node

Creating an issue for myself to work on this early this week (with the suggestions by @emwalker to tweak the interface a bit)

Discussed in #73

Originally posted by Kiyoshika October 1, 2023
I know there was someone interested in working on the CSS parser (not sure at what point, but also low priority right now) which made me think about the current ability to handle class names in an element.

I think the next contribution I planned to make was adding a couple utilities for checking, adding/removing classes from an element.

For instance,

<div class="one two three"></div>

This element has three classes, but we could also programmatically (through javascript later) add/remove/toggle classes.

My proposal is adding a new struct

pub struct ElementClass {
    /// a map of classes: key = name, value = is_active
    /// the is_active is used to toggle a class (JavaScript API)
    class_map: HashMap<String, bool>
}

impl ElementClass {
    /// Check if class name exists
    pub fn contains(&self, name: String);

    /// Add a new class (if already exists, does nothing)
    pub fn add(&self, name: String);

    /// Remove a class (does nothing if not exists)
    pub fn remove(&self, name: String);

    /// Toggle a class active/inactive
    pub fn toggle(&self, name: String);

    /// Set explicitly if a class is active or not
    pub fn set_active(&self, is_active: bool);
}

If I'm not mistaken, in the current implementation of the parser (as of 01 Oct 23) class would be an attribute, so my thought is after a tag is closed (in the above example, when </div> is reached) maybe we take the value of the class attribute and use that to construct the ElementClass instance which can be stored in the Node as a property

pub struct Node {
    // ...
    pub classes: ElementClass,
    // ...
}

Then we'd be able to use it when we're trying to apply CSS classes, something like:

if node.classes.contains("one".to_string()) {
    // apply class logic
}

Are there any thoughts/suggestions/issues on this before I start the work on it?

prior art

I fully support you making a new browser, but are you really not reusing anything? surely you could use an existing HTML parser, and build your own browser on top of that? can you explain your rationale here?

How will this browser "push back" against negative news?

After reading this post, I feel like something is missing.

"...I do like to follow tech news...but I started to get the same feeling I had ten years ago: it’s all negative." "I am going to write a browser. For two reasons: I want this to be a way to push back. Just a tiny amount."

I am not getting the meat of the vision here: how this new browser will push back against negative news. Will it be a positive-news-only browser? Or will it block access to deceptive, negative, or otherwise unhealthy content? Or something else?

Use Rust's built-in support for integration tests

Currently this project is compiling binaries for integration tests under src/bin. It turns out Rust has a mechanism for dealing with integration tests. More here.

Suggestion: create out a top-level tests directory and move the parser_test and tokenizer_test files over to it.

2023-09-24_07-40

Good strategy for a frontend

I was wondering what the initial architecture for the frontend/ui could be like?

Should you create a native hi for each major os, or write once run everywhere like electron (you said you didn't want that)?

I could have a look into a simple macOS gui in the next few days, so there could be something clickable

Consider folking chromium

Writing a full standard compliant browser is no trivial task. It's probably 20 years+ task even with google's resource, you can see how that's going.

Consider folking chromium and making something like Vivaldi if privacy is your goal.

Splitting up `Node.data` into structs

Discussed in #108

Originally posted by Kiyoshika October 6, 2023
After some discussion with @emwalker, we briefly talked about the idea of redesigning current structure of Node.data.

Currently we are storing data directly in the NodeData enum like so:

/// Different type of node data
#[derive(Debug, PartialEq, Clone)]
pub enum NodeData {
    Document,
    Text {
        value: String,
    },
    Comment {
        value: String,
    },
    Element {
        name: String,
        attributes: HashMap<String, String>,
    },
}

This gets a little messy when we have start adding methods that only apply to a particular type of node. For example, when I introduced all the attribute methods, we have these nasty checks in every method:

if self.type_of() != NodeType::Element {
    return Err(ATTRIBUTE_NODETYPE_ERR_MSG.into());
}

This will only get worse as we add more methods specific to different node types (text nodes, element, any others in the future.)

I did some brainstorming tonight and have a proposal:

We create different structs for each node type and wrap that in the enum:

pub enum NodeData {
    Document(DocumentNodeData),
    Text(TextNodeData),
    Comment(CommentNodeData),
    Element(ElementNodeData)
}

and the construction will be changed to (for example on the Element type):

pub fn new_element(args) -> Self {
    // ... other stuff
    data: NodeData::Element(ElementNodeData::new(args)),
    // ... other stuff
}

The dedicated structs will have their specific methods (this has the advantage of not polluting the Node struct as well):

use std::collections::HashMap;

#[derive(Debug, PartialEq, Clone)]
pub struct ElementNodeData {
    pub name: String,
    pub attributes: HashMap<String, String>,
}

impl Default for ElementNodeData {
    fn default() -> Self {
        Self::new()
    }   
}

impl ElementNodeData {
    pub fn new() -> Self {
        ElementNodeData {
            name: "".to_string(),
            attributes: HashMap::new(),
        }   
    }   

    // note that this no longer returns a Result<> like it does currently since it's no longer needed.
    pub fn insert_attribute(&mut self, name: &str, value: &str)  {
        // implementation without the nasty type check shown earlier
    }   

    // other methods specific to Element
}

Then when it comes to actual usage (for example, fetching a node and adding an attribute; side note, I'm writing this by hand and not actually compiling so there are likely errors in below syntax)

if let Some(node) = document.get_node_by_id_mut(NodeId::from(4)).expect("node") {
    // if fetched node is not Element type, nothing happens.
    // optionally, we could log a warning in an else clause
    if let NodeData::Element(element) = node.data {
        element.insert_attribute("class", "hello world");
    }
}

In current state, it looks more like the following:

if let Some(node) = document.get_node_by_id_mut(NodeId::from(4)).expect("node") {
    let result = node.insert_attribute("class", "hello world");
    // result = Err() if type is not Element. Could probably use "if let Ok(_) = node.insert_..."
}

The current implementation of insert_attribute is:

/// Add or update a an attribute
pub fn insert_attribute(&mut self, name: &str, value: &str) -> Result<(), String> {
    if self.type_of() != NodeType::Element {
        return Err(ATTRIBUTE_NODETYPE_ERR_MSG.into());
    }   

    if let NodeData::Element { attributes, .. } = &mut self.data {
        attributes.insert(name.to_owned(), value.to_owned());
    }   

    Ok(())
} 

But with the proposed approach could be simplified to:

// NOTE: this method would be inside ElementNodeData and no longer Node
pub fn insert_attribute(&mut self, name: &str, value: &str) { // <-- no longer returning Result<> because it's now unneeded
    self.attributes.insert(name.to_owned(), value.to_owned());
}

I think this would help remove bloat in the Node struct both now and in the future as well as significantly simplify the methods by removing the boilerplate type checks.

This would require a bit of rework so I wanted to have an open discussion before I started any serious work on it. If we are good with this idea, I will open an issue based off this discussion and assign it to myself.

License discussion

Basically, because I have no idea's about the legal stuff, I pretty much always use MIT license for all projects I create. I'm wondering if this would be the right license, or if there are things to consider?

My goals for this project:

  • source should be available for anyone to see and use and modify to their own needs.
  • source should be available to use in their own custom user-agent (ie: the gosub-engine being used in a 3rd party browser)
  • I would not like for companies to make money of the code (thank you for your free stuff, now we gonna repackage it and make lots of money from it).

The problem with the last part I guess is that it would be difficult to establish this without restricting things too much goal 1 and 2. What about an (paid) application that uses the gosub-engine for displaying html markup?

I'm curious what would be the best way to deal with this, as I can image it would be much harder to do changes in the licensing later on if the project gets a lot bigger.

Split browser components into their separate repo's?

Currently the "browser" is nothing more than a html5 to node-tree parser, or at least, that is what it will be. The library it generates is gosub-engine, but this repo also contains a few tools (`[parser|tokenizer]-tests), and a "browser", that should do nothing more than reading an url, and parsing the html5 tree.

If we move to a separate organisation (see #17), we could rename this repo to "gosub-engine", and have a separate "gosub-browser", "gosub-text-browser" etc, to split the different components. This would also mean that it would be easier to use the gosub-engine for anyone else for their own usage.

I'd like to get some input from others about this.

Move plotting of trees into a separate tree plotter structure

At the moment there is code inside the nodes / arena that displays node trees. This is not the place for this.
Instead, we should have a separate tree plotter that gets a document tree as input, and can output the tree itself. This separates the output from the functionality.

Setup an issue list for contributers

For (new) contributors, it would be helpful if there was a list of work that needs to be done. We should add labels to them to give them a sense in how complex and in what area the issue is.

Compute expected_y in render_tree_test

It'll be good at some point in the future in gosub-bindings/tests/render_tree_test.c to have the margins to compute the expected_y position instead of manually doing math. This will make the tests more robust if we change margins/etc. in the engine.

Implement box layouts

This needs a bit more research, but need to find a way to incorporate the CSS box layout system into the render tree (margin -> border -> padding -> content)

Can maybe take a look at taffy for this

Splitting of textTokens on special chars

The parser sometimes needs to know if the given text-token that it is given is a "whitespace", NULL, or any other special char.

However, to improve performance, text tokens are captured as long as possible. So the literal text FOOBAR, will become TextToken{"FOOBAR"), and not 6 TextTokens: ("F", "O", "O", "B" "A", "R").

The problem is when we need to check if the given text token contains NULL or whitespace, we need to split these tokens so we can actually test this.

So for instance, with the text "FOO_BAR", the tokenizer should emit 3 tokens: Text["FOO"], Text[" "], Text["BAR"].

Multiple whitespaces can be grouped together, and multiple NULL values can be grouped together.

Token.is_null() should return true when the token ONLY consists of 1 or more NULL values, and Token.is_whitespace() must do the same for whitespaces.

Note that it might be possible that both whitespaces and NULL values should be separate, so three spaces in the text will result in 3 tokens with each 1 space. We have to check to see if the parser handles this properly.

`TreeBuilder.insert_attribute` issues

While going through some code cleanup, I noticed a couple issues with the current insert_attribute implementation..

  1. The attribute value validation is happening in the wrong spot. Below are the first few lines of the method
/// Inserts an attribute to an element node.                                                                                       
/// If node is not an element or if passing an invalid attribute value, returns an Err()                                           
fn insert_attribute(&mut self, key: &str, value: &str, element_id: NodeId) -> Result<()> {                                         
    if !self.get().validate_id_attribute_value(value) {                                                                            
        return Err(Error::DocumentTask(format!(                                                                                    
            "Attribute value '{}' did not pass validation",                                                                        
            value                                                                                                                  
        )));                                                                                                                       
    } 
// ...

It's validating ALL attribute values like they are ID attributes, which may not be correct...

  1. When inserting an ID attribute into the DOM tree, it currently does not check if the current element_id has an ID attribute which would need to be removed from the DOM tree (to prevent a dangling ID reference)

Ideally, if we have <div id="myid">, the ID myid is tied to this node ID. But if this changes (e.g., through JavaScript) to something like otherid, the old myid is still pointing to this node when it should be removed.

One thing to be careful about is the elements attributes are updated before modifying the DOM. I think the order of these operations need to switch. Below is current implementation:

 if let Some(node) = self.get_mut().get_node_by_id_mut(element_id) {                                                            
      if let NodeData::Element(element) = &mut node.data {
          // TODO: don't update the element's attributes until AFTER making changes in DOM.                                                                    
          element.attributes.insert(key.to_owned(), value.to_owned());                                                           
      } else {                                                                                                                   
          return Err(Error::DocumentTask(format!(                                                                                
              "Node ID {} is not an element",                                                                                    
              element_id                                                                                                         
          )));                                                                                                                   
      }                                                                                                                          
  } else {                                                                                                                       
      return Err(Error::DocumentTask(format!(                                                                                    
          "Node ID {} not found",                                                                                                
          element_id                                                                                                             
      )));                                                                                                                       
  }                                                                                                                              
                                                                                                                                 
  // special cases that need to sync with DOM                                                                                    
  match key {                                                                                                                    
      "id" => {
          // if ID is already in use, ignore
          if !self.get().named_id_elements.contains_key(value) {
              // TODO: fetch the current element's ID value (if any) and remove it from the DOM.
              // Then we can insert value.to_owned() as below and also update the element's attribute afterwards
              self.get_mut()                                                                                                     
                  .named_id_elements                                                                                             
                  .insert(value.to_owned(), element_id);                                                                         
          }                                                                                                                      
      } 

Incorporate CSSOM into RenderTree

This is more long term since the CSSOM doesn't exist yet, but when it does, need to leverage it to compute some of the styles/manipulate the box layout

Error when slicing a string at an arbitrary byte index without considering character boundaries

I ran into an issue while running the parser for the wikipedia page.

thread 'main' panicked at src/html5_parser/tokenizer/character_reference.rs:331:49:
byte index 32 is not a char boundary; it is inside '' (bytes 30..33) of `amp; Company</a>. pp.&#160;338–3`

Reproduction Link:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=be21cf401750d068dbeea6327310388b

I was able to fix this issue by using .char_indices method.

    fn find_entity(&mut self) -> Option<String> {
        let s = self.stream.look_ahead_slice(*LONGEST_ENTITY_LENGTH);
        for (i, _) in s.char_indices().rev(){
            if TOKEN_NAMED_CHARS.contains_key(&s[0..i]) {
                // Move forward with the number of chars matching
                // self.stream.skip(i);
                return Some(String::from(&s[0..i]));
            }
        }
        None
    }

However, that failed 30 test cases.

failures:
  html5_parser::tokenizer::character_reference::tests::entity_100
  html5_parser::tokenizer::character_reference::tests::entity_102
  html5_parser::tokenizer::character_reference::tests::entity_103
  html5_parser::tokenizer::character_reference::tests::entity_104
  html5_parser::tokenizer::character_reference::tests::entity_106
  html5_parser::tokenizer::character_reference::tests::entity_109
  html5_parser::tokenizer::character_reference::tests::entity_200
  html5_parser::tokenizer::character_reference::tests::entity_204
  html5_parser::tokenizer::character_reference::tests::entity_208
  html5_parser::tokenizer::character_reference::tests::entity_209
  html5_parser::tokenizer::character_reference::tests::entity_210
  html5_parser::tokenizer::character_reference::tests::entity_211
  html5_parser::tokenizer::character_reference::tests::entity_214
  html5_parser::tokenizer::character_reference::tests::entity_217
  html5_parser::tokenizer::character_reference::tests::entity_220
  html5_parser::tokenizer::character_reference::tests::entity_222
  html5_parser::tokenizer::character_reference::tests::entity_224
  html5_parser::tokenizer::character_reference::tests::entity_226
  html5_parser::tokenizer::character_reference::tests::entity_228
  html5_parser::tokenizer::character_reference::tests::entity_230
  html5_parser::tokenizer::character_reference::tests::entity_232
  html5_parser::tokenizer::character_reference::tests::entity_234
  html5_parser::tokenizer::character_reference::tests::entity_236
  html5_parser::tokenizer::character_reference::tests::entity_238
  html5_parser::tokenizer::character_reference::tests::entity_240
  html5_parser::tokenizer::character_reference::tests::entity_242
  html5_parser::tokenizer::character_reference::tests::entity_244
  html5_parser::tokenizer::character_reference::tests::entity_246
  html5_parser::tokenizer::character_reference::tests::entity_248
  html5_parser::tokenizer::character_reference::tests::entity_250

I want to fix this issue and submit a PR. I would appreciate any suggestions or advice on fixing this issue.

Thank You

Short term plans

My idea of the short term actions would be this:

  • Make sure the tokenizer can tokenize html5 input into tokens (completed, most tests passing)
  • Make sure the parser can tokenize the tokens into a node-tree (in progress, some tests passing)
  • Have a different system (the user-agent (UA)) that receives the node-tree in some form, and prints
    this only the screen (textmode, maybe graphical).
  • Analyze the current code and see we can fix all the beginner Rust mistakes i've made.
  • Get the CI tests up and running, and make sure they pass the standard tests (clippy, fmt, tests etc) (completed)
  • generate a DOM tree from the given node tree (or maybe better, convert the node-tree so it can
    be used as the DOM tree.
  • Optional: start with the CSS parser (relatively easy compared to the html5 parser?)

Adding more / better tests

One thing I notice is that functions do not always do what you expect them to do.

For instance, the is_in_scope() function should return true or false based on the fact if an element from the open_elements stack is present or not. This should probably be unit-tested, as there is currently no guarantee that this function works properly (and apparently, it does not).

Add all to tokenizer tests from `css-parsing-tests`

To ensure that our CSS3 tokenizer adheres to the CSS3 specifications, we should include all tests from the css-parsing-tests repo.

I have already added a lot of tests, and the tokenizer is passing them all, but it is best to add them all to ensure we are covering all bases.

I will add some here and there, but for the time being, I will focus on the CSS3 parser.

Writing tests should be simple. Here is an example of writing this test from css-parsing-test into our tests

Considering the tokenizer, we only interested in tests in component_value_list.json file

Create Engine API to expose

Currently the RenderTree is being exposed directly, but rather we should have a general Engine API that is exposed the user agents can use. This should use the DNS system recently implemented by @jaytaph

Cargo build is broken

Doing a fresh clone + cargo build on latest MacOS results in the following error

error[E0609]: no field `root` on type `(&gosub_engine::html5_parser::parser::document::Document, Vec<ParseError>)`
  --> src/bin/gosub-browser.rs:28:50
   |
28 |     println!("Generated tree: \n\n {}", document.root);
   |                                                  ^^^^

For more information about this error, try `rustc --explain E0609`.
error: could not compile `gosub-engine` (bin "gosub-browser") due to previous error
warning: build failed, waiting for other jobs to finish...
warning: unused variable: `parse_errors`
   --> src/bin/parser_test.rs:150:20
    |
150 |     let (document, parse_errors) = parser.parse();
    |                    ^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_parse_errors`
    |
    = note: `#[warn(unused_variables)]` on by default

warning: variants `Success`, `Failure`, and `PositionFailure` are never constructed
   --> src/bin/parser_test.rs:224:5
    |
223 | enum ErrorResult {
    |      ----------- variants in this enum
224 |     Success,            // Found the correct error
    |     ^^^^^^^
225 |     Failure,            // Didn't find the error (not even with incorrect position)
    |     ^^^^^^^
226 |     PositionFailure,    // Found the error, but on an incorrect position
    |     ^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: function `match_error` is never used
   --> src/bin/parser_test.rs:304:4
    |
304 | fn match_error(got_err: &Error, expected_err: &Error) -> ErrorResult {
    |    ^^^^^^^^^^^

error[E0027]: pattern does not mention fields `is_self_closing`, `attributes`
   --> src/bin/tokenizer_test.rs:251:9
    |
251 |         Token::EndTagToken{name} => {
    |         ^^^^^^^^^^^^^^^^^^^^^^^^ missing fields `is_self_closing`, `attributes`
    |
help: include the missing fields in the pattern
    |
251 |         Token::EndTagToken{name, is_self_closing, attributes } => {
    |                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
help: if you don't care about these missing fields, you can explicitly ignore them
    |
251 |         Token::EndTagToken{name, .. } => {
    |                                ~~~~~~

error[E0308]: mismatched types
   --> src/bin/tokenizer_test.rs:247:53
    |
247 |             if check_match_starttag(expected, name, attributes, is_self_closing).is_err() {
    |                --------------------                 ^^^^^^^^^^ expected `Vec<Attribute>`, found `HashMap<String, String>`
    |                |
    |                arguments to this function are incorrect
    |
    = note: expected struct `Vec<Attribute>`
               found struct `HashMap<std::string::String, std::string::String>`
note: function defined here
   --> src/bin/tokenizer_test.rs:276:4
    |
276 | fn check_match_starttag(expected: &[Value], name: String, attributes: Vec<Attribute>, is_self_closing: bool) -> Result<(), ()> {
    |    ^^^^^^^^^^^^^^^^^^^^                                   --------------------------

Some errors have detailed explanations: E0027, E0308.
For more information about an error, try `rustc --explain E0027`.
error: could not compile `gosub-engine` (bin "tokenizer_test") due to 2 previous errors
warning: `gosub-engine` (bin "parser_test") generated 3 warnings (run `cargo fix --bin "parser_test"` to apply 1 suggestion)

Cargo version

 ~/Projects/gosub-browser/ [main] cargo --version
cargo 1.72.1 (103a7ff2e 2023-08-15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.