Git Product home page Git Product logo

htmlfastparse's Introduction

HTMLFastParse

When you need to transform complex HTML into rich NSAttributedStrings right now.

This library, written mostly in C with NSAttributedString functions in Obj-C, is specially tuned for Reddit's HTML outputs and blows all other parsers out of the water. As a warning, the parser itself takes no shortcuts however I didn't implement any HTML tags which are not used by Reddit and so this isn't a complete formatter. If you are using it with Reddit, you are in luck because you'll have access to the following at actually lightning fast speed:

  • Bold, italics, superscript, strikethrough, code blocks, quotes, links, etc
  • Tables (via the data URI)
  • Inline HTML entity decoding
  • An extensible architecture which allows you to add and remove tags as necessary for your application

Using it

HTMLFastParse is available as a Swift Package. Simply add this repo and import the library with #import <HTMLFastParse/HTMLFastParse.h>, and you're good to go! If you do not wish to use SPM, you may instead copy everything the folder HTMLFastParse into your project and then import HTMLFastParse.h.

Once you've imported everything initialize the engine with HFPFormatToAttributedString * formatter = [[HFPFormatToAttributedString alloc]init];

Note that you'll want to store this somewhere. Do not recreate the formatter every time! The formatter generates a ton of fonts and colors when it is initialized and then caches them and so your performance will be dismal if you recreate the formatter every time you use it.

After that you can simply use [formatter attributedStringForHTML:<string>]; to get an attributed string out.

Benchmarks

To parse and format this one thousand times on an iPhone X running 11.2 took just 477.956ms. The nearest neighbor, Cocoamarkdown, took 8497ms. For a summary and comparisons against other engines see my write up.

How it all fits together

A good way to get insight on the process and algorithms is to comment out #define printf(fmt, ...) (0) in C_HTML_Parser.c. This preprocessor definition disables printing which is important for speed as printf is slow.

Normally you will only ever work with FormatToAttributedString. This class handles calling all the much faster C functions below it as well as taking a flattened style array and applying it to a string to create the output product. In this class you can configure the attributed string's appearance (font, color of quotes/code, etc).

C_HTML_Parser: this class has two main methods.

  1. tokenizeHTML: This method takes in a C string as well as an output buffer for human readable text as well as a tag buffer. This method in essence reads through the input, separating tags and displayed text, and putting them into their respective slots while also doing HTML entity decoding. The tags put in the output buffer are of type t_tag which is a C struct holding the contents of the first tag and also the start and end positions of the tag. Something important to note about start and ending positions is that they are anchored based on visible characters and not byte characters. This really doesn't matter if you're using pure ASCII however certain characters like 'â' are actually a combination of multiple characters however render to only one. NSAttributedString treats them as single characters and so the ranges in the tags reflect that.
  2. makeAttributesLinear: This method takes a bunch of overlapping t_tags and converts them into a one dimensional/flattens them into a set of t_format structs. The algorithm I used for this is to apply the tag formats to its characters and then running back over that formatted array to generate a final style state which can be easily fed into NSAttributedString which doesn't really allow overlapping font styles. This is the method, along with t_format and addAttributeToString:(NSMutableAttributedString *)string forFormat:(struct t_format)format you'd modify if you want to add new styles.

If you have questions about implementing a new styling feature for your project and don't know what you need to change, submit an issue.

htmlfastparse's People

Contributors

ezhes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

htmlfastparse's Issues

Support for Img tags

The library looks promising. However, I tested it with the <img> tags and it just failed. It returned empty results. Can you look into this please? @ezhes

The HTML code which I tested against is

NSString *tableTest = @" \
<html> \
<head> \
<title>Page Title</title> \
</head> \
<body> \
 \
<h1>This is a Heading</h1> \
<p>This is a paragraph.</p> \
<p>Edit the code in the window to the left, and click \"Run\" to view the result.</p> \
<img src=\"https://static.scientificamerican.com/sciam/cache/file/4E0744CD-793A-4EF8-B550B54F7F2C4406.jpg\" alt=\"Avatar\" style=\"width:200px\"> \
 \
</body> \
</html> \
 \
";

Tables

HTML tables are not converted into anything at all

Lists

Lists are not parsed properly at all

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.