Git Product home page Git Product logo

Comments (10)

svgeesus avatar svgeesus commented on August 20, 2024

We should explain, and justify, each of these things if we can. Also discuss alternatives and why they are better/worse.

from ift.

garretrieger avatar garretrieger commented on August 20, 2024

We've documented the rationale for most of the technical decisions in the patch subset design doc found here

I'll add some more specific commentary on the things called out in this issue:

Set Encoding

Having a high efficiency integer set encoding method is a key technology that is required to enable incremental transfer. Since every request must describe two sets, the set of codepoints the client has and the set that they need. Consider a CJK font with 20 thousand codepoints. If the client currently has half of the codepoints then on each request it would need to encode a set of codepoints with 10,000 elements. If these were naively encoded with a basic variable length integer encoding (let's optimistically assume only 1 byte per element) that would take 10kb to encode the request. This is not acceptable for the request size.

So given that I did a thorough investigation of how to most efficiently encode integer sets here: https://docs.google.com/document/d/1yG6EycLLHHjtlT6mAeoO3s99M9cxIR7eavGvtxWBlKk/edit?usp=sharing
and followup improvements here:
https://github.com/w3c/PFE-analysis/blob/main/results/set_encoding_branch_factor.md

Performance was determined by way of simulations of encoding randomly generated codepoint sets (following character frequency distribution per language).

This study looks at many different possible encodings, including range coding an entropy encoding. Ultimately it found the novel sparse bit set encoding significantly outperforms the other considered methods. Given that set encoding sizes has a very direct impact on performance of the protocol it was decided to proceed with the novel, but more efficient encoding. I am not aware of any other existing standardized integer set encodings that give similar levels of performance as our currently chosen sparse bit set encoding method.

Checksum Algorithm

The checksum algorithm is not novel, as mentioned in the spec we are using an existing algorithm called fast hash. fast hash hasn't been standardized though, which is why the spec includes code for the fast hash algorithm. fast hash is in use in several other places, see widely used in industry and academia.

Fast hash was chosen for it's performance. Incxfer does not require a cryptographic grade checksum algorithm and commonly used cryptographic hashes like SHA are too slow for our purposes. Fast hash provides a good quality hash, while being very fast. The rationale for choosing fast hash is found here. Note this is a bit outdated and mentions that we chose FarmHash, however, we ultimately switched to fast hash as it has very similar characteristics but has a much simpler implementation.

from ift.

svgeesus avatar svgeesus commented on August 20, 2024

@martinthomson does that resolve your comments?

from ift.

martinthomson avatar martinthomson commented on August 20, 2024

Not entirely. The integer set encoding stuff seems solid (though I would prefer to see that integrated into an existing format as the content of a binary field.

The checksum algorithm here seems to be a hash function (that is, a function that is optimized for collision avoidance, usually used in hash tables) and not a checksum function. Is there any analysis that suggests that this algorithm is good at detecting the sort of corruption that might occur? Is there any evidence to suggest that you need this sort of detection, given that most data is exchanged over TLS, which provides very strong protection against corruption?

from ift.

garretrieger avatar garretrieger commented on August 20, 2024

The checksums we use aren't to detect data corruption but instead to make sure the client and server have the same state and are operating on the same original font file. For example the original font checksum is sent by the server to the client on the first response and then subsequent requests from that client will include the checksum. The server will use this to make sure that the original font has not changed (eg. updated to a new version of the font) since the clients first request. Like wise the base_checksum sent by the client is used to ensure the server recreates the same state that the client has. This guards against changes to the font subsetting library run by the server (for example after an update to the subsetter library it now produces a different binary result for a particular font subset).

So collision avoidance (and speed) is the main property we're interested in. fasthash has no collisions in the smhasher test suite: https://github.com/rurban/smhasher/blob/master/doc/fasthash64.txt and the two documented quality issues aren't a problem for this use case:

  • Requiring word alignment of the input: just needs to be handled by the implementer.
  • Seed exposure: seed is already known as it's hardcoded in the specification.

from ift.

svgeesus avatar svgeesus commented on August 20, 2024

@martinthomson does that clarification on the use of checksums in IFT resolve your comments?

from ift.

svgeesus avatar svgeesus commented on August 20, 2024

@martinthomson is this comment now resolved to your satisfaction?

from ift.

svgeesus avatar svgeesus commented on August 20, 2024

I see that this remaining issue from @martinthomson is that "the hash function is good at collision avoidance but poor at detecting corruption, which is rare" and our answer is "this is not for detecting corruption, it is for collision avoidance" so I believe we are actually in agreement and the comment arises from misunderstanding of the intent of the specification.

from ift.

martinthomson avatar martinthomson commented on August 20, 2024

OK, then I'm OK with where you are. I just wanted to ensure that these items were clear. They are now clearer to me and I am no longer a good judge of whether they are clear to a new reader :)

from ift.

svgeesus avatar svgeesus commented on August 20, 2024

Thanks!

from ift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.