Make `parseUrlHttp[s]` take in a `Text` value about req HOT 3 CLOSED

mrkkrp commented on August 28, 2024

Make `parseUrlHttp[s]` take in a `Text` value

from req.

Comments (3)

tysonzero commented on August 28, 2024

The current implementation seems to already basically do the right thing on UTF-8 encoded ByteString, trying to leave the URL alone except when it's truly invalid (a raw π half-way through) and then URL-encoding those invalid characters.

It should just be more transparent about the fact that it works on unicode/utf-8 by working on Text directly, helping avoid a variety of errors in the process.

from req.

mrkkrp commented on August 28, 2024

ByteString semantically represents a sequence of 8 bit octets, not a series of ASCII characters, and URLs are semantically sequences of [ASCII] characters. You can see this distinction when looking at how Char8 is too big to store only ASCII, and with how often ByteString is used to store non-ASCII data, and with the non-ASCII encoding typically used when outputting to stdout.

Even so, Text is composed of elements which are even "bigger" allowing all sorts of Unicode in it. Thus it's hardly a better option?

The type which we take as the argument of parseUrlHttp(s) function is dictated by the http-types package because we build on functions like decodePathSegments and parseQueryText, both of which take ByteStrings and make some Text pieces out of it. If you want to change this, then perhaps the right place to start is http-types issue tracker.

Anyway, AFAIU, both Text and ByteString are not perfect:

In case of ByteString it's 1) have elements which are too big 2) encoding issues may introduce bugs.
In case of Text it's 1) even bigger with Unicode symbols which cannot appear un-escaped in URLs (correct me if I'm wrong, I may have forgotten the subtleties).

Since both options are not ideal and http-types uses ByteString I think we should stick to it.

from req.

tysonzero commented on August 28, 2024

The flaws in the types are more or less what you said, but I would adjust the phrasing slightly to show why Text really is much less problematic than ByteString:

The two main issues are:

Encoding issues: It's relatively easy to have bugs that slip through the type checker if you just use the wrong encoding/decoding function. This issue is unique to ByteString.
Too many codepoints: Suffered by both ByteString and Text, but I would actually argue that ByteString is even worse in this regard. At least with Text, these overly large codepoints have a pretty sane default behavior, namely UTF-8 encode + percent encode (e.g. what your browser does when you type arbitrary unicode in the URL).

The http-types aspect is interesting, and my follow up to that is that http-types is also doing the wrong thing, as they are currently worrying about encoding and assuming it to be utf8, when they should really only be dealing with charpoints. Occasionally it makes sense to do both such as in XML when you need to try and guess the encoding based on the encoding parameter and change the parser accordingly, but usually it does not make sense.

from req.

Make `parseUrlHttp[s]` take in a `Text` value about req HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent