This library fails on one of the parsing tests <a href="https://github.com/ada-url/ada

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Handle unusual unicode escape characters about ada-python HOT 15 CLOSED

bbayles commented on July 21, 2024

Handle unusual unicode escape characters

from ada-python.

Comments (15)

anonrig commented on July 21, 2024 1

The JSON file shared by web-platform tests does not contain valid UTF-8. I recommend skipping invalid UTF-8 tests.

from ada-python.

anonrig commented on July 21, 2024

cc @lemire

from ada-python.

bbayles commented on July 21, 2024

In case it wasn't clear, this is just a problem with the Python bindings and not the source C++ application.

Here is where ada does the relevant tests.

Here is the branch where I've added tests and am attempting fixes.

from ada-python.

bbayles commented on July 21, 2024

This might be due to different handling of the surrogate pairs between Python's json module and ada's simdjson.

I tried plugging in pysimdjson, but it has trouble even parsing the input file. I'll file a report in that repo.

from ada-python.

lemire commented on July 21, 2024

@bbayles Ok. So we must take an invalid Unicode sequence and convert it to UTF-8. The UTF-8 should be valid. The simdjson library also supports invalid UTF-8 transcoding, called WTF-8, but the ada URL library assumes that the input is UTF-8 so I wanted to give it UTF-8. There is no mention of WTF-8 in the URL spec.

How to do it? In JavaScript, we do it by replacement typically...

Try it out in Node...

const encoder = new TextEncoder();
console.log(encoder.encode('\ud800\ud801\ud811'))

You will notice that it uses the default replacement character U+FFFD.

That's what simdjson does in this instance.

I have not investigated but it is standard enough as a practice that it should be standard in Python too. Although I admit I don't know how to do it right now.

from ada-python.

lemire commented on July 21, 2024

I tried plugging in pysimdjson, but it has trouble even parsing the input file.

Effectively, the input is not portable JSON because of the bad Unicode sequence, so you should expect the file to fail in several instances.

It is unfortunate.

from ada-python.

bbayles commented on July 21, 2024

Agreed. I will add the same test as a properly encoded byte sequence, outside of the JSON.

from ada-python.

lemire commented on July 21, 2024

@bbayles If you find out how to do decoding with replacement in Python, I'd love to know. I am sure it is covered by the standard libraries.

from ada-python.

bbayles commented on July 21, 2024

I think we'll find that the issue is with decoding the JSON; it works fine when I encode the string myself.

I'll compare the outputs of various JSON decoders to see if I should file bugs against them.

from ada-python.

bbayles commented on July 21, 2024

For reference, here are some things I found when testing...

This code (tests/wpt_tests.cpp in the ada repo) emits logs during testing.

The output for the test in question is:

That string represent these bytes:

b'http://example.com/\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'

Those work fine with ada_parse. That is, this test of the Python bindings works with no changes:

    def test_surrogates(self):
        s = b'http://example.com/\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'.decode('utf-8')
        expected = 'http://example.com/%EF%BF%BD%F0%90%9F%BE%EF%BF%BD%EF%B7%90%EF%B7%8F%EF%B7%AF%EF%B7%B0%EF%BF%BE%EF%BF%BF?%EF%BF%BD%F0%90%9F%BE%EF%BF%BD%EF%B7%90%EF%B7%8F%EF%B7%AF%EF%B7%B0%EF%BF%BE%EF%BF%BF'
        actual = URL(s).href
        self.assertEqual(actual, expected)

We'll try to reproduce these bytes below.

Python's json module does this:

# Parse the file with the Python stdlib json module
>>> with open('urltestdata.json', 'rb') as f:
...     data = load(f)

# Inspect the REPL representation
>>> data[409]['input']
'http://example.com/\ud800\U000107fe\udfff\ufdd0﷏\ufdefﷰ\ufffe\uffff?\ud800\U000107fe\udfff\ufdd0﷏\ufdefﷰ\ufffe\uffff'

# Try to encode the string to bytes
>>> data[409]['input'].encode('utf-8')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 19: surrogates not allowed

# Encode to bytes with the the 'replace' errors strategy
>>> data[409]['input'].encode('utf-8', 'replace')
b'http://example.com/?\xf0\x90\x9f\xbe?\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf??\xf0\x90\x9f\xbe?\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'

# Encode to bytes with the the 'backslashreplace' errors strategy
>>> data[409]['input'].encode('utf-8', 'backslashreplace')
b'http://example.com/\\ud800\xf0\x90\x9f\xbe\\udfff\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\\ud800\xf0\x90\x9f\xbe\\udfff\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'

# Try to encode to bytes with the the 'surrogateescape' errors strategy
>>> data[409]['input'].encode('utf-8', 'surrogateescape')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 19: surrogates not allowed

# Encode to bytes with the '' strategy
>>> data[409]['input'].encode('utf-8', 'surrogatepass')
b'http://example.com/\xed\xa0\x80\xf0\x90\x9f\xbe\xed\xbf\xbf\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xed\xa0\x80\xf0\x90\x9f\xbe\xed\xbf\xbf\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'

None of those matches the target.

If we forget the Python standard library's json module entirely, and start with the input string:

"http://example.com/\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF?\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF"

the all of the above problems are still evident. Thefore, the JSON decoding isn't really the issue.

So I think the question reduces to: "how does simdjson (which is used by the ada tests) get the target byte representation from the JSON input?"

That is, how do we transform:

"http://example.com/\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF?\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF"

into

http://example.com/�𐟾��﷏�ﷰ��?�𐟾��﷏�ﷰ��

I'm not sure of the answer yet!

from ada-python.

lemire commented on July 21, 2024

@bbayles The core issue is that \uD800\uD801 cannot be represented as UTF-8. In fact, it is invalid Unicode. The JSON specification warns about such cases:

When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations that parse it will agree on the contents of names and of string values in objects and arrays. However, this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions. rfc8259

So it is bad JSON that can produce unspecified outcomes.

In JavaScript, these cases are converted using replacement. That is, the exact codes (\uD800\uD801) are ignored and effectively treated as if they were the replacement character \uFFFD. JavaScript is generally very forgiving.

By default, simdjson will just reject such data outright but, as you have learned, we can force simdjson to accept it. There are two options in simdjson. You can load it as WTF-8, a non-standard non-Unicode format, or you can use replacement characters. In the main ada library, we use replacement characters...

In truth, the WPT cases should be written with portable JSON.

from ada-python.

bbayles commented on July 21, 2024

There are two options in simdjson. You can load it as WTF-8, a non-standard non-Unicode format, or you can use replacement characters. In the main ada library, we use replacement characters...

I think the open question for me is: can we make Python do the same thing that simdjson does when ada uses it with replacement characters?

So far I don't know this answer.

from ada-python.

lemire commented on July 21, 2024

So far I don't know this answer.

I also don't know the answer and I am surprised at how difficult it is.

from ada-python.

lemire commented on July 21, 2024

This answer https://stackoverflow.com/a/38565489 suggests that Python will generate the replacement character \ufffd as expected. Oddly enough, that's not what my Python does.

from ada-python.

bbayles commented on July 21, 2024

For what it's worth, the ftfy package implements the same algorithm as simdjson, and produces the same byte string from the input.

from ada-python.

Handle unusual unicode escape characters about ada-python HOT 15 CLOSED

Comments (15)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent