Comments (15)
The JSON file shared by web-platform tests does not contain valid UTF-8. I recommend skipping invalid UTF-8 tests.
from ada-python.
cc @lemire
from ada-python.
In case it wasn't clear, this is just a problem with the Python bindings and not the source C++ application.
Here is where ada
does the relevant tests.
Here is the branch where I've added tests and am attempting fixes.
from ada-python.
This might be due to different handling of the surrogate pairs between Python's json
module and ada
's simdjson
.
I tried plugging in pysimdjson, but it has trouble even parsing the input file. I'll file a report in that repo.
from ada-python.
@bbayles Ok. So we must take an invalid Unicode sequence and convert it to UTF-8. The UTF-8 should be valid. The simdjson library also supports invalid UTF-8 transcoding, called WTF-8, but the ada URL library assumes that the input is UTF-8 so I wanted to give it UTF-8. There is no mention of WTF-8 in the URL spec.
How to do it? In JavaScript, we do it by replacement typically...
Try it out in Node...
const encoder = new TextEncoder();
console.log(encoder.encode('\ud800\ud801\ud811'))
You will notice that it uses the default replacement character U+FFFD.
That's what simdjson does in this instance.
I have not investigated but it is standard enough as a practice that it should be standard in Python too. Although I admit I don't know how to do it right now.
from ada-python.
I tried plugging in pysimdjson, but it has trouble even parsing the input file.
Effectively, the input is not portable JSON because of the bad Unicode sequence, so you should expect the file to fail in several instances.
It is unfortunate.
from ada-python.
Agreed. I will add the same test as a properly encoded byte sequence, outside of the JSON.
from ada-python.
@bbayles If you find out how to do decoding with replacement in Python, I'd love to know. I am sure it is covered by the standard libraries.
from ada-python.
I think we'll find that the issue is with decoding the JSON; it works fine when I encode the string myself.
I'll compare the outputs of various JSON decoders to see if I should file bugs against them.
from ada-python.
For reference, here are some things I found when testing...
This code (tests/wpt_tests.cpp
in the ada
repo) emits logs during testing.
The output for the test in question is:
That string represent these bytes:
b'http://example.com/\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'
Those work fine with ada_parse
. That is, this test of the Python bindings works with no changes:
def test_surrogates(self):
s = b'http://example.com/\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xef\xbf\xbd\xf0\x90\x9f\xbe\xef\xbf\xbd\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'.decode('utf-8')
expected = 'http://example.com/%EF%BF%BD%F0%90%9F%BE%EF%BF%BD%EF%B7%90%EF%B7%8F%EF%B7%AF%EF%B7%B0%EF%BF%BE%EF%BF%BF?%EF%BF%BD%F0%90%9F%BE%EF%BF%BD%EF%B7%90%EF%B7%8F%EF%B7%AF%EF%B7%B0%EF%BF%BE%EF%BF%BF'
actual = URL(s).href
self.assertEqual(actual, expected)
We'll try to reproduce these bytes below.
Python's json
module does this:
# Parse the file with the Python stdlib json module
>>> with open('urltestdata.json', 'rb') as f:
... data = load(f)
# Inspect the REPL representation
>>> data[409]['input']
'http://example.com/\ud800\U000107fe\udfff\ufdd0﷏\ufdefﷰ\ufffe\uffff?\ud800\U000107fe\udfff\ufdd0﷏\ufdefﷰ\ufffe\uffff'
# Try to encode the string to bytes
>>> data[409]['input'].encode('utf-8')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 19: surrogates not allowed
# Encode to bytes with the the 'replace' errors strategy
>>> data[409]['input'].encode('utf-8', 'replace')
b'http://example.com/?\xf0\x90\x9f\xbe?\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf??\xf0\x90\x9f\xbe?\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'
# Encode to bytes with the the 'backslashreplace' errors strategy
>>> data[409]['input'].encode('utf-8', 'backslashreplace')
b'http://example.com/\\ud800\xf0\x90\x9f\xbe\\udfff\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\\ud800\xf0\x90\x9f\xbe\\udfff\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'
# Try to encode to bytes with the the 'surrogateescape' errors strategy
>>> data[409]['input'].encode('utf-8', 'surrogateescape')
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 19: surrogates not allowed
# Encode to bytes with the '' strategy
>>> data[409]['input'].encode('utf-8', 'surrogatepass')
b'http://example.com/\xed\xa0\x80\xf0\x90\x9f\xbe\xed\xbf\xbf\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf?\xed\xa0\x80\xf0\x90\x9f\xbe\xed\xbf\xbf\xef\xb7\x90\xef\xb7\x8f\xef\xb7\xaf\xef\xb7\xb0\xef\xbf\xbe\xef\xbf\xbf'
None of those matches the target.
If we forget the Python standard library's json
module entirely, and start with the input string:
"http://example.com/\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF?\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF"
the all of the above problems are still evident. Thefore, the JSON decoding isn't really the issue.
So I think the question reduces to: "how does simdjson
(which is used by the ada
tests) get the target byte representation from the JSON input?"
That is, how do we transform:
"http://example.com/\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF?\uD800\uD801\uDFFE\uDFFF\uFDD0\uFDCF\uFDEF\uFDF0\uFFFE\uFFFF"
into
http://example.com/���﷏�ﷰ��?���﷏�ﷰ��
I'm not sure of the answer yet!
from ada-python.
@bbayles The core issue is that \uD800\uD801
cannot be represented as UTF-8. In fact, it is invalid Unicode. The JSON specification warns about such cases:
When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations that parse it will agree on the contents of names and of string values in objects and arrays. However, this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions. rfc8259
So it is bad JSON that can produce unspecified outcomes.
In JavaScript, these cases are converted using replacement. That is, the exact codes (\uD800\uD801
) are ignored and effectively treated as if they were the replacement character \uFFFD
. JavaScript is generally very forgiving.
By default, simdjson will just reject such data outright but, as you have learned, we can force simdjson to accept it. There are two options in simdjson. You can load it as WTF-8, a non-standard non-Unicode format, or you can use replacement characters. In the main ada library, we use replacement characters...
In truth, the WPT cases should be written with portable JSON.
from ada-python.
There are two options in simdjson. You can load it as WTF-8, a non-standard non-Unicode format, or you can use replacement characters. In the main ada library, we use replacement characters...
I think the open question for me is: can we make Python do the same thing that simdjson
does when ada
uses it with replacement characters?
So far I don't know this answer.
from ada-python.
So far I don't know this answer.
I also don't know the answer and I am surprised at how difficult it is.
from ada-python.
This answer https://stackoverflow.com/a/38565489 suggests that Python will generate the replacement character \ufffd
as expected. Oddly enough, that's not what my Python does.
from ada-python.
For what it's worth, the ftfy package implements the same algorithm as simdjson, and produces the same byte string from the input.
from ada-python.
Related Issues (14)
- Set up ReadTheDocs HOT 1
- Releases HOT 6
- Upgrade to 2.6.0, add IDNA functions to library
- Build wheels for Windows
- Type annotations HOT 1
- Upgrade to ada 2.6.9
- Update to ada 2.7.2
- Strip unneeded files from binary wheels
- Allow `pip install --no-binary` to work HOT 1
- Update to ada 2.7.3
- Use uv HOT 1
- Add benchmarks HOT 4
- Should we migrate to pybind11 or merge with can_ada? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ada-python.