gsnedders / python-webencodings Goto Github PK

View Code? Open in Web Editor NEW

37.0 8.0 30.0 48 KB

Character encoding for the web.

License: Other

Python 100.00%

python-webencodings's Introduction

python-webencodings

This is a Python implementation of the WHATWG Encoding standard.

Latest documentation: http://packages.python.org/webencodings/
Source code and issue tracker: https://github.com/gsnedders/python-webencodings
PyPI releases: http://pypi.python.org/pypi/webencodings
License: BSD
Python 2.6+ and 3.3+

In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.

This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.

python-webencodings's People

Contributors

Stargazers

Watchers

python-webencodings's Issues

Please make a release on pypi.

Hey,

on pypi, you still have an old version of webencodings, which has the wrong utf8 encoding in it. Hence, xgettext fails to extract any strings from it.

Please release the fixed version, which I saw the commit for.

Cheers,
László

Support getstate and setstate on IncrementalEncoder/Decoder

Python 3 introduces a getstate/setstate method pair on the incremental encoder/decoders. It would be nice to expose this, even if only on Py3.

BSD Licesne Clarification

Which of the BSD Licenses described in this SPDX repository is actually being used?

New release

Please, could you issue new release on PyPI.org?
In the very last commit you added LICENSE file into package metadata, but on PyPI.org this is not applied yet.
I maintain a project depending on this your package and I need to have licenses added to them. (Especially BSD explicitly requires one copy to be bundled with each code redistribution - [your] package currently on PyPI violates that.)
Thank you for your work and time.
Pax et bonum.

Copyright notice and BSD licence text are missing in pypi package

I would like to distribute the webencodings pypi sources along with a commercial product. However the pypi package (https://pypi.python.org/pypi/webencodings) does not include the LICENSE file. Therefore it is actually not legal to distribute the pypi package.

I would appreciate if someone would make sure that future versions of the pypi package also include the LICENSE file in order to be compliant with the BSD license.

'utf8' encoding not found

Pip version: 9.0.0
Python version: 2.7.12
Operating System: Ubuntu 16.04.1 LTS xenial
Description:

I tried to run Django makemessages with the latest version of pip and it gives me the following error
./lib/python2.7/site-packages/pip/_vendor/webencodings/init.py:1: Unknown encoding "utf8". Proceeding with ASCII instead.

xgettext: ./lib/python2.7/site-packages/pip/_vendor/webencodings/__init__.py:1: Unknown encoding "utf8". Proceeding with ASCII instead.
xgettext: Non-ASCII string at ./lib/python2.7/site-packages/pip/_vendor/webencodings/__init__.py:64.
          Please specify the source encoding through --from-code or through a comment
          as specified in http://www.python.org/peps/pep-0263.html.

Include LICENSE in the generated wheels

See https://github.com/cython/cython/pull/3982/files for an example of how to do so.

This is necessary since pip’s vendoring logic is being changed to allow using wheels preferentially. This means it would need to use a hard-coded URL for the license of this package, until this package adds a license directly.

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists.

This is a confusing topic, since most people when learning that Shift JIS is a thing do not want to have to learn about multiple different competing Shift JIS versions.

However:

WHATWG's index jis0208 includes "formerly proprietary extensions from IBM and NEC". Python's codec for Shift JIS including these extensions is "cp932", aka "ms-kanji". Python's "shift_jis" codec excludes these extensions. Sadly, Python does not offer EUC-JP or ISO-2022-JP codecs including these extensions.
WHATWG's index Big5 includes "the Hong Kong Supplementary Character Set and other common extensions". Python's "big5" codec follows BIG5.TXT, which does not include these extensions, but does include a less common extension for hiragana and katakana, which is incompatible with (and actually collides with) the extension for hiragana and katakana included by the ETEN, IBM and WHATWG versions of Big5. Although not exactly the same due to a small number of edge cases (and due to not treating codes with lead bytes below 0xA1 as decode-only), Python's "big5hkscs" codec is much, much closer to the WHATWG behaviour than its "big5" codec, especially in their decoders (despite a few edge cases, where Python's "big5hkscs" decoder doesn't accept absolutely all codes that WHATWG's does, though it is still miles and miles closer than Python's "big5" decoder)—and even though the encoders are still quite different in terms of which codes they exclude, the output of Python's "big5hkscs" encoder will basically always be correctly interpreted by WHATWG's "big5" decoder, while the same cannot be said of the output of Python's "big5" encoder.
WHATWG's index EUC-KR consists of "the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949". Python's codec for exactly this is "cp949", aka "uhc". By contrast, Python's "euc-kr" codec does not include the Unified Hangul Code extensions, and instead transforms the characters in question to and from KS X 1001 combining sequences (which work differently to Unicode combining sequences; hence, the characters in question do not exhibit combining behaviour when decoded one-by-one to Unicode). The WHATWG decoder for EUC-KR does not recognise or transform back these sequences.

Some illustrative examples where differences occur:

>>> webencodings.decode(b'\x87\x82\x87@ \xedB', "windows-31j") # Should be "№①  鍈"
('�ｇ@ �B', <Encoding shift_jis>)
>>> webencodings.decode(b'\xc7g\xc6\xf1\xc6\xfd\xc7g\xc6\xf1\xc6\xfd', "big5-hkscs") # Should be "むかしむかし"
('ハろウハろウ', <Encoding big5>)
>>> webencodings.decode(b'\x8cc\xb9\xe6\xb0\xa2\xc7\xcf', "windows-949") # Should be "똠방각하"
('�c방각하', <Encoding euc-kr>)
>>>

Although a number of other differences exist, and it is not possible to create a fully conformant implementation of the WHATWG Encoding Standard in Python without re-implementing several of the encodings (including most of the CJK ones, as well as e.g. KOI8-U) to actually conform to it, the degree of conformance and in particular compatibility with it would be considerably improved for much less effort by:

Using Python's "ms-kanji" codec for WHATWG's Shift JIS, not Python's "shift_jis" codec.
Using Python's "big5hkscs" codec for WHATWG's Big5, not Python's "big5" codec.
Using Python's "uhc" codec for WHATWG's EUC-KR, not Python's "euc-kr" codec.

LICENSE is not packaged

The installed package does not contain LICENSE file

$ ls lib/python3.10/site-packages/webencodings-0.5.1.dist-info/
DESCRIPTION.rst  INSTALLER  METADATA  metadata.json  RECORD  REQUESTED  top_level.txt  WHEEL

We should probably replace license_file -> license_files in setup.cfg or simply remove that line (since LICENSE is a standard name and should be packaged by default).

test failures

=============================================== FAILURES ===============================================
___________________________________________ test_all_labels ____________________________________________
[gw6] linux -- Python 3.4.3 /usr/bin/python3.4
def test_all_labels():
        for label in LABELS:
>           assert decode(b'', label) == ''
E           assert ('', <Encoding iso-8859-4>) == ''
E            +  where ('', <Encoding iso-8859-4>) = decode(b'', 'iso88594')

webencodings/tests.py:50: AssertionError
_________________________________________ test_x_user_defined __________________________________________
[gw1] linux -- Python 3.4.3 /usr/bin/python3.4
def test_x_user_defined():
        encoded = b'2,\x0c\x0b\x1aO\xd9#\xcb\x0f\xc9\xbbt\xcf\xa8\xca'
        decoded = '2,\x0c\x0b\x1aO\uf7d9#\uf7cb\x0f\uf7c9\uf7bbt\uf7cf\uf7a8\uf7ca'
        encoded = b'aa'
        decoded = 'aa'
>       assert decode(encoded, 'x-user-defined') == decoded
E       assert ('aa', <Encoding x-user-defined>) == 'aa'
E        +  where ('aa', <Encoding x-user-defined>) = decode(b'aa', 'x-user-defined')

webencodings/tests.py:152: AssertionError
_____________________________________________ test_decode ______________________________________________
[gw5] linux -- Python 3.4.3 /usr/bin/python3.4
def test_decode():
>       assert decode(b'\x80', 'latin1') == '€'
E       assert ('€', <Encoding windows-1252>) == '€'
E        +  where ('€', <Encoding windows-1252>) = decode(b'\x80', 'latin1')

webencodings/tests.py:77: AssertionError
================================== 3 failed, 5 passed in 1.67 seconds ==================================

sdist is missing tox.ini

The sdist package at PyPI is missing tox.ini file. Please add the tox.ini file to sdist to make downstream testing easier. Thanks.

gsnedders / python-webencodings Goto Github PK

python-webencodings's Introduction

python-webencodings

python-webencodings's People

Contributors

Stargazers

Watchers

Forkers

python-webencodings's Issues

Recommend Projects

Recommend Topics

Recommend Org