Git Product home page Git Product logo

python-webencodings's Introduction

python-webencodings

This is a Python implementation of the WHATWG Encoding standard.

In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.

This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.

python-webencodings's People

Contributors

adamchainz avatar di avatar fkrull avatar gsnedders avatar jdufresne avatar simonsapin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-webencodings's Issues

Please make a release on pypi.

Hey,

on pypi, you still have an old version of webencodings, which has the wrong utf8 encoding in it. Hence, xgettext fails to extract any strings from it.

Please release the fixed version, which I saw the commit for.

Cheers,
László

New release

Please, could you issue new release on PyPI.org?
In the very last commit you added LICENSE file into package metadata, but on PyPI.org this is not applied yet.
I maintain a project depending on this your package and I need to have licenses added to them. (Especially BSD explicitly requires one copy to be bundled with each code redistribution - [your] package currently on PyPI violates that.)
Thank you for your work and time.
Pax et bonum.

'utf8' encoding not found

Pip version: 9.0.0
Python version: 2.7.12
Operating System: Ubuntu 16.04.1 LTS xenial
Description:

I tried to run Django makemessages with the latest version of pip and it gives me the following error
./lib/python2.7/site-packages/pip/_vendor/webencodings/init.py:1: Unknown encoding "utf8". Proceeding with ASCII instead.

xgettext: ./lib/python2.7/site-packages/pip/_vendor/webencodings/__init__.py:1: Unknown encoding "utf8". Proceeding with ASCII instead.
xgettext: Non-ASCII string at ./lib/python2.7/site-packages/pip/_vendor/webencodings/__init__.py:64.
          Please specify the source encoding through --from-code or through a comment
          as specified in http://www.python.org/peps/pep-0263.html.

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists.

This is a confusing topic, since most people when learning that Shift JIS is a thing do not want to have to learn about multiple different competing Shift JIS versions.

However:

  • WHATWG's index jis0208 includes "formerly proprietary extensions from IBM and NEC".  Python's codec for Shift JIS including these extensions is "cp932", aka "ms-kanji".  Python's "shift_jis" codec excludes these extensions.  Sadly, Python does not offer EUC-JP or ISO-2022-JP codecs including these extensions.
  • WHATWG's index Big5 includes "the Hong Kong Supplementary Character Set and other common extensions".  Python's "big5" codec follows BIG5.TXT, which does not include these extensions, but does include a less common extension for hiragana and katakana, which is incompatible with (and actually collides with) the extension for hiragana and katakana included by the ETEN, IBM and WHATWG versions of Big5.  Although not exactly the same due to a small number of edge cases (and due to not treating codes with lead bytes below 0xA1 as decode-only), Python's "big5hkscs" codec is much, much closer to the WHATWG behaviour than its "big5" codec, especially in their decoders (despite a few edge cases, where Python's "big5hkscs" decoder doesn't accept absolutely all codes that WHATWG's does, though it is still miles and miles closer than Python's "big5" decoder)—and even though the encoders are still quite different in terms of which codes they exclude, the output of Python's "big5hkscs" encoder will basically always be correctly interpreted by WHATWG's "big5" decoder, while the same cannot be said of the output of Python's "big5" encoder.
  • WHATWG's index EUC-KR consists of "the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949".  Python's codec for exactly this is "cp949", aka "uhc".  By contrast, Python's "euc-kr" codec does not include the Unified Hangul Code extensions, and instead transforms the characters in question to and from KS X 1001 combining sequences (which work differently to Unicode combining sequences; hence, the characters in question do not exhibit combining behaviour when decoded one-by-one to Unicode).  The WHATWG decoder for EUC-KR does not recognise or transform back these sequences.

Some illustrative examples where differences occur:

>>> webencodings.decode(b'\x87\x82\x87@ \xedB', "windows-31j") # Should be "№①  鍈"
('�g@ �B', <Encoding shift_jis>)
>>> webencodings.decode(b'\xc7g\xc6\xf1\xc6\xfd\xc7g\xc6\xf1\xc6\xfd', "big5-hkscs") # Should be "むかしむかし"
('ハろウハろウ', <Encoding big5>)
>>> webencodings.decode(b'\x8cc\xb9\xe6\xb0\xa2\xc7\xcf', "windows-949") # Should be "똠방각하"
('�c방각하', <Encoding euc-kr>)
>>> 

Although a number of other differences exist, and it is not possible to create a fully conformant implementation of the WHATWG Encoding Standard in Python without re-implementing several of the encodings (including most of the CJK ones, as well as e.g. KOI8-U) to actually conform to it, the degree of conformance and in particular compatibility with it would be considerably improved for much less effort by:

  • Using Python's "ms-kanji" codec for WHATWG's Shift JIS, not Python's "shift_jis" codec.
  • Using Python's "big5hkscs" codec for WHATWG's Big5, not Python's "big5" codec.
  • Using Python's "uhc" codec for WHATWG's EUC-KR, not Python's "euc-kr" codec.

LICENSE is not packaged

The installed package does not contain LICENSE file

$ ls lib/python3.10/site-packages/webencodings-0.5.1.dist-info/
DESCRIPTION.rst  INSTALLER  METADATA  metadata.json  RECORD  REQUESTED  top_level.txt  WHEEL

We should probably replace license_file -> license_files in setup.cfg or simply remove that line (since LICENSE is a standard name and should be packaged by default).

test failures

=============================================== FAILURES ===============================================
___________________________________________ test_all_labels ____________________________________________
[gw6] linux -- Python 3.4.3 /usr/bin/python3.4
def test_all_labels():
        for label in LABELS:
>           assert decode(b'', label) == ''
E           assert ('', <Encoding iso-8859-4>) == ''
E            +  where ('', <Encoding iso-8859-4>) = decode(b'', 'iso88594')

webencodings/tests.py:50: AssertionError
_________________________________________ test_x_user_defined __________________________________________
[gw1] linux -- Python 3.4.3 /usr/bin/python3.4
def test_x_user_defined():
        encoded = b'2,\x0c\x0b\x1aO\xd9#\xcb\x0f\xc9\xbbt\xcf\xa8\xca'
        decoded = '2,\x0c\x0b\x1aO\uf7d9#\uf7cb\x0f\uf7c9\uf7bbt\uf7cf\uf7a8\uf7ca'
        encoded = b'aa'
        decoded = 'aa'
>       assert decode(encoded, 'x-user-defined') == decoded
E       assert ('aa', <Encoding x-user-defined>) == 'aa'
E        +  where ('aa', <Encoding x-user-defined>) = decode(b'aa', 'x-user-defined')

webencodings/tests.py:152: AssertionError
_____________________________________________ test_decode ______________________________________________
[gw5] linux -- Python 3.4.3 /usr/bin/python3.4
def test_decode():
>       assert decode(b'\x80', 'latin1') == '€'
E       assert ('€', <Encoding windows-1252>) == '€'
E        +  where ('€', <Encoding windows-1252>) = decode(b'\x80', 'latin1')

webencodings/tests.py:77: AssertionError
================================== 3 failed, 5 passed in 1.67 seconds ==================================

sdist is missing tox.ini

The sdist package at PyPI is missing tox.ini file. Please add the tox.ini file to sdist to make downstream testing easier. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.