Git Product home page Git Product logo

cchardet's Introduction

cChardet

PyPI version Run tests Build Wheels

cChardet is high speed universal character encoding detector. - binding to uchardet.

Supported Languages/Encodings

  • International (Unicode)
    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Arabic
    • ISO-8859-6
    • WINDOWS-1256
  • Bulgarian
    • ISO-8859-5
    • WINDOWS-1251
  • Chinese
    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-2312
  • Croatian:
    • ISO-8859-2
    • ISO-8859-13
    • ISO-8859-16
    • Windows-1250
    • IBM852
    • MAC-CENTRALEUROPE
  • Czech
    • Windows-1250
    • ISO-8859-2
    • IBM852
    • MAC-CENTRALEUROPE
  • Danish
    • ISO-8859-1
    • ISO-8859-15
    • WINDOWS-1252
  • English
    • ASCII
  • Esperanto
    • ISO-8859-3
  • Estonian
    • ISO-8859-4
    • ISO-8859-13
    • ISO-8859-13
    • Windows-1252
    • Windows-1257
  • Finnish
    • ISO-8859-1
    • ISO-8859-4
    • ISO-8859-9
    • ISO-8859-13
    • ISO-8859-15
    • WINDOWS-1252
  • French
    • ISO-8859-1
    • ISO-8859-15
    • WINDOWS-1252
  • German
    • ISO-8859-1
    • WINDOWS-1252
  • Greek
    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew
    • ISO-8859-8
    • WINDOWS-1255
  • Hungarian:
    • ISO-8859-2
    • WINDOWS-1250
  • Irish Gaelic
    • ISO-8859-1
    • ISO-8859-9
    • ISO-8859-15
    • WINDOWS-1252
  • Italian
    • ISO-8859-1
    • ISO-8859-3
    • ISO-8859-9
    • ISO-8859-15
    • WINDOWS-1252
  • Japanese
    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean
    • ISO-2022-KR
    • EUC-KR / UHC
  • Lithuanian
    • ISO-8859-4
    • ISO-8859-10
    • ISO-8859-13
  • Latvian
    • ISO-8859-4
    • ISO-8859-10
    • ISO-8859-13
  • Maltese
    • ISO-8859-3
  • Polish:
    • ISO-8859-2
    • ISO-8859-13
    • ISO-8859-16
    • Windows-1250
    • IBM852
    • MAC-CENTRALEUROPE
  • Portuguese
    • ISO-8859-1
    • ISO-8859-9
    • ISO-8859-15
    • WINDOWS-1252
  • Romanian:
    • ISO-8859-2
    • ISO-8859-16
    • Windows-1250
    • IBM852
  • Russian
    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MAC-CYRILLIC
    • IBM866
    • IBM855
  • Slovak
    • Windows-1250
    • ISO-8859-2
    • IBM852
    • MAC-CENTRALEUROPE
  • Slovene
    • ISO-8859-2
    • ISO-8859-16
    • Windows-1250
    • IBM852
    • M

Example

import cchardet as chardet
with open(r"tests/samples/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f:
  msg = f.read()
  result = chardet.detect(msg)
  print(result)

Benchmark

$ python setup.py build_ext -i -f
$ python tests/bench.py

Results

CPU: AMD Ryzen 9 7950X3D

RAM: DDR5-5600MT/s 96GB

Platform: Ubuntu 24.04 amd64

Python 3.12.3

Request (call/s)
chardet v5.2.0 1.1
cchardet v2.2.0a1 2263.6

LICENSE

See COPYING file.

Contact

Support Platforms

  • Windows i686, x86_64
  • Linux i686, x86_64
  • macOS x86_64

cchardet's People

Contributors

asapo avatar craigds avatar decaz avatar denik avatar dependabot[bot] avatar mcepl avatar meshy avatar moden-py avatar nirzak avatar pyyoshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cchardet's Issues

macOS wheels

Thanks for a great library.

I see there are no macos wheels. macOS jobs can be run on Travis CI and aZure pipelines IIRC, to do deploys to pypi or upload artifacts to another location for manual uploading to pypi.

Version 1.0?

The version of this library on PyPI is 0.3.5. It's been a long time since the last commit, and with no open issues, that would seem to indicate that the library is very stable.

Also, it seems like libcharsetdetect + related things are all in a settled or stable state as well -- so it doesn't seem like libcharsetdetect needs updating... References: uchardet-enhanced - 2012, libcharsetdetect - 2010, Mozilla universalchardet - 2012...

What is missing, to take this library to "1.0"? Is there, perhaps, a list of improvements that could be contributed?

cchardet-2.1.0.tar.gz dosen't contain COPYING file

OS/Arch

any

Python version

any

cChardet version

2.1.0

What is the problem?

cchardet-2.1.0.tar.gz download from https://pypi.python.org/ don't contain COPYING file.

Expected behavior

COPYING file is contained in archive, (README.rst suggests to see COPYING file for License)

Actual behavior

It is not contain.

Steps to reproduce the behavior

fetch tarball from https://pypi.python.org/packages/source/c/cchardet/cchardet-2.1.0.tar.gz and then execute
gzip -d -c cchardet-2.1.0.tar.gz | tar tvf -
etc.

Installation on OSX fails

$ sudo pip install cchardet
$ python -c "import cchardet"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/cchardet/__init__.py", line 4, in <module>
    from cchardet import _cchardet
ImportError: dlopen(/Library/Python/2.7/site-packages/cchardet/_cchardet.so, 2): no suitable image found.  Did find:
    /Library/Python/2.7/site-packages/cchardet/_cchardet.so: mach-o, but wrong architecture

When running the install manually without pip, it works.

System Info: OSX 10.10.5

Segmentation fault when fed `\x8f`

This code:

from __future__ import unicode_literals
import cchardet
msg = b'\x8f'
d = cchardet.Detector()
d.feed(msg)
d.close()
print(d.result)

results in a segmentation fault (during d.result)

(Python 2.7.6, 2.7.12, 3.4.3)

Can't install cChardet on my ARM64 machine

OS/Arch

('Linux', 'S905X', '5.9.8-meson64', '#trunk SMP PREEMPT Thu Nov 12 23:40:19 UTC 2020', 'aarch64', 'aarch64')

Python version

Python 3.8.8

cChardet version

2.1.7 and below version

What is the problem?

Can't install cchardet. Maybe it's because setup.py

Expected behavior

It can install normally

Actual behavior

Showing error

Collecting cchardet
Using cached cchardet-2.1.7.tar.gz (653 kB)
Building wheels for collected packages: cchardet
Building wheel for cchardet (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'
"'"'; file='"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().
replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-gfswq87_
cwd: /tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/
Complete output (24 lines):
cythonize: ['src/cchardet/_cchardet.pyx']
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-aarch64-3.8
creating build/lib.linux-aarch64-3.8/cchardet
copying src/cchardet/version.py -> build/lib.linux-aarch64-3.8/cchardet
copying src/cchardet/init.py -> build/lib.linux-aarch64-3.8/cchardet
running build_ext
building 'cchardet._cchardet' extension
creating build/temp.linux-aarch64-3.8
creating build/temp.linux-aarch64-3.8/src
creating build/temp.linux-aarch64-3.8/src/cchardet
creating build/temp.linux-aarch64-3.8/src/ext
creating build/temp.linux-aarch64-3.8/src/ext/uchardet
creating build/temp.linux-aarch64-3.8/src/ext/uchardet/src
creating build/temp.linux-aarch64-3.8/src/ext/uchardet/src/LangModels
aarch64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack
-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isrc/ext/uchardet/src -I/usr/include/python3.8 -c src/cchardet/_cchardet.cpp -
o build/temp.linux-aarch64-3.8/src/cchardet/_cchardet.o
src/cchardet/_cchardet.cpp:4:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
ERROR: Failed building wheel for cchardet
Running setup.py clean for cchardet
Failed to build cchardet
Installing collected packages: cchardet
Running setup.py install for cchardet ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'"'"'; file='"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-okdztqew/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/cchardet
cwd: /tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/
Complete output (24 lines):
cythonize: ['src/cchardet/_cchardet.pyx']
running install
running build
running build_py
creating build
creating build/lib.linux-aarch64-3.8
creating build/lib.linux-aarch64-3.8/cchardet
copying src/cchardet/version.py -> build/lib.linux-aarch64-3.8/cchardet
copying src/cchardet/init.py -> build/lib.linux-aarch64-3.8/cchardet
running build_ext
building 'cchardet._cchardet' extension
creating build/temp.linux-aarch64-3.8
creating build/temp.linux-aarch64-3.8/src
creating build/temp.linux-aarch64-3.8/src/cchardet
creating build/temp.linux-aarch64-3.8/src/ext
creating build/temp.linux-aarch64-3.8/src/ext/uchardet
creating build/temp.linux-aarch64-3.8/src/ext/uchardet/src
creating build/temp.linux-aarch64-3.8/src/ext/uchardet/src/LangModels
aarch64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isrc/ext/uchardet/src -I/usr/include/python3.8 -c src/cchardet/_cchardet.cpp -o build/temp.linux-aarch64-3.8/src/cchardet/_cchardet.o
src/cchardet/_cchardet.cpp:4:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'"'"'; file='"'"'/tmp/pip-install-hxlmfkru/cchardet_ec84530a651448e8b7d54bd420b69736/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-okdztqew/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/cchardet Check the logs for full command output.

Steps to reproduce the behavior

python3.8 -m pip install cchardet OR pip3 install cchardet

ImportError: dynamic module does not define init function (init_cchardet)

OS/Arch

An error occurred while import the module

Python version

# /usr/local/python2.7/bin/python
Python 2.7.13 (default, Dec  8 2017, 00:52:09) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2

cChardet version

# cat cchardet/version.py
__version__ = '2.1.1'

What is the problem?

import cchardet
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/python2.7/lib/python2.7/site-packages/cchardet/init.py", line 3, in
from cchardet import _cchardet
ImportError: dynamic module does not define init function (init_cchardet)

I use thecommand "python setup.py install" installed package: cchardet

Expected behavior

Actual behavior

Steps to reproduce the behavior

Installation error: Visual C++ is required

OS/Arch

C:\Users\eight04\dev>python -c "import platform;print(platform.uname())"
uname_result(system='Windows', node='***', release='10', version='10.0.18362', machine='AMD64', processor='Intel64 Family 6 Model 69 Stepping 1, GenuineIntel')

Python version

C:\Users\eight04\dev>python --version
Python 3.8.1

cChardet version

2.1.5

What is the problem?

pip tries to compile the source code and fail.

Expected behavior

Install the wheel.

Actual behavior

C:\Users\eight04\dev>pip install cchardet
Collecting cchardet
  Using cached cchardet-2.1.5.tar.gz (653 kB)
Installing collected packages: cchardet
    Running setup.py install for cchardet ... error
    ERROR: Command errored out with exit status 1:
     command: 'c:\users\eight04\appdata\local\programs\python\python38-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\eight04\\AppData\\Local\\Temp\\pip-install-l1qsb7ur\\cchardet\\setup.py'"'"'; __file__='"'"'C:\\Users\\eight04\\AppData\\Local\\Temp\\pip-install-l1qsb7ur\\cchardet\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\eight04\AppData\Local\Temp\pip-record-d2kzk51y\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\eight04\appdata\local\programs\python\python38-32\Include\cchardet'
         cwd: C:\Users\eight04\AppData\Local\Temp\pip-install-l1qsb7ur\cchardet\
    Complete output (11 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.8
    creating build\lib.win32-3.8\cchardet
    copying src\cchardet\version.py -> build\lib.win32-3.8\cchardet
    copying src\cchardet\__init__.py -> build\lib.win32-3.8\cchardet
    running build_ext
    building 'cchardet._cchardet' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\eight04\appdata\local\programs\python\python38-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\eight04\\AppData\\Local\\Temp\\pip-install-l1qsb7ur\\cchardet\\setup.py'"'"'; __file__='"'"'C:\\Users\\eight04\\AppData\\Local\\Temp\\pip-install-l1qsb7ur\\cchardet\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\eight04\AppData\Local\Temp\pip-record-d2kzk51y\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\eight04\appdata\local\programs\python\python38-32\Include\cchardet' Check the logs for full command output.
C:\Users\eight04\dev>pip install cchardet --only-binary :all:
ERROR: Could not find a version that satisfies the requirement cchardet (from versions: none)
ERROR: No matching distribution found for cchardet

Is this an detect error

python 2.7.11
cchardet-1.1.1-cp27-cp27m-win_amd64.whl (md5)
window server 2008 r2 enterprise
system language: chinese simple (操作系统显示语言:简体中文)
test.py is the python code
testfile.cs is the input file (open by notepad, [save as(另存) ] show the encode(编码) is ANSI
testfile.chardet.cs is the output file: decode by chardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.cchardet.cs is the output file: decode by cchardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.GB2312.cs is the output file: decode by 'GB2312' ,then , encode by utf8.

testfile.GB2312.cs is the RIGHT.

test_chardet.zip

UniversalDetector.reset() does not reset the detector

OS/Arch

$ python -c 'import platform;print(platform.uname())'

uname_result(system='Linux', node='testserver.mimeanalytics.com', release='4.18.0-240.10.1.el8_3.x86_64', version='#1 SMP Mon Jan 18 17:05:51 UTC 2021', machine='x86_64', processor='x86_64')

Python version

$ python --version

Python 3.6.8

cChardet version

$ python -c 'import cchardet;print(cchardet.__version__)'

2.1.7

What is the problem?

ud = cchardet.UniversalDetector()
ud.reset() does not reset the values of ud.done or ud.result after the first file has had its encoding detected.

Expected behavior

ud.done == False
ud.result == None

Actual behavior

ud.done == True
ud.result == (the last result)

Steps to reproduce the behavior

#!/usr/bin/env python3
import cchardet

files = ['file1', 'file2', 'file3']
ud = cchardet.UniversalDetector()
for file in files:
    ud.reset()
    print(f'Before: {ud.done}, {ud.result}')
    with open(file, 'rb') as ifh:
        for line in ifh.readlines():
            ud.feed(line)
            if ud.done:
                break
    ud.close()
    print(f'After: {ud.done}, {ud.result}')

can not detect some encoding

OS/Arch

('Linux', 'scan05', '4.4.0-105-generic', '#128-Ubuntu SMP Thu Dec 14 12:42:11 UTC 2017', 'x86_64', 'x86_64')

Python version

3.7.2

cChardet version

2.1.4

What is the problem?

Can not detect some case....
eg: b'HTTP/1.0 200 Ok\r\nServer: Gnway Web Server\r\nDate: \xd6\xdc\xc8\xfd, 02 \xd2\xbb\xd4\xc2 2019 18:51:54 GMT\r\nContent-Type: text/html\r\nContent-Length: 243\r\n\r\n'

Expected behavior

import chardet
chardet.detect(info)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Actual behavior

import cchardet
cchardet.detect(info)
{'encoding': None, 'confidence': None}

Steps to reproduce the behavior

info = b'HTTP/1.0 200 Ok\r\nServer: Gnway Web Server\r\nDate: \xd6\xdc\xc8\xfd, 02 \xd2\xbb\xd4\xc2 2019 18:51:54 GMT\r\nContent-Type: text/html\r\nContent-Length: 243\r\n\r\n'
cchardet.detect(info)
cchardet get nothing but chardet tell me the right encoding

Remove bundled uchardet

What is the problem?

uchardet is bundled. This is a major concern if the package should be included in package collection of distributions as uchardet-devel could already be present.

Expected behavior

Use distro package (e.g., uchardet-devel) and not the bundled copy of uchardet during the build process. Please consider to remove the bundled uchardet or provide the possibility to use the installed package.

Actual behavior

Uses uchardet in src/ext.

Steps to reproduce the behavior

Install the package.

error: command '/usr/bin/clang' failed with exit status 1

OS 10.15

$ pip3 install cchardet 

Python version

$ python 3.6

As I try to install cChardet, it generates the following error. Anyone could help? Thank you.

What is the problem?

    ERROR: Command errored out with exit status 1:
     command: /Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/setup.py'"'"'; __file__='"'"'/private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-record-pyoqp_k9/install-record.txt --single-version-externally-managed --compile --install-headers /Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m/cchardet
         cwd: /private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/
    Complete output (23 lines):
    cythonize: ['src/cchardet/_cchardet.pyx']
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/src/cchardet/_cchardet.pyx
      tree = Parsing.p_module(s, pxd, full_module_name)
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.6-intel-3.6
    creating build/lib.macosx-10.6-intel-3.6/cchardet
    copying src/cchardet/version.py -> build/lib.macosx-10.6-intel-3.6/cchardet
    copying src/cchardet/__init__.py -> build/lib.macosx-10.6-intel-3.6/cchardet
    running build_ext
    building 'cchardet._cchardet' extension
    creating build/temp.macosx-10.6-intel-3.6
    creating build/temp.macosx-10.6-intel-3.6/src
    creating build/temp.macosx-10.6-intel-3.6/src/cchardet
    creating build/temp.macosx-10.6-intel-3.6/src/ext
    creating build/temp.macosx-10.6-intel-3.6/src/ext/uchardet
    creating build/temp.macosx-10.6-intel-3.6/src/ext/uchardet/src
    creating build/temp.macosx-10.6-intel-3.6/src/ext/uchardet/src/LangModels
    /usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch i386 -arch x86_64 -g -Isrc/ext/uchardet/src -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c src/cchardet/_cchardet.cpp -o build/temp.macosx-10.6-intel-3.6/src/cchardet/_cchardet.o
    xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
    error: command '/usr/bin/clang' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/setup.py'"'"'; __file__='"'"'/private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-install-3vloyusd/cchardet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/_g/2rrxbrnj1xb_h5k70xgtdss00000gn/T/pip-record-pyoqp_k9/install-record.txt --single-version-externally-managed --compile --install-headers /Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m/cchardet Check the logs for full command output.

Failed to Build on Python 3.11.0

OS/Arch

$ python -c 'import platform;print(platform.uname())'

Python version

3.11.0

$ python --version

cChardet version

Latest

$ python -c 'import cchardet;print(cchardet.__version__)'

What is the problem?

Failed to build.

Expected behavior

Supposed to build.

Actual behavior

fatal error: longintrepr.h: No such file or directory

Steps to reproduce the behavior

Just update to Python 3.11 and attempt to build code results in error.

AARCH64 Support?

Hello!

I'm using an ODROID C2 (http://www.hardkernel.com/main/products/prdt_info.php), which has AARCH64 architecture. The operating system is Ubuntu MATE 16.04.

When running pip install cchardet, I get this error complaining about the architecture:

cchardet_error.txt

The same happens when I try to install manually from this repository. I found the following, which seems like what I need: http://tardis.tiny-vps.com/aarm/packages/p/python-cchardet/ but I'm not sure how to install it, as it doesn't contain a setup.py file.

Any help is appreciated!

Thanks,
Oliver

Windows curly quotes trip up cchardet

Forgive me if this is the wrong place for this - I'm somewhat ignorant of the internal workings of cchardet.

\x92 seems cause strings to be interpreted as this central european encoding:

>>> cchardet.detect('Bob\x92s Burgers')
{'confidence': 0.8183978796005249, 'encoding': u'MacCentralEurope'}
>>> print 'Bob\x92s Burgers'.decode('MacCentralEurope')
Bobís Burgers

I don't know enough about central european languages to comment on whether that's a good choice. I do know that curly quotes as produced by MS Word are quite common, so interpreting them badly seems like a fairly obvious bug.

A correct choice for that string might be windows-1252 which renders curly quotes correctly.

To be fair, this same string trips up chardet too (a bit differently). So I guess this must not be a trivially-obvious situation:

>>> chardet.detect('Bob\x92s Burgers')
{'confidence': 0.846643894804694, 'encoding': 'ISO-8859-2'}
>>> print 'Bob\x92s Burgers'.decode('ISO-8859-2')
Bobs Burgers

Failed to build cchardet - osx 10.13.3 with brew python3

Failed building wheel for cchardet
Running setup.py clean for cchardet
Failed to build cchardet
Installing collected packages: cchardet, ccxt
Running setup.py install for cchardet ... error
Complete output from command /usr/local/opt/python/bin/python3.6 -u -c "import setuptools, tokenize;file='/private/var/folders/c6/t3s_sm6906j2lm2rcp2bdmqh0000gp/T/pip-build-ni2b6d85/cchardet/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/c6/t3s_sm6906j2lm2rcp2bdmqh0000gp/T/pip-470mvm70-record/install-record.txt --single-version-externally-managed --compile:
cythonize: ['src/cchardet/_cchardet.pyx']
running install
running build
running build_py
creating build
creating build/lib.macosx-10.13-x86_64-3.6
creating build/lib.macosx-10.13-x86_64-3.6/cchardet
copying src/cchardet/version.py -> build/lib.macosx-10.13-x86_64-3.6/cchardet
copying src/cchardet/init.py -> build/lib.macosx-10.13-x86_64-3.6/cchardet
running build_ext
building 'cchardet._cchardet' extension
creating build/temp.macosx-10.13-x86_64-3.6
creating build/temp.macosx-10.13-x86_64-3.6/src
creating build/temp.macosx-10.13-x86_64-3.6/src/cchardet
creating build/temp.macosx-10.13-x86_64-3.6/src/ext
creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet
creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet/src
creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet/src/LangModels
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Isrc/ext/uchardet/src -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c src/cchardet/_cchardet.cpp -o build/temp.macosx-10.13-x86_64-3.6/src/cchardet/_cchardet.o
src/cchardet/_cchardet.cpp:1093:15: error: use of undeclared identifier 'assert'
__pyx_t_1 = PyBytes_GET_SIZE(__pyx_v_msg); if (unlikely(__pyx_t_1 == ((Py_ssize_t)-1))) __PYX_ERR(1, 15, __pyx_L1_error)
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:89:32: note: expanded from macro 'PyBytes_GET_SIZE'
#define PyBytes_GET_SIZE(op) (assert(PyBytes_Check(op)),Py_SIZE(op))
^
src/cchardet/_cchardet.cpp:1116:15: error: use of undeclared identifier 'assert'
__pyx_t_2 = __Pyx_PyBytes_AsWritableString(__pyx_v_msg); if (unlikely((__pyx_t_2 == ((const char*)NULL)) && PyErr_Occurred())) __PYX_ERR(1, 19, __pyx_L1_error)
^
src/cchardet/_cchardet.cpp:596:56: note: expanded from macro '__Pyx_PyBytes_AsWritableString'
#define __Pyx_PyBytes_AsWritableString(s) ((char*) PyBytes_AS_STRING(s))
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:87:32: note: expanded from macro 'PyBytes_AS_STRING'
#define PyBytes_AS_STRING(op) (assert(PyBytes_Check(op)),
^
src/cchardet/_cchardet.cpp:1206:57: error: use of undeclared identifier 'assert'
__pyx_t_3 = (__pyx_v_detected_charset != Py_None) && (PyBytes_GET_SIZE(__pyx_v_detected_charset) != 0);
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:89:32: note: expanded from macro 'PyBytes_GET_SIZE'
#define PyBytes_GET_SIZE(op) (assert(PyBytes_Check(op)),Py_SIZE(op))
^
src/cchardet/_cchardet.cpp:1553:15: error: use of undeclared identifier 'assert'
__pyx_t_2 = PyBytes_GET_SIZE(__pyx_v_msg); if (unlikely(__pyx_t_2 == ((Py_ssize_t)-1))) __PYX_ERR(1, 64, __pyx_L1_error)
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:89:32: note: expanded from macro 'PyBytes_GET_SIZE'
#define PyBytes_GET_SIZE(op) (assert(PyBytes_Check(op)),Py_SIZE(op))
^
src/cchardet/_cchardet.cpp:1577:17: error: use of undeclared identifier 'assert'
__pyx_t_3 = __Pyx_PyBytes_AsWritableString(__pyx_v_msg); if (unlikely((__pyx_t_3 == ((const char*)NULL)) && PyErr_Occurred())) __PYX_ERR(1, 66, __pyx_L1_error)
^
src/cchardet/_cchardet.cpp:596:56: note: expanded from macro '__Pyx_PyBytes_AsWritableString'
#define __Pyx_PyBytes_AsWritableString(s) ((char*) PyBytes_AS_STRING(s))
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:87:32: note: expanded from macro 'PyBytes_AS_STRING'
#define PyBytes_AS_STRING(op) (assert(PyBytes_Check(op)),
^
src/cchardet/_cchardet.cpp:1940:15: error: use of undeclared identifier 'assert'
__pyx_t_2 = PyBytes_GET_SIZE(__pyx_t_1); if (unlikely(__pyx_t_2 == ((Py_ssize_t)-1))) __PYX_ERR(1, 93, __pyx_L1_error)
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytesobject.h:89:32: note: expanded from macro 'PyBytes_GET_SIZE'
#define PyBytes_GET_SIZE(op) (assert(PyBytes_Check(op)),Py_SIZE(op))
^
src/cchardet/_cchardet.cpp:3817:19: error: use of undeclared identifier 'assert'
*length = PyByteArray_GET_SIZE(o);
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytearrayobject.h:54:37: note: expanded from macro 'PyByteArray_GET_SIZE'
#define PyByteArray_GET_SIZE(self) (assert(PyByteArray_Check(self)), Py_SIZE(self))
^
src/cchardet/_cchardet.cpp:3818:16: error: use of undeclared identifier 'assert'
return PyByteArray_AS_STRING(o);
^
/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/bytearrayobject.h:52:6: note: expanded from macro 'PyByteArray_AS_STRING'
(assert(PyByteArray_Check(self)),
^
8 errors generated.
error: command 'clang' failed with exit status 1

----------------------------------------

Command "/usr/local/opt/python/bin/python3.6 -u -c "import setuptools, tokenize;file='/private/var/folders/c6/t3s_sm6906j2lm2rcp2bdmqh0000gp/T/pip-build-ni2b6d85/cchardet/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/c6/t3s_sm6906j2lm2rcp2bdmqh0000gp/T/pip-470mvm70-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/c6/t3s_sm6906j2lm2rcp2bdmqh0000gp/T/pip-build-ni2b6d85/cchardet/

multiple licenses

Hi,
Can I ask why you mention multiple licenses in the License.txt ? For a reference- chardet had used only GNU LGPL. Just seemed unusual
Thanks!

pyinstaller issue with cchardet 2.1.6

OS/Arch

uname_result(system='Windows', release='10', version='10.0.17763', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 10, GenuineIntel')

Python version

Python 3.6.5

cChardet version

2.1.6

What is the problem?

An application frozen with pyinstaller does not start raising this error. Using cChardet 2.1.5 does not cause this problem.

ImportError: DLL load failed: The specified module could not be found. 

Expected behavior

Actual behavior

Steps to reproduce the behavior

confidence=0 when scanning ASCII text

cchardet gives confidence of 0.0 when scanning ascii text. This is inconsistent with chardet's behaviour:

>>> import chardet
>>> import cchardet
>>> import string
>>> chardet.detect(string.ascii_letters*1000)
{'confidence': 1.0, 'encoding': 'ascii'}
>>> cchardet.detect(string.ascii_letters*1000)
{'confidence': 0.0, 'encoding': u'ASCII'}

The size of the document makes no difference; the returned confidence is always 0.0.

1.0 would seem to be more correct since the document has no non-ASCII characters.

Build wheels with LTO enabled

Not sure if this is already being done, but since I couldn't find a reference to it, I thought I'd bring up the suggestion.

uchardet is a non-trivial external library, and it looks like it's directly linked in by cchardet. This setup usually benefits from enabling -flto in GCC, to allow for cross-module link-time optimisations, such as inlining. It's a cheap thing to do, but can have quite a visible impact on the overall performance.

Support setuptools 0.7 series

I'm using setuptools 0.7.1.

easy_install cchardet

or

pip install cchardet

output

ValueError: A 0.7-series setuptools cannot be installed with distribute. Found one at /home/azureuser/.pyenv/versions/2.7.5/lib/python2.7/site-packages/setuptools-0.7.1-py2.7.egg
/tmp/easy_install-WQPHBO/cchardet-0.3.3/distribute-0.6.14-py2.7.egg

LookupError: unknown encoding: EUC-TW

This seems similar in nature to #8, but unfortunately, I do not know what to recommend as an alternative to EUC-TW.

One can see that there is nothing that quite matches in Python's list of standard encodings.

I also thought that I should look through the other encodings mentioned in the readme, and found that there are a number of other codecs that did not come up in the list:

Do you have any recommendations for how I could decode strings that are detected as these types in python?

failed to build on CentOS6

gcc: /tmp/cChardet/src/cchardet/cchardet.c: No such file or directory
gcc: no input files
error: command 'gcc' failed with exit status 1

not work "cython /tmp/cChardet/src/cchardet/cchardet.pyx"?

Raising Exception

Hi,
I was just wondering whether cchardet.detect() can throw an error or not, if it can then in which scenario.
Thanks.

cchardet upgrade fails in Python 3.5.2

OS/Arch

MacOS

Python version

3.5.2

cChardet version

2.1.4

What is the problem?

Instead of getting a wheel, pip tries to compile cchardet from source, which doesn't work on my users' machines.

Expected behavior

"pip install --upgrade cchardet" should install the latest working version of cchardet.

Actual behavior

pip tries to build cchardet from source, and fails. See attached terminal log.

Steps to reproduce the behavior

"pip install --upgrade cchardet" on a Mac running Python 3.5.2
Terminal_Info.txt

Incorrect detection of TIS-620 instead of UTF-8

cchardet incorrectly detects TIS-620 encoding for some UTF-8 encoded cyrillic strings:

>>> cchardet.__version__
'2.1.1'

>>> cchardet.detect(u"тест".encode("utf-8"))
{'confidence': 0.9900000095367432, 'encoding': u'TIS-620'}
>>> cchardet.detect(u"тост".encode("utf-8"))
{'confidence': 0.9900000095367432, 'encoding': u'TIS-620'}

>>> chardet.__version__
'3.0.4'
>>> chardet.detect(u"тест".encode("utf-8"))
{'confidence': 0.938125, 'language': '', 'encoding': 'utf-8'}
>>> chardet.detect(u"тост".encode("utf-8"))
{'confidence': 0.938125, 'language': '', 'encoding': 'utf-8'}

Build error on Mojave

OS/Arch

uname_result(system='Darwin', node='Horizon.local', release='18.0.0', version='Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64', machine='x86_64', processor='i386')

Python version

Python 3.6.5, Python 3.7.0

cChardet version

2.1.4

What is the problem?

After upgrade to OSX Mojave I'm started to get error on module compilation.

Expected behavior

Successful build.

Actual behavior

Collecting cchardet==2.1.4 ▉▉▉▉▉▉▉▉▉▉ 0/4 — 00:00:00
  Using cached https://files.pythonhosted.org/packages/74/64/3988d388315c1af3e24f447689dadf30edead43366fb2041cb103380b57f/cchardet-2.1.4.tar.gz
Building wheels for collected packages: cchardet
  Running setup.py bdist_wheel for cchardet: started
  Running setup.py bdist_wheel for cchardet: finished with status 'error'
  Complete output from command /Users/xen/.local/share/virtualenvs/notify-1dEpo5ac/bin/python3.6m -u -c "import setuptools, tokenize;__file__='/private/var/folders/bb/l1m9ls5925s0tqfjb_fhhw6h0000gp/T/pip-install-e9qcie_r/cchardet/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/bb/l1m9ls5925s0tqfjb_fhhw6h0000gp/T/pip-wheel-d3z98u1t --python-tag cp36:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.13-x86_64-3.6
  creating build/lib.macosx-10.13-x86_64-3.6/cchardet
  copying src/cchardet/version.py -> build/lib.macosx-10.13-x86_64-3.6/cchardet
  copying src/cchardet/__init__.py -> build/lib.macosx-10.13-x86_64-3.6/cchardet
  warning: build_py: byte-compiling is disabled, skipping.
  
  running build_ext
  building 'cchardet._cchardet' extension
  creating build/temp.macosx-10.13-x86_64-3.6
  creating build/temp.macosx-10.13-x86_64-3.6/src
  creating build/temp.macosx-10.13-x86_64-3.6/src/cchardet
  creating build/temp.macosx-10.13-x86_64-3.6/src/ext
  creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet
  creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet/src
  creating build/temp.macosx-10.13-x86_64-3.6/src/ext/uchardet/src/LangModels
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Isrc/ext/uchardet/src -I/Users/xen/.pyenv/versions/3.6.5/include/python3.6m -c src/cchardet/_cchardet.cpp -o build/temp.macosx-10.13-x86_64-3.6/src/cchardet/_cchardet.o
  In file included from src/cchardet/_cchardet.cpp:4:
  /Users/xen/.pyenv/versions/3.6.5/include/python3.6m/Python.h:14:2: error: "Something's broken.  UCHAR_MAX should be defined in limits.h."
  #error "Something's broken.  UCHAR_MAX should be defined in limits.h."
   ^
  /Users/xen/.pyenv/versions/3.6.5/include/python3.6m/Python.h:18:2: error: "Python's source code assumes C's unsigned char is an 8-bit type."
  #error "Python's source code assumes C's unsigned char is an 8-bit type."
   ^
  In file included from src/cchardet/_cchardet.cpp:4:
  In file included from /Users/xen/.pyenv/versions/3.6.5/include/python3.6m/Python.h:30:
  /usr/local/include/string.h:49:1: error: unknown type name 'HASHKIT_API'
  HASHKIT_API
  ^
  /usr/local/include/string.h:50:1: error: expected unqualified-id
  void hashkit_string_free(hashkit_string_st *ptr);
  ^
  /usr/local/include/string.h:53:1: error: unknown type name 'HASHKIT_API'
  HASHKIT_API
  ^
  /usr/local/include/string.h:54:7: error: expected ';' after top level declarator
  size_t hashkit_string_length(const hashkit_string_st *self);
        ^
        ;
  /usr/local/include/string.h:56:1: error: unknown type name 'HASHKIT_API'
  HASHKIT_API
  ^
  /usr/local/include/string.h:57:1: error: expected unqualified-id
  const char *hashkit_string_c_str(const hashkit_string_st* self);
  ^
  In file included from src/cchardet/_cchardet.cpp:4:
  In file included from /Users/xen/.pyenv/versions/3.6.5/include/python3.6m/Python.h:34:
  In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/stdlib.h:94:
  In file included from /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/stdlib.h:66:
  In file included from /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/wait.h:110:
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:196:2: error: unknown type name 'uint8_t'
          uint8_t  ri_uuid[16];
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:197:2: error: unknown type name 'uint64_t'
          uint64_t ri_user_time;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:198:2: error: unknown type name 'uint64_t'
          uint64_t ri_system_time;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:199:2: error: unknown type name 'uint64_t'
          uint64_t ri_pkg_idle_wkups;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:200:2: error: unknown type name 'uint64_t'
          uint64_t ri_interrupt_wkups;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:201:2: error: unknown type name 'uint64_t'
          uint64_t ri_pageins;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:202:2: error: unknown type name 'uint64_t'
          uint64_t ri_wired_size;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:203:2: error: unknown type name 'uint64_t'
          uint64_t ri_resident_size;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:204:2: error: unknown type name 'uint64_t'
          uint64_t ri_phys_footprint;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:205:2: error: unknown type name 'uint64_t'
          uint64_t ri_proc_start_abstime;
          ^
  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/sys/resource.h:206:2: error: unknown type name 'uint64_t'
          uint64_t ri_proc_exit_abstime;
          ^
  fatal error: too many errors emitted, stopping now [-ferror-limit=]
  20 errors generated.
  error: command 'clang' failed with exit status 1
  
  ----------------------------------------
  Running setup.py clean for cchardet
Failed to build cchardet

Steps to reproduce the behavior

pipenv install aiohttp

Incorrect detection of GB18030 as ISO-8859-16

OS/Arch

$ python -c 'import platform;print(platform.uname())'
('Linux', 'linux-lwww', '5.1.7-1-default', '#1 SMP Tue Jun 4 07:56:54 UTC 2019 (55f2451)', 'x86_64', 'x86_64')

Python version

$ python3 --version
Python 3.7.2

cChardet version

$ python -c 'import cchardet;print(cchardet.__version__)'
2.1.4

What is the problem?

b'\xc4\xe3\xba\xc3' is detected as {'encoding': 'ISO-8859-16', 'confidence': 0.3758675158023834} which renders as ÄășĂ

Expected behavior

It should be detected as GB18030 你好

Error building cchardet

My system is CentOS release 5.8 (Final), gcc 4.1
python setup.py build

// some code
g++ -pthread -shared build/temp.linux-x86_64-2.7/src/cchardet/_cchardet.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/charsetdetect.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangFrenchModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsCharSetProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsEUCTWProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangHebrewModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangPolishModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsMBCSSM.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsUniversalDetector.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsSJISProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangHungarianModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsMBCSGroupProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsSBCharSetProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsEscSM.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsSBCSGroupProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsUTF8Prober.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsEUCJPProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangBulgarianModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsBig5Prober.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsGB2312Prober.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangFinnishModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangSpanishModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsEscCharsetProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangCzechModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsLatin1Prober.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/JpCntx.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/CharDistribution.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsHebrewProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangGreekModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangGermanModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangSwedishModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangTurkishModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/nsEUCKRProber.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangThaiModel.o build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangCyrillicModel.o -o build/lib.linux-x86_64-2.7/cchardet/_cchardet.so
/usr/bin/ld: build/temp.linux-x86_64-2.7/src/ext/libcharsetdetect/charsetdetect.o: relocation R_X86_64_PC32 against `nsUniversalDetector::nsUniversalDetector(unsigned int)' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: ld returned 1 exit status
error: command 'g++' failed with exit status 1

Result eq to None

OS/Arch

$ python -c 'import platform;print(platform.uname())'

win10/ubuntu18.03

Python version

python3.7

$ python --version

cChardet version

$ python -c 'import cchardet;print(cchardet.__version__)'

What is the problem?

import cchardet
x=cchardet.detect('我是菜鸡'.encode(encoding='GBK'))
print(x)

Expected behavior

{'encoding': 'GB18030', 'confidence': ***}

Actual behavior

{'encoding': None, 'confidence': None}

Why

Why is None?

DLL load failed while importing _cchardet

OS/Arch

win10/x64

$ python -c 'import cchardet as chardet'

Python version

python 3.8.2 x64

cChardet version

cchardet 2.1.6

What is the problem?

File "C:\DevPy\DnsOverHTTPS\test.py", line 2, in
import cchardet as chardet
File "C:\Py3\lib\site-packages\cchardet_init_.py", line 1, in
from cchardet import _cchardet
ImportError: DLL load failed while importing _cchardet

switched to version from

https://www.lfd.uci.edu/~gohlke/pythonlibs/#cchardet
and no errors. so bug in your imports.

Clarify the intent of COPYING

The COPYING text file mentions three (incompatible) licenses:

  • GPL 2
  • LGPL 2.1
  • MPL 1.1

Is cChardet multi-licensed? That is, can I freely choose between the licenses?
or
Do each license apply to some parts (files) of cChardet? If so, which licenses apply to which parts?

In any case, I'd like to see the intent clarified within the COPYING file itself. A good example of how to do so is moodycamel::concurrentqueue's LICENSE.

Thanks for your work on cChardet.

Installation fails

OS/Arch

OSX 10.15.2
uname_result(system='Darwin', release='19.2.0', version='Darwin Kernel Version 19.2.0: Sat Nov 9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64', machine='x86_64', processor='i386')

Python version

Python 3.6.0

What is the problem?

pip installation fails with the following error

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
    clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
    ld: library not found for -lstdc++
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command '/usr/bin/clang++' failed with exit status 1

Steps to reproduce the behavior

pip install cchardet

cChardet returns {'encoding': None, 'confidence': None} on very large file

OS/Arch


Python version

cChardet version

What is the problem?

When passing very large file to the .feed() function line by line cchardet is unable to determine the encoding.

head -X of my file shows that it's tab delimited text while file -i reports application/octet-stream; charset=binary --this may be my issue but I've read that file is bad at determining encoding so I'm not sure.

Expected behavior

That cchardet should return some non-None non-Unicode result e.g., ascii or win-1252 either before the end of the file or once it finishes.

Actual behavior

cchardet will seemingly consume the whole file but in the end return {'encoding': None, 'confidence': None}

Steps to reproduce the behavior

Sample data cannot be provided due to PII in the file, but I'm using this form:

import pathlib
import cchardet as chardet

target = pathlib.Path('\\\\path\\to\\file')

detector = chardet.UniversalDetector()
detector.reset()

i = 0
with open(target, "rb") as f:
    print(f'Reading {target}')
    for row in f:
        result = detector.feed(row)
        i +=1
        if i%10000 == 0:
            print(f'Line {i}')
        if detector.done:
            break
    print(detector.result)

Incorrect detection of utf-8-sig encoding

Test data:
data = "some text".encode('utf-8-sig')

Detection result:
cchardet 1.1.3: {'encoding': 'UTF-8', 'confidence': 0.9900000095367432}
cchardet 2.0.0: {'encoding': 'UTF-8', 'confidence': 0.9900000095367432}
chardet 3.0.2: {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}

crashed

if line.startswith('<ALT_TEXT>'):
      alt=line[10:-12]
      charset= cchardet.detect(alt)['encoding']      
      try:
         alt=alt.decode(charset)
      except:
         alt='';print 'alt except'

The crashed program seems to use third-party or local libraries:

/usr/local/lib/python2.7/dist-packages/cchardet-0.3.1-py2.7-linux-i686.egg/cchardet/_cchardet.so

It is highly recommended to check if the problem persists without those first.

Do you want to continue the report process anyway?

misdetecting windows-1251 for mac-cyrillic

Dialog.tra is in fact windows-1251.

import cchardet as chardet
f=sys.argv[1]
with open(f, "rb") as f:
  mess = f.read()
  result = chardet.detect(mess)
  print result
/1.py dialog.tra
{'confidence': 0.8082455396652222, 'encoding': u'MAC-CYRILLIC'}

dialog.tra.zip

Wheel support for linux aarch64[arm64]

Problem Description:
On aarch64, pip install cChardet builds the wheels from source code to install it which takes some time to build the wheels than downloading and extracting the wheels from pypi. To build the wheel, development environment is also required to be installed on the user's system.

Resolution:
On aarch64, pip install cChardet should be able to download the wheels from pypi

@PyYoshi , please let me know if I can help in building wheel/uploading to PyPI repository by adding support in build-linux.yml for building aarch64 wheels.

cChardet is not working docker python:slim

Installing collected packages: cchardet
  Running setup.py install for cchardet
    Complete output from command /usr/local/bin/python3.4 -c "import setuptools, tokenize;__file__='/tmp/pip-build-hwermwsb/cchardet/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-2z18myrw-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.4
    creating build/lib.linux-x86_64-3.4/cchardet
    copying src/cchardet/__init__.py -> build/lib.linux-x86_64-3.4/cchardet
    running build_ext
    building 'cchardet._cchardet' extension
    creating build/temp.linux-x86_64-3.4
    creating build/temp.linux-x86_64-3.4/src
    creating build/temp.linux-x86_64-3.4/src/cchardet
    creating build/temp.linux-x86_64-3.4/src/ext
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect/mozilla
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect/mozilla/extensions
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect/mozilla/extensions/universalchardet
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src
    creating build/temp.linux-x86_64-3.4/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base
    gcc -pthread -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Isrc/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/ -Isrc/ext/libcharsetdetect/nspr-emu/ -Isrc/ext/libcharsetdetect/ -I/usr/local/include/python3.4m -c src/cchardet/_cchardet.cpp -o build/temp.linux-x86_64-3.4/src/cchardet/_cchardet.o
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
Command "/usr/local/bin/python3.4 -c "import setuptools, tokenize;__file__='/tmp/pip-build-hwermwsb/cchardet/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-2z18myrw-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-hwermwsb/cchardet
The command '/bin/sh -c pip install cchardet' returned a non-zero code: 1

probably some required package are missing, if setup can add them or at least provide clear error message it will be helpful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.