Comments (5)
Same problem here.
from compact_enc_det.
Unfortunately I don't have plans to improve the detection quality. Could you share the data you get poor results with? I can take a look. Hopefully there will be some things to do to get around the issue.
Thanks.
from compact_enc_det.
See the attached file, which is encoded in Windows-1252 but detected as GB18030. Thanks for helping!
from compact_enc_det.
I'm joining test files and the results I get, as you can see I'm satisfied with the unicode detections (and mostly for asian encodings) but really disappointed by ISO, particularly for western European languages (ISO-1 and 15 for instance, which are really widespread, see:
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html ).
Thank you in any case!
big5-hkscs.txt → BIG5_HKSCS, reliable: false → OK
big5.txt → BIG5, reliable: false → OK
BIG5.txt → BIG5, reliable: true → OK
euc-jp.txt → GB (=GBA8030?), reliable: false → ~OK
euc-kr.txt → KSC (=?), reliable: false → OK
gbk.txt → GB (=GBA8030?), reliable: false → OK
IBM855.txt → CP-1256, reliable: false → Not OK
ISO-8859-15-CRLF.srt → ASCII, reliable: true → Not OK
ISO-8859-15 euro.txt → ASCII, reliable: true → Not OK
ISO-8859-15 petit test.txt → CP1250, reliable: true → Not OK
ISO-8859-15.srt → ASCII, reliable: true → Not OK
ISO-8859-1.srt → ASCII, reliable: true → Not OK
ISO-8859-6.srt → Arabic, reliable: true → OK
shift_jis.txt → SJC, reliable: true → OK
UTF16BE.srt → UTF16BE, reliable: false → OK
UTF16LE.srt → UTF16LE, reliable: false → OK
UTF-7.txt → ASCII-7 bits, reliable: true → OK
UTF8BOM.srt → UTF8, reliable: true → OK
utf-8 CN.txt → UTF8, reliable: true → OK
UTF8CRLF.srt → UTF8, reliable: true → OK
UTF8CR.srt → UTF8, reliable: true → OK
UTF8LF.srt → UTF8, reliable: true → OK
from compact_enc_det.
Just on the off-chance... do you have any idea on how this might be tackled; in case somebody else wants to take a crack at it?
from compact_enc_det.
Related Issues (13)
- Standardize the encoding name HOT 1
- LICENSE file contains boilerplate HOT 1
- logic HOT 2
- Introduce HTML5 mode HOT 4
- Encoding issue
- update CMakeLists.txt to accomodate the needs of the community HOT 3
- Pin to known-good version of googletest HOT 1
- N 000k0 on
- [email protected] HOT 1
- Provide a bridge for other languages (Java, node.js, etc) HOT 8
- can't use it to build a shared library HOT 4
- Add possibility to detect encoding from input streams HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compact_enc_det.