roytam1 / rtoss Goto Github PK

Automatically exported from code.google.com/p/rtoss

C++ 11.95% C 81.76% PHP 0.67% CSS 0.02% VCL 0.04% Shell 0.71% NSIS 0.04% Makefile 2.90% Java 0.59% HTML 0.13% Perl 0.60% Smarty 0.08% Objective-C 0.14% TeX 0.25% Logos 0.01% XSLT 0.01% C# 0.01% JavaScript 0.01% R 0.01% Python 0.08%

rtoss's People

Contributors

Stargazers

Watchers

Forkers

laerm0 h0kd33 prcuvu xingyuncaoxyz ramonunch theaibotos beiklive

rtoss's Issues

[16Edit] preoperly open files >4GB

it will still need using 64bit to open >4GB files because it use memory mapping to open file.

for now it opens in 64bit build, but only displaying (filesize % 4GB) bytes, as a result many places needed to be changed for 64bit.

GB18030 support for GreenPad

What steps will reproduce the problem?
1. Open any text file encoded in GB18030 with 4-byte char sequences.
2. Auto detect file encoding.


What is the expected output? What do you see instead?
It is expected to decode the file as GB18030 or GBK.
GreenPad decodes the file as Shift-JIS or Turkish.


What version of the product are you using? On what operating system?
SVN revision 166. XP SP3.


Please provide any additional information below.
I have made local modifications to use CP54936 instead of CP936 for Chinese 
text, and implemented CharNext and detection for GB18030, as the M$ Windoze API 
sucks. Here is the patch attached.

It seems to work in my test cases, but I hope anyone could conduct more 
thorough tests to make it robust enough for merging into mainstream hopefully.

However the patch has one known problem, that 1-byte Euro sign (0x80) in CP936 
no longer works in CP54936. Maybe it would be a better solution to separate GBK 
and GB18030 handling routines.

Original issue reported on code.google.com by [email protected] on 1 Jan 2011 at 6:37

Attachments:

[libchardet] Improper UTF-8 detection of some files (regression?)

Try with this random gibberish I typed in Windows-1252

notU8.txt

It is now auto-detected as UTF-8 by the latest chardet.dll (when I build it myself or I take it from your latest GreenPad-1.08.3 build).
However it contains invalid sequences:

Actually just this sequence is detected as UTF-8:
é"èr in cp1252 ie: E9 22 E8 72 which is an invalid sequence. E9 defines a 3 byte sequence so the following byte should be of the form 10xxxxxx also E8 cannot be there.

Maybe some invalid sequences could be tolerated but they should be a minority.

If I use an older chardet.dll version I do not have the problem.

roytam1 / rtoss Goto Github PK

rtoss's People

Contributors

Stargazers

Watchers

Forkers

rtoss's Issues

[16Edit] preoperly open files >4GB

GB18030 support for GreenPad

[libchardet] Improper UTF-8 detection of some files (regression?)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent