Git Product home page Git Product logo

rtoss's People

Contributors

roytam1 avatar ramonunch avatar

Stargazers

 avatar 卧看微尘 avatar Denjhang Liu avatar  avatar Bocke avatar James Edward Lewis II avatar  avatar Henry Xu avatar  avatar Stephen J. Carnam avatar  avatar cs8425 avatar  avatar

Watchers

Lucian avatar James Cloos avatar Kostas Georgiou avatar  avatar

rtoss's Issues

[16Edit] preoperly open files >4GB

it will still need using 64bit to open >4GB files because it use memory mapping to open file.

for now it opens in 64bit build, but only displaying (filesize % 4GB) bytes, as a result many places needed to be changed for 64bit.

GB18030 support for GreenPad

What steps will reproduce the problem?
1. Open any text file encoded in GB18030 with 4-byte char sequences.
2. Auto detect file encoding.


What is the expected output? What do you see instead?
It is expected to decode the file as GB18030 or GBK.
GreenPad decodes the file as Shift-JIS or Turkish.


What version of the product are you using? On what operating system?
SVN revision 166. XP SP3.


Please provide any additional information below.
I have made local modifications to use CP54936 instead of CP936 for Chinese 
text, and implemented CharNext and detection for GB18030, as the M$ Windoze API 
sucks. Here is the patch attached.

It seems to work in my test cases, but I hope anyone could conduct more 
thorough tests to make it robust enough for merging into mainstream hopefully.

However the patch has one known problem, that 1-byte Euro sign (0x80) in CP936 
no longer works in CP54936. Maybe it would be a better solution to separate GBK 
and GB18030 handling routines.

Original issue reported on code.google.com by [email protected] on 1 Jan 2011 at 6:37

Attachments:

[libchardet] Improper UTF-8 detection of some files (regression?)

Try with this random gibberish I typed in Windows-1252

notU8.txt

It is now auto-detected as UTF-8 by the latest chardet.dll (when I build it myself or I take it from your latest GreenPad-1.08.3 build).
However it contains invalid sequences:

Actually just this sequence is detected as UTF-8:
é"èr in cp1252 ie: E9 22 E8 72 which is an invalid sequence. E9 defines a 3 byte sequence so the following byte should be of the form 10xxxxxx also E8 cannot be there.

Maybe some invalid sequences could be tolerated but they should be a minority.

If I use an older chardet.dll version I do not have the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.