Git Product home page Git Product logo

Comments (9)

fdegros avatar fdegros commented on May 3, 2024

Thanks for the report.

This works for me on Linux. But there is a subtlety with std::iscntrl() which might explain the difference.

Please check if 823c952 fixes it for you.

from mount-zip.

0-wiz-0 avatar 0-wiz-0 commented on May 3, 2024

Thanks for looking at this.
With the same test files, it's a bit better now:

# ./work/mount-zip/mount-zip  /tmp/test-cp437.zip /mnt/                                                                                                                            
mount-zip[15268]: Bad file name: '


'
mount-zip[15268]: Skipped File [0]: Cannot normalize path
mount-zip[15268]: Bad file name: '
ount-zip[15268]: Skipped File [1]: Cannot normalize path
# ./work/mount-zip/mount-zip  /tmp/test-utf8.zip  /mnt/                                                                                                                            
# ls /mnt/                                                                                                                                                                         
ÄÖÜäöüßćçĉéèêëē

from mount-zip.

fdegros avatar fdegros commented on May 3, 2024

There are a couple of things happening with test-cp437.zip.

First, libicu erroneously detects the filename encoding as GB18030:

$ mount-zip --verbose test-cp437.zip mnt
mount-zip[855602]: Detected encoding GB18030 with 32% confidence

$ tree mnt
mnt
├── !"#$%&'()*+,-.
│   └── 0
├── 、¥ウЖ┆
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~�
├── 亗儎厗噲墛媽崕彁
├── 岩釉罩棕仝圮蒉哙
├── 徕沅彐玷殛腱眍镳
├── 憭摂晼棙櫄洔潪煚
├── 耱篝貊鼬��
├── 谅媚牌侨墒颂臀闲
└── 辈炒刀犯购患骄坷

1 directory, 14 files

This filename encoding detection can be overridden by providing the -o encoding=cp437 option:

$ mount-zip -o encoding=cp437 test-cp437.zip mnt

$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── !"#$%&'()*+,-.
│   └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñѪº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■  
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσμτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α

1 directory, 14 files

Second, files at indices [0] and [1] contain control codes in their names. For security reasons, these are considered invalid and are simply skipped. This explains why the mounted ZIP only contains 14 files instead of the 16 files initially present in the ZIP:

$ unzip -l test-cp437.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2012-02-18 06:51   ^A^B^C^D^E^F^G^I^H^J^K^L^M^N^O^P
        0  2012-02-18 06:51   ^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ 
        0  2012-02-18 06:51   !"#$%&'()*+,-./0
        0  2012-02-18 06:52   123456789:;<=>?@
        0  2012-02-18 06:52   ABCDEFGHIJKLMNOP
        0  2012-02-18 06:52   QRSTUVWXYZ[\]^_`
...
---------                     -------
        0                     16 files

Note that the file at index [2] contains a slash in its name, which is correctly interpreted as a directory separator.

At the moment, mount-zip doesn't try to escape or modify filenames that are considered invalid. Invalid filenames are simply skipped and ignored.

from mount-zip.

fdegros avatar fdegros commented on May 3, 2024

Now, passing -o encoding=libzip uses libzip's internal filename encoding detection and conversion (which only works with UTF-8 and CP437). This completely short-circuits the detection and conversion normally provided by ICU.

$ mount-zip -o encoding=libzip test-cp437.zip mnt

$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── ◄↕‼¶§▬↨↑↓→←∟↔▲▼ 
├── ☺☻♥♦♣♠•○◘◙♂♀♪♫☼►
├── !"#$%&'()*+,-.
│   └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñѪº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■  
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσµτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α

1 directory, 16 files

Note that libzip treats control characters in the range 1–31 differently from ICU. It converts them to graphical characters ☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼.

from mount-zip.

0-wiz-0 avatar 0-wiz-0 commented on May 3, 2024

Thanks.
I don't know where ICU gets their conversion tables from, but libzip's output matches https://en.wikipedia.org/wiki/Code_page_437

from mount-zip.

fdegros avatar fdegros commented on May 3, 2024

Right. And these characters bring back memories from using and programming on MS-DOS... 🙂

The file test-cp437.zip seems to check every character from the CP437 charset, but I noticed two irregularities and I'm curious about them.

The file [0] is named \x01\x02\x03\x04\x05\x06\x07\x09\x08\x0a\x0b\x0c\x0d\x0e\x0f\x10. Note that the control codes \x08 and \x09 are swapped. Is there any particular reason for that?

The file [7] is named qrstuvwxyz{|}~~\x80. Note that the character ~ is repeated and the control code \x7f is skipped. Is there any particular reason for that?

from mount-zip.

fdegros avatar fdegros commented on May 3, 2024

I added a test and noticed another difference between the way ICU and libzip treat CP437 code 0xE6.

ICU maps this code to μ U+03BC (Greek Small Letter Mu) whereas libzip maps this code to µ U+00B5 (Micro Sign).

from mount-zip.

0-wiz-0 avatar 0-wiz-0 commented on May 3, 2024

I think the reason for the \x08 switch with \0x09 is that the test case 'raw' is more easily readable if the character is not at the end of the line. But I've changed this for consistency.
I don't remember why I didn't test 0x7f - I've changed that now.
As for the mu character -- Wikipedia says it's a Micro Sign. I don't know why ICU uses the Greek Mu instead. Do you know?
Thanks for the suggestions!

from mount-zip.

fdegros avatar fdegros commented on May 3, 2024

A few code points of the CP437 charset can have multiple interpretations. I think this is Ok. And we now have automated tests asserting both libzip's and ICU's conversions in mount-zip. All good.

from mount-zip.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.