Cannot normalize path about mount-zip HOT 9 CLOSED

google commented on May 3, 2024

Cannot normalize path

from mount-zip.

Comments (9)

fdegros commented on May 3, 2024

Thanks for the report.

This works for me on Linux. But there is a subtlety with std::iscntrl() which might explain the difference.

Please check if 823c952 fixes it for you.

from mount-zip.

0-wiz-0 commented on May 3, 2024

Thanks for looking at this.
With the same test files, it's a bit better now:

# ./work/mount-zip/mount-zip  /tmp/test-cp437.zip /mnt/                                                                                                                            
mount-zip[15268]: Bad file name: '


'
mount-zip[15268]: Skipped File [0]: Cannot normalize path
mount-zip[15268]: Bad file name: '
ount-zip[15268]: Skipped File [1]: Cannot normalize path
# ./work/mount-zip/mount-zip  /tmp/test-utf8.zip  /mnt/                                                                                                                            
# ls /mnt/                                                                                                                                                                         
ÄÖÜäöüßćçĉéèêëē

from mount-zip.

fdegros commented on May 3, 2024

There are a couple of things happening with test-cp437.zip.

First, libicu erroneously detects the filename encoding as GB18030:

$ mount-zip --verbose test-cp437.zip mnt
mount-zip[855602]: Detected encoding GB18030 with 32% confidence

$ tree mnt
mnt
├── !"#$%&'()*+,-.
│   └── 0
├── 、￥ウЖ┆
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~�
├── 亗儎厗噲墛媽崕彁
├── 岩釉罩棕仝圮蒉哙
├── 徕沅彐玷殛腱眍镳
├── 憭摂晼棙櫄洔潪煚
├── 耱篝貊鼬��
├── 谅媚牌侨墒颂臀闲
└── 辈炒刀犯购患骄坷

1 directory, 14 files

This filename encoding detection can be overridden by providing the -o encoding=cp437 option:

$ mount-zip -o encoding=cp437 test-cp437.zip mnt

$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── !"#$%&'()*+,-.
│   └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñÑªº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■  
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσμτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α

1 directory, 14 files

Second, files at indices [0] and [1] contain control codes in their names. For security reasons, these are considered invalid and are simply skipped. This explains why the mounted ZIP only contains 14 files instead of the 16 files initially present in the ZIP:

$ unzip -l test-cp437.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2012-02-18 06:51   ^A^B^C^D^E^F^G^I^H^J^K^L^M^N^O^P
        0  2012-02-18 06:51   ^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ 
        0  2012-02-18 06:51   !"#$%&'()*+,-./0
        0  2012-02-18 06:52   123456789:;<=>?@
        0  2012-02-18 06:52   ABCDEFGHIJKLMNOP
        0  2012-02-18 06:52   QRSTUVWXYZ[\]^_`
...
---------                     -------
        0                     16 files

Note that the file at index [2] contains a slash in its name, which is correctly interpreted as a directory separator.

At the moment, mount-zip doesn't try to escape or modify filenames that are considered invalid. Invalid filenames are simply skipped and ignored.

from mount-zip.

fdegros commented on May 3, 2024

Now, passing -o encoding=libzip uses libzip's internal filename encoding detection and conversion (which only works with UTF-8 and CP437). This completely short-circuits the detection and conversion normally provided by ICU.

$ mount-zip -o encoding=libzip test-cp437.zip mnt

$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── ◄↕‼¶§▬↨↑↓→←∟↔▲▼ 
├── ☺☻♥♦♣♠•○◘◙♂♀♪♫☼►
├── !"#$%&'()*+,-.
│   └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñÑªº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■  
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσµτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α

1 directory, 16 files

Note that libzip treats control characters in the range 1–31 differently from ICU. It converts them to graphical characters ☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼.

from mount-zip.

0-wiz-0 commented on May 3, 2024

Thanks.
I don't know where ICU gets their conversion tables from, but libzip's output matches https://en.wikipedia.org/wiki/Code_page_437

from mount-zip.

fdegros commented on May 3, 2024

Right. And these characters bring back memories from using and programming on MS-DOS... 🙂

The file test-cp437.zip seems to check every character from the CP437 charset, but I noticed two irregularities and I'm curious about them.

The file [0] is named \x01\x02\x03\x04\x05\x06\x07\x09\x08\x0a\x0b\x0c\x0d\x0e\x0f\x10. Note that the control codes \x08 and \x09 are swapped. Is there any particular reason for that?

The file [7] is named qrstuvwxyz{|}~~\x80. Note that the character ~ is repeated and the control code \x7f is skipped. Is there any particular reason for that?

from mount-zip.

fdegros commented on May 3, 2024

I added a test and noticed another difference between the way ICU and libzip treat CP437 code 0xE6.

ICU maps this code to μ U+03BC (Greek Small Letter Mu) whereas libzip maps this code to µ U+00B5 (Micro Sign).

from mount-zip.

0-wiz-0 commented on May 3, 2024

I think the reason for the \x08 switch with \0x09 is that the test case 'raw' is more easily readable if the character is not at the end of the line. But I've changed this for consistency.
I don't remember why I didn't test 0x7f - I've changed that now.
As for the mu character -- Wikipedia says it's a Micro Sign. I don't know why ICU uses the Greek Mu instead. Do you know?
Thanks for the suggestions!

from mount-zip.

fdegros commented on May 3, 2024

A few code points of the CP437 charset can have multiple interpretations. I think this is Ok. And we now have automated tests asserting both libzip's and ICU's conversions in mount-zip. All good.

from mount-zip.

Cannot normalize path about mount-zip HOT 9 CLOSED

Comments (9)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent