Comments (9)
Thanks for the report.
This works for me on Linux. But there is a subtlety with std::iscntrl()
which might explain the difference.
Please check if 823c952 fixes it for you.
from mount-zip.
Thanks for looking at this.
With the same test files, it's a bit better now:
# ./work/mount-zip/mount-zip /tmp/test-cp437.zip /mnt/
mount-zip[15268]: Bad file name: '
'
mount-zip[15268]: Skipped File [0]: Cannot normalize path
mount-zip[15268]: Bad file name: '
ount-zip[15268]: Skipped File [1]: Cannot normalize path
# ./work/mount-zip/mount-zip /tmp/test-utf8.zip /mnt/
# ls /mnt/
ÄÖÜäöüßćçĉéèêëē
from mount-zip.
There are a couple of things happening with test-cp437.zip
.
First, libicu
erroneously detects the filename encoding as GB18030
:
$ mount-zip --verbose test-cp437.zip mnt
mount-zip[855602]: Detected encoding GB18030 with 32% confidence
$ tree mnt
mnt
├── !"#$%&'()*+,-.
│ └── 0
├── 、¥ウЖ┆
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~�
├── 亗儎厗噲墛媽崕彁
├── 岩釉罩棕仝圮蒉哙
├── 徕沅彐玷殛腱眍镳
├── 憭摂晼棙櫄洔潪煚
├── 耱篝貊鼬��
├── 谅媚牌侨墒颂臀闲
└── 辈炒刀犯购患骄坷
1 directory, 14 files
This filename encoding detection can be overridden by providing the -o encoding=cp437
option:
$ mount-zip -o encoding=cp437 test-cp437.zip mnt
$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── !"#$%&'()*+,-.
│ └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñѪº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσμτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α
1 directory, 14 files
Second, files at indices [0] and [1] contain control codes in their names. For security reasons, these are considered invalid and are simply skipped. This explains why the mounted ZIP only contains 14 files instead of the 16 files initially present in the ZIP:
$ unzip -l test-cp437.zip
Length Date Time Name
--------- ---------- ----- ----
0 2012-02-18 06:51 ^A^B^C^D^E^F^G^I^H^J^K^L^M^N^O^P
0 2012-02-18 06:51 ^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_
0 2012-02-18 06:51 !"#$%&'()*+,-./0
0 2012-02-18 06:52 123456789:;<=>?@
0 2012-02-18 06:52 ABCDEFGHIJKLMNOP
0 2012-02-18 06:52 QRSTUVWXYZ[\]^_`
...
--------- -------
0 16 files
Note that the file at index [2] contains a slash in its name, which is correctly interpreted as a directory separator.
At the moment, mount-zip
doesn't try to escape or modify filenames that are considered invalid. Invalid filenames are simply skipped and ignored.
from mount-zip.
Now, passing -o encoding=libzip
uses libzip
's internal filename encoding detection and conversion (which only works with UTF-8 and CP437). This completely short-circuits the detection and conversion normally provided by ICU
.
$ mount-zip -o encoding=libzip test-cp437.zip mnt
$ tree mnt
mnt
├── ┴┬├─┼╞╟╚╔╩╦╠═╬╧╨
├── ▒▓│┤╡╢╖╕╣║╗╝╜╛┐└
├── ◄↕‼¶§▬↨↑↓→←∟↔▲▼
├── ☺☻♥♦♣♠•○◘◙♂♀♪♫☼►
├── !"#$%&'()*+,-.
│ └── 0
├── 123456789:;<=>?@
├── abcdefghijklmnop
├── ABCDEFGHIJKLMNOP
├── æÆôöòûùÿÖÜ¢£¥₧ƒá
├── íóúñѪº¿⌐¬½¼¡«»░
├── ±≥≤⌠⌡÷≈°∙·√ⁿ²■
├── QRSTUVWXYZ[\]^_`
├── qrstuvwxyz{|}~~Ç
├── ßΓπΣσµτΦΘΩδ∞φε∩≡
├── üéâäàåçêëèïîìÄÅÉ
└── ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀α
1 directory, 16 files
Note that libzip
treats control characters in the range 1–31 differently from ICU
. It converts them to graphical characters ☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼
.
from mount-zip.
Thanks.
I don't know where ICU gets their conversion tables from, but libzip's output matches https://en.wikipedia.org/wiki/Code_page_437
from mount-zip.
Right. And these characters bring back memories from using and programming on MS-DOS... 🙂
The file test-cp437.zip
seems to check every character from the CP437 charset, but I noticed two irregularities and I'm curious about them.
The file [0] is named \x01\x02\x03\x04\x05\x06\x07\x09\x08\x0a\x0b\x0c\x0d\x0e\x0f\x10
. Note that the control codes \x08
and \x09
are swapped. Is there any particular reason for that?
The file [7] is named qrstuvwxyz{|}~~\x80
. Note that the character ~
is repeated and the control code \x7f
is skipped. Is there any particular reason for that?
from mount-zip.
I added a test and noticed another difference between the way ICU
and libzip
treat CP437
code 0xE6
.
ICU
maps this code to μ U+03BC
(Greek Small Letter Mu) whereas libzip
maps this code to µ U+00B5
(Micro Sign).
from mount-zip.
I think the reason for the \x08 switch with \0x09 is that the test case 'raw' is more easily readable if the character is not at the end of the line. But I've changed this for consistency.
I don't remember why I didn't test 0x7f - I've changed that now.
As for the mu character -- Wikipedia says it's a Micro Sign. I don't know why ICU uses the Greek Mu instead. Do you know?
Thanks for the suggestions!
from mount-zip.
A few code points of the CP437 charset can have multiple interpretations. I think this is Ok. And we now have automated tests asserting both libzip
's and ICU
's conversions in mount-zip
. All good.
from mount-zip.
Related Issues (10)
- debian does not support libzip-dev 1.9 yet HOT 3
- Instructions for installing on Windows HOT 1
- error: ‘zip_file_is_seekable’ was not declared in this scope HOT 2
- Operation not supported HOT 12
- compile on macos aarch64 HOT 4
- libfuse dependency HOT 1
- Option for turning off caching HOT 4
- Can we get this in debian/ubuntu? HOT 2
- How to build static binary ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mount-zip.