Git Product home page Git Product logo

Comments (22)

comicfans avatar comicfans commented on August 25, 2024

I take a quick hack here, is this acceptable ?

from unarr.

selmf avatar selmf commented on August 25, 2024

I'm not sure if adding format-specific functions like this is a good idea. Most of unarr's code wasn't written by me, so I have to study the issue myself first. If unarr gives you the wrong encoding, that is a bug and that bug should be fixed.

from unarr.

comicfans avatar comicfans commented on August 25, 2024

I agree that unarr should keep interface generic and avoid format-specific function. unarr currently only try to convert from cp437 (seems this is from zip spec?), but in practice this is not the case , zip file encoding is a long standing problem , there're many zip file with non-standard encoding. afaik it's impossible to know actual encoding without some 'guess' work .

from unarr.

selmf avatar selmf commented on August 25, 2024

I don't think unarr needs to know the actual encoding. According to the zip specification the file headers must contain a flag indicating if the file name is encoded in UTF-8. So if this flag is set, it should be enough to just return the raw UTF-8 string as you suggested.

from unarr.

comicfans avatar comicfans commented on August 25, 2024

seems that problem happend if flag is not set to utf8. it takes

 zip->entry.name = ar_conv_dos_to_utf8(name);

which do cp437 => utf8 convert, but this gives 'incorrect' result . taken from previous blog link

The Zip spec says that the only supported encodings are CP437 and UTF-8, 
but everyone has ignored that. 
Implementers just encode file names however they want 

there is also other zip tool which suffer from such problem, for example unzip.

because unarr will replace slash in filename before return result to user, so convert back 'incorrect' result back to raw is also not possible (maybe possible ?)

from unarr.

comicfans avatar comicfans commented on August 25, 2024

with some checking, I think it is possible to convert back 'incorrect decoded' utf8 back to raw string( cp437 table used by unarr has unique utf8 mapping, and slash is compatible for both). the only problem is that cp437 has two dialects , unarr used the one from wiki (with wingdings) , not the DOSLatinUS . by using uchardet to guess correct encoding of raw string , I finally got correct result.

integrate uchardet into unarr may be infeasible, some API comment for zip filename (the problem and solution) will be helpful for developers. since caller didn't know the filename is converted from cp437 or just raw string as utf8, the problem still exists.

from unarr.

selmf avatar selmf commented on August 25, 2024

I see. This is more complicated than I thought. I agree that unarr should provide a mechanism that allows access to the raw filename, but this has to be implemented in a more generic way. Ideally, unarr would not do any character decoding at all and leave this step to the caller. Instead of adding a raw_filename entry we could add a character_encoding entry and keep the filename raw. That way the caller could decide to use it raw, handle conversion himself or use unarr's conversion.

from unarr.

comicfans avatar comicfans commented on August 25, 2024

That's better approach , I hope new API keeps clean and easy as previously. I'm afraid it may break ABI and introduces extra work for simple usecases ,since encoding may not be a problem for other formats.

from unarr.

selmf avatar selmf commented on August 25, 2024

I'm not interested in making the API unnecessary complicated or breaking ABI. I will do some research on how other libraries or unpackers are handling this and then decide what I am going to do.

My current idea is to have something like this:

const char *ar_entry_get_raw_name(ar_archive *ar);
unarr_encoding ar_entry_get_encoding(ar_archive *ar);

Unarr_encoding will be an enum, NULL will indicate unknown encoding. The old ar_entry_get_name function will stay. But that is my idea now. If I have a better idea I will do it differently :)

from unarr.

selmf avatar selmf commented on August 25, 2024

I recently had some time to think about this issue again and I would like to include your solution in the next unarr release. There is no use in waiting for a more general approach when a simple solution exists and I can always map the zip specific raw name function to a more generic solution if I ever implement it.

For the PR, it would be great if you could implement the following changes:

  • replace strdup() with malloc + strcpy or memcpy for better compability (strdup() is not part of the standard library, at least for C99)
  • move the ar_entry_zip_get_raw_name function to the zip section of the API, as it is format specific
  • increase the API version in unarr.h
  • add yourself to the AUTHORS file and add a CHANGELOG entry :)

We also should discuss how to describe this function in the header. There should be information on why this function exists and how to handle the raw filenames (uchardet or other method).

from unarr.

comicfans avatar comicfans commented on August 25, 2024

Thanks for consider including this as solution, but currently I have very limited spare time to work on this, I'll reconstruct my commit as your suggestion in next few days.

from unarr.

selmf avatar selmf commented on August 25, 2024

No problem. I just picked up work again recently, so there is no hurry and I let you code sit long enough, so a few days or even a bit longer won't be an issue.

from unarr.

selmf avatar selmf commented on August 25, 2024

Actually, if you're a bit short on time right now and rather would like me to finish the code I'd be fine with that too ;)

from unarr.

comicfans avatar comicfans commented on August 25, 2024

Appreciated that if you can finish the code! seems that I still don't have too much spare time on this

from unarr.

selmf avatar selmf commented on August 25, 2024

Great! Just create a PR from your master and I'll take over from that. If you have the time, a quick rebase would also help ;)

from unarr.

comicfans avatar comicfans commented on August 25, 2024

a little bit late... here it is

from unarr.

selmf avatar selmf commented on August 25, 2024

Question: How do you build/use the code and the function for the raw filename?

I've added a call to it to the unarr-test sample code and I get warnings about implicit function declaration and cast from int to pointer, followed by a segfault when I run it.

Note that this is not a complaint, I'm not really a C wizard myself and I'm just trying to understand why my sample application is complaining while it probably works for you :)

from unarr.

comicfans avatar comicfans commented on August 25, 2024

@selmf

Seems that problem is due to incomplete CMakeLists.

in main CMakeLists unarr didn't define target public include directories, so under test dir example code looks into system for unarr.h (which is old one) instead of project's.

setting

target_include_directories(unarr PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})

in main CMakeLists resolve this. I suggest that moving unarr.h into a individual include dir as a complete solution.

from unarr.

selmf avatar selmf commented on August 25, 2024

Good catch. I tested if it was linking to the correct library but forgot to check if it includes the wrong header.
A complete solution needs to take into account both local builds and system-installed headers, but now that I know what I am looking for this should be solvable :)

from unarr.

selmf avatar selmf commented on August 25, 2024

Ok, I finally finished this. I ended up modifying your code more than I was initially planning, but I think the end result is worth it.

To be specific, I had to add at least some boilerplate code to the other archive formats to prevent users from running the raw zip filename function on non-zip functions, so I added a "raw" parameter to the internal function for retrieving filenames and defaulted it to NULL on non-zip archives. Doing so also allowed me to merge the raw filename stuff into the regular zip name function, so I am quite happy with the outcome.

Feel free to test it a bit and give me some feedback :)

from unarr.

comicfans avatar comicfans commented on August 25, 2024

Great!

from unarr.

Dennis-Zhang-SH avatar Dennis-Zhang-SH commented on August 25, 2024

I don't think unarr needs to know the actual encoding. According to the zip specification the file headers must contain a flag indicating if the file name is encoded in UTF-8. So if this flag is set, it should be enough to just return the raw UTF-8 string as you suggested.

Is it able to provide a flag indicates that filename and file content may not encode in UTF-8?

from unarr.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.