Git Product home page Git Product logo

Comments (8)

selmf avatar selmf commented on May 26, 2024 1

No, it just means that there is much more work involved to make this work. SzArEx_Extract and probably other parts of 7zDec.c need to be rewritten to support decompressing directly to a buffer and keeping the decompression state and dictionary instead of decompressing everything into a private buffer and allowing access to that.

The question is – is it worth it? It actually might be a better idea to rebuild the parsing code from scratch in C99 and only use the low-level parts of the 7z SDK. That way we have better control over what is happening and it is easier to implement full support for 7z archives and not only the limited the SDK provides.

from unarr.

r0ptr avatar r0ptr commented on May 26, 2024

Oh shoot, after looking into it a bit more, it seems not even the LZMA SDK supports archive entries which size exceed 4294967295 bytes, when compiled for 32bits (at least not the C version of the 7z SDK)

from unarr.

selmf avatar selmf commented on May 26, 2024

The 7z SDK decompresses all data into memory before returning it. This will of course fail on systems that can't address that much memory.

from unarr.

selmf avatar selmf commented on May 26, 2024

The problem with size_t and huge file entries is that size_t is the maximum chunk of memory your system can address. This means even if you'd change the uncompress function to always use a 64 bit buffer this would fail on a 32 bit system for the simple reason that it is too huge to address. There is a simple way around this, though - use a smaller buffer and call uncompress repeatedly to decompress the data in chunks that you write to disk.

This still leaves the problem with the entry size. Size_t in this context is mainly used because it indicates a size. If the data type prevents us from getting the true size this is a bug and it should be fixed. The issue I see with this is that changing the return type will require digging deeper into the respective archive implementations and we need to take care to only change size_t to 64 bit in the correct contexts.

I also need to consider how to handle the (minor) API breakage this might incur for 32 bit users.

from unarr.

r0ptr avatar r0ptr commented on May 26, 2024

The 7z SDK has this piece of code in the SzArEx_Extract function:

  if (*tempBuf == NULL || *blockIndex != folderIndex)
  {
    UInt64 unpackSizeSpec = SzAr_GetFolderUnpackSize(&p->db, folderIndex);
    /*
    UInt64 unpackSizeSpec =
        p->UnpackPositions[p->FolderToFile[(size_t)folderIndex + 1]] -
        p->UnpackPositions[p->FolderToFile[folderIndex]];
    */
    size_t unpackSize = (size_t)unpackSizeSpec;

    if (unpackSize != unpackSizeSpec)
      return SZ_ERROR_MEM;

This is clearly designed to fail on 32bit systems, I'm guessing they wanted to fail early here because the implementation also has problems deeper down...

Regarding the problem you mention that addressing such a large piece of memory is impossible, that is true, but the way I work around that is by using memory mapping, where I map the largest possible free chunk of memory, unpack that much data into the chunk, unmap that chunk, map the next chunk, unpack, and so on.

from unarr.

selmf avatar selmf commented on May 26, 2024

Don't bother too much with the 7z SDK. This memory limitation is the main reason I have marked 7z support as experimental. The underlying decompression code should be able to handle large files just fine, but the C code for archive insists on decompressing huge blocks into a memory cache instead of returning the files as they are decompressed. Fixing this would need a partial rewrite of the SDK.

from unarr.

r0ptr avatar r0ptr commented on May 26, 2024

I see, does it mean that extraction of large files (>4GB) with 7z will never be supported on 32bit systems?

from unarr.

selmf avatar selmf commented on May 26, 2024

Hey, just a quick heads up. I recently rechecked this issue to see if I could include a fix into the upcoming unarr release, but the problem goes deeper than just 7z SDK and the API signature. The bad pattern of using size_t for filesizes is present in a lot of the internal code and structures. Working on this without having proper unit tests to catch regressions is asking for trouble.

I will check if I can improve the situation in the next development cycle. By then I should have a proper test system set up.

from unarr.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.