The current API, namely the functions: ar_entry_get_size</co

API limits the maximum archive entry size on 32bit systems about unarr HOT 8 OPEN

selmf commented on May 26, 2024

API limits the maximum archive entry size on 32bit systems

from unarr.

Comments (8)

selmf commented on May 26, 2024 1

No, it just means that there is much more work involved to make this work. SzArEx_Extract and probably other parts of 7zDec.c need to be rewritten to support decompressing directly to a buffer and keeping the decompression state and dictionary instead of decompressing everything into a private buffer and allowing access to that.

The question is – is it worth it? It actually might be a better idea to rebuild the parsing code from scratch in C99 and only use the low-level parts of the 7z SDK. That way we have better control over what is happening and it is easier to implement full support for 7z archives and not only the limited the SDK provides.

from unarr.

r0ptr commented on May 26, 2024

Oh shoot, after looking into it a bit more, it seems not even the LZMA SDK supports archive entries which size exceed 4294967295 bytes, when compiled for 32bits (at least not the C version of the 7z SDK)

from unarr.

selmf commented on May 26, 2024

The 7z SDK decompresses all data into memory before returning it. This will of course fail on systems that can't address that much memory.

from unarr.

selmf commented on May 26, 2024

The problem with size_t and huge file entries is that size_t is the maximum chunk of memory your system can address. This means even if you'd change the uncompress function to always use a 64 bit buffer this would fail on a 32 bit system for the simple reason that it is too huge to address. There is a simple way around this, though - use a smaller buffer and call uncompress repeatedly to decompress the data in chunks that you write to disk.

This still leaves the problem with the entry size. Size_t in this context is mainly used because it indicates a size. If the data type prevents us from getting the true size this is a bug and it should be fixed. The issue I see with this is that changing the return type will require digging deeper into the respective archive implementations and we need to take care to only change size_t to 64 bit in the correct contexts.

I also need to consider how to handle the (minor) API breakage this might incur for 32 bit users.

from unarr.

r0ptr commented on May 26, 2024

The 7z SDK has this piece of code in the SzArEx_Extract function:

  if (*tempBuf == NULL || *blockIndex != folderIndex)
  {
    UInt64 unpackSizeSpec = SzAr_GetFolderUnpackSize(&p->db, folderIndex);
    /*
    UInt64 unpackSizeSpec =
        p->UnpackPositions[p->FolderToFile[(size_t)folderIndex + 1]] -
        p->UnpackPositions[p->FolderToFile[folderIndex]];
    */
    size_t unpackSize = (size_t)unpackSizeSpec;

    if (unpackSize != unpackSizeSpec)
      return SZ_ERROR_MEM;

This is clearly designed to fail on 32bit systems, I'm guessing they wanted to fail early here because the implementation also has problems deeper down...

Regarding the problem you mention that addressing such a large piece of memory is impossible, that is true, but the way I work around that is by using memory mapping, where I map the largest possible free chunk of memory, unpack that much data into the chunk, unmap that chunk, map the next chunk, unpack, and so on.

from unarr.

selmf commented on May 26, 2024

Don't bother too much with the 7z SDK. This memory limitation is the main reason I have marked 7z support as experimental. The underlying decompression code should be able to handle large files just fine, but the C code for archive insists on decompressing huge blocks into a memory cache instead of returning the files as they are decompressed. Fixing this would need a partial rewrite of the SDK.

from unarr.

r0ptr commented on May 26, 2024

I see, does it mean that extraction of large files (>4GB) with 7z will never be supported on 32bit systems?

from unarr.

selmf commented on May 26, 2024

Hey, just a quick heads up. I recently rechecked this issue to see if I could include a fix into the upcoming unarr release, but the problem goes deeper than just 7z SDK and the API signature. The bad pattern of using size_t for filesizes is present in a lot of the internal code and structures. Working on this without having proper unit tests to catch regressions is asking for trouble.

I will check if I can improve the situation in the next development cycle. By then I should have a proper test system set up.

from unarr.

API limits the maximum archive entry size on 32bit systems about unarr HOT 8 OPEN

Comments (8)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent