Git Product home page Git Product logo

karton-archive-extractor's Introduction

Extractor karton service

Performs extraction of known archive types and e-mail attachments. Produces "raw" artifacts for further classification.

Author: CERT.pl

Maintainers: psrok1, nazywam

Consumes:

{
    "type":  "sample",
    "stage": "recognized",
    "kind":  "archive"
    "payload": {
        "sample": <Resource>,
        "extraction_level": <int, default: 0>,
        "password": <archive password>,
    }
}

Produces:

{
    "type": "sample",
    "kind": "raw",
    "payload": {
        "sample": <Resource>,
        "parent": <Resource>,
        "extraction_level": <int++>
    }
}

Usage

First of all, make sure you have setup the core system: https://github.com/CERT-Polska/karton

In order to unpack all available formats you'll also need a few native dependencies that sflock relies on, the installation method recommended by sflock is:

RUN sed -i 's/ main/ main non-free/' /etc/apt/sources.list \
    && apt-get update && apt-get install -y \
    p7zip-full \
    rar \
    unace \
    cabextract \
    lzip

Then install karton-archive-extractor from PyPi:

$ pip install karton-archive-extractor

$ karton-archive-extractor

Configuration

There are several configuration options you can tweak up to your liking.

[archive-extractor]
# Maximum levels of nested extraction
max_depth = 5
# Maximum unpacked child filesize, larger files are not reported
max_size = 26214400
# Maximum number of children files for further analysis
max_children = 1000

To learn more about configuring your karton services, take a look at karton configuration docs

Running in Docker

Sflock uses ZipJail as a usermode syscall filtering mechanism. As a result, in our experience, container running the karton service has to have the SYS_PTRACE capability in order for the ptrace to execute correctly. Make sure it's enabled if you run into problems extracting certain archive types.

Supported archive/compression formats*

.7z
.ace
.bup
.cab
.daa
.eml
.gz
.gzip
.iso
.lha
.lz
.lzh
.msg
.mso
.pdf
.rar
.tar
.tar.bz2
.tar.gz
.udf
.vhd
.vhdx
.xz
.zip

* Assuming you are running Linux, please see the sflock's readme for more information

PE files debloating

Some malicious PE files contain intentionally added junk to make them too big for processing. Starting from v1.4.0, archive extractor supports optional debloating of these files, using debloat tool made by Squiblydoo.

certpl/karton-archive-extractor Docker image debloats PE files by default. To enable debloating in karton-archive-extractor installed from PyPI, you need to install additional extra dependencies:

pip install karton-archive-extractor[debloat]

Co-financed by the Connecting Europe Facility by of the European Union

karton-archive-extractor's People

Contributors

74wny0wl avatar alex-ilgayev avatar bonusplay avatar chivay avatar conitrade-as avatar doomedraven avatar michaelweiser avatar msm-code avatar nazywam avatar psrok1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

karton-archive-extractor's Issues

Crash on some files due to sflock shellcode detection

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350258688) at ./nptl/pthread_kill.c:44
44    ./nptl/pthread_kill.c: No such file or directory.
(gdb) x/i $rip
=> 0x7ffff7ce4a7c <__GI___pthread_kill+300>:    mov    %eax,%r13d
(gdb) info threads
  Id   Target Id                                    Frame 
* 1    Thread 0x7ffff7c4d000 (LWP 217993) "python3" __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350258688) at ./nptl/pthread_kill.c:44
(gdb) info stack
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350258688) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350258688) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350258688, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c90476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c767f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff391d249 in temp_load[cold] () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#6  0x00007ffff395c459 in tcg_gen_code_x86_64 () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#7  0x00007ffff3983fbc in tb_gen_code_x86_64 () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#8  0x00007ffff396832b in cpu_exec_x86_64 () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#9  0x00007ffff392ded4 in resume_all_vcpus_x86_64 () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#10 0x00007ffff392189e in uc_emu_start () from /home/psrok1/karton-archive-extractor/venv/lib/python3.10/site-packages/unicorn/lib/libunicorn.so.2
#11 0x00007ffff5e10e2e in ?? () from /lib/x86_64-linux-gnu/libffi.so.8
#12 0x00007ffff5e0d493 in ?? () from /lib/x86_64-linux-gnu/libffi.so.8
#13 0x00007ffff5e31451 in ?? () from /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
#14 0x00007ffff5e3ace2 in ?? () from /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
#15 0x00005555556a630b in _PyObject_MakeTpCall ()
#16 0x000055555569ec67 in _PyEval_EvalFrameDefault ()
#17 0x00005555556affbc in _PyFunction_Vectorcall ()
#18 0x00005555556985c9 in _PyEval_EvalFrameDefault ()

Probably caused by sflock.ident.detect_shellcode (https://github.com/doomedraven/sflock/blob/master/sflock/ident.py#L137)

  1. Maybe we should run sflock in separate process to properly handle crashes like that (e.g. caused by missing syscall on whitelist)
  2. Why every file is evaluated as shellcode when doesn't match to handled format? Isn't it overkill?

Debloat versions

Greetings!
First, thanks for including my debloat tool as part of karton-archive-extractor.

Second, I've just released 1.4.1: which processes much better, faster, and fixes a few bugs pertaining to a few some samples.

What is the best way for me to contribute to karton-archive-extractor when I release larger debloat updates? Should I build karton locally, test locally to ensure new versions work correctly through karton-archive-extractor, and then submit Pull Requests? Or would you prefer to primarily do your own testing in regards to included tools like debloat?

Try "infected" as the password because why not

Many archive files are protected with the common "infected" password. I suggest that if an archive is password-protected, the karton can try to extract it using the "infected" password.

List supported formats

Supported formats:

.ace
.bup
.cab
.daa
.eml
.gz
.gzip
.iso
.lha
.lz
.lzh
.msg
.mso
.pdf
.rar
.tar
.tar.bz
.tar.gz
.vhd
.vhdx
.xz
.z
.zip
.7z

Read additional extractor values from config/argv

# Maximum levels of nested extraction
max_depth = 5
# Maximum unpacked child filesize, larger files are not reported
max_size = 25 * 1024 * 1024
# Maximum number of childs for further analysis
max_children = 1000

Rewrite the extractor to use 7zip instead of zipfile

Issue: we don't support newer encryption methods, because Python's zipfile can't handle them:

karton.archive_extractor.sflock.exception.UnpackException: Unknown zipfile error: That compression method is not supported

Solution: use 7z to extract files. The downside is that we lose sandboxing provided by sflock, but 7z exploit is IMO highly unlikely (I know, famous last words).

Other solutions: find a secure pure python library for zip files? Something else?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.