trailofbits / polyfile Goto Github PK

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer

License: Apache License 2.0

Python 97.97% CSS 0.11% JavaScript 0.75% HTML 0.11% Kaitai Struct 0.96% Shell 0.10%

file-formats file-format-detection polyglots python libmagic

polyfile's Issues

Lengthy processing time for 27kb .docx file

Apache Tika's example of a .docx file with recursively embedded zip files takes roughly an hour to process.

I confirmed privately with a key committer that this is unexpected.

File is available here: https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx

in unget, assert isinstance(byte, PDFByte), AssertionError]

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 243, in parse_pdf, object = parser.GetObject(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 426, in GetObject, self.token = self.oPDFTokenizer.TokenIgnoreWhiteSpace(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 396, in TokenIgnoreWhiteSpace, token = self.Token(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 373, in Token, self.oPDF.unget(self.byte), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 240, in unget, assert isinstance(byte, PDFByte), AssertionError]

File: govdocs/176/176361.pdf

Runtime exception: 'NoneType' object is not subscriptable

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], TypeError: 'NoneType' object is not subscriptable]

govdocs/706/706211.pdf

Let me know if I should attach the triggering file.

Misinterpreting %comment as a key in a dictionary

398129.pdf
File from govdocs1.
hex editor

polyfile's interpretation

Support sys.stderr that does not have a buffer

The StatusLogHandler defaults to using sys.stderr.buffer, but that is not always guaranteed to be available if a custom IO is replaced.

Support for the undocumented libmagic ternary operator

Add support for the undocumented ternary operator in libmagic that is apparently only used in the ELF definition:
https://github.com/file/file/blob/03b6dcb4a24455207ef4094560c334fbc38875bd/magic/Magdir/elf#L61-L63

The x tests to see if the execute bits are set on the input file: https://github.com/file/file/blob/master/src/softmagic.c#L529-L566

This will probably require modifying the API to maintain information about the execute bits.

Example files that cause an unexpected delay iterating (infinite?) matches

On Windows with python 3.11 and polyfile 0.5.2, processing the following files as demonstrated in the README seems to take forever:

python-magic using libmagic v4 from years ago spits out an error mentioning regex and memory.

code

import polyfile

def from_file(file):
 print(file)
 with open(file, "rb") as f:
  # the default instance automatically loads all file definitions
  for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
   for mimetype in match.mimetypes:
    print(f"Matched MIME: {mimetype}", flush=True)
   print(f"Match string: {match!s}", flush=True)

from_file("test3.py")
from_file("memblock.txt")
from_file("whisper.cpp/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java")

output including stack trace after Ctrl+C after waiting ~5 minutes

09/05/2023 01:45:27 C:\Users\WDAGUtilityAccount\Desktop> python.exe .\test3.py
test3.py
Matched MIME: text/plain
Match string: ascii text
memblock.txt
Matched MIME: text/x-c
Match string: C source text
Traceback (most recent call last):
  File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 13, in <module>
    from_file("memblock.txt")
  File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 7, in from_file
    for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2742, in match
    if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in __bool__
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in <genexpr>
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 44, in __iter__
    yield self[i]
          ~~~~^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
    self._items.append(next(self._source_iter))
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 54, in unique
    for t in iterator:
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2493, in <genexpr>
    return LazyIterableSet((
                           ^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2543, in __iter__
    yield self[i]
          ~~~~^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2527, in __getitem__
    result = next(self._result_iter)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 928, in _match
    yield from child._match(context=context, parent_match=m)
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 917, in _match
    m = self.test(context.data, absolute_offset, parent_match)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2103, in test
    match = self.data_type.match(data[absolute_offset:], self.constant)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 1767, in match
    m = expected.search(data[:self.length])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
09/05/2023 01:50:32 C:\Users\WDAGUtilityAccount\Desktop>

Remove jinja2 as installation requirement

Dynamically load jinja2 at runtime so it is not required for installation using setup.py.

Option to only match magic

Add a command line option to only do filetype/polyglot matching and not parse subtypes.

Incompatible with latest version of chardet

Since commit a98992c polyfile uses a narrower version specifier for the version of chardet expected. In the meantime chardet 5.1.0 has been released (and packaged on Arch Linux), which leads to a DistributionNotFound error since pkg_resources is explicitly imported at runtime:

Traceback (most recent call last):
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 581, in _build_master
    ws.require(__requires__)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 909, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 800, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (chardet 5.1.0 (/usr/lib/python3.10/site-packages), Requirement.parse('chardet~=5.0.0'), {'polyfile'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/polyfile", line 33, in <module>
    sys.exit(load_entry_point('polyfile==0.5.0', 'console_scripts', 'polyfile')())
  File "/usr/bin/polyfile", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3.10/site-packages/polyfile/__init__.py", line 1, in <module>
    from . import nes, pdf, jpeg, zipmatcher, nitf, kaitaimatcher, languagematcher, polyfile
  File "/usr/lib/python3.10/site-packages/polyfile/nes.py", line 6, in <module>
    from .polyfile import register_parser, InvalidMatch, Submatch
  File "/usr/lib/python3.10/site-packages/polyfile/polyfile.py", line 8, in <module>
    import pkg_resources
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3260, in <module>
    def _initialize_master_working_set():
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3234, in _call_aside
    f(*args, **kwargs)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3272, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 596, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 795, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'chardet~=5.0.0' distribution was not found and is required by polyfile

This was observed on Arch Linux with the AUR package for polyfile at version 0.5.

Can the version specifier be safely loosened to chardet>=5.0.0 again?

Difficulty handling some key-value pairs

In the file 123165.pdf, object 37 contains the dictionary <</Filter [/FlateDecode]\n/Length 1813>>. The PDF spec explicitly allows this (the value associated with the /Filter key may be "an array of zero, one or several names"), but polyfile treats that dict as having three key-value pairs: the first is /Filter and [, the second is /FlateDecode and ]\n, and the third is /Length and 1813. This issue seems to be shared in other PDFs with arrays as values (e.g. 123324.pdf).

binary data displayed where nothing is defined.

I ran polyfile on yes.png (https://github.com/corkami/collisions/blob/master/workshop/yes.png).

There's no parser for PNG so the JSON is quite empty, which is OK, but OTOH the standalone HTML entry shows binary data at the next offsets even if no value is defined by the JSON there.

Feature request: improved search w/ stream support

Sometimes in the FAW, we'll find an error message that refers to specific bytes (such as a font or key name) within a set of files. It would be very handy to be able to reliably search for these byte sequences within files using the Polyfile report. While Polyfile has a search function, the current version fails to find many specific byte sequences within files. Presumably this is due to the fact that it doesn't recurse into streams, though there may be another cause.

Let me know if it would be helpful to attach a specific file / search string.

'List index out of range' on one of Ange's POC files

When I try to parse https://github.com/corkami/pocs/blob/master/pdf/PDFSecrets/deniable%20removal%202%20with%20incremental%20update.pdf

I get:

Found a subregion of type Value at byte offset 111 Parsing PDF obj 4 0Traceback (most recent call last): File "/usr/local/bin/polyfile", line 11, in <module> load_entry_point('polyfile', 'console_scripts', 'polyfile')() File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main for match in matcher.match(file_path, progress_callback=progress_callback): File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match yield from submatch_iter File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch yield from parse_pdf(file_stream, matcher=self.matcher, parent=self) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf yield from parse_object(file_stream, object, matcher=matcher, parent=parent) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 136, in parse_object yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 53, in _emit_dict yield from _emit_dict(value, pair, pdf_offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 35, in _emit_dict value_end = value[-1].offset.offset + len(value[-1].token) IndexError: list index out of range

Weird Zip parsing in Evan's resume

It's pretty cool to have NES and PDF dissection of the polyglot resume, but what's with the weird ZIP structure?

Something went wrong:

Revisit support for sys.stderr that does not have a buffer

Hi, it looks like issue #30 was closed but the pull request associated with it was never merged into master. Was this intentional?

Polyfile crashes with NotImplementedError

I am running PolyFile version 0.4.2 on Python 3.9.7 on Windows. I get unexpected exceptions of type NotImplementedError. Consider the following test:

    def test_polyfile(self):
        from polyfile.magic import MagicMatcher
        data = zlib.decompress(base64.b64decode(
            'eNptUsFO4zAQvVvyPwyHSnAgtpukpRJCKtBuJbqkanxZbRAy1C2BkqDYRbv79YydRGm7WLJlv3me9zzj'
            '3uJ2ei4CQYkADuXTKyWXl0AJMPn3QwO7UVZty40DFmqjDfSRtqTk6ooSXaz8BUr6R3fv8pWB3xA6Ljw4'
            '5KbcFRbEXuY63XGmsMt0SK2TFFYX1kBUmwA2HoMnAkuaDbApnGY4xKgfiMFF0I8DkWVWG5tl63yrz2rW'
            'LfrjSM6tN4hICuxHKcuJOzlT9YLiFWq27wa21KbcVc/ovVGeoqtOXLTb1rwLN0C6e7IecxHRgNfKaJ+C'
            'zfT2U9v8WfmIV++MHJYpOir4XBcb+wKC85pJibGVVu+UXEtKmBSPHIsv19hmdxUPEZIDzjkM4zAYDQcg'
            'kYwItLPCpp8mSbJIT+AXvhju5fwnzMbpDF6UgdedsTDX6k2vggDOQKKZifQeW+nO7p9KozSHGJduwCCO'
            'wxjWe6BAbR8q9sDhN6CIov/BKBx1ICW2Utjvqv1Ly7J0P7BpY5r/0xDV1TJWVbb2OBCI9XqTZPoFx5+0'
            'nw=='
        ))
        types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
        self.assertIn('application/pdf', types)

In my setup, this crashes with the following exception:

Traceback (most recent call last):
  File "X:\test\test_polyfile.py", line 82, in test_polyfile
    types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
  File "X:\test\test_polyfile.py", line 82, in <listcomp>
    types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2184, in match
    if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in __bool__
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in <genexpr>
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 44, in __iter__
    yield self[i]
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
    self._items.append(next(self._source_iter))
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 54, in unique
    for t in iterator:
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2005, in <genexpr>
    return LazyIterableSet((
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2047, in __iter__
    yield self[i]
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2031, in __getitem__
    result = next(self._result_iter)
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 760, in _match
    m = self.test(context.data, absolute_offset, parent_match)
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 1953, in test
    raise NotImplementedError(
NotImplementedError: TODO: Implement support for the DER test (e.g., using the Kaitai asn1_der.py parser)

From my limited understanding, I would expect that this exception should not be propagated to me; instead, I would expect that when a test raises an exception, it is silently discarded as having produced no match.

Two crashes when parsing PDFs

Govdocs -

000899.pdf
001940.pdf

Parsing PDF obj 62 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 38, in _emit_dict
    value_end = value[-1].offset.offset + len(value[-1].token)
IndexError: list index out of range

Parsing PDF obj 424 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in _emit_dict
    ''.join(v.token for v in value),
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in <genexpr>
    ''.join(v.token for v in value),
AttributeError: 'str' object has no attribute 'token'

Some PDF streams not detected

If I run polyfile on an off-the-shelf PDF, it will generally detect all streams. If I use Mac's Preview to highlight/annotate the file, in this case 5 times, a number of streams are added. (Many of them appear to be ICC color spaces, and oddly, many are identical - 20, seemingly, in this case.) polyfile is unable to detect most newly-added streams:

> ls -lh optician*
-rw-r--r--@ 1 sam  staff   515K Jan  8 10:41 optician-with-annots.pdf
-rw-r--r--@ 1 sam  staff   370K Jan  8 10:41 optician.pdf
> strings optician.pdf | grep -c endstream
158
> polyfile -q optician.pdf | jq | grep -c "\"type\": \"EndStream\""
158
> strings optician-with-annots.pdf | grep -c endstream
198
> polyfile -q optician-with-annots.pdf | jq | grep -c "\"type\": \"EndStream\""
158

The files in question:
optician-with-annots.pdf
optician.pdf

Build of 0.3.3 fails with `TypeError: '>' not supported between instances of 'NoneType' and 'float'`

Downloading https://github.com/trailofbits/polyfile/archive/refs/tags/v0.3.3.tar.gz and building yields an error.

❯ python setup.py build

Traceback (most recent call last):
  File "/home/dkasak/code/dkasak/packages/aur_packages/polyfile/src/polyfile-0.3.3/setup.py", line 47, in <module>
    if not MANIFEST_PATH.exists() or newest_definition > MANIFEST_PATH.stat().st_mtime:
TypeError: '>' not supported between instances of 'NoneType' and 'float'

Looking at setup.py, the cause is here

# see if any of the files are out of date and need to be recompiled
newest_definition: Optional[float] = None
for definition in KAITAI_FORMAT_LIBRARY.glob("**/*.ksy"):
    mtime = definition.stat().st_mtime
    if newest_definition is None or newest_definition < mtime:
        newest_definition = mtime

It turns out that the KAITAI_FORMAT_LIBRARY: Path = POLYFILE_DIR / "kaitai_struct_formats" dir is empty so the above loop never gets executed and newest_definition remains None:

❯ ls -l kaitai_struct_formats
total 0

Cannot unpack non-iterable NoneType

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 881, in ParseDictionary, value, tokens = self.ParseDictionary(tokens), TypeError: cannot unpack non-iterable NoneType object]

triggering file: govdocs/810/810124.pdf

UnboundLocalError: local variable 'is_dct_decode' referenced before assignment

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 208, in parse_object, if is_dct_decode and raw_content[:1] == b'\xff':, UnboundLocalError: local variable 'is_dct_decode' referenced before assignment

triggering file: common_crawl/6fb/9cc/eb3/6fb9cceb33a2bf98749e895e43840e90f4bcbf4a631b512c010675e3763f5433

Readline missing from setup.py?

Trying to install this fresh on a Windows box today.

Traceback (most recent call last):
  File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Spencer\Apps\python\Scripts\polyfile.exe\__main__.py", line 4, in <module>
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__init__.py", line 2, in <module>
    from .__main__ import main
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__main__.py", line 18, in <module>
    from .debugger import Debugger
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\debugger.py", line 12, in <module>
    from .repl import ANSIColor, ANSIWriter, arg_completer, command, ExitREPL, log, REPL, SetCompleter
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\repl.py", line 7, in <module>
    import readline
ModuleNotFoundError: No module named 'readline'

It looks like readline is a new requirement and isn't listed in the requirements:

polyfile/setup.py

Lines 123 to 134 in 9c2d20b

 install_requires=[ 

 'cint', 

 'graphviz', 

 'intervaltree', 

 'jinja2', 

 'kaitaistruct>=0.7', 

 'networkx', 

 'pdfminer.six', # currently just for ascii85decode 

 'Pillow>=5.0.0', 

 'pyyaml>=3.13', 

 'setuptools' 

 ],

Pip version: pip 21.3.1 from C:\Users\Spencer\Apps\python\lib\site-packages\pip (python 3.9)

[Feature Request/Improvement] Alternate JSON Output w/o b64contents

Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).

Options could be:

Add a new output format (like "json-nob64") that doesn't include it
Add a command line switch to skip it (--no-contents or something like that?)

Also a second question becomes:

Change the schema of the JSON output and remove the b64contents key entirely (this is probably a bad idea...)
Just set the b64contents key to an empty string (or even None)
Set the b64contents key to some string actually encoded in base64 ... say base64("null")...

Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a --no-contents flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing the b64contents key, replacing the data in the key with None/null, or setting the key to a short base64 encoded string of "null".

In my experience, at least in the Python world, developers often don't check for the existence of a key in a dict (or they do not use the dict.get() method which gracefully handles a non-existing key - unlike the case of mydict['noKey'] ). I suppose that the concern is somewhat moot since the default behavior won't change.

With either option, it seems prudent to add an optional parameter to the polyfile.Analyzer.sbud method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.

polyfile/polyfile/polyfile.py

Line 372 in 438628f

def sbud(self, matches: Optional[Iterable[Match]] = None) -> Dict[str, Any]:

polyfile/polyfile/polyfile.py

Line 383 in 438628f

b64contents = base64.b64encode(data)

Low, low priority: Consider alternate handling for encrypted Zip streams

In this example file: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testZipEncrypted.zip, one of the streams is encrypted, and the other isn't.

It might be useful to report an encrypted stream in the output json and then process the unencrypted file type rather than throwing an exception on the file and not processing it at all.

Fails to identify some ZIP files

For example, this one:

In fact, the only filetype identified is PNG.

Not detecting the PDF is understandable, since it's out-of-spec - the header appears too far into the file - although Firefox can still render it.

Create handler for SIGTERM to exit cleanly

Evan do you think it would be possible to create a signal handler for SIGTERM to dump whatever polyfile has learned during its analysis?

	install_requires=[
	'cint',
	'graphviz',
	'intervaltree',
	'jinja2',
	'kaitaistruct>=0.7',
	'networkx',
	'pdfminer.six', # currently just for ascii85decode
	'Pillow>=5.0.0',
	'pyyaml>=3.13',
	'setuptools'
	],

trailofbits / polyfile Goto Github PK

polyfile's Issues

Recommend Projects

Recommend Topics

Recommend Org