Git Product home page Git Product logo

polyfile's Introduction

PolyFile


PyPI version Tests Slack Status

A utility to identify and map the semantic and syntactic structure of files, including polyglots, chimeras, and schizophrenic files. It has a pure-Python implementation of libmagic and can act as a drop-in replacement for the file command. However, unlike file, PolyFile can recursively identify embedded files, like binwalk.

PolyFile can be used in conjunction with its sister tool PolyTracker for Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of collectively referring to the tools as The ALAN Parsers Project.

Quickstart

You can install the latest stable version of PolyFile from PyPI:

pip3 install polyfile

To install PolyFile from source, in the same directory as this README, run:

pip3 install .

Important: Before installing from source, make sure Java is installed. Java is used to run the Kaitai Struct compiler, which compiles the file format definitions.

This will automatically install the polyfile and polymerge executables in your path.

Usage

Running polyfile on a file with no arguments will mimic the behavior of file --keep-going:

$ polyfile png-polyglot.png
PNG image data, 256 x 144, 8-bit/color RGB, non-interlaced
Brainfu** Program
Malformed PDF
PDF document, version 1.3,  1 pages
ZIP end of central directory record Java JAR archive 

To generate an interactive hex viewer for the file, use the --html option:

$ polyfile --html output.html png-polyglot.png
Found a file of type application/pdf at byte offset 0
Found a file of type application/x-brainfuck at byte offset 0
Found a file of type image/png at byte offset 0
Found a file of type application/zip at byte offset 0
Found a file of type application/java-archive at byte offset 0
Saved HTML output to output.html

Run polyfile --help for full usage instructions.

Interactive Debugger

PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser. You can run PolyFile with the debugger enabled using the -db option.

File Support

PolyFile has a cleanroom, pure Python implementation of the libmagic file classifier, and supports all 263 MIME types that it can identify.

It currently has support for parsing and semantically mapping the following formats:

For an example that exercises all of these file formats, run:

curl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html -

Prior to PolyFile version 0.3.0, it used the TrID database for file identification rather than the libmagic file definitions. This proved to be very slow (since TrID has many duplicate entries) and prone to false positives (since TrID's file definitions are much simpler than libmagic's). The original TrID matching code is still shipped with PolyFile and can be invoked programmatically, but it is not used by default.

Output Format

PolyFile has several options for outputting its results, specified by its --format option. For computer-readable output, PolyFile has an extension of the SBuD JSON format described in the documentation. Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the file command. To maintain the original behavior, use the --format sbud option.

libmagic Implementation

PolyFile has a cleanroom implementation of libmagic (used in the file command). It can be invoked programmatically by running:

from polyfile.magic import MagicMatcher

with open("file_to_test", "rb") as f:
    # the default instance automatically loads all file definitions
    for match in MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
        for mimetype in match.mimetypes:
            print(f"Matched MIME: {mimetype}")
        print(f"Match string: {match!s}")

To load a specific or custom file definition:

list_of_paths_to_definitions = ["def1", "def2"]
matcher = MagicMatcher.parse(*list_of_paths_to_definitions)
with open("file_to_test", "rb") as f:
    for match in matcher.match(f.read()):
        ...

Extending PolyFile

Instructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation](in the documentation).

License and Acknowledgements

This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 license. © 2019, Trail of Bits.

polyfile's People

Contributors

apstickler avatar artemdinaburg avatar danieldjewell avatar dependabot[bot] avatar esultanik avatar facutuesca avatar kaoudis avatar mike-myers-tob avatar oldsj avatar pombredanne avatar samcowger avatar woodruffw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polyfile's Issues

Runtime exception: 'NoneType' object is not subscriptable

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], TypeError: 'NoneType' object is not subscriptable]

govdocs/706/706211.pdf

Let me know if I should attach the triggering file.

UnboundLocalError: local variable 'is_dct_decode' referenced before assignment

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 208, in parse_object, if is_dct_decode and raw_content[:1] == b'\xff':, UnboundLocalError: local variable 'is_dct_decode' referenced before assignment

triggering file: common_crawl/6fb/9cc/eb3/6fb9cceb33a2bf98749e895e43840e90f4bcbf4a631b512c010675e3763f5433

Some PDF streams not detected

If I run polyfile on an off-the-shelf PDF, it will generally detect all streams. If I use Mac's Preview to highlight/annotate the file, in this case 5 times, a number of streams are added. (Many of them appear to be ICC color spaces, and oddly, many are identical - 20, seemingly, in this case.) polyfile is unable to detect most newly-added streams:

> ls -lh optician*
-rw-r--r--@ 1 sam  staff   515K Jan  8 10:41 optician-with-annots.pdf
-rw-r--r--@ 1 sam  staff   370K Jan  8 10:41 optician.pdf
> strings optician.pdf | grep -c endstream
158
> polyfile -q optician.pdf | jq | grep -c "\"type\": \"EndStream\""
158
> strings optician-with-annots.pdf | grep -c endstream
198
> polyfile -q optician-with-annots.pdf | jq | grep -c "\"type\": \"EndStream\""
158

The files in question:
optician-with-annots.pdf
optician.pdf

Example files that cause an unexpected delay iterating (infinite?) matches

On Windows with python 3.11 and polyfile 0.5.2, processing the following files as demonstrated in the README seems to take forever:

python-magic using libmagic v4 from years ago spits out an error mentioning regex and memory.

code
import polyfile

def from_file(file):
 print(file)
 with open(file, "rb") as f:
  # the default instance automatically loads all file definitions
  for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
   for mimetype in match.mimetypes:
    print(f"Matched MIME: {mimetype}", flush=True)
   print(f"Match string: {match!s}", flush=True)

from_file("test3.py")
from_file("memblock.txt")
from_file("whisper.cpp/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java")
output including stack trace after Ctrl+C after waiting ~5 minutes
09/05/2023 01:45:27 C:\Users\WDAGUtilityAccount\Desktop> python.exe .\test3.py
test3.py
Matched MIME: text/plain
Match string: ascii text
memblock.txt
Matched MIME: text/x-c
Match string: C source text
Traceback (most recent call last):
  File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 13, in <module>
    from_file("memblock.txt")
  File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 7, in from_file
    for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2742, in match
    if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in __bool__
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in <genexpr>
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 44, in __iter__
    yield self[i]
          ~~~~^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
    self._items.append(next(self._source_iter))
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 54, in unique
    for t in iterator:
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2493, in <genexpr>
    return LazyIterableSet((
                           ^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2543, in __iter__
    yield self[i]
          ~~~~^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2527, in __getitem__
    result = next(self._result_iter)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 928, in _match
    yield from child._match(context=context, parent_match=m)
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 917, in _match
    m = self.test(context.data, absolute_offset, parent_match)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2103, in test
    match = self.data_type.match(data[absolute_offset:], self.constant)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 1767, in match
    m = expected.search(data[:self.length])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
09/05/2023 01:50:32 C:\Users\WDAGUtilityAccount\Desktop>

Feature request: improved search w/ stream support

Sometimes in the FAW, we'll find an error message that refers to specific bytes (such as a font or key name) within a set of files. It would be very handy to be able to reliably search for these byte sequences within files using the Polyfile report. While Polyfile has a search function, the current version fails to find many specific byte sequences within files. Presumably this is due to the fact that it doesn't recurse into streams, though there may be another cause.

Let me know if it would be helpful to attach a specific file / search string.

in unget, assert isinstance(byte, PDFByte), AssertionError]

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 243, in parse_pdf, object = parser.GetObject(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 426, in GetObject, self.token = self.oPDFTokenizer.TokenIgnoreWhiteSpace(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 396, in TokenIgnoreWhiteSpace, token = self.Token(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 373, in Token, self.oPDF.unget(self.byte), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 240, in unget, assert isinstance(byte, PDFByte), AssertionError]

File: govdocs/176/176361.pdf

Build of 0.3.3 fails with `TypeError: '>' not supported between instances of 'NoneType' and 'float'`

Downloading https://github.com/trailofbits/polyfile/archive/refs/tags/v0.3.3.tar.gz and building yields an error.

❯ python setup.py build

Traceback (most recent call last):
  File "/home/dkasak/code/dkasak/packages/aur_packages/polyfile/src/polyfile-0.3.3/setup.py", line 47, in <module>
    if not MANIFEST_PATH.exists() or newest_definition > MANIFEST_PATH.stat().st_mtime:
TypeError: '>' not supported between instances of 'NoneType' and 'float'

Looking at setup.py, the cause is here

# see if any of the files are out of date and need to be recompiled
newest_definition: Optional[float] = None
for definition in KAITAI_FORMAT_LIBRARY.glob("**/*.ksy"):
    mtime = definition.stat().st_mtime
    if newest_definition is None or newest_definition < mtime:
        newest_definition = mtime

It turns out that the KAITAI_FORMAT_LIBRARY: Path = POLYFILE_DIR / "kaitai_struct_formats" dir is empty so the above loop never gets executed and newest_definition remains None:

❯ ls -l kaitai_struct_formats
total 0

Two crashes when parsing PDFs

Govdocs -

000899.pdf
001940.pdf

Parsing PDF obj 62 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 38, in _emit_dict
    value_end = value[-1].offset.offset + len(value[-1].token)
IndexError: list index out of range
Parsing PDF obj 424 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in _emit_dict
    ''.join(v.token for v in value),
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in <genexpr>
    ''.join(v.token for v in value),
AttributeError: 'str' object has no attribute 'token'

'List index out of range' on one of Ange's POC files

When I try to parse https://github.com/corkami/pocs/blob/master/pdf/PDFSecrets/deniable%20removal%202%20with%20incremental%20update.pdf

I get:

Found a subregion of type Value at byte offset 111 Parsing PDF obj 4 0Traceback (most recent call last): File "/usr/local/bin/polyfile", line 11, in <module> load_entry_point('polyfile', 'console_scripts', 'polyfile')() File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main for match in matcher.match(file_path, progress_callback=progress_callback): File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match yield from submatch_iter File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch yield from parse_pdf(file_stream, matcher=self.matcher, parent=self) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf yield from parse_object(file_stream, object, matcher=matcher, parent=parent) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 136, in parse_object yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 53, in _emit_dict yield from _emit_dict(value, pair, pdf_offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 35, in _emit_dict value_end = value[-1].offset.offset + len(value[-1].token) IndexError: list index out of range

Cannot unpack non-iterable NoneType

[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 881, in ParseDictionary, value, tokens = self.ParseDictionary(tokens), TypeError: cannot unpack non-iterable NoneType object]

triggering file: govdocs/810/810124.pdf

Weird Zip parsing in Evan's resume

It's pretty cool to have NES and PDF dissection of the polyglot resume, but what's with the weird ZIP structure?
image

Something went wrong:
image

Readline missing from setup.py?

Trying to install this fresh on a Windows box today.

Traceback (most recent call last):
  File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Spencer\Apps\python\Scripts\polyfile.exe\__main__.py", line 4, in <module>
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__init__.py", line 2, in <module>
    from .__main__ import main
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__main__.py", line 18, in <module>
    from .debugger import Debugger
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\debugger.py", line 12, in <module>
    from .repl import ANSIColor, ANSIWriter, arg_completer, command, ExitREPL, log, REPL, SetCompleter
  File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\repl.py", line 7, in <module>
    import readline
ModuleNotFoundError: No module named 'readline'

It looks like readline is a new requirement and isn't listed in the requirements:

polyfile/setup.py

Lines 123 to 134 in 9c2d20b

install_requires=[
'cint',
'graphviz',
'intervaltree',
'jinja2',
'kaitaistruct>=0.7',
'networkx',
'pdfminer.six', # currently just for ascii85decode
'Pillow>=5.0.0',
'pyyaml>=3.13',
'setuptools'
],

Pip version: pip 21.3.1 from C:\Users\Spencer\Apps\python\lib\site-packages\pip (python 3.9)

Difficulty handling some key-value pairs

In the file 123165.pdf, object 37 contains the dictionary <</Filter [/FlateDecode]\n/Length 1813>>. The PDF spec explicitly allows this (the value associated with the /Filter key may be "an array of zero, one or several names"), but polyfile treats that dict as having three key-value pairs: the first is /Filter and [, the second is /FlateDecode and ]\n, and the third is /Length and 1813. This issue seems to be shared in other PDFs with arrays as values (e.g. 123324.pdf).

Incompatible with latest version of chardet

Since commit a98992c polyfile uses a narrower version specifier for the version of chardet expected. In the meantime chardet 5.1.0 has been released (and packaged on Arch Linux), which leads to a DistributionNotFound error since pkg_resources is explicitly imported at runtime:

Traceback (most recent call last):
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 581, in _build_master
    ws.require(__requires__)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 909, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 800, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (chardet 5.1.0 (/usr/lib/python3.10/site-packages), Requirement.parse('chardet~=5.0.0'), {'polyfile'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/polyfile", line 33, in <module>
    sys.exit(load_entry_point('polyfile==0.5.0', 'console_scripts', 'polyfile')())
  File "/usr/bin/polyfile", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3.10/site-packages/polyfile/__init__.py", line 1, in <module>
    from . import nes, pdf, jpeg, zipmatcher, nitf, kaitaimatcher, languagematcher, polyfile
  File "/usr/lib/python3.10/site-packages/polyfile/nes.py", line 6, in <module>
    from .polyfile import register_parser, InvalidMatch, Submatch
  File "/usr/lib/python3.10/site-packages/polyfile/polyfile.py", line 8, in <module>
    import pkg_resources
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3260, in <module>
    def _initialize_master_working_set():
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3234, in _call_aside
    f(*args, **kwargs)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3272, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 596, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 795, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'chardet~=5.0.0' distribution was not found and is required by polyfile

This was observed on Arch Linux with the AUR package for polyfile at version 0.5.

Can the version specifier be safely loosened to chardet>=5.0.0 again?

[Feature Request/Improvement] Alternate JSON Output w/o b64contents

Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).

Options could be:

  • Add a new output format (like "json-nob64") that doesn't include it
  • Add a command line switch to skip it (--no-contents or something like that?)

Also a second question becomes:

  • Change the schema of the JSON output and remove the b64contents key entirely (this is probably a bad idea...)
  • Just set the b64contents key to an empty string (or even None)
  • Set the b64contents key to some string actually encoded in base64 ... say base64("null")...

Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a --no-contents flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing the b64contents key, replacing the data in the key with None/null, or setting the key to a short base64 encoded string of "null".

In my experience, at least in the Python world, developers often don't check for the existence of a key in a dict (or they do not use the dict.get() method which gracefully handles a non-existing key - unlike the case of mydict['noKey'] ). I suppose that the concern is somewhat moot since the default behavior won't change.

With either option, it seems prudent to add an optional parameter to the polyfile.Analyzer.sbud method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.

def sbud(self, matches: Optional[Iterable[Match]] = None) -> Dict[str, Any]:

b64contents = base64.b64encode(data)

Fails to identify some ZIP files

For example, this one:

bggp

In fact, the only filetype identified is PNG.

Not detecting the PDF is understandable, since it's out-of-spec - the header appears too far into the file - although Firefox can still render it.

Polyfile crashes with NotImplementedError

I am running PolyFile version 0.4.2 on Python 3.9.7 on Windows. I get unexpected exceptions of type NotImplementedError. Consider the following test:

    def test_polyfile(self):
        from polyfile.magic import MagicMatcher
        data = zlib.decompress(base64.b64decode(
            'eNptUsFO4zAQvVvyPwyHSnAgtpukpRJCKtBuJbqkanxZbRAy1C2BkqDYRbv79YydRGm7WLJlv3me9zzj'
            '3uJ2ei4CQYkADuXTKyWXl0AJMPn3QwO7UVZty40DFmqjDfSRtqTk6ooSXaz8BUr6R3fv8pWB3xA6Ljw4'
            '5KbcFRbEXuY63XGmsMt0SK2TFFYX1kBUmwA2HoMnAkuaDbApnGY4xKgfiMFF0I8DkWVWG5tl63yrz2rW'
            'LfrjSM6tN4hICuxHKcuJOzlT9YLiFWq27wa21KbcVc/ovVGeoqtOXLTb1rwLN0C6e7IecxHRgNfKaJ+C'
            'zfT2U9v8WfmIV++MHJYpOir4XBcb+wKC85pJibGVVu+UXEtKmBSPHIsv19hmdxUPEZIDzjkM4zAYDQcg'
            'kYwItLPCpp8mSbJIT+AXvhju5fwnzMbpDF6UgdedsTDX6k2vggDOQKKZifQeW+nO7p9KozSHGJduwCCO'
            'wxjWe6BAbR8q9sDhN6CIov/BKBx1ICW2Utjvqv1Ly7J0P7BpY5r/0xDV1TJWVbb2OBCI9XqTZPoFx5+0'
            'nw=='
        ))
        types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
        self.assertIn('application/pdf', types)

In my setup, this crashes with the following exception:

Traceback (most recent call last):
  File "X:\test\test_polyfile.py", line 82, in test_polyfile
    types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
  File "X:\test\test_polyfile.py", line 82, in <listcomp>
    types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2184, in match
    if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in __bool__
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in <genexpr>
    return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 44, in __iter__
    yield self[i]
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
    self._items.append(next(self._source_iter))
  File "X:\venv\lib\site-packages\polyfile\iterators.py", line 54, in unique
    for t in iterator:
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2005, in <genexpr>
    return LazyIterableSet((
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2047, in __iter__
    yield self[i]
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 2031, in __getitem__
    result = next(self._result_iter)
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 760, in _match
    m = self.test(context.data, absolute_offset, parent_match)
  File "X:\venv\lib\site-packages\polyfile\magic.py", line 1953, in test
    raise NotImplementedError(
NotImplementedError: TODO: Implement support for the DER test (e.g., using the Kaitai asn1_der.py parser)

From my limited understanding, I would expect that this exception should not be propagated to me; instead, I would expect that when a test raises an exception, it is silently discarded as having produced no match.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.