trailofbits / polyfile Goto Github PK
View Code? Open in Web Editor NEWA pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
License: Apache License 2.0
A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
License: Apache License 2.0
Apache Tika's example of a .docx file with recursively embedded zip files takes roughly an hour to process.
I confirmed privately with a key committer that this is unexpected.
File is available here: https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 243, in parse_pdf, object = parser.GetObject(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 426, in GetObject, self.token = self.oPDFTokenizer.TokenIgnoreWhiteSpace(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 396, in TokenIgnoreWhiteSpace, token = self.Token(), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 373, in Token, self.oPDF.unget(self.byte), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 240, in unget, assert isinstance(byte, PDFByte), AssertionError]
File: govdocs/176/176361.pdf
[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], TypeError: 'NoneType' object is not subscriptable]
govdocs/706/706211.pdf
Let me know if I should attach the triggering file.
398129.pdf
File from govdocs1.
hex editor
The StatusLogHandler
defaults to using sys.stderr.buffer
, but that is not always guaranteed to be available if a custom IO is replaced.
Add support for the undocumented ternary operator in libmagic that is apparently only used in the ELF definition:
https://github.com/file/file/blob/03b6dcb4a24455207ef4094560c334fbc38875bd/magic/Magdir/elf#L61-L63
The x
tests to see if the execute bits are set on the input file: https://github.com/file/file/blob/master/src/softmagic.c#L529-L566
This will probably require modifying the API to maintain information about the execute bits.
On Windows with python 3.11 and polyfile 0.5.2, processing the following files as demonstrated in the README seems to take forever:
python-magic using libmagic v4 from years ago spits out an error mentioning regex and memory.
import polyfile
def from_file(file):
print(file)
with open(file, "rb") as f:
# the default instance automatically loads all file definitions
for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
for mimetype in match.mimetypes:
print(f"Matched MIME: {mimetype}", flush=True)
print(f"Match string: {match!s}", flush=True)
from_file("test3.py")
from_file("memblock.txt")
from_file("whisper.cpp/bindings/java/src/test/java/io/github/ggerganov/whispercpp/WhisperCppTest.java")
09/05/2023 01:45:27 C:\Users\WDAGUtilityAccount\Desktop> python.exe .\test3.py
test3.py
Matched MIME: text/plain
Match string: ascii text
memblock.txt
Matched MIME: text/x-c
Match string: C source text
Traceback (most recent call last):
File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 13, in <module>
from_file("memblock.txt")
File "C:\Users\WDAGUtilityAccount\Desktop\test3.py", line 7, in from_file
for match in polyfile.magic.MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2742, in match
if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in __bool__
return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2513, in <genexpr>
return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 44, in __iter__
yield self[i]
~~~~^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
self._items.append(next(self._source_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\iterators.py", line 54, in unique
for t in iterator:
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2493, in <genexpr>
return LazyIterableSet((
^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2543, in __iter__
yield self[i]
~~~~^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2527, in __getitem__
result = next(self._result_iter)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 928, in _match
yield from child._match(context=context, parent_match=m)
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 917, in _match
m = self.test(context.data, absolute_offset, parent_match)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 2103, in test
match = self.data_type.match(data[absolute_offset:], self.constant)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WDAGUtilityAccount\scoop\apps\python\current\Lib\site-packages\polyfile\magic.py", line 1767, in match
m = expected.search(data[:self.length])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
09/05/2023 01:50:32 C:\Users\WDAGUtilityAccount\Desktop>
Dynamically load jinja2 at runtime so it is not required for installation using setup.py
.
Add a command line option to only do filetype/polyglot matching and not parse subtypes.
Since commit a98992c polyfile uses a narrower version specifier for the version of chardet
expected. In the meantime chardet 5.1.0 has been released (and packaged on Arch Linux), which leads to a DistributionNotFound
error since pkg_resources
is explicitly imported at runtime:
Traceback (most recent call last):
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 581, in _build_master
ws.require(__requires__)
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 909, in require
needed = self.resolve(parse_requirements(requirements))
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 800, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (chardet 5.1.0 (/usr/lib/python3.10/site-packages), Requirement.parse('chardet~=5.0.0'), {'polyfile'})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/polyfile", line 33, in <module>
sys.exit(load_entry_point('polyfile==0.5.0', 'console_scripts', 'polyfile')())
File "/usr/bin/polyfile", line 25, in importlib_load_entry_point
return next(matches).load()
File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
module = import_module(match.group('module'))
File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/lib/python3.10/site-packages/polyfile/__init__.py", line 1, in <module>
from . import nes, pdf, jpeg, zipmatcher, nitf, kaitaimatcher, languagematcher, polyfile
File "/usr/lib/python3.10/site-packages/polyfile/nes.py", line 6, in <module>
from .polyfile import register_parser, InvalidMatch, Submatch
File "/usr/lib/python3.10/site-packages/polyfile/polyfile.py", line 8, in <module>
import pkg_resources
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3260, in <module>
def _initialize_master_working_set():
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3234, in _call_aside
f(*args, **kwargs)
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3272, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 583, in _build_master
return cls._build_from_requirements(__requires__)
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 596, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/home/prettybits/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 795, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'chardet~=5.0.0' distribution was not found and is required by polyfile
This was observed on Arch Linux with the AUR package for polyfile at version 0.5.
Can the version specifier be safely loosened to chardet>=5.0.0
again?
In the file 123165.pdf, object 37 contains the dictionary <</Filter [/FlateDecode]\n/Length 1813>>
. The PDF spec explicitly allows this (the value associated with the /Filter
key may be "an array of zero, one or several names"), but polyfile
treats that dict as having three key-value pairs: the first is /Filter
and [
, the second is /FlateDecode
and ]\n
, and the third is /Length
and 1813
. This issue seems to be shared in other PDFs with arrays as values (e.g. 123324.pdf).
I ran polyfile on yes.png
(https://github.com/corkami/collisions/blob/master/workshop/yes.png).
There's no parser for PNG so the JSON is quite empty, which is OK, but OTOH the standalone HTML entry shows binary data at the next offsets even if no value is defined by the JSON there.
Sometimes in the FAW, we'll find an error message that refers to specific bytes (such as a font or key name) within a set of files. It would be very handy to be able to reliably search for these byte sequences within files using the Polyfile report. While Polyfile has a search function, the current version fails to find many specific byte sequences within files. Presumably this is due to the fact that it doesn't recurse into streams, though there may be another cause.
Let me know if it would be helpful to attach a specific file / search string.
When I try to parse https://github.com/corkami/pocs/blob/master/pdf/PDFSecrets/deniable%20removal%202%20with%20incremental%20update.pdf
I get:
Found a subregion of type Value at byte offset 111 Parsing PDF obj 4 0Traceback (most recent call last): File "/usr/local/bin/polyfile", line 11, in <module> load_entry_point('polyfile', 'console_scripts', 'polyfile')() File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main for match in matcher.match(file_path, progress_callback=progress_callback): File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match yield from submatch_iter File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch yield from parse_pdf(file_stream, matcher=self.matcher, parent=self) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf yield from parse_object(file_stream, object, matcher=matcher, parent=parent) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 136, in parse_object yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 53, in _emit_dict yield from _emit_dict(value, pair, pdf_offset) File "/Users/allison/Documents/tob/polyfile/pdf.py", line 35, in _emit_dict value_end = value[-1].offset.offset + len(value[-1].token) IndexError: list index out of range
Hi, it looks like issue #30 was closed but the pull request associated with it was never merged into master. Was this intentional?
I am running PolyFile version 0.4.2 on Python 3.9.7 on Windows. I get unexpected exceptions of type NotImplementedError
. Consider the following test:
def test_polyfile(self):
from polyfile.magic import MagicMatcher
data = zlib.decompress(base64.b64decode(
'eNptUsFO4zAQvVvyPwyHSnAgtpukpRJCKtBuJbqkanxZbRAy1C2BkqDYRbv79YydRGm7WLJlv3me9zzj'
'3uJ2ei4CQYkADuXTKyWXl0AJMPn3QwO7UVZty40DFmqjDfSRtqTk6ooSXaz8BUr6R3fv8pWB3xA6Ljw4'
'5KbcFRbEXuY63XGmsMt0SK2TFFYX1kBUmwA2HoMnAkuaDbApnGY4xKgfiMFF0I8DkWVWG5tl63yrz2rW'
'LfrjSM6tN4hICuxHKcuJOzlT9YLiFWq27wa21KbcVc/ovVGeoqtOXLTb1rwLN0C6e7IecxHRgNfKaJ+C'
'zfT2U9v8WfmIV++MHJYpOir4XBcb+wKC85pJibGVVu+UXEtKmBSPHIsv19hmdxUPEZIDzjkM4zAYDQcg'
'kYwItLPCpp8mSbJIT+AXvhju5fwnzMbpDF6UgdedsTDX6k2vggDOQKKZifQeW+nO7p9KozSHGJduwCCO'
'wxjWe6BAbR8q9sDhN6CIov/BKBx1ICW2Utjvqv1Ly7J0P7BpY5r/0xDV1TJWVbb2OBCI9XqTZPoFx5+0'
'nw=='
))
types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
self.assertIn('application/pdf', types)
In my setup, this crashes with the following exception:
Traceback (most recent call last):
File "X:\test\test_polyfile.py", line 82, in test_polyfile
types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
File "X:\test\test_polyfile.py", line 82, in <listcomp>
types = [next(iter(match.mimetypes)) for match in MagicMatcher.DEFAULT_INSTANCE.match(data)]
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2184, in match
if m and (not to_match.only_match_mime or any(t is not None for t in m.mimetypes)):
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in __bool__
return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2017, in <genexpr>
return any(m for m in self.mimetypes) or any(e for e in self.extensions) or bool(self.message())
File "X:\venv\lib\site-packages\polyfile\iterators.py", line 44, in __iter__
yield self[i]
File "X:\venv\lib\site-packages\polyfile\iterators.py", line 30, in __getitem__
self._items.append(next(self._source_iter))
File "X:\venv\lib\site-packages\polyfile\iterators.py", line 54, in unique
for t in iterator:
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2005, in <genexpr>
return LazyIterableSet((
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2047, in __iter__
yield self[i]
File "X:\venv\lib\site-packages\polyfile\magic.py", line 2031, in __getitem__
result = next(self._result_iter)
File "X:\venv\lib\site-packages\polyfile\magic.py", line 760, in _match
m = self.test(context.data, absolute_offset, parent_match)
File "X:\venv\lib\site-packages\polyfile\magic.py", line 1953, in test
raise NotImplementedError(
NotImplementedError: TODO: Implement support for the DER test (e.g., using the Kaitai asn1_der.py parser)
From my limited understanding, I would expect that this exception should not be propagated to me; instead, I would expect that when a test raises an exception, it is silently discarded as having produced no match.
Govdocs -
Parsing PDF obj 62 0Traceback (most recent call last):
File "/usr/local/bin/polyfile", line 11, in <module>
load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
yield from submatch_iter
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 38, in _emit_dict
value_end = value[-1].offset.offset + len(value[-1].token)
IndexError: list index out of range
Parsing PDF obj 424 0Traceback (most recent call last):
File "/usr/local/bin/polyfile", line 11, in <module>
load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
yield from submatch_iter
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in _emit_dict
''.join(v.token for v in value),
File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in <genexpr>
''.join(v.token for v in value),
AttributeError: 'str' object has no attribute 'token'
If I run polyfile
on an off-the-shelf PDF, it will generally detect all streams. If I use Mac's Preview to highlight/annotate the file, in this case 5 times, a number of streams are added. (Many of them appear to be ICC color spaces, and oddly, many are identical - 20, seemingly, in this case.) polyfile
is unable to detect most newly-added streams:
> ls -lh optician*
-rw-r--r--@ 1 sam staff 515K Jan 8 10:41 optician-with-annots.pdf
-rw-r--r--@ 1 sam staff 370K Jan 8 10:41 optician.pdf
> strings optician.pdf | grep -c endstream
158
> polyfile -q optician.pdf | jq | grep -c "\"type\": \"EndStream\""
158
> strings optician-with-annots.pdf | grep -c endstream
198
> polyfile -q optician-with-annots.pdf | jq | grep -c "\"type\": \"EndStream\""
158
The files in question:
optician-with-annots.pdf
optician.pdf
Downloading https://github.com/trailofbits/polyfile/archive/refs/tags/v0.3.3.tar.gz and building yields an error.
❯ python setup.py build
Traceback (most recent call last):
File "/home/dkasak/code/dkasak/packages/aur_packages/polyfile/src/polyfile-0.3.3/setup.py", line 47, in <module>
if not MANIFEST_PATH.exists() or newest_definition > MANIFEST_PATH.stat().st_mtime:
TypeError: '>' not supported between instances of 'NoneType' and 'float'
Looking at setup.py
, the cause is here
# see if any of the files are out of date and need to be recompiled
newest_definition: Optional[float] = None
for definition in KAITAI_FORMAT_LIBRARY.glob("**/*.ksy"):
mtime = definition.stat().st_mtime
if newest_definition is None or newest_definition < mtime:
newest_definition = mtime
It turns out that the KAITAI_FORMAT_LIBRARY: Path = POLYFILE_DIR / "kaitai_struct_formats"
dir is empty so the above loop never gets executed and newest_definition
remains None
:
❯ ls -l kaitai_struct_formats
total 0
[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 126, in parse_object, oPDFParseDictionary = pdfparser.cPDFParseDictionary(object.content, False), File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 844, in __init__, self.parsed = self.ParseDictionary(dataTrimmed)[0], File "/Users/allison/Documents/tob/polyfile/pdfparser.py", line 881, in ParseDictionary, value, tokens = self.ParseDictionary(tokens), TypeError: cannot unpack non-iterable NoneType object]
triggering file: govdocs/810/810124.pdf
[Traceback (most recent call last):, File "/usr/local/bin/polyfile", line 11, in <module>, load_entry_point('polyfile', 'console_scripts', 'polyfile')(), File "/Users/allison/Documents/tob/polyfile/__main__.py", line 72, in main, for match in matcher.match(file_path, progress_callback=progress_callback):, File "/Users/allison/Documents/tob/polyfile/polyfile.py", line 170, in match, yield from submatch_iter, File "/Users/allison/Documents/tob/polyfile/pdf.py", line 288, in submatch, yield from parse_pdf(file_stream, matcher=self.matcher, parent=self), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 282, in parse_pdf, yield from parse_object(file_stream, object, matcher=matcher, parent=parent), File "/Users/allison/Documents/tob/polyfile/pdf.py", line 208, in parse_object, if is_dct_decode and raw_content[:1] == b'\xff':, UnboundLocalError: local variable 'is_dct_decode' referenced before assignment
triggering file: common_crawl/6fb/9cc/eb3/6fb9cceb33a2bf98749e895e43840e90f4bcbf4a631b512c010675e3763f5433
Trying to install this fresh on a Windows box today.
Traceback (most recent call last):
File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Spencer\Apps\python\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Spencer\Apps\python\Scripts\polyfile.exe\__main__.py", line 4, in <module>
File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__init__.py", line 2, in <module>
from .__main__ import main
File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\__main__.py", line 18, in <module>
from .debugger import Debugger
File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\debugger.py", line 12, in <module>
from .repl import ANSIColor, ANSIWriter, arg_completer, command, ExitREPL, log, REPL, SetCompleter
File "C:\Users\Spencer\Apps\python\lib\site-packages\polyfile\repl.py", line 7, in <module>
import readline
ModuleNotFoundError: No module named 'readline'
It looks like readline
is a new requirement and isn't listed in the requirements:
Lines 123 to 134 in 9c2d20b
Pip version: pip 21.3.1 from C:\Users\Spencer\Apps\python\lib\site-packages\pip (python 3.9)
Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).
Options could be:
--no-contents
or something like that?)Also a second question becomes:
b64contents
key entirely (this is probably a bad idea...)b64contents
key to an empty string (or even None
)b64contents
key to some string actually encoded in base64 ... say base64("null")
...Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a --no-contents
flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing the b64contents
key, replacing the data in the key with None/null
, or setting the key to a short base64 encoded string of "null".
In my experience, at least in the Python world, developers often don't check for the existence of a key in a dict
(or they do not use the dict.get()
method which gracefully handles a non-existing key - unlike the case of mydict['noKey']
). I suppose that the concern is somewhat moot since the default behavior won't change.
With either option, it seems prudent to add an optional parameter to the polyfile.Analyzer.sbud
method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.
Line 372 in 438628f
Line 383 in 438628f
In this example file: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testZipEncrypted.zip, one of the streams is encrypted, and the other isn't.
It might be useful to report an encrypted stream in the output json and then process the unencrypted file type rather than throwing an exception on the file and not processing it at all.
Evan do you think it would be possible to create a signal handler for SIGTERM to dump whatever polyfile has learned during its analysis?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.