Git Product home page Git Product logo

pdf_analysis's Introduction

PDF Analysis

Several PDF analysis has already been done, I reassembled a lot of them with additional tips & tools here

PDF Format ๐Ÿ“„

alt text


https://www.adobe.com/devnet/pdf/pdf_reference.html
https://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/
https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF

Tools list ๐Ÿ”ง

Tool URL
AnalyzePDF.py https://github.com/hiddenillusion/AnalyzePDF
ByteForce https://github.com/weaknetlabs/ByteForce
Caradoc https://github.com/ANSSI-FR/caradoc
Didier Stevens suite https://github.com/DidierStevens/DidierStevensSuite
dumppdf https://packages.debian.org/jessie/python-pdfminer
forensics-all https://packages.debian.org/jessie-backports/forensics-all
Origami https://code.google.com/archive/p/origami-pdf/
ParanoiDF https://github.com/patrickdw123/ParanoiDF
peepdf https://github.com/jesparza/peepdf
PDF Xray https://github.com/9b/pdfxray_public
pdf-parser http://didierstevens.com/files/software/pdf-parser_V0_6_4.zip
pdf2jhon.py https://github.com/magnumripper/JohnTheRipper/blob/unstable-jumbo/run/pdf2john.py
pdfcrack https://packages.debian.org/jessie/pdfcrack
pdfextract https://github.com/CrossRef/pdfextract
pdfobjflow.py https://bitbucket.org/sebastiendamaye/pdfobjflow
pdfresurrect https://packages.debian.org/jessie/pdfresurrect
PdfStreamDumper.exe http://sandsprite.com/CodeStuff/PDFStreamDumper_Setup.exe
pdftk https://packages.debian.org/en/jessie/pdftk
pdfxray_lite.py https://github.com/9b/pdfxray_lite
poppler-utils https://packages.debian.org/en/jessie/poppler-utils (pdftotext, pdfimages, pdftohtml, pdftops, pdfinfo, pdffonts, pdfdetach, pdfseparate, pdfsig, pdftocairo, pdftoppm, pdfunite)
pyew https://packages.debian.org/en/jessie/pyew
qpdf https://packages.debian.org/jessie/qpdf
swf_mastah.py https://github.com/9b/pdfxray_public/blob/master/builder/swf_mastah.py

Existing list

http://blog.didierstevens.com/programs/pdf-tools/
https://github.com/sans-dfir/sift-files/tree/master/pdf-tools

Quick Analysis ๐Ÿš€

Basic informations

$ file file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Displaying objects and actions structure

$ python pdfdid.py -aefv file.pdf

Search for /OpenAction /AA /Launch /GoTo /GoToR /SubmitForm /Richmedia (for Flash) /JS /JavaScript /URI - Encode - Cipher - Shell code - Obfuscation...

Automatically with ParanoiDF

$ python paranoiDF.py -fl file.pdf

Or with pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

Extract files / scripts / Objects

pdf-parser to extract a js object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Online analysis

Beware to don't leak any important/professional/personnal data or to expose your research
https://www.hybrid-analysis.com/

Complete Analysis ๐Ÿ”Ž

Basic informations

$ file file.pdf
$ pdfinfo file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Powerfull Python tool to analyze PDF and exploit

$ pyew file.pdf 	

Other Python tool to explore PDF

$ peepdf -fl file.pdf
$ peepdf --interactive file.pdf

Analysis under Windows

PDF Stream Dumper
https://github.com/dzzie/pdfstreamdumper

Metadata

Get metadata

$ exiftool -a -u -g2 file.pdf

Get metadata recursivly from current directory

$ exiftool -r -ext pdf .

Change an element

$ exiftool -Title="New title" file.pdf

Remove metadata

$ exiftool -all= file.pdf && exiftool -all:all= file.pdf && qpdf --linearize file.pdf filewithoutmeta.pdf
$ mat file.pdf # latest version of mat doesn't support pdf format anymore...

Remove metadata recursively from the current directory : Very dirty but work well The filename must not have space at the moment, the commande will be optimized

$ find . -name "*.pdf" -print0 | while read -d $'\0' file; do echo ${file:2} && mv ${file:2} ${file:2}.pdf && exiftool -all= ${file:2}.pdf && exiftool -all:all= ${file:2}.pdf && qpdf --linearize ${file:2}.pdf ${file:2} && rm ${file:2}.pdf && rm ${file:2}.pdf_original; done

Search for older versions

Search for older "hidden" versions

$ pdfresurrect file.pdf -i
$ exiftool -pdf-update:all= file.pdf

Online Analysis

Name URL
Malwr https://malwr.com/submission/
Hybrid analysis https://www.hybrid-analysis.com/
Malware Tracker https://www.malwaretracker.com/pdf.php
VirusTotal http://www.virustotal.com/
PDF examiner http://www.pdfexaminer.com/
Document Analyzer http://www.document-analyzer.net/
Jotti https://virusscan.jotti.org/
PDF X-ray http://www.pdfxray.com/
PDF Online https://www.pdf-online.com/
Extract PDF http://www.extractpdf.com
Char conversion https://kt.pe/tools.html#conv/

Statistics

Calcul byte statistics, entropy min and max, ASCII count, ... from a PDF

$ python byte-stats.py file.pdf

Visual analysis

Visual analysis of a PDF or a binary file
http://binvis.io

Go deeper in the analysis

Displaying objects and actions structure

$ python pdfid.py --all --extra --force --verbose file.pdf

Map of the objects flows

$ pdf-parser file.pdf | ./pdfobjflow
$ eog pdfobjflow.png

Actions

Search for :
/OpenAction /AA specifies the script or action to run automatically.
/Names /AcroForm /Action can also specify and launch scripts or actions.
/JavaScript specifies JavaScript to run.
/GoTo changes the view to a specified destination within the PDF or in another PDF file.
/Launch a program or opens a document.
/URI accesses a resource by its URL.
/SubmitForm /GoToR can send data to URL.
/RichMedia can be used to embed Flash in PDF.
/ObjStm can hide objects inside an Object Stream.
/JavaScript > /J#61vaScript Beware on obfuscation technique with hex codes

With ParanoiDF

$ python paranoiDF.py -fl file.pdf

With pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

With dumppdf

$ dumppdf -a file.pdf

Compression

Search for compression

$ strings file.pdf | grep --color "/Filter"

2 ways to decompress a PDF

$ pdftk compressed.pdf output uncompressed.pdf uncompress
$ qpdf --stream-data=uncompress compressed.pdf uncompressed.pdf 

Embeded files

4 ways to search for embeded files/scripts inside a PDF

$ binwalk file.pdf
$ foremost -a -v file.pdf
$ hachoir-subfile file.pdf
$ scalpel file.pdf

Extract files / scripts / objects

Extract file corresponding to object ID, jpg for example

$ dumppdf.py -i 32 -r file.pdf > image.jpg

Extract js from an object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Conversion

PDF to Postscript

$ pdftops file.pdf

PDF to TXT

$ pdftotext file.pdf

PDF to JPG

$ convert file.pdf image.jpg

Non-exhaustive list of possible conversion

LZWDecode filter

Convert a PDF to Postscript without the LZWDecode filter

$ qpdf --stream-data=uncompress original.pdf decoded.pdf # Decompress it
$ pdftops decoded.pdf decoded.ps # Convert it

Encryption

PDF supports RC4 encryption (40 to 128 bits keys) and AES (128 to 256 with the Extension Level 3).
Beware with empty password.

Password recovering

Brute force a PDF with pdfcrack

$ pdfcrack -w yourDictionnary.txt file.pdf

With john

$ pdf2john.py file.pdf > x.hash
$ john --wordlist=yourDictionnary.txt x.hash

Javascript

2 ways to search for Javascript

$ pdf-parser --search=JavaScript file.pdf 
$ pdfinfo -js file.pdf

Extract an object With jsunpack

$ jsunpack-extractjs file.pdf

With pdf-parser

$ pdf-parser --object 32 --raw file.pdf > file.js

With pdfextract from Origami

$ pdfextract --js file.pdf

De-obfuscate

https://github.com/urule99/jsunpack-n

Online :
http://jsunpack.jeek.org/java/

Malzilla and SpiderMonkey can also help deobfuscate JavaScript.
Malzilla :
http://www.malzilla.org/downloads.html
SpiderMonkey :
http://www.didierstevens.com/files/software/js-1.7.0-mod.tar.gz
More details coming soon.

Add Javascript to PDF

https://didierstevens.com/files/software/make-pdf_V0_1_6.zip
https://neonprimetime.blogspot.fr/2015/03/how-to-add-javascript-to-pdf.html

Disarming a PDF

$ python pdfid.py --disarm file.pdf

Flash

Search for flash

$ python pdf-parser.py --search flash file.pdf

Extract flash with swf_mastah

$ python swf_mastah.py -f file.pdf -o ./
$ file *.swf

With pdf-parser

$ pdf-parser.py --object 32 --filter --raw file.pdf > flashFile.swf
$ file flashFile.swf

Analysing flash program

$ swfdump -Ddu flashFile.swf > flashFile.txt

More details coming soon.

Sources โ„น๏ธ

https://blog.didierstevens.com/category/pdf/
http://www.decalage.info/file_formats_security/pdf
https://zeltser.com/analyzing-malicious-documents/
https://code.google.com/archive/p/corkami/wikis/PDFTricks.wiki
https://www.sans.org/reading-room/whitepapers/malicious/owned-malicious-pdf-analysis-33443
https://digital-forensics.sans.org/blog/2009/12/14/pdf-malware-analysis/
http://fileformats.archiveteam.org/wiki/PDF

pdf_analysis's People

Contributors

zbetcheckin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.