Git Product home page Git Product logo

bulk_extractor's Introduction

codecov Coverity Scan Build Status

bulk_extractor is a high-performance digital forensics exploitation tool. It is a "get evidence" button that rapidly scans any kind of input (disk images, files, directories of files, etc) and extracts structured information such as email addresses, credit card numbers, JPEGs and JSON snippets without parsing the file system or file system structures. The results are stored in text files that are easily inspected, searched, or used as inputs for other forensic processing. bulk_extractor also creates histograms of certain kinds of features that it finds, such as Google search terms and email addresses, as previous research has shown that such histograms are especially useful in investigative and law enforcement applications.

Unlike other digital forensics tools, bulk_extractor probes every byte of data to see if it is the start of a sequence that can be decompressed or otherwise decoded. If so, the decoded data are recursively re-examined. As a result, bulk_extractor can find things like BASE64-encoded JPEGs and compressed JSON objects that traditional carving tools miss.

This is the bulk_extractor 2.1 development branch! It is reliable, but if you want to have a well-tested production quality release, download a release from https://github.com/simsong/bulk_extractor/releases.

Building bulk_extractor

We recommend building from sources. We provide a number of bash scripts in the etc/ directory that will configure a clean virtual machine:

git clone --recurse-submodules https://github.com/simsong/bulk_extractor.git
./bootstrap.sh
./configure
make
make install

For detailed instructions on installing packages and building bulk_extractor, read the wiki page here: https://github.com/simsong/bulk_extractor/wiki/Installing-bulk_extractor

For more information on bulk_extractor, visit: https://forensics.wiki/bulk_extractor

Tested Configurations

This release of bulk_extractor requires C++17 and has been tested to compile on the following platforms:

  • Amazon Linux as of 2023-05-25
  • Fedora 36 (most recently)
  • Ubuntu 20.04LTS
  • MacOS 13.2.1

You should always start with a fresh VM and prepare the system using the appropriate prep script in the etc/ directory.

Tested Configurations Which bulk_extractor Does Not Work

  • Debian 10 (is not supported for native builds))

RECOMMENDED CITATION

If you are writing a scientific paper and using bulk_extractor, please cite it with:

Garfinkel, Simson, Digital media triage with bulk data analysis and bulk_extractor. Computers and Security 32: 56-72 (2013)

@article{10.5555/2748150.2748581,
author = {Garfinkel, Simson L.},
title = {Digital Media Triage with Bulk Data Analysis and Bulk_extractor},
year = {2013},
issue_date = {February 2013},
publisher = {Elsevier Advanced Technology Publications},
address = {GBR},
volume = {32},
number = {C},
issn = {0167-4048},
journal = {Comput. Secur.},
month = feb,
pages = {56–72},
numpages = {17},
keywords = {Digital forensics, Bulk data analysis, bulk_extractor, Stream-based forensics, Windows hibernation files, Parallelized forensic analysis, Optimistic decompression, Forensic path, Margin, EnCase}
}

ENVIRONMENT VARIABLES

The following environment variables can be set to change the operation of bulk_extractor:

Variable Behavior
DEBUG_BENCHMARK_CPU Include CPU benchmark information in report.xml file
DEBUG_NO_SCANNER_BYPASS Disables scanner bypass logic that bypasses some scanners if an sbuf contains ngrams or does not have a high distinct character count.
DEBUG_HISTOGRAMS Print debugging information on file-based histograms.
DEBUG_HISTOGRAMS_NO_INCREMENTAL Do not use incremental, memory-based histograms.
DEBUG_PRINT_STEPS Prints to stdout when each scanner is called for each sbuf
DEBUG_DUMP_DATA Hex-dump each sbuf that is to be scanned.
DEBUG_SCANNERS_IGNORE A comma-separated list of scanners to ignore (not load). Useful for debugging unit tests.

Other hints for debugging:

  • Run -xall to run without any scanners.
  • Run with a random sampling of 0.001% to debug reading image size and a few quick seeks.

BUILDING ON WINDOWS

Note: Currenlty bulk_extractor 2.1 does not build on windows, but 2.0 does.

If you wish to build for Windows, you should cross-compile from a Fedora system. Start with a clean VM and use these commands:

$ git clone --recurse-submodules https://github.com/simsong/bulk_extractor.git
$ cd bulk_extractor/etc
$ bash CONFIGURE_FEDORA36_win64.bash
$ cd ..
$ make win64

BULK_EXTRACTOR 2.0 STATUS REPORT

bulk_extractor 2.0 (BE2) is now operational. Although it works with the Java-based viewer, we do not currently have an installer that runs under Windows.

BE2 requires C++17 to compile. It requires https://github.com/simsong/be13_api.git as a sub-module, which in turn requires dfxml as a sub-module.

The project took longer than anticipated. In addition to updating to C++17, It was used as an opportunity for massive code refactoring and general increase in code quality, testability and reliability. An article about the experiment will appear in a forthcoming issue of ACM Queue

bulk_extractor's People

Contributors

4n6ist avatar ant1 avatar blueteam0ps avatar brucemty avatar cho-m avatar d3vil0p3r avatar datafrogman avatar dfjxs avatar dloveall avatar edsu avatar esebese avatar fake4d avatar flakfizer avatar garfi303 avatar grayed avatar hephastie avatar jgru avatar joachimmetz avatar kefir- avatar linxon avatar mattdri-ir avatar moshekaplan avatar randomaccess3 avatar simsong avatar uckelman avatar uckelman-sf avatar zaratec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bulk_extractor's Issues

Scanning in recursive mode drops features and files

When running the bulk_extractor in recursive directory scan mode (-R), bulk_extractor drops features and files:

  • If a feature is encountered but a feature has already been recorded at that Forensic Path from another file, then the feature is dropped.
  • If a filename is not simple ASCII, bulk_extractor will skip the file and not scan it.

This behavior limits completeness of scans using recursive mode.

Whitelist stats go to stdout but not to report.xml

Whitelist stats are reported to stdout but not to report.xml.

Specifically:

When bulk_extractor initializes in main.cpp, it reads any alert list(s) and stop list(s) using function word_and_context_list::readfile in file word_and_context_list.cpp. Unfortunately, bulk_extractor does this before opening report.xml as pointer variable dfxml_writer *xreport, so it is not yet ready to write to report.xml.

To fix this:

  • Move instantiation of xreport way up near the top,
    being careful not to disrupt behavior in the event of an error or if bulk_extractor is being restarted.
  • Pass the xreport pointer as a new parameter to word_and_context_list::readfile() so that readfile can write the stats directly into report.xml.
  • I recommend the same treatment of passing xreport to any function
    that prints to stdout wherever the user also wants the output to go into report.xml.

python bulkextractor_reader should have an iterator for reports

Currently the iterator only works with report directories and zip files of report directories. It should be modified so that it can handle top-level directories or zip files with multiple reports and return an interator for all of the reports, and each report an iterator for all of the enclosed feature files.

JVM required by BEViewer

A discussion on bulk_extractor-users group resolved in that it is good to compile BEViewer on the latest compiler. The next BEViewer will be compiled on OpenJDK and will require Java 7 JRE.

YY_FATAL_ERROR macro called in scan_email.cpp or scan_accts.cpp.

When running bulk_extractor I am getting the following error message:

input buffer overflow, can't enlarge buffer because scanner uses REJECT
Segmentation fault

Does anyone know what is causing this error message? I am running this under Linux Mint on a 8 core machine with 12 GB of RAM.

TZ Typo?

Looking at line 48 of /src / scan_email_lg.cpp, it looks like the ABBREV constant has a value of 'UT' instead of 'UTC'.

Was this a typo or a deliberate choice?

Whitelist system may not work properly with exif XML output

From the mailing list:

I ran bulk_extractor against an image and then re-ran it against the
same image again giving it -w exif.txt from the first run.  This
should have resulted in all exif features being stopped, but I get a
non-empty exif file on the second run:

This is the entire exif feature file from the second run.

# UTF-8 Byte Order Marker; see http://unicode.org/faq/utf_bom.html
# BULK_EXTRACTOR-Version: 1.3.1 ($Rev: 10844 $)
# Feature-Recorder: exif
# Filename: win7.vmdk
# Feature-File-Version: 1.1
292220928   288a8ed63c00c1b39343dbe82a090cd0    <exif><ifd0.tiff.Software>Adobe ImageReady</ifd0.tiff.Software></exif>
4899467264  6d5f317239f1b039bc534660ac2abae4    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4900933632  5aea5473d3bd76a86cf4dbe46385545f    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4890324992  4146c4da38363f4e2862d10c1f84f80d    <exif><ifd0.tiff.Copyright>Will Austin</ifd0.tiff.Copyright></exif>
4895125504  92fc7a14c551dae96c1960074865aa59    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4896358400  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4897701888  9a48c674f92171fc20eb1f8a5b8c2e9b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4918804480  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4920516608  5b57a8c6cd9393c567f89f0f4cc89522    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4922580992  4faf65eb81de15c1a371f53e5a3a38e0    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4924956672  698fcb66721525f86140188781bdb33e    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>

(there is a tab after the offset on the first line, I don't know why
the mail client doesn't show it).

All of these features are in the exif.txt feature file from the first
run that I used as a stoplist.

Ubuntu 10.04.4 LTS

Prefer octal or hex escape codes

A discussion in bulk_extractor-users group of octal vs. hex escape codes resolved that hex is preferred. Functionally, it doesn't matter, but people visually prefer hex.

pcap stoplist

bulk_extractor scan_pcap should support a stoplist of packet artifacts.

URL parse error when surrounded by '&quot;'

Results incorrectly include trailing '&quot' when parsing URLs.

url.txt output:

199452984   http://www.icra.org/ratingsv02.html&quot;   (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com&quot  true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com&quot  true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com&quot  true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html&quot;   z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com&quot  true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com&quot  true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

should be:

199452984   http://www.icra.org/ratingsv02.html (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com   true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com   true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com   true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com   true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com   true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

Version information:

# BULK_EXTRACTOR-Version: 1.5.5 ($Rev: 10844 $)
# Feature-Recorder: url
# Feature-File-Version: 1.1

Please let me know if I can provide you with any better information.

process_aff appears to be ignoring pagesize

It appears that process_aff::get_sbuf() ignores the pagesize. I don't think that this can all be rewritten to use pread because process_dir needs to be able to return an sbuf for an iterator.

Integrated handling of magic numbers

Scanners should be able to register magic numbers that they can handle. Then other scanners like scan_xor could look for the magic numbers and only xor when they find them... Useful?

Possible typo

This appears to be a typo in the script, should it be getpwuid? It is listed as getwpuid and was missing when the ./configure command was run for Bulk_extractor.

yyFlexLexer warning needs resolved

The following warning should be fixed:

: void yyFlexLexer::LexerError( yyconst char msg[] )
:1662:6: warning: function might be candidate for attribute ‘noreturn’ [-Wsuggest-attribute=noreturn]

bulk_extractor wordlist should be rewritten to use la-strings.

bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.

An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.

cda_tool.py should not use python3.2 but python3

The frist line reads
#!/usr/bin/env python3.2

I suggest to change it to
#!/usr/bin/env python3
so it works with the current python 3.x, too.
Or is there a reason that makes it not working with 3.3?

Custom LEX can't be set during configure

I am trying to ./configure with LEX=/usr/loca/bin/flex (this is needed because /usr/bin/flex doesn't support -R but /usr/local/bin/flex does)

But it is not possible because of those 3 lines in configure.ac:

if test "$LEX" != flex; then
AC_MSG_ERROR([flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....])
fi

So I get the following error:
configure: error: flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....

SHA-1 support

Currently BE uses MD5 as a universal hash. There should be a flag allowing other hash algorithms to be used and reported. The hash in use should be evidenced in the feature files. Also support SHA-3/128, which would be the first 128 bits of SHA-3?

be13_api/pcap_fake.cpp:2:21: error: tcpflow.h: No such file or directory

I downloaded both bulk_extractor and tcpflow via git. I built and installed tcpflow and am trying to build bulk_extractor but run into the above error. I could hardwire a fix, but I'd like to get this solved properly.

If I copy tcpflow/src/tcpflow.h to /usr/local/include the compile throws this error:

In file included from be13_api/pcap_fake.cpp:2:
/usr/local/include/tcpflow.h:206: error: conflicting declaration ‘typedef size_t socklen_t’

-David

User Plugins

As of Version 1.4.4, a user defined Plugin can be loaded only by giving a plugin directory via command line option '-P'. I would appreciate an environment Variable a la PATH (something like BE_PATH), in order to keep the command line short.

Further, BEViewer 1.4.4 can't show the content of a path containing a component belonging to a user defined recursive plugin, as the -P option is not given in the underlying call to bulk_extractor. An environment Variable would solve this issue, too.

The following patch in the source code of bulk_extractor 1.4.4 helps me as a temporary solution. Would be nice if it would be fixed in the next release:

diff -r bulk_extractor-1.4.4/src/main.cpp source/bulk_extractor-patched/src/main.cpp
809a810,820
>     // >>> Patch
>     // add to plugin_path: /usr/local/lib/bulk_extractor:/usr/lib/bulk_extractor:.
>     {
>       const char* p;
>       struct stat s;
>       p="/usr/local/lib/bulk_extractor"; if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p="/usr/lib/bulk_extractor";       if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p=".";                                                scanner_dirs.push_back(p);
>     }
>     // <<< Patch
> 
diff -r bulk_extractor-1.4.4/src/be13_api/plugin.cpp source/bulk_extractor-patched/src/be13_api/plugin.cpp
218c218,219
<     std::cout << "Loading: " << fn << " (" << func_name << ")\n";

---
>     // >>> Patch: The following output would confuse BEViewer.
>     // std::cout << "Loading: " << fn << " (" << func_name << ")\n";

allow feature files to include ?arg=val in forensic path.

The idea is to tack on these fields to the forensic path as URL
query string parameters, e.g., ?re=foo&enc=UTF-8. We'd obviously need
to work out the details about escaping, etc., but there are a few
things to like about this. First, URLs are cool and one can easily
imagine some future web service for exposing bulk_extractor output,
and that's not a bad way to integrate disparate enterprise systems.
Second, the scheme is idempotent, so if you ran a slightly different
set of patterns at a later time, the patterns that remained the same
would generate the same forensic paths. Third, the query parameters
act as annotations to the location of the data.

The main cons are that it reads kind of ugly, and will be a bit harder
to deal with in quick-and-dirty scripts.

exiv2 doesn't compile

I started fixing:

--- ./configure.ac.orig 2013-07-12 01:19:20.000000000 +0000
+++ ./configure.ac      2013-07-13 07:43:24.000000000 +0000
@@ -518,8 +518,8 @@
   fi
 fi
 if test x"$exiv2" == x"yes" ; then
-  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
   AC_LANG_PUSH(C++)
+  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
     AC_TRY_COMPILE([#include <exiv2/image.hpp>
                    #include <exiv2/exif.hpp>
                     #include <exiv2/error.hpp>],
--- ./src/scan_exiv2.cpp.orig   2013-05-29 01:03:05.000000000 +0000
+++ ./src/scan_exiv2.cpp        2013-07-13 07:45:01.000000000 +0000
@@ -7,6 +7,7 @@

 #include "config.h"
 #include "bulk_extractor_i.h"
+#include "be13_api/utils.h"

 #include <stdlib.h>
 #include <string.h>
@@ -101,7 +102,7 @@
 void scan_exiv2(const class scanner_params &sp,const recursion_control_block &rcb)
 {
     assert(sp.sp_version==scanner_params::CURRENT_SP_VERSION);
-    if(sp.phase==scanner_params::startup){
+    if(sp.phase==scanner_params::PHASE_STARTUP){
         assert(sp.info->si_version==scanner_info::CURRENT_SI_VERSION);
        sp.info->name  = "exiv2";
         sp.info->author         = "Simson L. Garfinkel";
@@ -112,8 +113,8 @@
        sp.info->flags = scanner_info::SCANNER_DISABLED; // disabled because we have be_exif
        return;
     }
-    if(sp.phase==scanner_params::shutdown) return;
-    if(sp.phase==scanner_params::scan){
+    if(sp.phase==scanner_params::PHASE_SHUTDOWN) return;
+    if(sp.phase==scanner_params::PHASE_SCAN){

        const sbuf_t &sbuf = sp.sbuf;
        feature_recorder *exif_recorder = sp.fs.get_name("exif");

But now I have other issues:

scan_exiv2.cpp: In function 'void scan_exiv2(const scanner_params&, const recursion_control_block&)':
scan_exiv2.cpp:155: error: 'be_hash' was not declared in this scope
scan_exiv2.cpp:186: error: 'xml' is not a class or namespace

lightgrep

I'm getting "error while loading shared libraries: liblightgrep.so.0: cannot open shared object file: No such file or directory" when trying to run bulk extractor. I have lightgrep installed,and was hoping to run bulk extractor with it. This is from a pull made today.

Thanks

User plugins (continued)

Thanks for integrating my suggestions (issue #53).

However, the second part of the patch concerning line 218 of file bulk_extractor-1.4.4/src/be13_api/plugin.cpp was not included, yet.

The output information of this line should go to a log-file (if at all) rather than to cout. Otherwise, BEViewer is not able to show any image data, any more, because the underlying call ''bulk_extractor -p -http ..." will be polluted by this logging information and thus it is no clean http code any more...

fname use after free in process_ewf::open

Hi,

In process_ewf::open, fname is being freed immediately before having being used.
Patch below seems to fix the problem

--- ./src/image_process.h.orig  2014-01-15 15:00:06.000000000 +0000
+++ ./src/image_process.h       2014-06-09 14:15:54.000000000 +0000
@@ -128,7 +128,7 @@
     virtual int open()=0;                                  /* open; return 0 if successful */
     virtual int pread(uint8_t *,size_t bytes,int64_t offset) const =0;     /* read */
     virtual int64_t image_size() const=0;
-    virtual std::string image_fname() const { return image_fname_;}
+    virtual const std::string &image_fname() const { return image_fname_;}

     /* iterator support; these virtual functions are called by iterator through (*myimage) */
     virtual image_process::iterator begin() const =0;

bulk_extractor scan_flexdemo error

When trying to run bulk_extractor (1.5.5) in plugins directory, it throws an error:
"bulk_extractor: symbol lookup error: ./scan_flexdemo.so: undefined symbol: _ZN7beregexC1Esi"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.