Git Product home page Git Product logo

bulk_extractor's Issues

bulk_extractor wordlist should be rewritten to use la-strings.

bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' && ch<128.

An improvement to this would be to support encodings such as UTF-8, UTF-16 and UTF-32, possibly as options specified by the user. The words should then be converted to a single encoding (UTF-8?) and then split/deduped, for possible conversion and use by the target application.

yyFlexLexer warning needs resolved

The following warning should be fixed:

: void yyFlexLexer::LexerError( yyconst char msg[] )
:1662:6: warning: function might be candidate for attribute ‘noreturn’ [-Wsuggest-attribute=noreturn]

allow feature files to include ?arg=val in forensic path.

The idea is to tack on these fields to the forensic path as URL
query string parameters, e.g., ?re=foo&enc=UTF-8. We'd obviously need
to work out the details about escaping, etc., but there are a few
things to like about this. First, URLs are cool and one can easily
imagine some future web service for exposing bulk_extractor output,
and that's not a bad way to integrate disparate enterprise systems.
Second, the scheme is idempotent, so if you ran a slightly different
set of patterns at a later time, the patterns that remained the same
would generate the same forensic paths. Third, the query parameters
act as annotations to the location of the data.

The main cons are that it reads kind of ugly, and will be a bit harder
to deal with in quick-and-dirty scripts.

be13_api/pcap_fake.cpp:2:21: error: tcpflow.h: No such file or directory

I downloaded both bulk_extractor and tcpflow via git. I built and installed tcpflow and am trying to build bulk_extractor but run into the above error. I could hardwire a fix, but I'd like to get this solved properly.

If I copy tcpflow/src/tcpflow.h to /usr/local/include the compile throws this error:

In file included from be13_api/pcap_fake.cpp:2:
/usr/local/include/tcpflow.h:206: error: conflicting declaration ‘typedef size_t socklen_t’

-David

Whitelist system may not work properly with exif XML output

From the mailing list:

I ran bulk_extractor against an image and then re-ran it against the
same image again giving it -w exif.txt from the first run.  This
should have resulted in all exif features being stopped, but I get a
non-empty exif file on the second run:

This is the entire exif feature file from the second run.

# UTF-8 Byte Order Marker; see http://unicode.org/faq/utf_bom.html
# BULK_EXTRACTOR-Version: 1.3.1 ($Rev: 10844 $)
# Feature-Recorder: exif
# Filename: win7.vmdk
# Feature-File-Version: 1.1
292220928   288a8ed63c00c1b39343dbe82a090cd0    <exif><ifd0.tiff.Software>Adobe ImageReady</ifd0.tiff.Software></exif>
4899467264  6d5f317239f1b039bc534660ac2abae4    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4900933632  5aea5473d3bd76a86cf4dbe46385545f    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4890324992  4146c4da38363f4e2862d10c1f84f80d    <exif><ifd0.tiff.Copyright>Will Austin</ifd0.tiff.Copyright></exif>
4895125504  92fc7a14c551dae96c1960074865aa59    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4896358400  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4897701888  9a48c674f92171fc20eb1f8a5b8c2e9b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4918804480  72ee2842f3d7872a92964734322cac2b    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4920516608  5b57a8c6cd9393c567f89f0f4cc89522    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4922580992  4faf65eb81de15c1a371f53e5a3a38e0    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>
4924956672  698fcb66721525f86140188781bdb33e    <exif><ifd0.tiff.Copyright>Microsoft Corporation</ifd0.tiff.Copyright></exif>

(there is a tab after the offset on the first line, I don't know why
the mail client doesn't show it).

All of these features are in the exif.txt feature file from the first
run that I used as a stoplist.

Ubuntu 10.04.4 LTS

cda_tool.py should not use python3.2 but python3

The frist line reads
#!/usr/bin/env python3.2

I suggest to change it to
#!/usr/bin/env python3
so it works with the current python 3.x, too.
Or is there a reason that makes it not working with 3.3?

YY_FATAL_ERROR macro called in scan_email.cpp or scan_accts.cpp.

When running bulk_extractor I am getting the following error message:

input buffer overflow, can't enlarge buffer because scanner uses REJECT
Segmentation fault

Does anyone know what is causing this error message? I am running this under Linux Mint on a 8 core machine with 12 GB of RAM.

User Plugins

As of Version 1.4.4, a user defined Plugin can be loaded only by giving a plugin directory via command line option '-P'. I would appreciate an environment Variable a la PATH (something like BE_PATH), in order to keep the command line short.

Further, BEViewer 1.4.4 can't show the content of a path containing a component belonging to a user defined recursive plugin, as the -P option is not given in the underlying call to bulk_extractor. An environment Variable would solve this issue, too.

The following patch in the source code of bulk_extractor 1.4.4 helps me as a temporary solution. Would be nice if it would be fixed in the next release:

diff -r bulk_extractor-1.4.4/src/main.cpp source/bulk_extractor-patched/src/main.cpp
809a810,820
>     // >>> Patch
>     // add to plugin_path: /usr/local/lib/bulk_extractor:/usr/lib/bulk_extractor:.
>     {
>       const char* p;
>       struct stat s;
>       p="/usr/local/lib/bulk_extractor"; if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p="/usr/lib/bulk_extractor";       if(stat(p, &s)==0) scanner_dirs.push_back(p);
>       p=".";                                                scanner_dirs.push_back(p);
>     }
>     // <<< Patch
> 
diff -r bulk_extractor-1.4.4/src/be13_api/plugin.cpp source/bulk_extractor-patched/src/be13_api/plugin.cpp
218c218,219
<     std::cout << "Loading: " << fn << " (" << func_name << ")\n";

---
>     // >>> Patch: The following output would confuse BEViewer.
>     // std::cout << "Loading: " << fn << " (" << func_name << ")\n";

Whitelist stats go to stdout but not to report.xml

Whitelist stats are reported to stdout but not to report.xml.

Specifically:

When bulk_extractor initializes in main.cpp, it reads any alert list(s) and stop list(s) using function word_and_context_list::readfile in file word_and_context_list.cpp. Unfortunately, bulk_extractor does this before opening report.xml as pointer variable dfxml_writer *xreport, so it is not yet ready to write to report.xml.

To fix this:

  • Move instantiation of xreport way up near the top,
    being careful not to disrupt behavior in the event of an error or if bulk_extractor is being restarted.
  • Pass the xreport pointer as a new parameter to word_and_context_list::readfile() so that readfile can write the stats directly into report.xml.
  • I recommend the same treatment of passing xreport to any function
    that prints to stdout wherever the user also wants the output to go into report.xml.

Scanning in recursive mode drops features and files

When running the bulk_extractor in recursive directory scan mode (-R), bulk_extractor drops features and files:

  • If a feature is encountered but a feature has already been recorded at that Forensic Path from another file, then the feature is dropped.
  • If a filename is not simple ASCII, bulk_extractor will skip the file and not scan it.

This behavior limits completeness of scans using recursive mode.

URL parse error when surrounded by '&quot;'

Results incorrectly include trailing '&quot' when parsing URLs.

url.txt output:

199452984   http://www.icra.org/ratingsv02.html&quot;   (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com&quot  true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com&quot  true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com&quot  true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html&quot;   z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com&quot  true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com&quot  true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

should be:

199452984   http://www.icra.org/ratingsv02.html (pics-1.1 &quot;http://www.icra.org/ratingsv02.html&quot; l gen true for 
199453047   http://www.msn.com   true for &quot;http://www.msn.com&quot; r (cz 1 lz 1 n
199453120   http://msn.com   true for &quot;http://msn.com&quot; r (cz 1 lz 1 n
199453189   http://stb.msn.com   true for &quot;http://stb.msn.com&quot; r (cz 1 lz 1 n
199453396   http://www.rsac.org/ratingsv01.html z 1 vz 1) &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true for 
199453645   http://stc.msn.com   true for &quot;http://stc.msn.com&quot; r (n 0 s 0 v 0
199453709   http://stj.msn.com   true for &quot;http://stj.msn.com&quot; r (n 0 s 0 v 0

Version information:

# BULK_EXTRACTOR-Version: 1.5.5 ($Rev: 10844 $)
# Feature-Recorder: url
# Feature-File-Version: 1.1

Please let me know if I can provide you with any better information.

bulk_extractor scan_flexdemo error

When trying to run bulk_extractor (1.5.5) in plugins directory, it throws an error:
"bulk_extractor: symbol lookup error: ./scan_flexdemo.so: undefined symbol: _ZN7beregexC1Esi"

python bulkextractor_reader should have an iterator for reports

Currently the iterator only works with report directories and zip files of report directories. It should be modified so that it can handle top-level directories or zip files with multiple reports and return an interator for all of the reports, and each report an iterator for all of the enclosed feature files.

Prefer octal or hex escape codes

A discussion in bulk_extractor-users group of octal vs. hex escape codes resolved that hex is preferred. Functionally, it doesn't matter, but people visually prefer hex.

SHA-1 support

Currently BE uses MD5 as a universal hash. There should be a flag allowing other hash algorithms to be used and reported. The hash in use should be evidenced in the feature files. Also support SHA-3/128, which would be the first 128 bits of SHA-3?

pcap stoplist

bulk_extractor scan_pcap should support a stoplist of packet artifacts.

Integrated handling of magic numbers

Scanners should be able to register magic numbers that they can handle. Then other scanners like scan_xor could look for the magic numbers and only xor when they find them... Useful?

Custom LEX can't be set during configure

I am trying to ./configure with LEX=/usr/loca/bin/flex (this is needed because /usr/bin/flex doesn't support -R but /usr/local/bin/flex does)

But it is not possible because of those 3 lines in configure.ac:

if test "$LEX" != flex; then
AC_MSG_ERROR([flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....])
fi

So I get the following error:
configure: error: flex not installed; required for compiling regular expressions. Try 'apt-get install flex' or 'yum install flex' or 'port install flex' or whatever package manager you happen to be using....

Possible typo

This appears to be a typo in the script, should it be getpwuid? It is listed as getwpuid and was missing when the ./configure command was run for Bulk_extractor.

process_aff appears to be ignoring pagesize

It appears that process_aff::get_sbuf() ignores the pagesize. I don't think that this can all be rewritten to use pread because process_dir needs to be able to return an sbuf for an iterator.

TZ Typo?

Looking at line 48 of /src / scan_email_lg.cpp, it looks like the ABBREV constant has a value of 'UT' instead of 'UTC'.

Was this a typo or a deliberate choice?

User plugins (continued)

Thanks for integrating my suggestions (issue #53).

However, the second part of the patch concerning line 218 of file bulk_extractor-1.4.4/src/be13_api/plugin.cpp was not included, yet.

The output information of this line should go to a log-file (if at all) rather than to cout. Otherwise, BEViewer is not able to show any image data, any more, because the underlying call ''bulk_extractor -p -http ..." will be polluted by this logging information and thus it is no clean http code any more...

exiv2 doesn't compile

I started fixing:

--- ./configure.ac.orig 2013-07-12 01:19:20.000000000 +0000
+++ ./configure.ac      2013-07-13 07:43:24.000000000 +0000
@@ -518,8 +518,8 @@
   fi
 fi
 if test x"$exiv2" == x"yes" ; then
-  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
   AC_LANG_PUSH(C++)
+  AC_CHECK_HEADERS([exiv2/image.hpp exiv2/exif.hpp exiv2/error.hpp])
     AC_TRY_COMPILE([#include <exiv2/image.hpp>
                    #include <exiv2/exif.hpp>
                     #include <exiv2/error.hpp>],
--- ./src/scan_exiv2.cpp.orig   2013-05-29 01:03:05.000000000 +0000
+++ ./src/scan_exiv2.cpp        2013-07-13 07:45:01.000000000 +0000
@@ -7,6 +7,7 @@

 #include "config.h"
 #include "bulk_extractor_i.h"
+#include "be13_api/utils.h"

 #include <stdlib.h>
 #include <string.h>
@@ -101,7 +102,7 @@
 void scan_exiv2(const class scanner_params &sp,const recursion_control_block &rcb)
 {
     assert(sp.sp_version==scanner_params::CURRENT_SP_VERSION);
-    if(sp.phase==scanner_params::startup){
+    if(sp.phase==scanner_params::PHASE_STARTUP){
         assert(sp.info->si_version==scanner_info::CURRENT_SI_VERSION);
        sp.info->name  = "exiv2";
         sp.info->author         = "Simson L. Garfinkel";
@@ -112,8 +113,8 @@
        sp.info->flags = scanner_info::SCANNER_DISABLED; // disabled because we have be_exif
        return;
     }
-    if(sp.phase==scanner_params::shutdown) return;
-    if(sp.phase==scanner_params::scan){
+    if(sp.phase==scanner_params::PHASE_SHUTDOWN) return;
+    if(sp.phase==scanner_params::PHASE_SCAN){

        const sbuf_t &sbuf = sp.sbuf;
        feature_recorder *exif_recorder = sp.fs.get_name("exif");

But now I have other issues:

scan_exiv2.cpp: In function 'void scan_exiv2(const scanner_params&, const recursion_control_block&)':
scan_exiv2.cpp:155: error: 'be_hash' was not declared in this scope
scan_exiv2.cpp:186: error: 'xml' is not a class or namespace

fname use after free in process_ewf::open

Hi,

In process_ewf::open, fname is being freed immediately before having being used.
Patch below seems to fix the problem

--- ./src/image_process.h.orig  2014-01-15 15:00:06.000000000 +0000
+++ ./src/image_process.h       2014-06-09 14:15:54.000000000 +0000
@@ -128,7 +128,7 @@
     virtual int open()=0;                                  /* open; return 0 if successful */
     virtual int pread(uint8_t *,size_t bytes,int64_t offset) const =0;     /* read */
     virtual int64_t image_size() const=0;
-    virtual std::string image_fname() const { return image_fname_;}
+    virtual const std::string &image_fname() const { return image_fname_;}

     /* iterator support; these virtual functions are called by iterator through (*myimage) */
     virtual image_process::iterator begin() const =0;

JVM required by BEViewer

A discussion on bulk_extractor-users group resolved in that it is good to compile BEViewer on the latest compiler. The next BEViewer will be compiled on OpenJDK and will require Java 7 JRE.

lightgrep

I'm getting "error while loading shared libraries: liblightgrep.so.0: cannot open shared object file: No such file or directory" when trying to run bulk extractor. I have lightgrep installed,and was hoping to run bulk extractor with it. This is from a pull made today.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.