bulk_extractor wordlist currently checks if a byte isprint(ch) && ch!=' ' &amp

Hi. I'm reviewing this for BE2.0 What is the status of <a href="https://sourceforg

bulk_extractor wordlist should be rewritten to use la-strings. about bulk_extractor HOT 4 OPEN

simsong commented on May 30, 2024

bulk_extractor wordlist should be rewritten to use la-strings.

from bulk_extractor.

Comments (4)

simsong commented on May 30, 2024

This would be nice. The problem with this approach is that pretty much any random sequence of bytes will produce valid UTF-16 with the Han characters, so you're going to need to add a language model. So you'll really need to add la-strings and then tell the system which language or languages you want to extract. That's a complete rewrite to this module. Would you like to do it, or would you be happy with just English strings?

from bulk_extractor.

kefir- commented on May 30, 2024

Wouldn't it suffice if the character set (or language) was specified by the user? For example, if I were searching for Norwegian words or passwords written on a Norwegian keyboard, I could grab the full list of Norwegian UTF-8 characters from somewhere like http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet and specify that as my UTF-8 input character set (which could also be converted to UTF-16 and UTF-32). Bundling some charsets with bulk_extractor would be useful as well.

I'm not sure how the language models in la-strings work, but if it tries to detect language based on character and word frequencies, it might fail on passwords that aren't words in any language, and specifically on passwords written on a non-english keyboard that are stored together with text in a different language, for example english language URLs or database table names. I may easily have misunderstood how la-strings works, though.

I could try to put some code together, but don't hold your breath! :-) I certainly won't feel bad if someone else beats me to it.

Straying off topic: Is there a generic way to pass parameters to modules without changing bulk_extractor core to support the parameters? That would be useful for this and other potential modules.

from bulk_extractor.

simsong commented on May 30, 2024

bulk_extractor uses the -S option to pass name=value pairs to modules. This is better supported in the 1.4 codebase currently in github. Regarding specifying the language --- yes, that's possible, but our experience to date is that the examiner frequently doesn't know which languages are present on the media.

from bulk_extractor.

simsong commented on May 30, 2024

Hi. I'm reviewing this for BE2.0
What is the status of Language-Aware Strings?

from bulk_extractor.

Recommend Projects

bulk_extractor wordlist should be rewritten to use la-strings. about bulk_extractor HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent