Git Product home page Git Product logo

Comments (4)

simsong avatar simsong commented on May 30, 2024

This would be nice. The problem with this approach is that pretty much any random sequence of bytes will produce valid UTF-16 with the Han characters, so you're going to need to add a language model. So you'll really need to add la-strings and then tell the system which language or languages you want to extract. That's a complete rewrite to this module. Would you like to do it, or would you be happy with just English strings?

from bulk_extractor.

kefir- avatar kefir- commented on May 30, 2024

Wouldn't it suffice if the character set (or language) was specified by the user? For example, if I were searching for Norwegian words or passwords written on a Norwegian keyboard, I could grab the full list of Norwegian UTF-8 characters from somewhere like http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet and specify that as my UTF-8 input character set (which could also be converted to UTF-16 and UTF-32). Bundling some charsets with bulk_extractor would be useful as well.

I'm not sure how the language models in la-strings work, but if it tries to detect language based on character and word frequencies, it might fail on passwords that aren't words in any language, and specifically on passwords written on a non-english keyboard that are stored together with text in a different language, for example english language URLs or database table names. I may easily have misunderstood how la-strings works, though.

I could try to put some code together, but don't hold your breath! :-) I certainly won't feel bad if someone else beats me to it.

Straying off topic: Is there a generic way to pass parameters to modules without changing bulk_extractor core to support the parameters? That would be useful for this and other potential modules.

from bulk_extractor.

simsong avatar simsong commented on May 30, 2024

bulk_extractor uses the -S option to pass name=value pairs to modules. This is better supported in the 1.4 codebase currently in github. Regarding specifying the language --- yes, that's possible, but our experience to date is that the examiner frequently doesn't know which languages are present on the media.

from bulk_extractor.

simsong avatar simsong commented on May 30, 2024

Hi. I'm reviewing this for BE2.0
What is the status of Language-Aware Strings?

from bulk_extractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.