Comments (4)
This would be nice. The problem with this approach is that pretty much any random sequence of bytes will produce valid UTF-16 with the Han characters, so you're going to need to add a language model. So you'll really need to add la-strings and then tell the system which language or languages you want to extract. That's a complete rewrite to this module. Would you like to do it, or would you be happy with just English strings?
from bulk_extractor.
Wouldn't it suffice if the character set (or language) was specified by the user? For example, if I were searching for Norwegian words or passwords written on a Norwegian keyboard, I could grab the full list of Norwegian UTF-8 characters from somewhere like http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet and specify that as my UTF-8 input character set (which could also be converted to UTF-16 and UTF-32). Bundling some charsets with bulk_extractor would be useful as well.
I'm not sure how the language models in la-strings work, but if it tries to detect language based on character and word frequencies, it might fail on passwords that aren't words in any language, and specifically on passwords written on a non-english keyboard that are stored together with text in a different language, for example english language URLs or database table names. I may easily have misunderstood how la-strings works, though.
I could try to put some code together, but don't hold your breath! :-) I certainly won't feel bad if someone else beats me to it.
Straying off topic: Is there a generic way to pass parameters to modules without changing bulk_extractor core to support the parameters? That would be useful for this and other potential modules.
from bulk_extractor.
bulk_extractor uses the -S option to pass name=value pairs to modules. This is better supported in the 1.4 codebase currently in github. Regarding specifying the language --- yes, that's possible, but our experience to date is that the examiner frequently doesn't know which languages are present on the media.
from bulk_extractor.
Hi. I'm reviewing this for BE2.0
What is the status of Language-Aware Strings?
from bulk_extractor.
Related Issues (20)
- AddressSanitizer doesn't appear to be running on GitHub unit tests
- Add actual validation to build-jo-work run
- GitHub actions not reporting warnings
- Remove or rewrite INSTALL so that it can be understood by people who are not familiar with GNU autoconf HOT 2
- Migrate be20_api and bulk_extractor from autotools to CMAKE or a modern make system HOT 1
- Allow scanners to be written in Rust.
- bring back md5 support
- Disabling all scanners does not allow re-enabling
- Compile fails when exiv2 is not installed HOT 18
- fix shadowing issue in #433
- bulk_extractor goes multi-threaded with -J or --no_threads
- report.xml does not include debug:work_end tags
- upgrade catch
- Make bulk_extractor compile under Debian HOT 22
- Compile on Ubuntu 20.04 LTS (g++ 9.4) error HOT 2
- make mingw fedora 39 use libgnurx if re2 is not available
- Add RTTI thumbnail option HOT 15
- Bulk_extractor HOT 1
- Cross-compiling for Windows using Fedora 36; CONFIGURE_FEDORA36_win64.bash errors out on libre2-dev HOT 6
- Channel
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bulk_extractor.