Git Product home page Git Product logo

jstrings's Introduction

jstrings

A tool for finding JIS encoded Japanese text in binary data.

Usage

jstrings [options] [input_file]

Input can be a filename or data from stdin. Output is sent to stdout.

Options

-e encoding
--encoding encoding

Specify the encoding to use. Use one of the strings listed in parantheses below for that encoding:

  • Shift-JIS (shift-jis, shiftjis, sjis)
  • EUC-JP (euc, euc-jp, eucjp)
  • Microsoft CP932 (cp932, windows932, windows31j)

Optional; default is Shift-JIS.

-l value
--match-length value

Set number of consecutive characters required to be considered a valid string.

Optional; default is 5.

-c value
--cutoff value

Limit the output to the specified number of characters for a string. This is useful for "previewing" a file which may have large blocks of junk data that happen to fall within the range of valid code points. Strings that are cut off will be appended with an ellipsis.

Note that the length is in bytes, not characters. As such, due to the variable width nature of UTF-8, there is a chance the final character displayed may be incorrect. STL string functions do not work natively with encodings and the author feels that the work needed to implement this for an optional feature that should only be used for quickly previewing data would be overly complex.

Optional; default is no cutoff.

-m
--multiline

Include newline characters (0x0D or 0x0D0A) as valid. Otherwise, these will count as end of string markers.

Optional; default is disabled.

-r
--raw

Output the data in its original encoding without converting to Unicode.

Optional; default is disabled (will convert output strings to UTF-8 using libiconv).

-s
--skip-jis0201

Skip checking for JIS X 0201 characters. These is an 8 bit katakana-only code space that acts as a supplement to ASCII and was generally only used in older (early to mid 1980s) home computers. Disabling this can reduce false positives if you are working with newer data.

Optional; default is disabled (will include JIS X 0201 code points as valid matches).

Output

Found strings are prepended with the offset in which they were found in the original data and sent to stdout. Strings are converted to UTF-8 using libiconv. The original encoding can be preserved by using the --raw option.

Building

CMake is used for the build system. From the root directory:

mkdir build && cd build
cmake ..
make
sudo make install

jstrings's People

Contributors

drojaazu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

wowjinxy

jstrings's Issues

String at the end of is not found

It seems like a string is not considered valid unless there are at least 3 bytes following it before EOF.

good.txt: jstrings finds the Haiku
bad.txt: jstrings does not find the Haiku

I'm not sure if this is expected behavior or a bug, I just noticed it while testing and thought I might as well report it.

Multiline strings

Hi, first, fantastic tool, really helpful!

Currently, if a string contains a line break (0x0a / 0x0d0a), it will be detected as two separate strings instead. I understand that due to the tool's line-by-line output, representing newlines in a single string is not well-defined. However, in my use case, splitting the strings is causing some trouble.

Here's the scenario:

  1. I have 2 binaries, one Japanese, one English.
  2. I want to reverse engineer which strings from the Japanese binary have been translated to which in the English binary.
  3. Finally, I want to apply the same changes to a newer version of the Japanese binary.

For example, the Japanese binary contains this string:

データベースファイル
%sが
ありませんでした

In the English binary, it was translated as follows:

Cannot find database file:
%s

However, jstrings finds each line as a separate string. In order to meaningfully do what I'm trying to do, I would need to somehow stitch these individual lines back together. Of course, I could do that manually, but with a total of over 160,000 strings found, that could be quite cumbersome, and need repeating every time a new Japanese binary is released.

Would it be possible to add a command line option that enables finding multiline strings? Perhaps some sort of escape sequence could be used to mark the CR/LF characters.

Thanks for your great work and consideration of my case.

Addresses getting out of sync

I have 2 binary files, one Japanese, one English.

In both files, it finds the same English string. [Save Error].

In the Japanese binary, it claims the string is at offset 0xf85cb6, in the English binary it supposedly finds it at 0xf85cb2.

The actual position of the string in both binaries is 0xf85cec.

Somehow, the address is drifting while processing the file.

I will try to reproduce with a smaller example. If I can't, I can link to the original binaries I'm having trouble with.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.