iximeow / poggle Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 113 KB

Parser generator for binary data

Ruby 98.45% Java 0.40% Shell 1.05% C 0.10%

poggle's People

Contributors

Watchers

poggle's Issues

Set operations on value ranges

Ex:

root := 0x07-0xff | 0x01

root := 0x01 | 0x03 | 0x06 | 0x10-0x20 ! 0x1f

to consider: operator precedence
| and & should probably have the same precedence as !, applied all left to right.

Also consider that negations can have ranges as well:

root := 0x00-0x0f | 0x04-0x0e

TL;DR: & == intersection, | == union, ! <set> == & negation <set>

Match backtracking is improperly handled

with the grammar

:components:
A := 0x04
B := 0x30
C := 0x24
D := (A|B):C
E := (A:B)|C

root := D:E

on the input
0x30 0x24 0x04 0x30

poggle inconsistently marks where to backtrack to, which results in an operation like @bytes[nil].

error looks like

/toy/poggle/parser_parts/matcher.rb:11:in `[]': no implicit conversion from nil to integer (TypeError)
    from /toy/poggle/parser_parts/matcher.rb:11:in `next'
    from /toy/poggle/parser_parts/primitives/byte_body.rb:15:in `match'
    from /toy/poggle/parser_parts/body_proxy.rb:11:in `match'

Allow explicit alignment requirements

Something like

A: byte{2} aligned 4 := 0x06:0x50

that would only allow A to match if at a 4-byte alignment. Slightly more concrete example:

Data: byte{4} aligned 4
IndexingInfo: byte{3} := 0x01:_
root := (IndexingInfo:_:Data){_}

such that in future alterations of the grammar it's known that any Data must lie on the alignment boundary. It can then error if that condition is not met. This applies more with computed offsets (another issue for the future)

There may be merit in adding a simple form to express padding as well, but I can't think of an example.

Support data at computed offsets

This one is hard.

It should be possible to write that an instance of a referenced rule must be found at a particular offset into the data stream.

Example:

A := 0x05:0x06:0x07:size\ASize:data\Data{size}:0x00
ASize: byte{2}
Data: byte

Offset: byte{2}
root := offset\Offset:A@offset

Things to think about:

How to handle the data stream when doing this?
- 00 01 02 03 04 05 06 07 08
- If 04 05 are read due to an offset which of the following should the data stream look like?
  - 00 01 02 03 06 07 08
  - 00 01 02 03 04 05 06 07 08
Will have to permit arithmetic on the offset
How to handle data between $curr_loc and the offset?
If there's a sequence of entries, should there be some nesting behavior?
In pathological cases of sequential data/offset information, especially if the mapping isn't linear, this will definitely involve calling external functions.

Real string support

cString, raw string, and unicode string support DOESN'T really exist. need to make that a thing.

Recursion causes stack overflow on parser creation

Example:

a := b:"1"
b := a:"0"|"2"

Should be able to parse "2101010101010...." but instead overflows the stack by greedily duplicating the mutually recursive rules.

a := a:1|0 also demonstrates this issue.

Rewriting input streams

Ran into a major headache working with the instruction set for the MSP430 microcontroller. First, notice the instruction set layout on this page.

The poggle grammar for instruction forms with no operands is simple and rather pleasant:

byteOpFlag: bit{1}
pcOffset:bit{10}

JNE := b000
JEQ := b001
JNC := b010
JC  := b011
JN  := b100
JGE := b101
JL  := b110
JMP := b111

jumpCondition: bit{3} :=
  JNE | JEQ | JNC | JC | JN | JGE | JL | JMP

noOp := b001:jumpCondition:pcOffset

where this correctly parses any jump-like instruction.

One operand instructions are a little trickier, building from the above:

RRC := b000
SWPB:= b001
RRA := b010
SXT := b011
PUSH:= b100
CALL:= b101
RETI:= b110

oneOpCode: bit{3} :=
  RRC | SWPB | RRA | SXT | PUSH | CALL | RETI

oneOpDestMode: bit{2}

destReg: bit{4}

oneOp := b000100:oneOpCode:byteOpFlag:oneOpDestMode:destReg

which parses one-op instructions correctly. It would be nice to parse out the destination addressing mode as the register-aware values Register direct, Indexed, Register indirect, Indirect auto-increment, Symbolic, Immediate, Absolute, or one of the various constants, but it's an acceptable loss for now. It can even be done with a few dozen extra matching rules.

But for trying to parse out addressing types directly on two operand instructions, the problem becomes very clear just from the structure:
twoOp := twoOpCode:sourceReg:twoOpDestMode:byteOpFlag:twoOpSourceMode:destReg

Because source, destMode, sourceMode, and dest are all interwoven, in order to parse out the addressing mode on two operand instructions there needs to be an exponential number of matching rules on the size of number of rules components range over. So in this case, there needs to be roughly sourceReg*twoOpDestMode*byteOpFlag*twoOpSourceMode*destReg number of matching rules, which comes out to about 1024 rules!

Alternatively, we can rewrite the input stream to look like
twoOpCode:byteOpFlag:twoOpSourceMode:sourceReg:sourceReg:twoOpDestMode:destReg:destReg, consuming the addressing mode/register pairs at once to read out an "enriched" addressing mode. This brings the number of total matching rules back down to around 15 or so, roughly linear on the number of matching rules the gap spans over.

Proposed syntax:

twoOp := twoOpCode:sourceReg:twoOpDestMode:byteOpFlag:twoOpSourceMode:destReg
twoOpRewrite $= twoOp =>
  twoOpCode:byteOpFlag:twoOpSourceMode:sourceReg:sourceReg:twoOpDestMode:destReg:destReg
twoOpPrime := twoOpCode:byteOpFlag:twoOpSourceModePrime:sourceReg:twoOpDestModePrime:destReg

with usage like

root := twoOpRewrite : twoOpPrime

where twoOpRewrite transparently rewrites the input it matches on.

Figure out how to handle endianness

There's no way to explicitly declare big or little endianness. Need to add that.

Size inference fails when using bit-values

A rule like
X := b00100100
works on its own, but when mixed with other rules like

X := b00101101
Y := 0x05
Z := X:Y

the size inference step fails on Z with a NoMethodError.

Already fixed, will have PR up in a jiffy.

Unbounded-size expressions cause errors

When trying to parse

:functions:
buildShort(byte{2}): byte{2}
foo(byte{_}): buildShort

:components:
B: byte{_}
root := b\B

Poggle ends up attempting to call .force on an UnboundedSize in rule_body:14, trying to compute the size of the argument to foo. Poggle would end up building an AnyBytes out of this value, and AnyBytes' matching isn't terminated. As a result, expressions like byte{_}:0x50 on input like 0x00 0x50 would consume all input for the bytes, then fail to match, when the correct behavior is to return a result like [[0x00], 0x50].

Future notes: Poggle should be able to handle {_} in a greedy and non-greedy manner, possibly requiring a change to the syntax?

Friendly reminder to the maintainers to come up with a consistent way to express match results!

Scoping of variables is incorrect

A := v\byte{2}
B := v\byte{2}

root := A:B

This causes an error thinking v is being set in two different places, when the scope should be different in A and B.

Permit comments in grammars

Consider:

# this is where the magic happens
root := 0x50:0x60:0x70:0x80{4}

Currently # lines are just... parsed like anything else.

Literals should be able to be expressed as ranges

Ex:

A := 0x01-0x0f
root := A:0xff

iximeow / poggle Goto Github PK

poggle's People

Contributors

Watchers

poggle's Issues

Set operations on value ranges

Match backtracking is improperly handled

Allow explicit alignment requirements

Support data at computed offsets

Real string support

Recursion causes stack overflow on parser creation

Rewriting input streams

Figure out how to handle endianness

Size inference fails when using bit-values

Unbounded-size expressions cause errors

Scoping of variables is incorrect

Permit comments in grammars

Literals should be able to be expressed as ranges

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent