iximeow / poggle Goto Github PK
View Code? Open in Web Editor NEWParser generator for binary data
Parser generator for binary data
Ex:
root := 0x07-0xff | 0x01
root := 0x01 | 0x03 | 0x06 | 0x10-0x20 ! 0x1f
to consider: operator precedence
|
and &
should probably have the same precedence as !
, applied all left to right.
Also consider that negations can have ranges as well:
root := 0x00-0x0f | 0x04-0x0e
TL;DR: &
== intersection, |
== union, ! <set>
== & negation <set>
with the grammar
:components:
A := 0x04
B := 0x30
C := 0x24
D := (A|B):C
E := (A:B)|C
root := D:E
on the input
0x30 0x24 0x04 0x30
poggle inconsistently marks where to backtrack to, which results in an operation like @bytes[nil]
.
error looks like
/toy/poggle/parser_parts/matcher.rb:11:in `[]': no implicit conversion from nil to integer (TypeError)
from /toy/poggle/parser_parts/matcher.rb:11:in `next'
from /toy/poggle/parser_parts/primitives/byte_body.rb:15:in `match'
from /toy/poggle/parser_parts/body_proxy.rb:11:in `match'
Something like
A: byte{2} aligned 4 := 0x06:0x50
that would only allow A to match if at a 4-byte alignment. Slightly more concrete example:
Data: byte{4} aligned 4
IndexingInfo: byte{3} := 0x01:_
root := (IndexingInfo:_:Data){_}
such that in future alterations of the grammar it's known that any Data
must lie on the alignment boundary. It can then error if that condition is not met. This applies more with computed offsets (another issue for the future)
There may be merit in adding a simple form to express padding as well, but I can't think of an example.
This one is hard.
It should be possible to write that an instance of a referenced rule must be found at a particular offset into the data stream.
Example:
A := 0x05:0x06:0x07:size\ASize:data\Data{size}:0x00
ASize: byte{2}
Data: byte
Offset: byte{2}
root := offset\Offset:A@offset
Things to think about:
00 01 02 03 04 05 06 07 08
04 05
are read due to an offset which of the following should the data stream look like?
00 01 02 03 06 07 08
00 01 02 03 04 05 06 07 08
$curr_loc
and the offset?cString, raw string, and unicode string support DOESN'T really exist. need to make that a thing.
Example:
a := b:"1"
b := a:"0"|"2"
Should be able to parse "2101010101010...." but instead overflows the stack by greedily duplicating the mutually recursive rules.
a := a:1|0
also demonstrates this issue.
Ran into a major headache working with the instruction set for the MSP430 microcontroller. First, notice the instruction set layout on this page.
The poggle grammar for instruction forms with no operands is simple and rather pleasant:
byteOpFlag: bit{1}
pcOffset:bit{10}
JNE := b000
JEQ := b001
JNC := b010
JC := b011
JN := b100
JGE := b101
JL := b110
JMP := b111
jumpCondition: bit{3} :=
JNE | JEQ | JNC | JC | JN | JGE | JL | JMP
noOp := b001:jumpCondition:pcOffset
where this correctly parses any jump-like instruction.
One operand instructions are a little trickier, building from the above:
RRC := b000
SWPB:= b001
RRA := b010
SXT := b011
PUSH:= b100
CALL:= b101
RETI:= b110
oneOpCode: bit{3} :=
RRC | SWPB | RRA | SXT | PUSH | CALL | RETI
oneOpDestMode: bit{2}
destReg: bit{4}
oneOp := b000100:oneOpCode:byteOpFlag:oneOpDestMode:destReg
which parses one-op instructions correctly. It would be nice to parse out the destination addressing mode as the register-aware values Register direct
, Indexed
, Register indirect
, Indirect auto-increment
, Symbolic
, Immediate
, Absolute
, or one of the various constants, but it's an acceptable loss for now. It can even be done with a few dozen extra matching rules.
But for trying to parse out addressing types directly on two operand instructions, the problem becomes very clear just from the structure:
twoOp := twoOpCode:sourceReg:twoOpDestMode:byteOpFlag:twoOpSourceMode:destReg
Because source
, destMode
, sourceMode
, and dest
are all interwoven, in order to parse out the addressing mode on two operand instructions there needs to be an exponential number of matching rules on the size of number of rules components range over. So in this case, there needs to be roughly sourceReg*twoOpDestMode*byteOpFlag*twoOpSourceMode*destReg
number of matching rules, which comes out to about 1024 rules!
Alternatively, we can rewrite the input stream to look like
twoOpCode:byteOpFlag:twoOpSourceMode:sourceReg:sourceReg:twoOpDestMode:destReg:destReg
, consuming the addressing mode/register pairs at once to read out an "enriched" addressing mode. This brings the number of total matching rules back down to around 15 or so, roughly linear on the number of matching rules the gap spans over.
Proposed syntax:
twoOp := twoOpCode:sourceReg:twoOpDestMode:byteOpFlag:twoOpSourceMode:destReg
twoOpRewrite $= twoOp =>
twoOpCode:byteOpFlag:twoOpSourceMode:sourceReg:sourceReg:twoOpDestMode:destReg:destReg
twoOpPrime := twoOpCode:byteOpFlag:twoOpSourceModePrime:sourceReg:twoOpDestModePrime:destReg
with usage like
root := twoOpRewrite : twoOpPrime
where twoOpRewrite
transparently rewrites the input it matches on.
There's no way to explicitly declare big or little endianness. Need to add that.
A rule like
X := b00100100
works on its own, but when mixed with other rules like
X := b00101101
Y := 0x05
Z := X:Y
the size inference step fails on Z with a NoMethodError.
Already fixed, will have PR up in a jiffy.
When trying to parse
:functions:
buildShort(byte{2}): byte{2}
foo(byte{_}): buildShort
:components:
B: byte{_}
root := b\B
Poggle ends up attempting to call .force
on an UnboundedSize
in rule_body:14
, trying to compute the size of the argument to foo
. Poggle would end up building an AnyBytes
out of this value, and AnyBytes
' matching isn't terminated. As a result, expressions like byte{_}:0x50
on input like 0x00 0x50
would consume all input for the bytes, then fail to match, when the correct behavior is to return a result like [[0x00], 0x50].
Future notes: Poggle should be able to handle {_} in a greedy and non-greedy manner, possibly requiring a change to the syntax?
Friendly reminder to the maintainers to come up with a consistent way to express match results!
A := v\byte{2}
B := v\byte{2}
root := A:B
This causes an error thinking v is being set in two different places, when the scope should be different in A
and B
.
Consider:
# this is where the magic happens
root := 0x50:0x60:0x70:0x80{4}
Currently #
lines are just... parsed like anything else.
Ex:
A := 0x01-0x0f
root := A:0xff
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.