rizinorg / rz-hexagon Goto Github PK

Hexagon disassembler code generator for Rizin from the official instruction manual.

C 43.15% Python 56.85%

rz-hexagon's Introduction

rz-hexagon

This is a Hexagon disassembly and analysis plugin generator for rizin. It uses the LLVM target description source code of the Hexagon architecture and additional hand-written code.

Missing features and bugs

This plugin is under continuous work. So checkout the Github issues for missing features or not yet fixed bugs.

Prerequisites

Requirements

For formatting we need clang-format. If it is not available on your distribution, you can install it from https://apt.llvm.org/.
Python requirements are in requirements.txt
As a developer you also need black, flake8, reuse.

Hexagon Target Description

We take all the information about the Hexagon instructions and operands from the many LLVM target description files.

Luckily there is a tool which combines all the information of those files into one .json file which we name Hexagon.json. So Hexagon.json will hold all information about the Hexagon instructions and operands.

In order to generate the Hexagon.json file we need the llvm-tblgen binary.

Unfortunately llvm-tblgen is usually not provided via the package manager. You have to compile LLVM by yourself.

Build LLVM

Please follow the LLVM docs (Build the release version to save a lot of RAM).

llvm-tblgen should be in <somewhere>/llvm-project/build/bin/ after the build.

Please add this directory to your PATH.

Install

Python 3.11

We require Python 3.11. Please follow the install-instructions from the Python documentation.

Clone repository

git clone --recurse-submodules https://github.com/rizinorg/rz-hexagon.git
cd rz-hexagon/

Setup a virtual environment

python3 -m venv .venv
# Activate the virtual environment.
# This step might differ from shell to shell (the one below is for bash/zsh).
# Take a look at the Python docs if you are using another one.
# https://docs.python.org/3.11/library/venv.html?highlight=virtual%20environment
source .venv/bin/activate

Install rz-hexagon as package

pip3 install -r requirements.txt -r rzil_compiler/requirements.txt
# If you enjoy some colors
pip3 install -r optional_requirements.txt
# Install as develop package
pip3 install -e rzil_compiler/
pip3 install -e .

Generate PlugIn

The first time you run the generator you need to add the -j option. This will generate the Hexagon.json from the current LLVM source.

./LLVMImporter.py -j

It processes the LLVM definition files and generates C code in ./rizin and its subdirectories.

Copy the generated files to the rizin directory with

rsync -a rizin/ <rz-src-path>/

Test

You can run the tests with:

cd Tests
python3 -m unittest discover -s . -t .

Development info

**Before you open a PR please run and fix the warnings.:

black -l 120 $(git ls-files '*.py')
flake8 --select=W504 --ignore=E203,W503 --max-line-length=120 $(git ls-files '*.py')
reuse lint

Coding info

The best way to start is to take a look at an instruction in Hexagon.json. We take all information from there and knowing the different objects makes it easier to understand the code.
If you need any information about a llvm specific term or variable name from the Hexagon.json file a simple grep -rn "term" llvm-project/llvm/lib/Target/Hexagon/ will usually help.

If you parse LLVM data always end it with an exception else statement:

if x:
   ...
elif y:
   ...
elif z:
   ...
else:
  raise ImplementationException("This case seems to be new, please add it.")

Names of variables which holds data directly taken from the Hexagon.json file should have a name which starts with llvm_.

For example:
- llvm_in_operands holds a list with the content of Hexagon.json::[Instr].InOperandList.
- llvm_syntax holds: $Rdd8 = combine(#0,#$Ii) (the syntax in LLVM style).
- syntax holds: Rdd = combine(#0,#Ii) (cleaned up LLVM syntax)
- Instruction.operands is a dictionary which contains Register and Immediate Python objects.
Please take a brief look at the Rizin development guide if you plan to change C code.

Contributors

Rot127
Anton Kochkov
Florian Märkl

rz-hexagon's People

Contributors

Stargazers

Watchers

Forkers

rot127 pelijah

rz-hexagon's Issues

Missing analysis tests

The rizin tests for the analysis parts are missing. Because of this the tests in rizinorg/rizin#1338 fail.

Minimum they should test for:

Correct extension of immediate values by immext instructions.
PC relative jumps/calls (forwards and backwards)
Hardware loops and their jumps (forwards and backwards)
Correct function recognition.
Position of an instruction in an instruction package

Optional

Cross references (maybe even for strings?)

Refactor code - tracker

This issue tracks code to refactor.

No representation of predicates in rizin

Rizin only supports conditionals which are not compatible with the predicate registers of the Hexagon architecture.

The Pu registers are interpreted like this:

If a scalar instruction uses it: The least significant bit determines the truth value. 1 = true, 0 = false.

If a vector instruction uses it: Each bit in the Pu register corresponds to a truth value of a single vector. 1 = true, 0 = false.

The condition, on which the instruction is executed, is stored in RzTypeCond RzAnalysisOp.cond. But non of the types in RzTypeCond match with the interpretation of the Pu register.

Analyse function preludes

Currently function preludes can not be detected.
To enable this we need to implement RzList *hexagon_analysis_preludes(RzAnalysis *analysis) and set RzAnalaysisPlugin.preludes.

It might be useful to mark packets during disassembly as prelude packet, if they contain an allocframe instruction.

Omit packet prefix if only one instruction is disassembled

In case only one instruction is disassembled the packet prefix should be omitted.

[0x00005000]> pi 1
\ jump 0x1209c

should become

[0x00005000]> pi 1
jump 0x1209c

To do this we have to track somehow whether a whole block is disassembled or only one instruction.
Also see the discussion here: rizinorg/rizin#1338 (comment)

Incorrect size of predicate registers in register profile

The predicate registers (P0-P3) have the incorrect size in the register profile set. The size is 8bit per register. Not 32bit.

rizin tests

Tracks all tests which need to be implemented.

Import symbol patching
HVX instructions and registers assembly
Relocation patching
Hardware loops and their jumps (forwards and backwards)

Remove manually added intendations `"...".format(indent, ...)` (clang-format-13 does it)

Simpler search for negative immediates

Searching for negative immediate values via /ai only works if they are written as unsigned int.
E.g.: to search for -0x8 type: /ai 0xfffffff8

It would be convenient to simply type /ai -0x8.

Virtual Python environment

Install instructions should install rz-hexagon in a virtual environment. Not the system one.

Syntax highlighting is broken

The syntax highlighting for jump instructions is a bit off at the moment.

In case of jump instruction the first few characters get colored green.
This is not what we want in our case because:

the first character always indicates package position
the jump keyword comes not always first in the instruction and the wrong characters get colored
instructions like loop get not colored green at all.

Example:

I haven't checked it, but this needs probably be fixed in the rizin core.

Easy way to import undocumented and missing instructions

LLVM has not all instructions defined. E.g.: all system/Monitor instructions are missing (chapter 11.9.2 in the programmers reference manual v67)

It is also possible that there are some undocumented ones we would like to include.

Maybe add a .json file which has all those instructions?

Rename `HEX_OP_TEMPLATE_FLAG_IMM_DOUBLE_HASH`

The immediate info HEX_OP_TEMPLATE_FLAG_IMM_DOUBLE_HASH should be renamed to HEX_OP_TEMPLATE_FLAG_IMM_32_BIT. Since immediate values with two hashes are 32bit wide. Immediate values with only one hash have an arbitrary length.

Add Doxygen documentation to functions

Replace `RzList HexPkt.bin` with `RzVector`

Since HexPkt.bin only saves pointers and RzVector is the recommended data structures for those, it should be changed.

Add tests for HVX disassembly

The disassembly of HVX instructions is not tested yet.

In order to solve this issue one has to:

Add HVX instructions to handwritten/asm-tests/test-code.S which cover all kind of HVX operands (single, double, quartile registers, immediate values).
Compile them with the Hexagon SDK, disassemble the binary with the SDKs objdump and compare the disassembly with the one rizin returns.
If successful, add the tests here: handwritten/asm-tests/hexagon

Note that the most recent public Hexagon SDK version only supports v62 instructions and previous ones.
But our instruction set also includes instructions until v68.
It would be great if someone finds a way to test those as well.

Allow multiple `endloop` packets per loop instruction

Multiple endloop packets can belong to a single loop instruction.

Example:

Currently the jump address of each endloop packet after the first one are set to 0x0. Therefore the analysis has errors.

The fix should be fairly simple. Just not set the static variables of loop starts to 0x0 once an endloop instruction is disasssembled.
See cases in hexagon_disas.c::hexagon_disasm_instruction().

Prerequisite: rizinorg/rizin#2073 should be merged.

Function analysis does not add complete packet to function.

The rizin function analysis stops analyzing the current function, if it encounters a direct jump instruction. This is problematic in our case since the jump instructions often is located at the beginning of a packet. Hence the rest of the packet is not analyzed by the analysis code, although it is executed on a real processor.

Example:

This function should actually look like this:

Possible, but not very nice, solutions:

Back up the jump target if jump #Ii is disassembled.
Once the last instruction of this packet is disassembled, set its type to RZ_ANALYSIS_OP_TYPE_JMP and set RzAnalysisOp.jump = #Ii
Dig into the rizin analysis code and add an exception for the hexagon architecture, so it always disassembles until the end of a packet before interpreting the instructions (seems like way too much work).

Duplex instructions: High instr. is printed before low instruction

The SDKs decompiler prints the Subinstructions of a Duplex in the order:

<low> ; <high>

The plugin does it
<high> ; <low>

SDB file with calling convnetion is not generated and outdated

The calling convention in the current sdb file is not generated and out of date.
It only lists R0-R3 instead of R0-R5 and R1:0.

The task in one sentence:

Add method in LLVMImporter.py which generates cc-hexagon-32.sdb.txt in rizin/librz/analysis/d.

Remove duplicate imported instructions

Multiple system instructions which we imported before are meanwhile added to LLVM.
We should check if the opcode bits match and remove the imported versions if so.
If they don't match we need to mark the previously IMPORTED instructions as UNDOCUMENTED.

List of potential duplicates.

[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_Rd_memw_locked_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_Rdd_memd_locked_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_barrier'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_brkpt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_dccleana_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_dccleaninva_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_dcfetch_Rs__u11_3'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_dcinva_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_dczeroa_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_diag0_Rss_Rtt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_diag1_Rss_Rtt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_diag_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_icinva_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_isync'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_l2fetch_Rs_Rt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_l2fetch_Rs_Rtt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_memd_locked_Rs_Pd__Rtt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_memw_locked_Rs_Pd__Rt'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_pause__u8'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_syncht'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_trace_Rs'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_trap0__u8'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_trap1_Rx__u8'
[*] Imported instruction was added to LLVM. Remove it if opcodes match. Instr.: 'IMPORTED_wait_Rs'

Generate FLIRT signatures

Rizin now supports FLIRT signatures (1, 2, 3).
We should generated them for the convenience of everyone.

Also, it would be nice to have the signatures of functions related to qurt or qurt services. Though this could be complicated or not possible from a copyright perspective since most of them are proprietary.

Compare results with other Hexagon disassemblers

Probably add more tests in https://github.com/rizinorg/rizin/blob/dev/test/db/asm/hexagon

Assign specific register types to HVX registers and control registers

The details are explained here: rizinorg/rizin#1287

In short:
HVX and Control Registers are marked as GPRs in the register profile. That is because rizin does not have vector or control register types.

Once they are implemented in rizin we should add them here as well.
An example is already implemented:

rz-hexagon/HardwareRegister.py

Lines 109 to 120 in 7df09f5

 def get_rz_reg_type(self) -> str: 

 return "gpr" 

 # if self.is_vector: 

 # return "vcr" 

 # elif self.is_control: 

 # return "ctr" 

 # elif self.is_general: 

 # return "gpr" 

 # elif self.is_guest: 

 # return "gst" 

 # else: 

 # raise ImplementationException("Rizin has no register type for the register {}".format(self.llvm_type))

Add methods which wrap code into a c-block/function/switch-case

At the moment most C code is simply written sequentially into the files.
Because a lot of functions/switch cases etc. exists there is a lot of duplicated strings.

That is ugly and annoying to read.

There should be methods which take for example:

a function name
Argumenets + types
return type
code indent

and return the C-code.
Similar methods for comments/switch cases could be useful.

Add support for post-v62 instructions as well as HVX

Function argument passing incorrect.

This seems to only happen if the function signature is added with afs.

If a signature is added to an analyzed function, it falsely claims that the arguments are all given via the stack. Although they should be passed by register.
The stack arguments are all only 1byte in size. Although they should be at least 4.

Example:

┌ void sym.function (void *arg0, char *arg1);                                                                                                     │ 0x00006f10    9 104          sym._Locksyslock                                │
│            ; arg void *arg0 @ R30+0x0                                                                                            │ 0x00006e10    3 96           sym._Initlocks                                  │
│            ; arg char *arg1 @ R30+0x1

Setup CI

CI should be set up for the following procedures:

Linting
If the files produced by ./LLVMImporter -j match the ones in the PR.

`endloopX` jumps always jump to `loopX` start. Not to the value in the `SA` register.

endloop instructions jump to the value in the SA0/SA1 register (if LC0/LC1 are greater than 0).

The values in SA0/SA1 can change during loop execution. Regardless of that, the jump targets of endloop instructions are always set to the original loop start. They will not change if the SA register is set again. This is technically incorrect.

We can't fix this because we can not emulate the code yet and it can't be detected during static analysis anyways.
So this will not be fixed but should be noted by anyone using this plugin.

This pattern often happens if multiple endloopX instructions belong to a single loopX instruction.

If you haven't done already you can check out the chapter about hardware loops in the programmers reference manual for a more in depth explanation of loop behavior.

Instruction behavior as RZIL/p-code

It would be pretty awsome to have an ~~ESIL~~ RZIL/p-code representation of the instruction behavior.

~~ESIL~~ RZIL would allow us to emulate the code, whereas the p-code representation would allow us to use the decompiler.

To be considered

Where does the behavior of each instruction comes from?
- Parsing it from the manual is tricky. The PDF has to be converted to .txt which introduces errors. Removing those during parsing is really, really annoying. On top of that we need a C-parser afterwards.
- Couldn't we get the instruction behavior from the QEMU src? Last time I checked they support instructions until v62 which would be fine for the beginning (E.g. the Pixel 2 has v62 processors. So this instruction set is not too old and probably covers most basic instructions).
ESIL seems to be reworked at the moment: rizinorg/rizin#1361

Some instructions have enormously complex behavior. Especially HVX instructions.

A quick example from the HVX manual illustrates this pretty well:

vtmp.h=vgather(Rt,Mu,Vvv.w).h

maps to:

MuV = MuV | (element_size-1);
Rt = Rt & ~(element_size-1);
for (i = 0; i < VELEM(32); i++) {
    for(j = 0; j < 2; j++) {
        EA = Rt+Vvv.v[j].uw[i];
        if (Rt <= EA <= Rt + MuV)
            TEMP.uw[i].uh[j] = *EA;
     }
}

There are also simpler ones but this seems to be the most complex we will get.

Add ARCHITECTURE.md

It would be nice to have an ARCHITECTURE.md file. It would describe the design of the plugin and the generator.
Also an overview which Python Object uses which one etc.

It would be very helpful if someone just wants to understand the rizin plugin from a bird view perspective.

Or which "hand-written" files contribute to which source files.

Refactor parse_instruction()

The method for parsing a Duplex or normal instruction are almost the same.
The method should be in InstructionTemplate from which DuplexInstruction and Instruction inherit.

Methods in question:

rz-hexagon/DuplexInstruction.py

Line 99 in 7df09f5

def parse_instruction(self) -> None:

rz-hexagon/Instruction.py

Line 106 in 7df09f5

def parse_instruction(self) -> None:

Compare buffered opcodes to current one at the same address

At https://github.com/rizinorg/rizin/blob/a689f0a86ee7218a12dc7ab5f00ff985d737019f/librz/asm/arch/hexagon/hexagon_arch.c#L780-L782

the asm plugin checks if an instruction at the given address is already buffered and returns it if so.
This can lead to incorrect textual disassembly if the opcode at this address changed and the instruction would be another than before.

The instruction should be disassembled again, if the buffered opcode doesn't match the given one.

Track the state and context of the asm and analysis plugin.

The Hexagon instructions have two properties which depend on the context they are in:

1. An instructions immediate value can be extended, if the previous instruction is an immext() instruction.
Example:

s 0x00005174
pi 1
0x00005174      ?     R0 = ##0x10  ; Immediate = 0x10

s 0x00005170
pi 2
0x00005170      ?     immext(##0x100)  ; 0x100 is added to the next immediate
0x00005174      ?     R0 = ##0x110  ; Immediate = 0x110

2. The position of an instruction in an instruction packet depends on the prior instruction.

Rizin marks the position of an instruction in a packet like this:

0x0000516c      [     R1 = ##0x77         ;  #1 and last in packet
0x00005170      /     immext(##0x1c00)    ;  #1 in packet
0x00005174      |     R11 = ##loc.start_  ;  #2 in packet
0x00005178      |     immext(##0x1c00)    ;  #3 in packet
0x0000517c      \     R12 = ##loc.start_  ;  #4 in packet

The position is determined by the parsing bits in the instruction opcode.

0x0000516c      [     R1 = ##0x77         ;  Parsing bits = 0b11 => Last
0x00005170      /     immext(##0x1c00)    ;  Parsing bits = 0b01 => Not last. But first in new packet because prev. instr. was last
0x00005174      |     R11 = ##loc.start_  ;  Parsing bits = 0b01 => Not last
0x00005178      |     immext(##0x1c00)    ;  Parsing bits = 0b01 => Not last
0x0000517c      \     R12 = ##loc.start_  ;  Parsing bits = 0b11 => Last

In both cases the asm plugin needs to know the three previous instructions in the binary to print the correct assembly.
But this is a problem in rizin.

Rizin disassembles an instruction by simply giving four bytes to the disassemble() function. The disassemble() function has no way to peek into the previous instructions in the binary.
The same happens during analysis. The analysis function (hex_op6_op()) will get four bytes which it passes to disassemble() and adds some attributes afterwards.

Problem

At the moment static variables track the new packet start (new_pkt_starts), the previous address (previous_addr) and the last constant_extender.

Though this becomes a problem if the analysis and disassembly function are not called sequentially. The analysis function can be called on offset 0xffffee, the disassembly function on offset 0x0c.

The static variables will hold inconsistent values as a consequence.
(Inconsistent static variable content was also the reason for the failing tests in rizinorg/rizin#1614).

Possible solutions I can think of

Add the ability to peek into the previous three instructions in the binary (is against the rizin architecture).
Add some kind of struct to the plugins which track the last instructions disassembled by the analysis and asm plugin separately (needs more memory and quite a lot of memcpy()).
Peek into the buffer given to disassemble(), not into the binary. Problem with instructions üackets which span over multiple block sizes.

Track instruction packet membership in more detail

Hexagon instructions can belong to instruction packets. To mark those packet the assembly gets prefixed (with /,\,|) like that:

; two instruction packets.
/ immext(##0x1c00)
| R11 = ##loc.start_pc
| immext(##0x1c00)
\ R12 = ##loc.start_sp
/ R15 = asl(R15,R2)
\ memw(R13+R2<<#0x2) = R5

Here is how the plugin assigns the correct packet prefix to the assembly:

The last instruction in a packet has certain bits set. (See reference manual v67 Chapter 10.5).
- Prefix it with \, set new_pkt_starts = true
A new packet starts (because the previous instruction was the last instruction in its packet.)
- The next instruction is prefixed with a /
Each following instruction is prefixed wit a |, if it is not the last instr. nor the first one.

The logic is implemented in hex_set_pkt_info().

Unfortunately this causes some problems.
Assume the correctly disassembled instruction packet:

> s 0x00005170
> pi 4
0x00005170      /     immext(##0x1c00)    ;  #1 in packet
0x00005174      |     R11 = ##loc.start_  ;  #2 in packet
0x00005178      |     immext(##0x1c00)    ;  #3 in packet
0x0000517c      \     R12 = ##loc.start_  ;  #4 in packet

now do:

> s 0x00005174
> pi 1
/     R11 = ##loc.start_  // Is marked as #1, but is #2 in packet

Bug: R11 = ##loc.start_ should be prefixed with a | but has a / because the last disassembled instruction (R12 = ##loc.start_ at 0x0000517c) was marked as the last packet in the packet.

To solve this the plugin somehow needs to peek into the previous instruction in memory. This way it could check, whether it is indeed the first one.

Missing double control registers

Registers C21:20 until C29:28 are not decompiled correctly (<err> appears in the assembly).
The LLVM definitions do not defines them and therefor they don't appear in the generated code.

We can create a pull request on the LLVM repository or just wait until the LLVM developers add them.

We should not insert the register definitions into the generated rizin files. This will only increase the complexity and make the code less clear.

Yet I can't judge how much of a problem this is and how much time it is worth fixing it. Does code in the wild even uses those double registers?

Files are not written if they differ only in blank characters

The generator only writes a generated file if the content differs in any non blank character.
This is a problem if formatting changes with blanks were made.

Replace Duplex instruction generation with sub-instruction focused approach

Duplex instructions are currently generated by taking all sub instructions and build as many permutations from them as the grouping constraints allow (see: chapter 10.3 in the v67 manual).

This approach has a lot of draw backs and no advantages (more cod needed for generation, more C code output, longer runtimes for generation and several more).

A possible replacement would be:

Remove Duplex instruction code (in rizin and here).
Generate struct templates for sub-instructions.
Add a C function which is called if the parse bits are 0b00.
The asm string string is simply concatenated from the high and low sub instruction.
HexInsn is filled with the values of the high and low sub-instruction.

Main task: hex_disasm_with_templates for duplex instructions is needed.

Some implementations details could be a bit tricky.

IDs are unique for each instruction. But duplex instructions need some way to merge the IDs of both sub instructions or hold them both.

endloop01 packets miss the third way of branching.

endloop01 packets have three ways to branch:

jump to loop0 target.
jump to loop1 target.
next instruction.

Quite naturally basic blocks in rizin only support two branches (jump and fail) but endloop01 needs a third option.

Seeing that it is a lot of work to alter basic branch logic in rizin, this issue has low priority.

Incorrect function recognition because of predicated jumps

Predicated jumps lets the analysis of rizin believe, that a function ends earlier than it actually does. ~~That is maybe because RzAnalysisOp.cond gets not set (reason: #29).~~

Example:

Import/Export symbol patching is missing

While the roughly 50% of relocations were implemented in rizinorg/rizin@1c22476 the patching for imported symbols is still missing.

Write LLVM commit hash and date into header of source files

@thestr4ng3r suggested that it would be a nice idea to add the commit hash of LLVM, the date of the commit and a timestamp of generation somewhere in the source.
This way we could be easily seen how up to date the generated code is.

The information should be written as preamble (just like the license information) to all source files.

Hexagon.json unfortunately does not have this information. So those are the ways I can think of how to achieve this.

Ask the user to give the commit hash and date to LLVMImporter.py as arguments.
- Easy but annoying for the user.
Alter llvm-tblgen to include the commit hash and date into the generated json file.
- Way more work, but less annoying for the user.
Write commit hash into a file and update it every time a new Hexagon.json is used by hand.
- Easy for the user who does not use a new Hexagon.json, but we should not forget to update it.

Register sizes >256bit are not supported by rizin

Several commands (arA, ar all) fail with the error:
rz_reg_get_value: Bit size 2048 not supported (or 1024,4096 bit).

That is because the maximum size of a register supported by rizin is 256 bit.

It could be fixed by either:

Using the bignum library in rzutil
or waiting until RZIL is finished and use some bitvector?

RxIn registers have `REG_OUT` flag set.

Rx and RxIn are set separately in hi->ops although they are the same register (Rx registers are used for input and output).
But the register attribute for RxIn is set to HEX_OP_REG_OUT although it is an "in" register.

The attribute is falsely assigned here:

rz-hexagon/InstructionTemplate.py

Lines 217 to 218 in 7363e59

 if op.is_out_operand: 

 code += "hi->ops[{}].attr |= HEX_OP_REG_OUT;\n".format(op.syntax_index)

Currently the attributes are not used. So there is not much damage.

Disassemble 0x00000000 to <unknown>; not to a valid dublex instruction.

At the moment the disassembler maps 0x0000000 to R0 = memw(R0+#0x0) ; R0 = memw(R0+#0x0).

It should be <unknown> or <invalid>.

Set more RzAnalysisOp members

The RzAnalysisOp struct has many useful members which we do not set at the moment. Although we have most of the information from the LLVM definitions.

Members to consider

Done	RzAnalysisOp member	Note
	`RzAnalysisOpPrefix prefix;`	conditional, likely, unlikely
	`RzAnalysisStackOp stackop;`	operation on stack? Does LLVM has this information?
	`RzTypeCond cond;`	condition type
✓	`int size;`	always 4 bytes
	`int nopcode;`	number of bytes representing the opcode (not the arguments). Useful?
	`int cycles;`	Seems to be stored somewhere in a LLVM instr. But in some anonymous objects
	`RzAnalysisOpFamily family;`	At least distinguish between float and non float (there is a LLVM flag for that.)
✓	`int id;`	= HexInsn.instruction
✓	`bool eob;`	end of block - Only set for non conditional jump instructions.
	`bool sign;`	operates on signed values, false by default
	`RzAnalysisOpDirection direction;`	rwx flags and reference flag. Could determine that by checking whether in or out operands are present.
✓	`st64 ptr;`	reference to memory - Set at least for all jmp/call instructions
✓	`ut64 val;`	reference to value - see: #13
	`int ptrsize;`	pointers are always 32bit in Hexagon
	`st64 stackptr;`	?
	`int refptr;`	?
	`RzAnalysisValue *src[3];`	Definitely set that. Although the src array needs to be extended to 5 or 6 (Hexagon has up to 6 operands per instruction)
	`RzAnalysisValue *dst;`	As above
	`RzList *access;`	RzAnalysisValue access information
	`RzStrBuf esil;`	see: #12
	`RzStrBuf opex;`	? What is this doing?
	`const char *reg;`	/* destination register */
	`const char *ireg;`	/* register used for indirect memory computation*/
	`int scale;`	? What is this doing?
	`ut64 disp;`	? What is this doing?
	`RzAnalysisSwitchOp *switch_op;`	? What is this doing ?
	`RzAnalysisHint hint;`	Seems useful
	`RzAnalysisDataType datatype;`	Int32, float, string, array (HVX!) etc.

Replace imported system registers with the LLVM generated one.

LLVM added the system registers to there definition files and four more instructions which assign values to them (Y4_tfrscpp, Y4_tfrspcp, Y2_tfrscrr, Y2_tfrsrcr).

We should remove our imported versions of them and use the LLVM generated.
It removes complexity and assigns the correct identifiers to those four instructions.

It should be enough to delete the corresponding instructions in import/instructions and the registers in import/registers, generate a new Hexagon.json and run the generator.

run ./LLVMImporter.py
Copy files to rizin rsync -a rizin/ <rz-src-path>/
Build rizin
If build errors occurred: Fix them in rz-hexagon and repeat.
Run tests. On error fix in rz-hexagon

This is far from optimal or convenient and even worse: error prone.
If one small change in rz-hexagon is forgotten to be pushed to the rizin PR the source code is out of sync.

=> We should have a script which automates the process above.

Here some ideas.

Enforce same branch names between rizin repo and rz-hexagon repo (for simpler reference).
Add PR tests in Github to compare files between repos (if possible).
Write helper script to do step 1-5 for the user and report the build errors.
Script to sync files and commit on both repos with the same commit message.

	def get_rz_reg_type(self) -> str:
	return "gpr"
	# if self.is_vector:
	# return "vcr"
	# elif self.is_control:
	# return "ctr"
	# elif self.is_general:
	# return "gpr"
	# elif self.is_guest:
	# return "gst"
	# else:
	# raise ImplementationException("Rizin has no register type for the register {}".format(self.llvm_type))

	if op.is_out_operand:
	code += "hi->ops[{}].attr \|= HEX_OP_REG_OUT;\n".format(op.syntax_index)

rizinorg / rz-hexagon Goto Github PK

rz-hexagon's Introduction

rz-hexagon

Missing features and bugs

Prerequisites

Requirements

Hexagon Target Description

Build LLVM

Install

Generate PlugIn

Test

Development info

Coding info

Contributors

rz-hexagon's People

Contributors

Stargazers

Watchers

Forkers

rz-hexagon's Issues

Recommend Projects

Recommend Topics

Recommend Org