asmjit / asmdb Goto Github PK

View Code? Open in Web Editor NEW

327.0 24.0 46.0 1.04 MB

Instructions database and utilities for X86/X64 and ARM (THUMB/A32/A64) architectures.

License: The Unlicense

JavaScript 100.00%

x86 x86-64 arm aarch64 database

asmdb's Introduction

Important

This project is discontinued.

AsmDB was a separate project to provide instruction database for AsmJit and other tools, but primarily for AsmJit. Unfortunately the database used by AsmJit was often out of sync and it's almost impossible to know which AsmDB was used by AsmJit back in time. To solve this problem, AsmDB has been merged with AsmJit and resides now in asmjit/db directory of AsmJit project. This ensures that AsmJit's instruction tables are always in sync with AsmDB.

AsmDB

This is a public domain instruction-set database that contains the following architectures:

X86|X64 - Provided by x86data.js
A32|A64 - Provided by armdata.js

NOTE: There is currently work-in-progress to more standardize the database between various architectures, expect some data changes.

Data Files

Data files use .js suffix and are require()d and interepreted as JavaScript, however, these files are also parseable as JSON after locating JSON-BEGIN and JSON-END marks and stripping content outside of them. The database is meant to be readable and editable, thus it should stay small and simple.

The database provides the following concepts:

architectures
- TODO: Better name
cpuLevels
- TODO: Better name
extensions
- List of available extensions, instructions can specify extension(s) in metadata
attributes
- List of available attributes, instructions can specify attribute(s) in metadata
specialRegs
- List of special registers (and their parts) that instructions can read/write to/from
shortcuts
- List of shortcuts that can be used inside instruction's metadata, these shortcuts then expand to the expand key
registers
- TODO: Better name and format
instructions
- List of all instructions in a tuple format

X86 Data Files

X86 data provides the following information about each X86/X64 instruction:

Instruction name
Instruction operand(s):
- Specifies always all possible operands for the given encoding & opcode
- Operands can optionally contains a read/write information:
  - R: - The operand is read
  - W: - The operand is written
  - X: - The operand is read & written
  - W[A:B]: - Like W:, but specifies the first byte that is written (A) and how many bytes are written (B)
  - <...> - The operand (in most cases a register) is implicit and can be omitted
- AVX-512 options:
  - {k} - Instruction supports write-masking
  - {kz} - Instruction supports write-masking by zeroing
  - {er} - Instruction supports embedded rounding control
  - {sae} - Instruction supports suppress-all-exceptions feature
Instruction encoding and opcode as specified in X86/X64 instruction-set manuals
Additional information that specifies:
- Architecture required to encode / execute the instruction (ANY, X86, X64)
- Extension(s) required to execute the instruction (MMX, SSE2, AVX2, ...)
- Flags read/written by the instruction (CF, ZF, ... - R=Read, W=Written X=RW)
- Prefixes that can be used before the instruction:
  - LOCK - Lock prefix can be used
  - REP - Rep prefix can be used
- FPU (x87) flags:
  - FPU - The instruction is a FPU (x87) instruction
  - FPU_PUSH - The instruction pushes a value onto the FPU stack
  - FPU_POP - The instruction pops a value from the FPU stack
  - FPU_POP=2 - The instruction pops two values from the FPU stack
  - FPU_TOP=[+-]N - The instruction changes the top pointer of the FPU stack
- Volatility - a hint for instruction reordering and scheduling
  - VOLATILE - The instruction must not be reordered
- Privilege level:
  - PRIVILEGE=L[0-3] - The instruction's privilege level

Base API

The database itself provides a lot of information about each instruction, but since the DB is meant to be human readable and editable, the information presented is not in the best form to be processed as is. AsmDB solves this issue by providing API that can be used to index and query information stored in these data-files.

The API provides the following concepts:

ISA
- Used to index and retrieve information located in architecture data-files
- Provides ability to explore the ISA
- Provides query interface that can be used to query only specific instructions
Instruction
- Contains information about a single instruction, as specified in vendor-specific architecture reference manual.
Operand
- Contains information about a single operand

AsmDB API probides base interfaces for these concepts, and each architecture then provides ISA-dependent versions of these.

X86 API

AsmDB's asmdb.x86.ISA is the interface used to index and access the ISA. The following snippet shows a basic usage of it:

// Create the ISA instance populated by default x86 data.
const asmdb = require("asmdb");
const isa = new asmdb.x86.ISA();

// Returns an array of instruction names stored in the database:
console.log(isa.instructionNames);

// Iterates over all instructions in the database. Please note that instructions
// that have different operands but the same name (or different encodings) will
// appear multiple times as specified in the x86/x64 manuals. The `inst` is an
// `asmdb.x86.Instruction` instance.
isa.instructions.forEach(function(inst) {
  console.log(`Instruction '{inst.name}' [${inst.encoding}] ${inst.opcodeString}`);
}, this);

// Iterates over all instructions in the database, but groups instructions having
// the same name. It's similar to `instructions.forEach()`, but instead of providing
// a single instruction each time it provides an array of instructions sharing the
// same name.
isa.forEachGroup(function(name, insts) {
  console.log(`Instruction ${name}`:);
  for (var i = 0; i < insts.length; i++) {
    const inst = insts[i];
    console.log(`  [${inst.encoding}] ${inst.opcodeString}`);
  }
}, this);

// If iterators are not what you want, it's possible to get a list of instructions
// of the same name by using `query()`.
var insts = isa.query("mov");
for (var i = 0; i < insts.length; i++) {
  const inst = insts[i];
  console.log(`  ${inst.name} [${inst.encoding}] ${inst.opcodeString}`);
}

// You can implement your own iterator by using `instruction`, `instructionNames`,
// `instructionMap`, or `query()`:
const names = isa.instructionNames;
for (var i = 0; i < names.length; i++) {
  const name = names[i];
  const insts = x86.query(name);
  // ...
}

The snippet above just shown how to get instructions and list basic properties. What is more interesting is accessing asmdb.x86.Instruction and asmdb.x86.Operand data.

const asmdb = require("asmdb");
const isa = new asmdb.x86.ISA();

// Get some instruction (the first in the group):
const inst = isa.query("vpunpckhbw")[0];
console.log(JSON.stringify(inst, null, 2));

// Iterate over its operands:
const operands = inst.operands;
for (var i = 0; i < operands.length; i++) {
  const operand = operands[i];
  // ...
}

The stringified instruction would print something like this (with added comments that describe the meaning of individual properties):

{
  "name": "vpunpckhbw",            // Instruction name.
  "arch": "ANY",                   // Architecture - ANY, X86, X64.
  "encoding": "RVM",               // Instruction encoding.
  "prefix": "VEX",                 // Prefix - "", "3DNOW", "EVEX", "VEX", "XOP".
  "opcode": "68",                  // A single opcode byte as a hex string, "00-FF".
  "opcodeInt": 104,                // A single opcode byte as an integer (0..255).
  "opcodeString":                  // The whole opcode string, as specified in manual.
    "VEX.NDS.128.66.0F.WIG 68 /r",
  "l": "128",                      // Opcode L field (nothing, 128, 256, 512).
  "w": "WIG",                      // Opcode W field.
  "pp": "66",                      // Opcode PP part.
  "mm": "0F",                      // Opcode MM[MMM] part.
  "vvvv": "NDS",                   // Opcode VVVV part.
  "_67h": false,                   // Instruction requires a size override prefix.
  "rm": "r",                       // Instruction specific payload "/0..7".
  "rmInt": -1,                     // Instruction specific payload as integer (0-7).
  "ri": false,                     // Instruction opcode is combined with register, "XX+r" or "XX+i".
  "rel": 0,                        // Displacement (cb cw cd parts).
  "implicit": false,               // Uses implicit operands (registers / memory).
  "privilege": "L3",               // Privilege level required to execute the instruction.
  "fpu": false,                    // True if this is an FPU instruction.
  "fpuTop": 0,                     // FPU top index manipulation [-1, 0, 1, 2].
  "vsibReg": "",                   // AVX VSIB register type (xmm/ymm/zmm).
  "vsibSize": -1,                  // AVX VSIB register size (32/64).
  "broadcast": false,              // AVX-512 broadcast support.
  "bcstSize": -1,                  // AVX-512 broadcast size.
  "kmask": false,                  // AVX-512 merging {k}.
  "zmask": false,                  // AVX-512 zeroing {kz}, implies {k}.
  "sae": false,                    // AVX-512 suppress all exceptions {sae} support.
  "rnd": false,                    // AVX-512 embedded rounding {er}, implies {sae}.
  "tupleType": "",                 // AVX-512 tuple-type.
  "elementSize": -1,               // Instruction element size (used by broadcast).

  // Extensions required to execute the instruction:
  "extensions": {
    "AVX": true                    // Instruction is an "AVX" instruction.
  },

  // Instruction attributes
  "attributes": {
  },

  // Special registers accessed by the instruction.
  "specialRegisters": {
  },

  // Instruction operands:
  "operands": [{
    "data": "xmm",                 // The operand's data (processed).
    "reg": "xmm",                  // Register operand's definition.
    "regType": "xmm",              // Register operand's type (would differ if reg is "eax" for example).
    "mem": "",                     // Memory operand's definition.
    "memSize": -1,                 // Memory operand's size.
    "memOff": false,               // Memory operand is an absolute offset (only a specific version of MOV).
    "memSeg": "",                  // Segment specified with register that is used to perform a memory IO.
    "vsibReg": "",                 // AVX VSIB register type (xmm/ymm/zmm).
    "vsibSize": -1,                // AVX VSIB register size (32/64).
    "bcstSize": -1,                // AVX-512 broadcast size.
    "imm": 0,                      // Immediate operand's size.
    "immValue": null,              // Immediate value - `null` or `1` (only used by shift/rotate instructions).
    "rel": 0,                      // Relative displacement operand's size.
    "implicit": false,             // True if the operand is an implicit register (not encoded in binary).
    "read": false,                 // True if the operand is a read-op (R or X) from reg/mem.
    "write": true,                 // True if the operand is a write-op (W or X) to reg/mem.
    "rwxIndex": null,              // Read/Write (RWX) index.
    "rwxWidth": null               // Read/Write (RWX) width.
  }, {
    "data": "xmm",                 // ...
    "reg": "xmm",
    "regType": "xmm",
    "mem": "",
    "memSize": -1,
    "memOff": false,
    "memSeg": "",
    "vsibReg": "",
    "vsibSize": -1,
    "bcstSize": -1,
    "imm": 0,
    "immValue": null,
    "rel": 0,
    "implicit": false,
    "read": true,
    "write": false,
    "rwxIndex": -1,
    "rwxWidth": -1
  }, {
    "data": "xmm/m128",
    "reg": "xmm",
    "regType": "xmm",
    "mem": "m128",
    "memSize": 128,
    "memOff": false,
    "memSeg": "",
    "vsibReg": "",
    "vsibSize": -1,
    "bcstSize": -1,
    "imm": 0,
    "immValue": null,
    "rel": 0,
    "implicit": false,
    "read": true,
    "write": false,
    "rwxIndex": -1,
    "rwxWidth": -1
  }]
}

ARM Database

TO BE DOCUMENTED...

asmdb's People

Contributors

Stargazers

Watchers

asmdb's Issues

Support for jmp, call, and ret [far] instructions

hello,
as this project effect's asmjit, i've created the issue here, not in asmjit, (because asmjit get's it's instruction sets from here)
some of the instructions like hlt, iret, and some others are not implemented in the asmjit
if these are implemented, asmjit can be like nasm, which is used to write operating systems
look at this and it should be easy to add these

keep consistency for instruction vpbroadcastb W:zmm {kz}, xmm/m8 to xmm[0]/m8

["vpbroadcastb"     , "W:zmm {kz}, xmm/m8"                          , "RM-T1S"  , "EVEX.512.66.0F38.W0 78 /r"        , "AVX512_BW"],

It seems it should be xmm[0]/m8 from angle of consistency.

other reference:

["vpbroadcastb"     , "W:ymm {kz}, xmm[0]/m8"                       , "RM-T1S"  , "EVEX.256.66.0F38.W0 78 /r"        , "AVX512_BW-VL"],

Enough metadata for codegen?

Hi there!
Looking at the database x86data.js and I was wondering if the file has enough information to generate a proper x86/x64 code generator? (assuming that the /0, ib, /r...etc. have to be "handcoded")... as it looks like you are using it for asmjit (for the generate-XXX.js), I believe that it should be ok, but just want to be sure!
Thanks!

What does 0 and U mean on the flags?

Wondering what 0 and U mean on the metadata for the flags, as in:

OF=U SF=U ZF=U AF=U PF=U CF=U
OF=0 SF=W ZF=W AF=U PF=W CF=0

Also what do the lowercase x vs. uppercase X mean, and lowercase w and W?

x:~r8/m8,~r8

inconsistent naming of implicit x86 operands, e.g. <eax> and "eax"

I would be nice to standardize on one, e.g. the one without angle brackets.

Instructions missed implicit operands info

popa, popad pop 8 generals
pusha, pushad push 8 generals

Maybe need a new registers flag string?
seems "all" not an option, because call instruction and others maybe redefine the semantics of "all"
xx/yx/zx like series for cases?

And
vzeroall
vzeroupper
need a "all" kind of symbol to flag it.

armdata.js marks "blx label" as available in ARMv4

armdata.js marks "blx label" as available in ARMv4:

["blx" , "#RelS*4" , "T32", "1111|0|RelS[22]|RelS[19:10]|11|Ja|0|Jb|RelS[9:0]|0" , "ARMv4T+ IT=OUT|LAST"],
["blx" , "#RelS*2" , "A32", "1111|101|RelS[0]|RelS[24:1]" , "ARMv4+"],

but I used to work with ARM7TDMI and I think that did not have BLX, and here ARM states that "This instruction is available in all T variants of ARM architecture v5 and above."

Shouldn't it then be "ARMv5T+" in both cases? "bx register" seems to be correct.

possible typo in movq (and maybe movd is as well)

movq is the only instruction using this descriptor: r64[63:0]/m64
"movq" , "W:xmm[63:0], r64[63:0]/m64"

should this be just r64/m64 ? Especially since the MR variant looks like:
"movq" , "W:r64/m64, xmm[63:0]"

movd seems also suspicious:
"movd" , "W:r32[31:0]/m32, xmm[31:0]"
"movd" , "W:xmm[31:0], R:r32[31:0]/m32"

but the use of r32[31:0] sees to be more widespread,

call vs jmp format inconsistency

For (indirect) jmps the format is "D":

 ["jmp"              , "R:r32/m32"                                       , "D"       , "FF /4"                        , "X86 BND          Control=Jump"],
 ["jmp"              , "R:r64/m64"                                       , "D"       , "FF /4"                        , "X64 BND          Control=Jump"],

But calls the format is "M":

["call"             , "R:r16/m16"                                       , "M"       , "66 FF /2"                     , "X86 BND          Control=Call OF=U SF=U ZF=U AF=U PF=U CF=U"],
 ["call"             , "R:r32/m32"                                       , "M"       , "FF /2"                        , "X86 BND          Control=Call OF=U SF=U ZF=U AF=U PF=U CF=U"],
 ["call"             , "R:r64/m64"                                       , "M"       , "FF /2"                        , "X64 BND          Control=Call OF=U SF=U ZF=U AF=U PF=U CF=

I think it should also be "M" for indirect jmps

inconsistent use of immediate operand placeholders

Example:
"add" , "x:al, ib/ub" , "I" , "04 ib"

place holders do not match ib/ub vs ib

On the other hand

"add" , "x:r16/m16, ib" , "MI" , "66 83 /0 ib"

uses ib consistently

suggestion for movss and movsd and possibly other similar case

Current movss is reflected in the table as:

    ["movss"            , "w:xmm[31:0], xmm[31:0]"                          , "RM"      , "F3 0F 10 /r"                  , "SSE"],
    ["movss"            , "W:xmm[31:0], m32"                                , "RM"      , "F3 0F 10 /r"                  , "SSE"],

Wouldn't it be more systematic to fold them into one entry:

    ["movss"            , "w:xmm[31:0], xmm[31:0]/m32"                          , "RM"      , "F3 0F 10 /r"                  , "SSE"],

There is also a strange asymmetry where the MR variant only has the W:m32 flavor. Not sure if this is an
ISA quirk or a transcription error:

["movss"            , "W:m32, xmm[31:0]"                                , "MR"      , "F3 0F 11 /r"                  , "SSE"],

confusion around x86 "and" instructions

These two seem to conflict:

["and" , "X:r32/m32, id/ud" , "MI" , "81 /4 id" , "ANY _XLock OF=0 SF=W ZF=W AF=U PF=W CF=0"],
["and" , "X:r64, ud" , "MI" , "81 /4 id" , "X64 _XLock OF=0 SF=W ZF=W AF=U PF=W CF=0"],

remove instruction ltr r32/m16, ltr r64/m16

Hi kobalicek:
there are 3 ltr items in asmdb:

ltr"              , "R:r16/m16"                                   , "M"       , "66 0F 00 /3"                      , "ANY              Volatile PRIVILEGE=L0
ltr"              , "R:r32/m16"                                   , "M"       , "0F 00 /3"                         , "ANY              Volatile PRIVILEGE=L0
ltr"              , "R:r64/m16"                                   , "M"       , "REX.W 0F 00 /3"                   , "X64              Volatile PRIVILEGE=L0

intel manual speaks:

The operand-size attribute has no effect on this instruction.
In 64-bit mode, the operand size is still fixed at 16 bits. The instruction references a 16-byte descriptor to load the 64-bit base.

AMD manual says:
The operand size attribute has no effect on this instruction

I checked out it within nasm and fasm. Both report error.
nasm: illegal instruction
fasm: invalid size of operand.

suggestions for making the format field ("M", "RM", "MR", etc) more useful

For my project (based on asmdb) it has been very useful to locally rewrite the format field to satisfy the following invariant:

the number of characters in the format field equals the number of operands.
This is violated for implicit operands, e.g.:

 "div"              , "X:<edx>, X:<eax>, r32/m32"                       , "M"       , "F7 /6"

My suggestion would be to change the format field to something like: "xxM" where "x" represents an implicit operand.

There also seems to be a problem with these opcodes:

    ["mov"              , "w:r8, ib/ub"                                     , "I"       , "B0+r ib"                      , "ANY"],
    ["mov"              , "w:r16, iw/uw"                                    , "I"       , "66 B8+r iw"                   , "ANY"],
    ["mov"              , "W:r32, id/ud"                                    , "I"       , "B8+r id"                      , "ANY"],
    ["mov"              , "W:r64, iq/uq"                                    , "I"       , "REX.W B8+r iq"                , "X64"],

I believe the format should be ""OI"

armdata.js marks one of the T32 forms of TST as available in ARMv4T+

["tst" , "Rn!=XX, #ImmC" , "T32", "1111|0|ImmC:1|0|0000|1|Rn|0|ImmC:3|1111|ImmC:8" , "ARMv4T+ IT=ANY APSR.NZC=W"]

Shouldn't that be ARMv6T2+?

See here, "These 32-bit Thumb-2 instructions are available in T2 variants of ARMv6 and above."

Cheers
Thomas

xlat and xlatb

I would suggest just having xlat [es:zbx + al] signature and remove xlatb completely (it's alias anyway).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.