OPC_Decode is a specialized OPC_TryDecode. The difference between them
is that OPC_TryDecode performs a "completeness check", while OPC_Decode
asserts that the check passes.
The check is just a boolean test, which is nothing compared to the
complexity of the decoding process, so there is no point in having a
special opcode that optimizes the check.
Eliminate `doesOpcodeNeedPredicate` and instead have
`emitPredicateMatch` return true if any predicates were generated.
Delegate actual predicate generation in `emitPredicateMatch` to
`SubtargetFeatureInfo::emitMCPredicateCheck`. Additionally, remove the
redundant parenthesis around the predicate conditions in the generated
`checkDecoderPredicate` function.
Note that for ARM/AMDGPU this reduces the total # of predicates
generated by a few. It seems the old code would sometimes generate
duplicate predicates which were identical in semantics but one had an
extra pair of parentheses (i..e, `X` and `(X)`). `emitMCPredicateCheck`
does not seems to have that issue.
Several code cleanup changes in code to emit decoder tables:
- Start comments on each line at a fixed column for readibility.
- Combine repeated code to decode and emit ULEB128 into a single
function.
- Add helper `getDecoderOpName` to print decoder op.
- Print Filter/CheckField/predicate index values with those opcodes.
Do not exit when the first decoding conflict is encountered. Instead
record the conflict and continue to report any additional decoding
conflicts and exit fatally after all instructions have been processed.
This change introduces OPC_Scope opcode, whose only purpose is to record
a continuation point to resume at if a subsequent opcode fails.
Each OPC_Scope pushes an entry onto the scope stack; an entry is popped
if an opcode in the scope fails.
Previously, we recorded this information on several opcodes, it has been
removed. A series of such opcodes often referred to the same
continuation point; this information is now recorded in one place,
reducing table sizes in most cases. Average reduction is 1.1%, some
table observe up to 7% reduction in size.
The new behavior of those opcodes is "check or leave scope". If we're in
the outermost scope (scope stack is empty), they act as "check or fail".
There is one opcode, OPC_FilterValueOrSkip that behaves like the old
OPC_FilterValue. It is special because it acts as a case of a switch
statement and has nothing to do with scopes. (If a case fails, we should
try the next case instead of leaving the current scope.)
There are two classes of operands that DecoderEmitter cannot currently
handle:
1. Operands that do not participate in instruction encoding.
2. Operands whose encoding contains only 1s and 0s.
Because of this, targets developed various workarounds. Some targets
insert missing operands after an instruction has been (incompletely)
decoded, other take into account the missing operands when printing the
instruction. Some targets do neither of that and fail to correctly
disassemble some instructions.
This patch makes it possible to decode both classes of operands and
allows to remove existing workarounds.
For the case of operand with no contribution to instruction encoding,
one should now add `bits<0> OpName` field to instruction encoding
record. This will make DecoderEmitter generate a call to the decoder
function specified by the operand's DecoderMethod. The function has a
signature different from the usual one and looks like this:
```
static DecodeStatus DecodeImm42Operand(MCInst &Inst, const MCDisassembler *Decoder) {
Inst.addOperand(MCOperand::createImm(42));
return DecodeStatus::Success;
}
```
Notably, encoding bits are not passed to it (since there are none).
There is nothing special about the second case, the operand bits are
passed as usual. The difference is that before this change, the function
was not called if all the bits of the operand were known (no '?' in the
operand encoding).
There are two options controlling the behavior. Passing an option
enables the old behavior. They exist to allow smooth transition to the
new behavior. They are temporary (yeah, I know) and will be removed once
all targets migrate, possibly giving some more time to downstream
targets.
Subsequent patches in the stack enable the new behavior on some in-tree
targets.
Move `InsnBitWidth` template into anonymous namespace in the generated
code and move template specialization of `InsnBitWidth` to anonymous
namespace as well, and drop `static` for them. This makes `InsnBitWidth`
completely private to each target and fixes the "explicit specialization
cannot have a storage class" warning as well as any potential linker
errors if `InsnBitWidth` is kept in the `llvm::MCD` namespace.
This change adds an option to specialize decoders per bitwidth, which
can help reduce the (compiled) code size of the decoder code.
**Current state**:
Currently, the code generated by the decoder emitter consists of two key
functions: `decodeInstruction` which is the entry point into the
generated code and `decodeToMCInst` which is invoked when a decode op is
reached while traversing through the decoder table. Both functions are
templated on `InsnType` which is the raw instruction bits that are
supplied to `decodeInstruction`.
Several backends call `decodeInstruction` with different `InsnType`
types, leading to several template instantiations of these functions in
the final code. As an example, AMDGPU instantiates this function with
type `DecoderUInt128` type for decoding 96/128-bit instructions,
`uint64_t` for decoding 64-bit instructions, and `uint32_t` for decoding
32-bit instructions. Since there is just one `decodeToMCInst` in the
generated code, it has code that handles decoding for *all* instruction
sizes. However, the decoders emitted for different instructions sizes
rarely have any intersection with each other. That means, in the AMDGPU
case, the instantiation with InsnType == DecoderUInt128 has decoder code
for 32/64-bit instructions that is *never exercised*. Conversely, the
instantiation with InsnType == uint64_t has decoder code for
128/96/32-bit instructions that is never exercised. This leads to
unnecessary dead code in the generated disassembler binary (that the
compiler cannot eliminate by itself).
**New state**:
With this change, we introduce an option
`specialize-decoders-per-bitwidth`. Under this mode, the DecoderEmitter
will generate several versions of `decodeToMCInst` function, one for
each bitwidth. The code is still templated, but will require backends to
specify, for each `InsnType` used, the bitwidth of the instruction that
the type is used to represent using a type-trait `InsnBitWidth`. This
will enable the templated code to choose the right variant of
`decodeToMCInst`. Under this mode, a particular instantiation will only
end up instantiating a single variant of `decodeToMCInst` generated and
that will include only those decoders that are applicable to a single
bitwidth, resulting in elimination of the code duplication through
instantiation and a reduction in code size.
Additionally, under this mode, decoders are uniqued only within a given
bitwidth (as opposed to across all bitwidths without this option), so
the decoder index values assigned are smaller, and consume less bytes in
their ULEB128 encoding. As a result, the generated decoder tables can
also reduce in size.
Adopt this feature for the AMDGPU and RISCV backend. In a release build,
this results in a net 55% reduction in the .text size of
libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. For
RISCV, which today uses a single `uint64_t` type, this results in a 3.7%
increase in code size (expected as we instantiate the code 3 times now).
Actual measured sizes are as follows:
```
Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
AMDGPU Before After Change
======================================================
.text 612327 275607 55% reduction
.rodata 369728 351336 5% reduction
RISCV:
======================================================
.text 47407 49187 3.7% increase
.rodata 35768 35839 0.1% increase
```
If a custom operand has MIOperandInfo with >= 2 sub-operands, it is
required that either the operand or its sub-operands have a decoder
method (depending on usage). Require this for single sub-operand
operands as well, since there is no good reason not to.
There are no changes in the generated files.
It can never be reached. It could be reached if we emitted an opcode
that could fall outside the outermost scope, but emission of all such
opcodes is guarded by `!isOutermostScope()`.
That also means we never add fixups to the outermost scope, so avoid
pushing an entry for it onto the stack.
There is no target named Thumb, so there is no need to make a special
case for it.
As part of this change, pass CodeGenTarget instead of DecoderEmitter
to FilterChooser to remove dependency between the latter two.
Change `InstructionEncoding` to use `Size` field to derive the BitWidth
for fixed length instructions as opposed to the number of bits in the
`Inst` field. For some backends, `Inst` has more bits than `Size`, but
`Size` is the true size of the instruction.
Also add validation that `Inst` has at least `Size * 8` bits and any
bits in `Inst` beyond that are either 0 or unset.
Verified no change in generated *GenDisassembler.inc files before/after.
* Inline two small functions so that `emitTableEntries()` calls itself
directly rather than through other functions.
* Peel the last iteration of the loop as it is special.
This should make the code easier to follow.
So we can see the changes in table sizes after making changes to
DecoderEmitter by simply running `grep DecoderTable`.
Also, remove an unnecessary terminating 0 from the end of the tables.
There was a lot of confusion about the responsibilities of Filter and
FilterChooser. They created instances of each other and called each
other's methods. Some of the methods had similar names and did similar
things.
This change moves most of the Filter members to FilterChooser and turns
Filter into a supplementary class with short lifetime. FilterChooser
constructs an array of (candidate) Filters, chooses the best performing
one, and applies it to the given set of encodings, creating inferior
FilterChoosers as necessary. The Filter array is then destroyed. All
responsibility for generating the decoder table now lies with
FilterChooser.
Extract `findBestFilter() const` searching for the best filter and
move calls to `recurse()` out of it to a single place.
Extract `dump()` as well, it is useful for debugging.
See the added test. Before this change the decoder would first read
the second byte, despite the fact that there are 1-byte instructions
that could match:
```
/* 0 */ MCD::OPC_ExtractField, 8, 8, // Inst{15-8} ...
/* 3 */ MCD::OPC_FilterValue, 0, 4, 0, // Skip to: 11
/* 7 */ MCD::OPC_Decode, 186, 2, 0, // Opcode: I16_0, DecodeIdx: 0
/* 11 */ MCD::OPC_FilterValue, 1, 4, 0, // Skip to: 19
/* 15 */ MCD::OPC_Decode, 187, 2, 0, // Opcode: I16_1, DecodeIdx: 0
/* 19 */ MCD::OPC_FilterValue, 2, 4, 0, // Skip to: 27
/* 23 */ MCD::OPC_Decode, 188, 2, 0, // Opcode: I16_2, DecodeIdx: 0
/* 27 */ MCD::OPC_ExtractField, 0, 1, // Inst{0} ...
/* 30 */ MCD::OPC_FilterValue, 0, 4, 0, // Skip to: 38
/* 34 */ MCD::OPC_Decode, 189, 2, 1, // Opcode: I8_0, DecodeIdx: 1
/* 38 */ MCD::OPC_FilterValueOrFail, 1,
/* 40 */ MCD::OPC_Decode, 190, 2, 1, // Opcode: I8_1, DecodeIdx: 1
/* 44 */ MCD::OPC_Fail,
```
There are no changes in the generated files. The only in-tree target
that uses variable length decoder is M68k, which is free of decoding
conflicts that could result in the decoder doing OOB access.
This also fixes misaligned "Decoding Conflict" dump,
prettified example output is shown in the second test.
`NumFiltered` is the number of elements in all vectors in a map.
It is ever compared to 1, which is equivalent to checking if the map
contains exactly one vector with exactly one element.
Parse the `Inst` and `SoftField` fields once and store them in
`InstructionEncoding` so that we don't parse them every time
`getMandatoryEncodingBits()` is called.
As the FIXME says, we might generate the wrong code to decode an
instruction if it had an operand with no encoding bits. An example is
M68k's `MOV16ds` that is defined as follows:
```
dag OutOperandList = (outs MxDRD16:$dst);
dag InOperandList = (ins SRC:$src);
list<Register> Uses = [SR];
string AsmString = "move.w\t$src, $dst"
dag Inst = (descend { 0, 1, 0, 0, 0, 0, 0, 0, 1, 1 },
(descend { 0, 0, 0 }, (operand "$dst", 3)));
```
The `$src` operand is not encoded, but what we see in the decoder is:
```C++
tmp = fieldFromInstruction(insn, 0, 3);
if (!Check(S, DecodeDR16RegisterClass(MI, tmp, Address, Decoder)))
{ return MCDisassembler::Fail; }
if (!Check(S, DecodeSRCRegisterClass(MI, insn, Address, Decoder)))
{ return MCDisassembler::Fail; }
return S;
```
This calls DecodeSRCRegisterClass passing it `insn` instead of the value
of a field that doesn't exist. DecodeSRCRegisterClass has an
unconditional llvm_unreachable inside it.
New decoder looks like:
```C++
tmp = fieldFromInstruction(insn, 0, 3);
if (!Check(S, DecodeDR16RegisterClass(MI, tmp, Address, Decoder)))
{ return MCDisassembler::Fail; }
return S;
```
We're still not disassembling this instruction right, but at least we no
longer have to provide a weird operand decoder method that accepts
instruction bits instead of operand bits.
See #154477 for the origins of the FIXME.