RFC:
https://discourse.llvm.org/t/rfc-extend-machine-value-type-from-uint8-t-to-uint16-t/80274
compile-time-tracker:
https://llvm-compile-time-tracker.com/compare.php?from=4b9fab591916eec9fd1942f37afe3b137b564089&to=177d28247efe5a4d59a8d8150b4daf01e4f57d74&stat=wall-time
Currently 208 out of 256 MVTs are used, it will be run out soon, so
ultimately we need to extend the original `MVT::SimpleValueType` from
`uint8_t` to `uint16_t` to accomodate more types.
The `MatcherTable` uses `unsigned char` for encoding the matcher code,
so the extended MVTs are no longer fit into the table, thus we need to
use VBR to encode them as we do on others that are wider than 8 bits.
The statistics below shows the difference of "Total Array size" of the
matcher table that appears in every files:
```
Table Before After Change(%)
WebAssemblyGenDAGISel.inc 23576 23775 0.844
NVPTXGenDAGISel.inc 173498 173498 0
RISCVGenDAGISel.inc 2179121 2369929 8.756
AVRGenDAGISel.inc 2754 2754 0
PPCGenDAGISel.inc 163315 163617 0.185
MipsGenDAGISel.inc 47280 47447 0.353
SystemZGenDAGISel.inc 56243 56461 0.388
AArch64GenDAGISel.inc 467893 487830 4.261
MSP430GenDAGISel.inc 8069 8069 0
LoongArchGenDAGISel.inc 78928 79131 0.257
XCoreGenDAGISel.inc 3432 3432 0
BPFGenDAGISel.inc 3733 3733 0
VEGenDAGISel.inc 65174 66456 1.967
LanaiGenDAGISel.inc 2067 2067 0
X86GenDAGISel.inc 628787 636987 1.304
ARMGenDAGISel.inc 170968 171036 0.040
HexagonGenDAGISel.inc 155764 155764 0
SparcGenDAGISel.inc 5762 5798 0.625
AMDGPUGenDAGISel.inc 504356 504463 0.021
R600GenDAGISel.inc 29785 29785 0
```
The statistics below shows the runtime peak memory usage by compiling a
simple C program:
`/bin/time -v clang -target $TARGET -O3 -c test.c`
```
int test(int a) {
return a * 3;
}
```
```
Target Before(kbytes) After(kbytes) Change(%)
wasm64 110172 110088 -0.076
nvptx64 109784 109980 0.179
riscv64 114020 113656 -0.319
avr 110352 110068 -0.257
ppc64 112612 112476 -0.120
mips64 113588 113668 0.070
systemz 110860 110760 -0.090
aarch64 113704 113432 -0.239
msp430 110284 110200 -0.076
loongarch64 111052 110756 -0.267
xcore 108340 108020 -0.295
bpf 110620 110708 0.080
ve 110960 110920 -0.036
lanai 110180 109960 -0.200
x86_64 113640 113304 -0.296
arm64 113540 113172 -0.324
hexagon 114620 114684 0.056
sparc 110412 110136 -0.250
amdgcn 118164 117144 -0.863
r600 111200 110508 -0.622
```
The original `CheckValueTypeMatcher` stores StringRef as the member
variable type, however it's more efficient to use use
MVT::SimpleValueType since it prevents string comparison in isEqualImpl,
it also reduce the memory consumption in each object.
Refactor of the llvm-tblgen source into:
- a "Basic" library, which contains the bare minimum utilities to build
`llvm-min-tablegen`
- a "Common" library which contains all of the helpers for TableGen
backends. Such helpers can be shared by more than one backend, and even
unit tested (e.g. CodeExpander is, maybe we can add more over time)
Fixes#80647
Almost all uses of `*TreePatternNode` expect it to be non-null. There
was the occasional check that it wasn't, which I have removed. Making
them references makes it clear that they exist.
This was attempted in 2018 (1b465767d6ca69f4b7201503f5f21e6125fe049a)
for `TreePatternNode::getChild()` but that was reverted.
We record the usage of each `Predicate` and sort them by usage.
For the top 8 `Predicate`s, we will emit a `PC_CheckPredicateN` to
save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 61K.
This is a recommit of 1a57927, which was reverted in bc98c31.
The CI failures occurred when doing expensive checks (with option
`LLVM_ENABLE_EXPENSIVE_CHECKS` being ON).
The key point here is that we need stable sorting result in the
test, but doing expensive checks uncovered the non-determinism of
`llvm::sort`. So `llvm::sort` is changed to `llvm::stable_sort`
in this revised patch.
And we use `llvm::MapVector` to keep insertion order.
We record the usage of each `Predicate` and sort them by usage.
For the top 8 `Predicate`s, we will emit a `PC_CheckPredicateN` to
save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 61K.
We record the usage of each `PatternPredicate` and sort them by
usage.
For the top 8 `PatternPredicate`s, we will emit a
`OPC_CheckPatternPredicateN` to save one byte.
The old `OPC_CheckPatternPredicate2` is renamed to
`OPC_CheckPatternPredicateTwoByte`.
Overall this reduces the llc binary size with all in-tree targets by
about 93K.
We record the usage of each `ComplexPat` and sort the `ComplexPat`s
by usage.
For the top 8 `ComplexPat`s, we will emit a `OPC_CheckComplexPatN`
to save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 89K.
The followed byte of `OPC_EmitRegister` is a MVT type, which is
usually i32 or i64.
We add `OPC_EmitRegisterI32` and `OPC_EmitRegisterI64` so that we
can reduce one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 10K.
There are a lot of operations to move current node to parent and
then move to another child.
So `OPC_MoveSibling` and its space-optimized forms are added to do
this "move to sibling" operations.
These new operations will be generated when optimizing matcher in
`ContractNodes`. Currently `MoveParent+MoveChild` will be optimized
to `MoveSibling` and sequences `MoveParent+RecordChild+MoveChild`
will be transformed into `MoveSibling+RecordNode`.
Overall this reduces the llc binary size with all in-tree targets by
about 30K.
If there is only one bit set in EmitNodeInfo, then we can encode it
implicitly to save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 168K.
The most common type is i32 or i64 so we add `OPC_CheckChildTypeI32`
and `OPC_CheckChildTypeI64` to save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 70K.
These new opcodes implicitly indicate the RecNo.
The old `OPC_EmitCopyToReg2` is renamed to `OPC_EmitCopyToRegTwoByte`.
Overall this reduces the llc binary size with all in-tree targets by
about 33K (most are from RISCV target).
The most common type is i32 or i64 so we add `OPC_CheckTypeI32` and
`OPC_CheckTypeI64` to save one byte.
Overall this reduces the llc binary size with all in-tree targets by
about 29K.
These two opcodes are used to be followed by a MVT operand, which is
always one of i8/i16/i32/i64.
We add instantiated `OPC_EmitInteger` and `OPC_EmitStringInteger` with
i8/i16/i32/i64 so that we can reduce one byte.
We reserve `OPC_EmitInteger` and `OPC_EmitStringInteger` in case that
we may need them someday, though I haven't found one usage after this
change.
Overall this reduces the llc binary size with all in-tree targets by
about 200K.
This is an alternative to D156967 where I suggested the author
could use a VBR type.
This patch takes inspirations from Opc_EmitRegister2 that is used
for two byte registers.
I'm assuming 1 or 2 byte predicates should be enough so we don't
need the fully generality of VBR.
This avoids impacting the table size on targets that have more than
128 predicates already like X86.
Reviewed By: bogner
Differential Revision: https://reviews.llvm.org/D157196
Instead of storing a string containing the instruction name, store a
reference to the instruction. We can use that reference to print the
instruction name when we emit the table.
The only slightly annoying part is that we have to find the
CodeGenInstruction for IMPLICIT_DEF. GlobalISel is doing
a similar thing.
Since 65b13610a5226b84889b923bae884ba395ad084d, raw_string_ostream has
been unbuffered by default. Based on an audit of llvm/utils/, this
commit removes every call to `raw_string_ostream::flush()` and any call
to `raw_string_ostream::str()` whose result is ignored or that doesn't
help with clarity.
I left behind a few calls to `str()`. In these cases, the underlying
std::string was declared pretty far away and never used again, whereas
stream recently had its last write. The code is easier to read as-is;
the no-op call to `flush()` inside `str()` isn't harmful, and when
https://reviews.llvm.org/D115421 lands it'll be gone anyway.
This allows for a much more efficient encoding for small negative
numbers by storing the sign bit first and negating the rest of
the bits. This was already being used for OPC_CheckInteger.
For every in tree target this affects, the table got smaller.
R600GenDAGISel.inc saw the largest reduction of 7K.
I did have to add a new opcode for StringIntegers used for
register class ids and subregister indices since we don't have the
integer value to encode. The enum name is emitted directly into
the table. Previously assumed the enum would expand to a positive
7-bit number. We might be able to just shift that right by 1 and
assume it is a positive 6 bit number, but that will need more
investigation.
CheckInteger uses an int64_t encoded using a variable width encoding
that is optimized for encoding a number with a lot of leading zeros.
Negative numbers have no leading zeros so use the largest encoding
requiring 9 bytes.
I believe its most like we want to check for positive and negative
numbers near 0. -1 is quite common due to its use in the 'not'
idiom.
To optimize for this, we can borrow an idea from the bitcode format
and move the sign bit to bit 0 with the magnitude stored in the
upper bits. This will drastically increase the number of leading
zeros for small magnitudes. Then we can run this value through
VBR encoding.
This gives a small reduction in the table size on all in tree
targets except VE where size increased by about 300 bytes due
to intrinsic ids now requiring 3 bytes instead of 2. Since the
intrinsic enum space is shared by all targets this an unfortunate
consquence of where VE is currently located in the range.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D96317
getSize and setSize both use unsigned. So size_t doesn't
increase range here and might get truncated if passed to
setSize.
Also not sure why EmitVBRValue was returning uint64_t, but used
an unsigned to supply the value.
Some of these were found by running clang-format over the generated
code, although that complains about far more issues than I have fixed
here.
Differential Revision: https://reviews.llvm.org/D90937
Tablegen's DAGISelMatcher emits integers in a VBR format,
so if an integer is below 128 it can fit into a single
byte, otherwise high bit is set, next byte is used etc.
MatcherTable is essentially an unsigned char table. When
SelectionDAGISel parses the table it does a reverse translation.
In a situation when numeric value of an integer to emit is
unknown it can be emitted not as OPC_EmitInteger but as
OPC_EmitStringInteger using a symbolic name of the value.
In this situation the value should not exceed 127.
One of the situations when OPC_EmitStringInteger is used is
if we need to emit a subreg into a matcher table. However,
number of subregs can exceed 127. Currently last defined subreg
for AMDGPU is 192. That results in a silent bug in the ISel
with matcher reading from an invalid offset.
Fixed this bug to emit actual VBR encoded value for a subregs
which value exceeds 127.
Differential Revision: https://reviews.llvm.org/D74368
This is how it should've been and brings it more in line with
std::string_view. There should be no functional change here.
This is mostly mechanical from a custom clang-tidy check, with a lot of
manual fixups. It uncovers a lot of minor inefficiencies.
This doesn't actually modify StringRef yet, I'll do that in a follow-up.
Includes a fix to emit a CheckOpcode for build_vector when immAllZerosV/immAllOnesV is used as a pattern root. This means it can't be used to look through bitcasts when used as a root, but that's probably ok. This extra CheckOpcode will ensure that the first match in the isel table will be a SwitchOpcode which is needed by the caching optimization in the ISel Matcher.
Original commit message:
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355784
This caused the first matcher in the isel table for many targets to Opc_Scope instead of Opc_SwitchOpcode. This leads to a significant increase in isel match failures.
llvm-svn: 355433
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355224
OPC_CheckCondCode is always used as operand 2 of a setcc. And its always surrounded by a MoveChild2 and a MoveParent. By having a dedicated opcode for this case we can reduce the number of bytes needed for this pattern from 4 bytes to 2.
This saves ~3000 bytes in the X86 table.
llvm-svn: 354763