Uses machine analyses to emit PGOAnalysisMap into the bb-addr-map ELF
section. Implements filecheck tests to verify emitting new fields.
This patch emits optional PGO related analyses into the bb-addr-map ELF
section during AsmPrinter. This currently supports Function Entry Count,
Machine Block Frequencies. and Machine Branch Probabilities. Each is
independently enabled via the `feature` byte of `bb-addr-map` for the given
function.
A part of [RFC - PGO Accuracy Metrics: Emitting and Evaluating Branch and Block Analysis](https://discourse.llvm.org/t/rfc-pgo-accuracy-metrics-emitting-and-evaluating-branch-and-block-analysis/73902).
Consistently treat packed 16-bit operands as 32-bit values, because
that's really what they are. The attempt to treat them differently was
ultimately incorrect and lead to miscompiles, e.g. when using non-splat
constants such as (1, 0) as operands.
Recognize 32-bit float constants for i/u16 instructions. This is a bit
odd conceptually, but it matches HW behavior and SP3.
Remove isFoldableLiteralV216; there was too much magic in the dependency
between it and its use in SIFoldOperands. Instead, we now simply rely on
checking whether a constant is an inline constant, and trying a bunch of
permutations of the low and high halves. This is more obviously correct
and leads to some new cases where inline constants are used as shown by
tests.
Move the logic for switching packed add vs. sub into SIFoldOperands.
This has two benefits: all logic that optimizes for inline constants in
packed math is now in one place; and it applies to both SelectionDAG and
GISel paths.
Disable the use of opsel with v_dot* instructions on gfx11. They are
documented to ignore opsel on src0 and src1. It may be interesting to
re-enable to use of opsel on src2 as a future optimization.
A similar "proper" fix of what inline constants mean could potentially
be applied to unpacked 16-bit ops. However, it's less clear what the
benefit would be, and there are surely places where we'd have to
carefully audit whether values are properly sign- or zero-extended. It
is best to keep such a change separate.
Fixes: Corruption in FSR 2.0 (latent bug exposed by an LLPC change)
The upstream test relies on jump-tables, which are lowered in
dramatically different ways with later arm64e/ptrauth patches.
Concretely, it's failing for at least two reasons:
- ptrauth removes x16/x17 from tcGPR64 to prevent indirect tail-calls
from using either register as the callee, conflicting with their usage
as scratch for the tail-call LR auth checking sequence. In the
1/2_available_regs_left tests, this causes the MI scheduler to move
the load up across some of the inlineasm register clobbers.
- ptrauth adds an x16/x17-using pseudo for jump-table dispatch, which
looks somewhat different from the regular jump-table dispatch codegen
by itself, but also prevents compression currently.
They seem like sensible changes. But they mean the tests aren't really
testing what they're intented to, because there's always an implicit
x16/x17 clobber when using jump-tables.
This updates the test in a way that should work identically regardless
of ptrauth support, with one exception, #1 above, which merely reorders
the load/inlineasm w.r.t. eachother.
I verified the tests still fail the live-reg assertions when
applicable.
After 47a1704ac94c8aeb1aa7e0fc438ff99d36b632c6 we are able to
reassociate a disjoint Or used as a GEP index to get the constant
closer to a load to fold it. This is show by the first test.
We are not able to do this if the GEP created a shift left to scale
the index as the second test shows.
To make this work, we need to preserve the disjoint flag when pulling
the Or through the shift.
Keep the haveNoCommonBitsSet check because we haven't started inferring
the flag yet.
I've added tests for two transforms, but these are not the only
transforms that use isADDLike.
If SHUF128 is accessing only the upper half of a vector source that is a concatenation/insert_subvector then try to access the subvector directly and adjust the element mask accordingly.
This atempts to fix#76734 which is a crash in invalid TRUNC nodes types
from unoptimized input code in combineShiftToAVG. The NVT can be VT if
the larger type was legal and the adds will not overflow, in which case
the inputs should be extended.
From what I can tell this appears to be valid (if not optimal for this
case): https://alive2.llvm.org/ce/z/fRieHR
The result has also been changed to getExtOrTrunc in case that VT==NVT,
which is not handled by SEXT/ZEXT.
This tries to allow libcalls to be tail called, using a similar method
to DAG where the type is checked to make sure they match, and if so the
backend, through lowerCall checks that the tailcall is valid for all
arguments.
There are no native operations that we can use for floating point mul,
so lower by splitting the vector into chunks multiple times. There is
still a missing fold for fmul_indexed, that could help the gisel test
cases a bit.
When shl is folded in compare instruction, a miscompilation occurs when
the CMP instruction is also sign-extended. For the following IR:
%op3 = shl i8 %op2, 3
%tmp3 = icmp eq i8 %tmp2, %op3
It used to generate
cmp w8, w9, sxtb #3
which means sign extend w9, shift left by 3, and then compare with the
value in w8. However, the original intention of the IR would require
`%op2` to first shift left before extending the operands in the
comparison operation . Moreover, if sign extension is used instead of
zero extension, the sample test would miscompile. This PR creates a fix
for the issue, more specifically to not fold the left shift into the CMP
instruction, and to create a zero-extended value rather than a
sign-extended value.
This reverts most of commit 5b155aea0e529b7b5c807e189fef6ea5cd5faec9.
I have left the new test file, but regenerated the checks.
This causes failures in our downstream testing. The input types
to the extends need to be checked so we don't create RISCVISD::VZEXT_VL
with illegal or unsupported input type.
If there's no shift being performed, the rounding mode doesn't matter.
We could do the same for vssra and vssrl, but they are no-ops with a
shift of 0 so would be better off being removed earlier.
This is a fix for the second half of #75822, where smaller constants can
also be bitcast to larger types. We should be checking the size is what
we expect it to be when matching ones.
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6
Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|
The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
Do optimization to turn x >> (shift & 31/63) into a single srlv instead
of andi + srlv, since the mips variable shift instruction already
implicitly masks the shift, like x86, wasm and AMDGPU. Copy the
X86DAGToDAGISel::isUnneededShiftMask() function to MIPS for checking
whether need combine two instructions to one.
This PR mainly aims at resolving the below missed-optimization case,
while it could also be considered as an extension of the previous patch
https://reviews.llvm.org/D133739?id=
## Missed-Optimization Case
Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh
### Source Code:
```
define <vscale x 2 x i16> @multiple_users(ptr %x, ptr %y, ptr %z) {
%a = load <vscale x 2 x i8>, ptr %x
%b = load <vscale x 2 x i8>, ptr %y
%b2 = load <vscale x 2 x i8>, ptr %z
%c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16>
%d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16>
%d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16>
%e = mul <vscale x 2 x i16> %c, %d
%f = add <vscale x 2 x i16> %c, %d2
%g = sub <vscale x 2 x i16> %c, %d2
%h = or <vscale x 2 x i16> %e, %f
%i = or <vscale x 2 x i16> %h, %g
ret <vscale x 2 x i16> %i
}
```
### Before This Patch
```
# %bb.0:
vsetvli a3, zero, e16, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
svf2 v11, v8
vsext.vf2 v8, v9
vsext.vf2 v9, v10
vmul.vv v8, v11, v8
vadd.vv v10, v11, v9
vsub.vv v9, v11, v9
vor.vv v8, v8, v10
vor.vv v8, v8, v9
ret
```
### After This Patch
```
# %bb.0:
vsetvli a3, zero, e8, mf4, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
vwmul.vv v11, v8, v9
vwadd.vv v9, v8, v10
vwsub.vv v12, v8, v10
vsetvli zero, zero, e16, mf2, ta, ma
vor.vv v8, v11, v9
vor.vv v8, v8, v12
ret
```
We can see Add/Sub/Mul are combined with the Sign Extension.
## Relation to the Patch D133739
The patch D133739 introduced an optimization for folding `ADD_VL`/
`SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did
not consider the case of non-fixed length vector case, thus this PR
could also be considered as an extension for the D133739.
Furthermore, in the current `SelectionDAG`, we represent scalable vector
add (or any binary operator) as a normal `ADD` operation. It might be
better to use an Opcode like `ADD_VL`, which needs further conversation
and decision.
Whilst this might technically not be correct if a combine treats signed zeroes
differently, where the neon operations are more defined than the minnum/maxnum
nodes. It mirrors what SDAG does, which allows us to lower aarch64_neon_fminnm
and aarch64_neon_fmaxnm through the existing selection patterns.