DAGCombiner, as well as InstCombine, tend to canonicalize GE/LE into
GT/LT, namely:
```
X >= C --> X > (C - 1)
```
Which sometime generates off-by-one constants that could have been CSE'd
with surrounding constants.
Instead of changing such canonicalization, this patch tries to swap
those branch conditions post-isel, in the hope of resurfacing more
constant CSE opportunities. More specifically, it performs the following
optimization:
For two constants C0 and C1 from
```
li Y, C0
li Z, C1
```
To remove redundnat `li Y, C0`,
1. if C1 = C0 + 1 we can turn:
(a) blt Y, X -> bge X, Z
(b) bge Y, X -> blt X, Z
2. if C1 = C0 - 1 we can turn:
(a) blt X, Y -> bge Z, X
(b) bge X, Y -> blt Z, X
This optimization will be done by PeepholeOptimizer through
RISCVInstrInfo::optimizeCondBranch.
As usual, making it easier for an upcoming test delta to be seen.
Note that several of these are examples of extremely bad testing practice.
Checking internal debug output (for no real purpose), and checking the
result of a fully O2 + llc run instead of reducing the specific problematic
pass.
[LLVM][AArch64] Add ASM constraints for reduced GPR register ranges.
The patch adds the follow ASM constraints:
Uci => w8-w11/x8-x11
Ucj => w12-w15/x12-x15
These constraints are required for SME load/store instructions
where a reduced set of GPRs are used to specify ZA array vectors.
NOTE: GCC has agreed to use the same constraint syntax.
Remove support for zext and sext constant expressions. All places
creating them have been removed beforehand, so this just removes the
APIs and uses of these constant expressions in tests.
There is some additional cleanup that can be done on top of this, e.g.
we can remove the ZExtInst vs ZExtOperator footgun.
This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
Relative to the first attempt, this contains two changes:
First, we only handle the case where one side simplifies to true or
false, instead of calling simplification recursively. The previous
approach would return poison if one operand simplified to poison
(under the equality assumption), which is incorrect.
Second, we do not fold llvm.is.constant in simplifyWithOpReplaced().
We may be assuming that a value is constant, if the equality holds,
but it may not actually be constant. This is nominally just a QoI
issue, but the std::list implementation in libstdc++ relies on the
precise behavior in a way that causes miscompiles.
-----
and/or in logical (select) form benefit from generic simplifications via
simplifyWithOpReplaced(). However, the corresponding fold for plain
and/or currently does not exist.
Similar to selects, there are two general cases for this fold
(illustrated with `and`, but there are `or` conjugates).
The basic case is something like `(a == b) & c`, where the replacement
of a with b or b with a inside c allows it to fold to true or false.
Then the whole operation will fold to either false or `a == b`.
The second case is something like `(a != b) & c`, where the replacement
inside c allows it to fold to false. In that case, the operand can be
replaced with c, because in the case where a == b (and thus the icmp is
false), c itself will already be false.
As the test diffs show, this catches quite a lot of patterns in existing
test coverage. This also obsoletes quite a few existing special-case
and/or of icmp folds we have (e.g. simplifyAndOrOfICmpsWithLimitConst),
but I haven't removed anything as part of this patch in the interest of
risk mitigation.
Fixes#69050.
Fixes#69091.
Previously the kill flags of the source register were unconditionally
cleared when a `str` pair was merged, which results in suboptimal
register allocation and inhibits some renaming opportunities which may
allow further merging `str`.
This adds a new pass to insert VXRM writes for vector instructions. With
the goal of avoiding redundant writes.
The pass does 2 dataflow algorithms. The first is a forward data flow to
calculate where a VXRM value is available. The second is a backwards
dataflow to determine where a VXRM value is anticipated.
Finally, we use the results of these two dataflows to insert VXRM writes
where a value is anticipated, but not available.
The pass does not split critical edges so we aren't always able to
eliminate all redundancy.
The pass will only insert vxrm writes on paths that always require it.
Prior to this patch, vector `s|uitofp` from narrow types (`<= i16`) were
scalarized when the hardware doesn't support fp16 conversions natively.
This patch fixes that by avoiding using `i16` as an intermediate type
when there is no hardware support conversion from this type to half. In
other words, when the target doesn't support `avx512fp16`, we avoid
using intermediate `i16` vectors for `s|uitofp` conversions.
Instead we extend the narrow type to `i32`, which will be converted to
`float` and downcasted to `half`.
Put differently, we go from:
```
s|uitofp iNarrow %src to half
```
To
```
%tmp = s|zext iNarrow %src to i32
%tmpfp = s|uitofp i32 %tmp to float
fptrunc float %tmpfp to half
```
Note that this patch:
- Doesn't change the actual lowering of i32 to half. I.e., the `float`
intermediate step and the final downcasting are what existed for this
input type to half.
- Changes only the intermediate type for the lowering of `s|uitofp`.
I.e., the first `s|zext` from i16 to i32.
Remark: The vector and scalar lowering of `s|uitofp` don't use the same
code path. Not super happy about that, but I'm not planning to fix that,
at least in this PR.
This fixes https://github.com/llvm/llvm-project/issues/67080
Recently, 98c90a1 (ISel: introduce vector ISD::LRINT, ISD::LLRINT;
custom RISCV lowering) introduced vector variants of llvm.lrint,
llvm.llrint, and bundled several tests along with the code change.
However, it forgot to test lrint and llrint on fixed vectors on RISC-V,
and it turns out that that fixed-vectors-lrint.ll requires
PromoteIntRes_XRINT to be implemented. Implement it, and add tests for
fixed-vector lrint, llrint.
Add llvm.amdgcn.s.ttracedata and llvm.amdgcn.s.ttracedata.imm which map
directly to the corresponding instructions s_ttracedata and
s_ttracedata_imm. These are inherently whole-wave operations so any
non-uniform inputs are readfirstlaned.
moveToVALU previously only handled S_CSELECT_B64 in the trivial case
where it was semantically equivalent to a copy. Implement the general
case using V_CNDMASK_B64_PSEUDO and implement post-RA expansion of
V_CNDMASK_B64_PSEUDO with immediate as well as register operands.
During the SIOptimizeExecMasking pass, we try to form V_CMPX
instructions by detecting S_AND_SAVEEXEC and V_MOV instructions.
Generally, we require the input operand of the V_MOV, which is the input
operand to the to-be-formed V_CMPX, to be alive. This is forced by
clearing the kill flags on the operand after V_CMPX has been generated.
However, if we have a kill of a register set that contains said
register, this will not be detected by clearKillFlags.
With this change, possible additional kill-flag candidates will be
detected during the final call to findInstrBackwards and then, the kill
flag will be removed to keep all registers in the set alive.
Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
This reverts commit 6bf1b4e8e0776e6f27013434d8b632016ccc795c.
Requiring a triple does not require moving these to a codegen test directory.
Move these to an x86 specific subdirectory of a transforms test.
Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.
When SI_PC_ADD_REL_OFFSET is expanded to S_GETPC/S_ADD/S_ADDC, the
GlobalAddress operands have to be adjusted by 4 or 12 bytes to account
for the offset from the end of the S_GETPC instruction to the literal
operands. Do this all in SIInstrInfo::expandPostRAPseudo instead of
duplicating the adjustment code in both AMDGPULegalizerInfo and
SITargetLowering. NFCI.
There is no PIC support for -mcmodel=large
(https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst)
and Clang recently rejects -mcmodel= with PIC (#70262).
The current backend code assumes that the large code model is non-PIC.
This patch adds `!getTargetMachine().isPositionIndependent()` conditions
to clarify that the support is non-PIC only. In addition, add some tests
as change detectors in case PIC large code model is supported in the
future.
If other front-ends/JITs use the large code model with PIC, they will
get small code model code sequence, instead of potentially-incorrect
MOVZ/MOVK sequence, which is only suitable for non-PIC. The sequence
will cause text relocations using ELF linkers.
(The small code model code sequence is usually sufficient as ADRP+ADD or
ADRP+LDR targets [-2**32,2**32), which has a doubled range of x86-64
R_X86_64_REX_GOTPCRELX/R_X86_64_PC32 [-2**32,2**32).)
This will select i32 operations directly to W instructions without
custom nodes. Hopefully this can allow us to be less dependent on
hasAllNBitUsers to recover i32 operations in RISCVISelDAGToDAG.cpp.
This support is enabled with a command line option that is off by
default.
Generated code is still not optimal.
I've duplicated many test cases for this, but its not complete. Enabling this runs all existing lit tests without crashing.
This fixes an issue introduced by PR #70679.
Using constrainRegClass() is not strong enough to actually force
the use of a register to be a PPR register class. It will need an
actual COPY to do the conversion.
The downside is that this introduces an extra register, which is an
issue we may want to fix at a later point using a custom copy operation
where the register allocator uses the same register when it can.
* Introduce field `PositionOrder` for class `Register` and
`RegisterTuples`
* If register A's `PositionOrder` < register B's `PositionOrder`, then A
is placed before B in the enum in X86GenRegisterInfo.inc
* The new order of registers in the enum for X86 will be
1. Registers before AVX512,
2. AVX512 registers (X/YMM16-31, ZMM0-31, K registers)
3. AMX registers (TMM)
4. APX registers (R16-R31)
* Add a new target hook `getNumSupportedRegs()` to return the number of
registers for the function (may overestimate).
* Replace `getNumRegs()` with `getNumSupportedRegs()` in LiveVariables
to eliminate iterations on unsupported registers
This patch can reduce 0.3% instruction count regression for sqlite3
during compile-stage (O3) by not iterating on APX registers
for #67702
Similar to what we do for binops, for undesirable casts we should
call the constant folding API instead of the constant expr API,
to avoid indirect creation of undesirable cast ops.
Most of the FP constants supported by FLI are positive. For negative FP
constants X whose positive values is supported by FLI, we can use `(FNEG
(FLI -X))` to materialize X.