This PR instruments the optimization passes in the SystemZ backend with
calls to `MachineFunction::substituteDebugValuesForInst` where
instruction substitutions are made to instructions that may compute
tracked values.
Tests are also added for each of the substitutions that were inserted.
Details on the individual passes follow.
### systemz-copy-physregs
When a copy targets an access register, we redirect the copy via an
auxiliary register. This leads to the final result being written by a
newly inserted SAR instruction, rather than the original MI, so we need
to update the debug value tracking to account for this.
### systemz-long-branch
This pass relaxes relative branch instructions based on the actual
locations of blocks. Only one of the branch instructions qualifies for
debug value tracking: BRCT, i.e. branch-relative-on-count, which
subtracts 1 from a register and branches if the result is not zero. This
is relaxed into an add-immediate and a conditional branch, so any
`debug-instr-number` present must move to the add-immediate instruction.
### systemz-post-rewrite
This pass replaces `LOCRMux` and `SELRMux` pseudoinstructions with
either the real versions of those instructions, or with branching
programs that implement the intent of the Pseudo. In all these cases,
any `debug-instr-number` attached to the pseudo needs to be reallocated
to the appropriate instruction in the result, either LOCR, SELR, or a
COPY.
### systemz-elim-compare
Similar to systemz-long-branch, for this pass, only few substitutions
are necessary, since it mainly deals with conditional branch
instructions. The only exceptiona are again branch-relative-on-count, as
it modifies a counter as part of the instruction, as well as any of the
load instructions that are affected.
As part of an effort to enable instr-ref-based debug value tracking,
this PR implements `SystemZInstrInfo::isLoadFromStackSlotPostFE`, as
well as `SystemZInstrInfo::isStoreToStackSlotPostFE`. The implementation
relies upon the presence of MachineMemoryOperands on the relevant
`MachineInstr`s in order to access the `FrameIndex` post frame index
elimination.
Since these new functions are only meant to be called after frame-index
elimination, they assert against the present of a frame index on the
base register operand of the instruction.
Outside of the utility of these functions to enable instr-ref-based
debug value tracking, they also changes the behavior of the AsmPrinter,
since it will now be able to properly detect non-folded spills and
reloads, so this changes a number of tests that were checking
specifically for folded reloads.
Note that there are some tests that still check for `vst` and `vl` as
folded spills/reloads even though they should be straight reloads. This
will be addressed in a future PR.
Co-authored-by: Dominik Steenken <dominik.steenken@gmail.com>
Generate code using the VECTOR ADD COMPUTE CARRY and
VECTOR SUBTRACT COMPUTE BORROW INDICATION instructions
to implement open-coded IR with those semantics.
Handles integer vector types as well as i128.
Fixes: https://github.com/llvm/llvm-project/issues/129608
Generate more efficient code for zero or sign extensions where
the source is a subvector generated via SHUFFLE_VECTOR.
Specifically, recognize patterns corresponding to (series of)
VECTOR UNPACK instructions, or the VECTOR SIGN EXTEND TO
DOUBLEWORD instruction.
As a special case, also handle zero or sign extensions of a
vector element to i128.
Fixes: https://github.com/llvm/llvm-project/issues/129576
Fixes: https://github.com/llvm/llvm-project/issues/129899
Detect (non-intrinsic) IR patterns corresponding to the semantics
of the various widening and high-word multiplication instructions.
Specifically, this is done by:
- Recognizing even/odd widening multiplication patterns in DAGCombine
- Recognizing widening multiply-and-add on top during ISel
- Implementing the standard MULHS/MUHLU IR opcodes
- Detecting high-word multiply-and-add (which common code does not)
Depending on architecture level, this can support all integer
vector types as well as the scalar i128 type.
Fixes: https://github.com/llvm/llvm-project/issues/129705
Generate efficient code using the condition code set by the
VECTOR (FP) COMPARE family of instructions to implement
vector comparison reductions, e.g. as resulting from
__builtin_reduce_and/or of some vector comparsion.
Fixes: https://github.com/llvm/llvm-project/issues/129434
When removing a redundant definition in order to reuse an earlier
identical one it is necessary to remove any earlier kill flag as well.
Previously, the assumption has been that any register that kills the
defined Reg is enough to handle for this purpose, but this is actually
not quite enough. A kill of a super-register does not necessarily imply
that all of its subregs (including Reg) is defined at that point: a
partial definition of a register is legal. This means Reg may have been
killed earlier and is not live at that point.
This patch changes the tracking of kill flags to allow for multiple
flags to be removed: instead of remembering just the single / latest
kill flag, a vector is now used to track and remove them all.
TinyPtrVector seems ideal for this as there are only very rarely more
than one kill flag, and it doesn't seem to give much difference in
compile time.
The kill flags handling here is making this pass much more complicated
than it would have to be. This pass does not depend on kill flags for
its own use, so an interesting alternative to all this handling would be
to just remove them all. If there actually is a serious user, maybe that pass
could instead recompute them.
Also adding an assertion which is unrelated to kill flags, but it seems
to make sense (according to liberal assertion policy), to verify that
the preceding definition is in fact identical in clearKillsForDef().
Fixes#117783
Use `-passes="regallocgreedy<[all|sgpr|wwm|vgpr]>` to insert the greedy
RA with a filter and `-regalloc-npm=<type>` to control which RA to use
in existing pipeline.
It seems that there can be other cases with this that also can lead to
wrong code (discovered with csmith). This time it involved not the kill
flag but the undef flag.
Use the intersection of the flags from both MachineOperand:s instead
of the RegState from just one of them.
This is straightforward as we already had all the necessary
instructions, they simply were not wired up.
Also allows implementing the vec_round intrinsic via the
standard llvm.roundeven IR instead of a platform intrinsic now.
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
This is needed for architectures that actually use strict pointer
arithmetic instead of integers such as AArch64 with FEAT_CPA (see
https://github.com/llvm/llvm-project/pull/105669) or CHERI. Using an
index as the first operand of pointer arithmetic may result in an
invalid output.
While there are quite a few codegen changes here, these only change the
order of registers in add instructions. One MIPS combine had to be
updated to handle the new node order.
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/125279
If both operands of a SELRMux use the same register which is killed, and
the SELRMux is expanded to a jump sequence, a broken MIR results if the
kill flag is not removed.
This patch replaces the SELRMux with a COPY in these cases.
I'm guessing based on tablegen definitions. I also don't
really understand how this could have been missing.
This defends against regressions in a future peephole-opt
patch.
We track definitions of stack objects, the implementation is identical
to tracking of registers.
Also, added printing of all found reaching definitions for testing
purposes.
---------
Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
The current MIR cycle sinking capabilities are rather limited. It only
support sinking copies into a single successor block while obeying
limits.
This opt-in feature adds a more aggressive option, that is not limited
to the above concerns. The feature will try to "sink" by duplicating any
top-level preheader instruction (that we are sure is safe to sink) into
any user block, then does some dead code cleanup. In particular, this is
useful for high RP situations when loop bodies have control flow.
We can only optimize a uaddo_carry via specialized instruction
if the carry was produced by another uaddo(_carry) instruction;
there is already a check for that.
However, i128 uaddo(_carry) use a completely different mechanism;
they indicate carry in a vector register instead of the CC flag.
Thus, we must also check that we don't mix those two - that check
has been missing.
Fixes: https://github.com/llvm/llvm-project/issues/124001
This patch adds support for the next-generation arch15
CPU architecture to the SystemZ backend.
This includes:
- Basic support for the new processor and its features.
- Detection of arch15 as host processor.
- Assembler/disassembler support for new instructions.
- Exploitation of new instructions for code generation.
- New vector (signed|unsigned|bool) __int128 data types.
- New LLVM intrinsics for certain new instructions.
- Support for low-level builtins mapped to new LLVM intrinsics.
- New high-level intrinsics in vecintrin.h.
- Indicate support by defining __VEC__ == 10305.
Note: No currently available Z system supports the arch15
architecture. Once new systems become available, the
official system name will be added as supported -march name.
This pull request modifies the behavior of the
`@llvm.experimental.stackmap` intrinsic to require that its two first
operands (`id` and `numShadowBytes`) be **immediate values**. This
change ensures that variables cannot be passed as two first arguments to
this intrinsic.
Related Issue: https://github.com/llvm/llvm-project/issues/115733
### Testing
- Added new test cases to ensure errors are emitted for non-immediate
operands.
- Ran the full LLVM test suite to verify no regressions were introduced.
The existing tests for constrained functions often use constant
arguments. If constant evaluation is enhanced, such tests will not check
code generation of the tested functions. To avoid it, the tests are
modified to use loaded value instead of constants. Now only the tests
for rounding functions are changed.
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `Builtins.td` - Add f16 support for libm atan2 builtin
- `CGBuiltin.cpp` - Emit constraint atan2 intrinsic for clang builtin
- `clang/test/CodeGenCXX/builtin-calling-conv.cpp` - Use erff instead of
atan2 for clang builtin to lib call calling convention check, now that
atan2 maps to an intrinsic.
- add atan2 cases to llvm.experimental.constrained tests for more
backends: ARM, PowerPC, RISCV, SystemZ.
- LangRef.rst: add llvm.experimental.constrained.atan2, revise
llvm.atan2 description.
Last part of Implement the atan2 HLSL Function. Fixes#70096.
A test case emerged with an i32 truncating store of an i64 constant
operand, where the i64 constant did not fit in 32 bits, which caused
FindReplicatedImm() to crash.
Make sure to truncate the APInt in these cases.
The lit test fmuladd-soft-float.ll only specifies s390x as platform,
but the test is Linux specific, causing problems when run on z/OS.
This change updates the triple to fix this.
When LiveRangeEdit::eliminateDeadDef() converts an MI to a KILL instruction,
it should also call dropMemRefs() in order to erase any MachineMemOperand
present.
This was discovered in testing as the MachineVerifier does not accept an MMO
without the corresponding MI mayLoad/mayStore flag, which the KILL opcode
lacks.
Now that the GNU and HLASM `InstPrinter` paths are separated in
https://github.com/llvm/llvm-project/pull/112975, differentiate between
them in `SystemZInstrFormats.td`.
The main difference are:
- Tabs converted to space
- Remove space after comma for instruction operands
---------
Co-authored-by: Tony Tao <tonytao@ca.ibm.com>
The previous behavior could be harmful in some edge cases, such as
emitting a call to `fma()` in the `fma()` implementation itself.
Do this by just being more accurate in `isFMAFasterThanFMulAndFAdd()`.
This was already done for PowerPC; this commit just extends that to Arm,
z/Arch, and x86. MIPS and SPARC already got it right, but I added tests
for them too, for good measure.
Note: I don't have commit access.
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
Due to recently reported problems with having the inlining threshold multiplier
set fairly high (x3), this patch removes the multiplier while addressing
the regressions seen by doing so in adjustInliningThreshold().
The specific cases that benefit from inlining that were now found to be in need
of handling contain a considerable number of memory accesses to the same
memory in both caller and callee.
A (csmith) test case appeared where combineExtract() crashed when the
input vector was a bitcast into a vector of i1:s. Fix this by adding a check
with canTreatAsByteVector() before the call.
For the purpose of verifying proper arguments extensions per the target's ABI,
introduce the NoExt attribute that may be used by a target when neither sign-
or zeroextension is required (e.g. with a struct in register). The purpose of
doing so is to be able to verify that there is always one of these attributes
present and by this detecting cases where sign/zero extension is actually
missing.
As a first step, this patch has the verification step done for the SystemZ
backend only, but left off by default until all known issues have been
addressed.
Other targets/front-ends can now also add NoExt attribute where needed and do
this check in the backend.
In the (zext (shl (zext x), cst)) -> (shl (zext x), cst) fold, don't use a bitmask / MaskedValueIsZero as we can't guarantee that the shift amount is in bounds.
Fixes#106202
The current MCInstBuilder for generating an ALGFI when loading something
from the ADA is incorrect and will crash the compiler.
r0 must also be excluded from the registers returned as the result,
since it is treated as the value "0" on z/OS.
Also add some tests to properly test the paths where LLILF and ALGFI are
generated.
---------
Co-authored-by: Tony Tao <tonytao@ca.ibm.com>
A test case showed up where the new vector type is v24i16, which is not a simple
MVT. In order to get an extended value type for cases like this, EVT::getVectorVT()
needs to be called instead of MVT::getVectorVT(), otherwise the following call
to getVectorElementType() in combineExtract() will fail.
Just like for regular IR we need to treat SELECT as conditionally
blocking poison in SelectionDAG. So (unless the condition itself is
poison) the result is only poison if the selected true/false value is
poison.
Thus, when doing DAG combines that turn SELECT into arithmetic/logical
operations (e.g. AND/OR) we need to make sure that the new operations
aren't more poisonous. One way to do that is to use FREEZE to make
sure the operands aren't posion.
This patch aims at fixing the kind of miscompiles reported in
https://github.com/llvm/llvm-project/issues/84653
and
https://github.com/llvm/llvm-project/issues/85190
Solution is to make sure that we insert FREEZE, if needed to make
the fold sound, when using the foldBoolSelectToLogic and
foldVSelectToSignBitSplatMask DAG combines.
The previous expansion of [US]CMP was done using two selects and two
compares. It produced decent code, but on many platforms it is better to
implement [US]CMP nodes by performing the following operation:
```
[us]cmp(x, y) = (x [us]> y) - (x [us]< y)
```
This patch adds this new expansion, as well as a hook in TargetLowering to allow some targets to still use the select-based approach. AArch64 and SystemZ are currently the only targets to prefer the former approach, but other targets may also start to use it if it provides for better codegen.
The combineSIGN_EXTEND_INREG routine was using
DAG.getConstant(-1, DL, VT), which does not result in
the expected value when VT has more than 64 bits.
Fix this by using DAG.getAllOnesConstant(DL, VT) instead.
Also add test cases for v1i128 comparisons (which triggers
the bug).
Previously we had the same instructions being generated for `ISD::CTLZ` and `ISD::CTLZ_ZERO_UNDEF` which did not take advantage of the fact that zero is an invalid input for `ISD::CTLZ_ZERO_UNDEF`. This commit separates codegen for the two cases to allow for the optimization for the latter case.
The details of the optimization are outlined in #82075Fixes#82075
Co-authored-by: Manish Kausik H <hmamishkausik@gmail.com>