In this commit:
(1) Added new pass manager support for `ReachingDefAnalysis`.
(2) Added printer pass.
(3) Make old pass manager use `ReachingDefInfoWrapperPass`
Turn a funnel shift by N in the range `121..128` into a funnel shift in
the opposite direction by `128 - N`. Because there are dedicated
instructions for funnel shifts by values smaller than 8, this emits
fewer instructions.
This additional rule is useful because LLVM appears to canonicalize
`fshr` into `fshl`, meaning that the rules for `fshr` on values less
than 8 would not match on organic input.
Simplify min/max instruction matching by making the related
SelectionDAG operations legal.
Add patterns to match (signed and unsigned) saturated
truncation based on open-coded min/max patterns.
Fixes https://github.com/llvm/llvm-project/issues/153655
Currently, when an instruction rematerialized by the register coalescer
defines more subregs of the destination register
than the original COPY instruction did, we only add dead defs for the
newly defined subregs if they were not defined anywhere
else. For example, consider something like this before
rematerialization:
```
%0:reg64 = CONSTANT 1
%1:reg128.sub_lo64_lo32 = COPY %0.lo32
%1:reg128.sub_lo64_hi32 = ...
...
```
that would look like this after rematerializing `%0`:
```
%0:reg64 = CONSTANT 2
%1:reg128.sub_lo64 = CONSTANT 2
%1:reg128.sub_lo64_hi32 = ...
...
```
A dead def would not be added for `%1.sub_lo64_hi32` at the 2nd
instruction because it's subrange wasn't empty beforehand.
The ZOS run line is mostly broken. update_test_checks seems
to not work on it and I have no idea what I'm looking at here.
It's not obvious to me what the calls are. I added some checks
for the references to the libcalls printed at the end of the module,
but didn't check anything in the function body. half also just
asserts somewhere.
Commit cdc7864 has an error which would wrongly fold widening
multiplications into an even/odd widening operation.
This PR fixes it and adds tests to check scenarios which should not be
folded into an even/odd widening operation are actually not.
This is a partial revert of #145939 (I've kept the BUILD_VECTOR(FREEZE(UNDEF), FREEZE(UNDEF), elt2, ...) canonicalization) as we're getting reports of infinite loops (#148084).
The issue appears to be due to deep chains of nodes and how visitFREEZE replaces all instances of an operand with a common frozen version - other users of the original frozen node then get added back to the worklist but might no longer be able to confirm a node isn't poison due to recursion depth limits on isGuaranteedNotToBeUndefOrPoison.
The issue still exists with the old implementation but by only allowing a single frozen operand it helps prevent cases of interdependent frozen nodes.
I'm still working on supporting multiple operands as its critical for topological DAG handling but need to get a fix in for trunk and 21.x.
Fixes#148084
Many tests for floating point libcalls include CFI directives, which
isn't needed for the purpose of these tests. Mark some of the relevant
test functions `nounwind` in order to remove this noise.
Use `emitValueToAlignment` as the section does not contain code.
`emitCodeAlignment` would lead to ALIGN relocations on RISC-V and
LoongArch with linker relaxation.
In addition, change the alignment to wordsize, sufficient for the
runtime requirement (`XRayFunctionSledIndex`).
Related to #147322
This PR takes the work previously done by @pawan-nirpal-031 on X86 in
#106370, and makes it available in common code. This should enable all
targets to use `__builtin_canonicalize` for all `f(16|32|64|128)` data
types.
Canonicalization is implemented here as multiplication by `1.0`, as
suggested in [the
docs](https://llvm.org/docs/LangRef.html#llvm-canonicalize-intrinsic).
A HLASM source file must end with the END instruction. It is implemented
by adding a new function to the target streamer. This change also turns
SystemZHLASMSAsmString.h into a proper header file, and only uses the
SystemZTargetHLASMStreamer when HLASM output is generated.
This PR resolves https://github.com/llvm/llvm-project/issues/144513
The modification include five pattern :
1.vselect Cond, 0, 0 → 0
2.vselect Cond, -1, 0 → bitcast Cond
3.vselect Cond, -1, x → or Cond, x
4.vselect Cond, x, 0 → and Cond, x
5.vselect Cond, 000..., X -> andn Cond, X
1-4 have been migrated to DAGCombine. 5 still in x86 code.
The reason is that you cannot use the andn instruction directly in
DAGCombine, you can only use and+xor, which will introduce optimization
order issues. For example, in the x86 backend, select Cond, 0, x →
(~Cond) & x, the backend will first check whether the cond node of
(~Cond) is a setcc node. If so, it will modify the comparison operator
of the condition.So the x86 backend cannot complete the optimization of
andn.In short, I think it is a better choice to keep the pattern of
vselect Cond, 000..., X instead of and+xor in combineDAG.
For commit, the first is code changes and x86 test(note 1), the second
is tests in other backend(node 2).
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Always try to fold freeze(op(....)) -> op(freeze(),freeze(),freeze(),...).
This patch proposes we drop the opt-in limit for opcodes that are allowed to push a freeze through the op to freeze all its operands, through the tree towards the roots.
I'm struggling to find a strong reason for this limit apart from the DAG freeze handling being immature for so long - as we've improved coverage in canCreateUndefOrPoison/isGuaranteedNotToBeUndefOrPoison it looks like the regressions are not as severe.
Hopefully this will help some of the regression issues in #143102 etc.
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.
This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
Sections which are not allowed to carry data are marked as virtual. Only
complication when writing out the text is that it must be written in
chunks of 32k-1 bytes, which is done by having a wrapper stream writing
those records.
Data of BSS sections is not written, since the contents is known to be
zero. Instead, the fill byte value is used.
Unlike other formats, the GOFF object file format uses a 2 dimensional structure
to define the location of data. For example, the equivalent of the ELF .text
section is made up of a Section Definition (SD) and a class (Element Definition;
ED). The name of the SD symbol depends on the application, while the class has
the predefined name C_CODE/C_CODE64 in AMODE31 and AMODE64 respectively.
Data can be placed into this structure in 2 ways. First, the data (in a text
record) can be associated with an ED symbol. To refer to data, a Label
Definition (LD) is used to give an offset into the data a name. When binding,
the whole data is pulled into the resulting executable, and the addresses
given by the LD symbols are resolved.
The alternative is to use a Part Definition (PR). In this case, the data (in
a text record) is associated with the part. When binding, only the data of
referenced PRs is pulled into the resulting binary.
Both approaches are used. SD, ED, and PR elements are modeled by nested
MCSectionGOFF instances, while LD elements are associated with MCSymbolGOFF
instances.
At the binary level, a record called "External Symbol Definition" (ESD) is used. The
ESD has a type (SD, ED, PR, LD), and depending on the type a different subset of
the fields is used.
This patch fixes an error in which `FAKE_USE` instructions would trigger
an assertion in SystemZLongBranch due to them having a size of 0 without
being excepted in the assertion that each instruction, other than a set
of known 0-size instruction types, should have a non-0 size.
`FAKE_USE` instructions are no-op instructions that are emitted into
LLVM by the `-fextend-variable-liveness` clang flag to help preserve the
liveness of source variables in optimized code, and therefore they
should be understood as being valid size 0 instructions.
If there's an implicit-def of a super register, the propagation
must preserve this implicit-def. Knowing how and when to do this
may require target specific knowledge so just disable it for now.
Prior to 2def1c4, we checked that the copy had explicit 2 operands
when that was removed we started allowing implicit operands through.
This patch adds a check for implicit operands, but still allows
extra explicit operands which was the goal of 2def1c4.
Fixes#131478.
Commit `083b4a3d66` introduced a store-and-load pair around the `BRASL`
call to mcount. That load instruction did not properly declare its
target register as defined, leading to a bad machine instruction.
This commit fixes this by explicitly labeling `%r14` on the load as
`def`.
When compiling with `-pg`, the `EntryExitInstrumenterPass` will insert
calls to the glibc function `mcount` at the begining of each
`MachineFunction`.
On SystemZ, these calls require special handling:
- The call to `mcount` needs to happen at the beginning of the prologue.
- Prior to the call to `mcount`, register `%r14`, the return address of
the callee function, must be stored 8 bytes above the stack pointer
`%r15`. After the call to `mcount` returns, that register needs to be
restored.
This commit adds some special handling to the EntryExitInstrumenterPass
that keeps the insertion of the mcount function into the module, but
skips over insertion of the actual call in order to perform this
insertion in the `emitPrologue` function. There, a simple sequence of
store/call/load is inserted, which implements the above.
The desired change in the `EntryExitInstrumenterPass` necessitated the
addition of a new attribute and attribute kind to each function, which
is used to trigger the postprocessing, aka call insertion, in
`emitPrologue`. Note that the new attribute must be of a different kind
than the `mcount` atribute, since otherwise it would replace that
attribute and later be deleted by the code that intended to delete
`mcount`. The new attribnute is called `insert-mcount`, while the
attribute kind is `systemz-backend`, to clearly mark it as a
SystemZ-specific backend concern.
This PR should address issue #121137 . The test inserted here is derived
from the example given in that issue.
EH landing pad entry implicitly clobbers target-specific exception
pointer and exception selector registers. The post-RA MachineLICM pass
needs to take these into account when deciding whether to hoist an
instruction out of the loop that initializes one of these registers.
Fixes: https://github.com/llvm/llvm-project/issues/122315
Add a DAGCombine for FCOPYSIGN that removes the rounding which is never
needed as the sign bit is already in the correct place. This helps in particular the
rounding to f16 case which needs a libcall.
Also remove the roundings for other FP VTs and simplify the CPSDR
patterns correspondingly.
fp-copysign-03.ll test updated, now also covering the other FP VT
combinations.
- _Float16 is now accepted by Clang.
- The half IR type is fully handled by the backend.
- These values are passed in FP registers and converted to/from float around
each operation.
- Compiler-rt conversion functions are now built for s390x including the missing
extendhfdf2 which was added.
Fixes#50374
AsmPrinter may switch the current section when e.g., emitting a jump
table for a switch. `.stack_sizes` should still be linked to the
function section. If the section is wrong, readelf emits a warning
"relocation symbol is not in the expected section".
Previously `vst` and `vl` were not considered "simple" BDX stores and
loads, leading to, among other things, some opportunities for `mvc`
optimization to be missed.
This PR addresses this and updates some tests to account for additional
`mvc` instructions being emitted.
This is observed to have a neutral or slightly beneficial effect
performance-wise.
The recently announced IBM z17 processor implements the architecture
already supported as "arch15" in LLVM. This patch adds support for "z17"
as an alternate architecture name for arch15.
This patch also add the scheduler description for the z17 processor,
provided by Jonas Paulsson.
Due to some optimization changes, INIT_UNDEF is making its way to
`getInstSizeInBytes` in `llvm/lib/Target/SystemZ/SystemZLongBranch.cpp`
but we do not have an exception there in the assert. Since INIT_UNDEF is
described as being similar to IMPLICIT_DEF and there is a check for
IMPLICIT_DEF, it seems logical to also add a check for INIT_UNDEF.
---------
Co-authored-by: Tony Tao <tonytao@ca.ibm.com>
This PR instruments the optimization passes in the SystemZ backend with
calls to `MachineFunction::substituteDebugValuesForInst` where
instruction substitutions are made to instructions that may compute
tracked values.
Tests are also added for each of the substitutions that were inserted.
Details on the individual passes follow.
### systemz-copy-physregs
When a copy targets an access register, we redirect the copy via an
auxiliary register. This leads to the final result being written by a
newly inserted SAR instruction, rather than the original MI, so we need
to update the debug value tracking to account for this.
### systemz-long-branch
This pass relaxes relative branch instructions based on the actual
locations of blocks. Only one of the branch instructions qualifies for
debug value tracking: BRCT, i.e. branch-relative-on-count, which
subtracts 1 from a register and branches if the result is not zero. This
is relaxed into an add-immediate and a conditional branch, so any
`debug-instr-number` present must move to the add-immediate instruction.
### systemz-post-rewrite
This pass replaces `LOCRMux` and `SELRMux` pseudoinstructions with
either the real versions of those instructions, or with branching
programs that implement the intent of the Pseudo. In all these cases,
any `debug-instr-number` attached to the pseudo needs to be reallocated
to the appropriate instruction in the result, either LOCR, SELR, or a
COPY.
### systemz-elim-compare
Similar to systemz-long-branch, for this pass, only few substitutions
are necessary, since it mainly deals with conditional branch
instructions. The only exceptiona are again branch-relative-on-count, as
it modifies a counter as part of the instruction, as well as any of the
load instructions that are affected.
As part of an effort to enable instr-ref-based debug value tracking,
this PR implements `SystemZInstrInfo::isLoadFromStackSlotPostFE`, as
well as `SystemZInstrInfo::isStoreToStackSlotPostFE`. The implementation
relies upon the presence of MachineMemoryOperands on the relevant
`MachineInstr`s in order to access the `FrameIndex` post frame index
elimination.
Since these new functions are only meant to be called after frame-index
elimination, they assert against the present of a frame index on the
base register operand of the instruction.
Outside of the utility of these functions to enable instr-ref-based
debug value tracking, they also changes the behavior of the AsmPrinter,
since it will now be able to properly detect non-folded spills and
reloads, so this changes a number of tests that were checking
specifically for folded reloads.
Note that there are some tests that still check for `vst` and `vl` as
folded spills/reloads even though they should be straight reloads. This
will be addressed in a future PR.
Co-authored-by: Dominik Steenken <dominik.steenken@gmail.com>
Generate code using the VECTOR ADD COMPUTE CARRY and
VECTOR SUBTRACT COMPUTE BORROW INDICATION instructions
to implement open-coded IR with those semantics.
Handles integer vector types as well as i128.
Fixes: https://github.com/llvm/llvm-project/issues/129608
Generate more efficient code for zero or sign extensions where
the source is a subvector generated via SHUFFLE_VECTOR.
Specifically, recognize patterns corresponding to (series of)
VECTOR UNPACK instructions, or the VECTOR SIGN EXTEND TO
DOUBLEWORD instruction.
As a special case, also handle zero or sign extensions of a
vector element to i128.
Fixes: https://github.com/llvm/llvm-project/issues/129576
Fixes: https://github.com/llvm/llvm-project/issues/129899
Detect (non-intrinsic) IR patterns corresponding to the semantics
of the various widening and high-word multiplication instructions.
Specifically, this is done by:
- Recognizing even/odd widening multiplication patterns in DAGCombine
- Recognizing widening multiply-and-add on top during ISel
- Implementing the standard MULHS/MUHLU IR opcodes
- Detecting high-word multiply-and-add (which common code does not)
Depending on architecture level, this can support all integer
vector types as well as the scalar i128 type.
Fixes: https://github.com/llvm/llvm-project/issues/129705
Generate efficient code using the condition code set by the
VECTOR (FP) COMPARE family of instructions to implement
vector comparison reductions, e.g. as resulting from
__builtin_reduce_and/or of some vector comparsion.
Fixes: https://github.com/llvm/llvm-project/issues/129434
When removing a redundant definition in order to reuse an earlier
identical one it is necessary to remove any earlier kill flag as well.
Previously, the assumption has been that any register that kills the
defined Reg is enough to handle for this purpose, but this is actually
not quite enough. A kill of a super-register does not necessarily imply
that all of its subregs (including Reg) is defined at that point: a
partial definition of a register is legal. This means Reg may have been
killed earlier and is not live at that point.
This patch changes the tracking of kill flags to allow for multiple
flags to be removed: instead of remembering just the single / latest
kill flag, a vector is now used to track and remove them all.
TinyPtrVector seems ideal for this as there are only very rarely more
than one kill flag, and it doesn't seem to give much difference in
compile time.
The kill flags handling here is making this pass much more complicated
than it would have to be. This pass does not depend on kill flags for
its own use, so an interesting alternative to all this handling would be
to just remove them all. If there actually is a serious user, maybe that pass
could instead recompute them.
Also adding an assertion which is unrelated to kill flags, but it seems
to make sense (according to liberal assertion policy), to verify that
the preceding definition is in fact identical in clearKillsForDef().
Fixes#117783
Use `-passes="regallocgreedy<[all|sgpr|wwm|vgpr]>` to insert the greedy
RA with a filter and `-regalloc-npm=<type>` to control which RA to use
in existing pipeline.
It seems that there can be other cases with this that also can lead to
wrong code (discovered with csmith). This time it involved not the kill
flag but the undef flag.
Use the intersection of the flags from both MachineOperand:s instead
of the RegState from just one of them.