There can only be meaningful aliasing between the memory accesses of
different instructions if at least one of the accesses modifies memory.
This check is applied at the instruction-level earlier in the method.
This change merely extends the check on a per-MMO basis.
This affects a SystemZ test because PFD instructions are both mayLoad
and mayStore but may carry a load-only MMO which is now no longer
treated as aliasing loads. The PFD instructions are from llvm.prefetch
generated by loop-data-prefetch.
This is consistent with other promotion, but causes negative constants
to be sign extended instead of zero extended in some cases.
I guess getNode and type legalizer are inconsistent about what
ANY_EXTEND of a constant does.
Linux kernel build fails for SystemZ as output of INLINEASM was GR32Bit
general-purpose register instead of SystemZ::CC.
---------
Co-authored-by: anoopkg6 <anoopkg6@github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Added Support for flag output operand "=@cc", inline assembly constraint
for
SystemZ.
- Clang now accepts "=@cc" assembly operands, and sets 2-bits condition
code
for output operand for SyatemZ.
- Clang currently emits an assertion that flag output operands are
boolean
values, i.e. in the range [0, 2). Generalize this mechanism to allow
targets to specify arbitrary range assertions for any inline assembly
output operand. This will be used to assert that SystemZ two-bit
condition-code values are in the range [0, 4).
- SystemZ backend lowers "@cc" targets by using ipm sequence to extract
condition code from PSW.
- DAGCombine tries to optimize lowered ipm sequence by combining
CCReg and computing effective CCMask and CCValid in combineCCMask for
select_ccmask and br_ccmask.
- Cost computation is done for merging conditionals for branch
instruction
in SelectionDAG, as split may cause branches conditions evaluation goes
across basic block and difficult to combine.
---------
Co-authored-by: anoopkg6 <anoopkg6@github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
In the register allocator we define non-trivial rematerialization as the
rematerlization of an instruction with virtual register uses.
We have been able to perform non-trivial rematerialization for a while,
but it has been prevented by default unless specifically overriden by
the target in `TargetTransformInfo::isReMaterializableImpl`. The
original reasoning for this given by the comment in the default
implementation is because we might increase a live range of the virtual
register, but we don't actually do this.
LiveRangeEdit::allUsesAvailableAt makes sure that we only rematerialize
instructions whose virtual registers are already live at the use sites.
https://reviews.llvm.org/D106408 had originally tried to remove this
restriction but it was reverted after some performance regressions were
reported. We think it is likely that the regressions were caused by the
fact that the old isTriviallyReMaterializable API sometimes returned
true for non-trivial rematerializations.
However https://github.com/llvm/llvm-project/pull/160377 recently split
the API out into a separate non-trivial and trivial version and updated
the call-sites accordingly, and
https://github.com/llvm/llvm-project/pull/160709 and #159180 fixed
heuristics which weren't accounting for the difference between
non-trivial and trivial.
With these fixes in place, this patch proposes to again allow
non-trivial rematerialization by default which reduces a significant
amount of spills and reloads across various targets.
For llvm-test-suite built with -O3 -flto, we get the following geomean
reduction in reloads:
- arm64-apple-darwin: 11.6%
- riscv64-linux-gnu: 8.1%
- x86_64-linux-gnu: 6.5%
For subregister copies, do a subregister live check instead of checking
the main range. Doesn't do much yet, the split analysis still does not
track live ranges.
In this commit:
(1) Added new pass manager support for `ReachingDefAnalysis`.
(2) Added printer pass.
(3) Make old pass manager use `ReachingDefInfoWrapperPass`
Turn a funnel shift by N in the range `121..128` into a funnel shift in
the opposite direction by `128 - N`. Because there are dedicated
instructions for funnel shifts by values smaller than 8, this emits
fewer instructions.
This additional rule is useful because LLVM appears to canonicalize
`fshr` into `fshl`, meaning that the rules for `fshr` on values less
than 8 would not match on organic input.
Simplify min/max instruction matching by making the related
SelectionDAG operations legal.
Add patterns to match (signed and unsigned) saturated
truncation based on open-coded min/max patterns.
Fixes https://github.com/llvm/llvm-project/issues/153655
Currently, when an instruction rematerialized by the register coalescer
defines more subregs of the destination register
than the original COPY instruction did, we only add dead defs for the
newly defined subregs if they were not defined anywhere
else. For example, consider something like this before
rematerialization:
```
%0:reg64 = CONSTANT 1
%1:reg128.sub_lo64_lo32 = COPY %0.lo32
%1:reg128.sub_lo64_hi32 = ...
...
```
that would look like this after rematerializing `%0`:
```
%0:reg64 = CONSTANT 2
%1:reg128.sub_lo64 = CONSTANT 2
%1:reg128.sub_lo64_hi32 = ...
...
```
A dead def would not be added for `%1.sub_lo64_hi32` at the 2nd
instruction because it's subrange wasn't empty beforehand.
The ZOS run line is mostly broken. update_test_checks seems
to not work on it and I have no idea what I'm looking at here.
It's not obvious to me what the calls are. I added some checks
for the references to the libcalls printed at the end of the module,
but didn't check anything in the function body. half also just
asserts somewhere.
Commit cdc7864 has an error which would wrongly fold widening
multiplications into an even/odd widening operation.
This PR fixes it and adds tests to check scenarios which should not be
folded into an even/odd widening operation are actually not.
This is a partial revert of #145939 (I've kept the BUILD_VECTOR(FREEZE(UNDEF), FREEZE(UNDEF), elt2, ...) canonicalization) as we're getting reports of infinite loops (#148084).
The issue appears to be due to deep chains of nodes and how visitFREEZE replaces all instances of an operand with a common frozen version - other users of the original frozen node then get added back to the worklist but might no longer be able to confirm a node isn't poison due to recursion depth limits on isGuaranteedNotToBeUndefOrPoison.
The issue still exists with the old implementation but by only allowing a single frozen operand it helps prevent cases of interdependent frozen nodes.
I'm still working on supporting multiple operands as its critical for topological DAG handling but need to get a fix in for trunk and 21.x.
Fixes#148084
Many tests for floating point libcalls include CFI directives, which
isn't needed for the purpose of these tests. Mark some of the relevant
test functions `nounwind` in order to remove this noise.
Use `emitValueToAlignment` as the section does not contain code.
`emitCodeAlignment` would lead to ALIGN relocations on RISC-V and
LoongArch with linker relaxation.
In addition, change the alignment to wordsize, sufficient for the
runtime requirement (`XRayFunctionSledIndex`).
Related to #147322
This PR takes the work previously done by @pawan-nirpal-031 on X86 in
#106370, and makes it available in common code. This should enable all
targets to use `__builtin_canonicalize` for all `f(16|32|64|128)` data
types.
Canonicalization is implemented here as multiplication by `1.0`, as
suggested in [the
docs](https://llvm.org/docs/LangRef.html#llvm-canonicalize-intrinsic).
A HLASM source file must end with the END instruction. It is implemented
by adding a new function to the target streamer. This change also turns
SystemZHLASMSAsmString.h into a proper header file, and only uses the
SystemZTargetHLASMStreamer when HLASM output is generated.
This PR resolves https://github.com/llvm/llvm-project/issues/144513
The modification include five pattern :
1.vselect Cond, 0, 0 → 0
2.vselect Cond, -1, 0 → bitcast Cond
3.vselect Cond, -1, x → or Cond, x
4.vselect Cond, x, 0 → and Cond, x
5.vselect Cond, 000..., X -> andn Cond, X
1-4 have been migrated to DAGCombine. 5 still in x86 code.
The reason is that you cannot use the andn instruction directly in
DAGCombine, you can only use and+xor, which will introduce optimization
order issues. For example, in the x86 backend, select Cond, 0, x →
(~Cond) & x, the backend will first check whether the cond node of
(~Cond) is a setcc node. If so, it will modify the comparison operator
of the condition.So the x86 backend cannot complete the optimization of
andn.In short, I think it is a better choice to keep the pattern of
vselect Cond, 000..., X instead of and+xor in combineDAG.
For commit, the first is code changes and x86 test(note 1), the second
is tests in other backend(node 2).
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Always try to fold freeze(op(....)) -> op(freeze(),freeze(),freeze(),...).
This patch proposes we drop the opt-in limit for opcodes that are allowed to push a freeze through the op to freeze all its operands, through the tree towards the roots.
I'm struggling to find a strong reason for this limit apart from the DAG freeze handling being immature for so long - as we've improved coverage in canCreateUndefOrPoison/isGuaranteedNotToBeUndefOrPoison it looks like the regressions are not as severe.
Hopefully this will help some of the regression issues in #143102 etc.
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.
This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
Sections which are not allowed to carry data are marked as virtual. Only
complication when writing out the text is that it must be written in
chunks of 32k-1 bytes, which is done by having a wrapper stream writing
those records.
Data of BSS sections is not written, since the contents is known to be
zero. Instead, the fill byte value is used.
Unlike other formats, the GOFF object file format uses a 2 dimensional structure
to define the location of data. For example, the equivalent of the ELF .text
section is made up of a Section Definition (SD) and a class (Element Definition;
ED). The name of the SD symbol depends on the application, while the class has
the predefined name C_CODE/C_CODE64 in AMODE31 and AMODE64 respectively.
Data can be placed into this structure in 2 ways. First, the data (in a text
record) can be associated with an ED symbol. To refer to data, a Label
Definition (LD) is used to give an offset into the data a name. When binding,
the whole data is pulled into the resulting executable, and the addresses
given by the LD symbols are resolved.
The alternative is to use a Part Definition (PR). In this case, the data (in
a text record) is associated with the part. When binding, only the data of
referenced PRs is pulled into the resulting binary.
Both approaches are used. SD, ED, and PR elements are modeled by nested
MCSectionGOFF instances, while LD elements are associated with MCSymbolGOFF
instances.
At the binary level, a record called "External Symbol Definition" (ESD) is used. The
ESD has a type (SD, ED, PR, LD), and depending on the type a different subset of
the fields is used.
This patch fixes an error in which `FAKE_USE` instructions would trigger
an assertion in SystemZLongBranch due to them having a size of 0 without
being excepted in the assertion that each instruction, other than a set
of known 0-size instruction types, should have a non-0 size.
`FAKE_USE` instructions are no-op instructions that are emitted into
LLVM by the `-fextend-variable-liveness` clang flag to help preserve the
liveness of source variables in optimized code, and therefore they
should be understood as being valid size 0 instructions.
If there's an implicit-def of a super register, the propagation
must preserve this implicit-def. Knowing how and when to do this
may require target specific knowledge so just disable it for now.
Prior to 2def1c4, we checked that the copy had explicit 2 operands
when that was removed we started allowing implicit operands through.
This patch adds a check for implicit operands, but still allows
extra explicit operands which was the goal of 2def1c4.
Fixes#131478.
Commit `083b4a3d66` introduced a store-and-load pair around the `BRASL`
call to mcount. That load instruction did not properly declare its
target register as defined, leading to a bad machine instruction.
This commit fixes this by explicitly labeling `%r14` on the load as
`def`.
When compiling with `-pg`, the `EntryExitInstrumenterPass` will insert
calls to the glibc function `mcount` at the begining of each
`MachineFunction`.
On SystemZ, these calls require special handling:
- The call to `mcount` needs to happen at the beginning of the prologue.
- Prior to the call to `mcount`, register `%r14`, the return address of
the callee function, must be stored 8 bytes above the stack pointer
`%r15`. After the call to `mcount` returns, that register needs to be
restored.
This commit adds some special handling to the EntryExitInstrumenterPass
that keeps the insertion of the mcount function into the module, but
skips over insertion of the actual call in order to perform this
insertion in the `emitPrologue` function. There, a simple sequence of
store/call/load is inserted, which implements the above.
The desired change in the `EntryExitInstrumenterPass` necessitated the
addition of a new attribute and attribute kind to each function, which
is used to trigger the postprocessing, aka call insertion, in
`emitPrologue`. Note that the new attribute must be of a different kind
than the `mcount` atribute, since otherwise it would replace that
attribute and later be deleted by the code that intended to delete
`mcount`. The new attribnute is called `insert-mcount`, while the
attribute kind is `systemz-backend`, to clearly mark it as a
SystemZ-specific backend concern.
This PR should address issue #121137 . The test inserted here is derived
from the example given in that issue.
EH landing pad entry implicitly clobbers target-specific exception
pointer and exception selector registers. The post-RA MachineLICM pass
needs to take these into account when deciding whether to hoist an
instruction out of the loop that initializes one of these registers.
Fixes: https://github.com/llvm/llvm-project/issues/122315
Add a DAGCombine for FCOPYSIGN that removes the rounding which is never
needed as the sign bit is already in the correct place. This helps in particular the
rounding to f16 case which needs a libcall.
Also remove the roundings for other FP VTs and simplify the CPSDR
patterns correspondingly.
fp-copysign-03.ll test updated, now also covering the other FP VT
combinations.
- _Float16 is now accepted by Clang.
- The half IR type is fully handled by the backend.
- These values are passed in FP registers and converted to/from float around
each operation.
- Compiler-rt conversion functions are now built for s390x including the missing
extendhfdf2 which was added.
Fixes#50374
AsmPrinter may switch the current section when e.g., emitting a jump
table for a switch. `.stack_sizes` should still be linked to the
function section. If the section is wrong, readelf emits a warning
"relocation symbol is not in the expected section".
Previously `vst` and `vl` were not considered "simple" BDX stores and
loads, leading to, among other things, some opportunities for `mvc`
optimization to be missed.
This PR addresses this and updates some tests to account for additional
`mvc` instructions being emitted.
This is observed to have a neutral or slightly beneficial effect
performance-wise.
The recently announced IBM z17 processor implements the architecture
already supported as "arch15" in LLVM. This patch adds support for "z17"
as an alternate architecture name for arch15.
This patch also add the scheduler description for the z17 processor,
provided by Jonas Paulsson.
Due to some optimization changes, INIT_UNDEF is making its way to
`getInstSizeInBytes` in `llvm/lib/Target/SystemZ/SystemZLongBranch.cpp`
but we do not have an exception there in the assert. Since INIT_UNDEF is
described as being similar to IMPLICIT_DEF and there is a check for
IMPLICIT_DEF, it seems logical to also add a check for INIT_UNDEF.
---------
Co-authored-by: Tony Tao <tonytao@ca.ibm.com>
This PR instruments the optimization passes in the SystemZ backend with
calls to `MachineFunction::substituteDebugValuesForInst` where
instruction substitutions are made to instructions that may compute
tracked values.
Tests are also added for each of the substitutions that were inserted.
Details on the individual passes follow.
### systemz-copy-physregs
When a copy targets an access register, we redirect the copy via an
auxiliary register. This leads to the final result being written by a
newly inserted SAR instruction, rather than the original MI, so we need
to update the debug value tracking to account for this.
### systemz-long-branch
This pass relaxes relative branch instructions based on the actual
locations of blocks. Only one of the branch instructions qualifies for
debug value tracking: BRCT, i.e. branch-relative-on-count, which
subtracts 1 from a register and branches if the result is not zero. This
is relaxed into an add-immediate and a conditional branch, so any
`debug-instr-number` present must move to the add-immediate instruction.
### systemz-post-rewrite
This pass replaces `LOCRMux` and `SELRMux` pseudoinstructions with
either the real versions of those instructions, or with branching
programs that implement the intent of the Pseudo. In all these cases,
any `debug-instr-number` attached to the pseudo needs to be reallocated
to the appropriate instruction in the result, either LOCR, SELR, or a
COPY.
### systemz-elim-compare
Similar to systemz-long-branch, for this pass, only few substitutions
are necessary, since it mainly deals with conditional branch
instructions. The only exceptiona are again branch-relative-on-count, as
it modifies a counter as part of the instruction, as well as any of the
load instructions that are affected.
As part of an effort to enable instr-ref-based debug value tracking,
this PR implements `SystemZInstrInfo::isLoadFromStackSlotPostFE`, as
well as `SystemZInstrInfo::isStoreToStackSlotPostFE`. The implementation
relies upon the presence of MachineMemoryOperands on the relevant
`MachineInstr`s in order to access the `FrameIndex` post frame index
elimination.
Since these new functions are only meant to be called after frame-index
elimination, they assert against the present of a frame index on the
base register operand of the instruction.
Outside of the utility of these functions to enable instr-ref-based
debug value tracking, they also changes the behavior of the AsmPrinter,
since it will now be able to properly detect non-folded spills and
reloads, so this changes a number of tests that were checking
specifically for folded reloads.
Note that there are some tests that still check for `vst` and `vl` as
folded spills/reloads even though they should be straight reloads. This
will be addressed in a future PR.
Co-authored-by: Dominik Steenken <dominik.steenken@gmail.com>