Change definition of expandBitCastI128ToF128 and expandBitCastF128ToI128
to allow for simplified use in atomic load/store.
Update logic to split 128-bit loads and stores in DAGCombine to also
handle the f128 case where appropriate. This fixes the regressions
introduced by recent atomic load/store patches.
This is the mirror to the recent atomic load change. The same
bitcast-back-to-integer case is a small code quality regression for the
same reason. This would disappear with a bitcastable legal 128-bit type.
shouldCastAtomicLoadInIR is a hack that should be removed. Simple
bitcasting of operations should be in the domain of ordinary type
legalization and does not need to be done in the IR.
This introduces a code quality regression due to the hack currently used
to avoid using 128-bit values in the case where the floating point value
is ultimately used as an integer. This would be avoidable if there were
always a legal 128-bit type (like v2i64). This is a pretty niche
situation so I assume it's not important.
I implemented about 85% of the work necessary to make v2i64 legal, but
it was taking too long and I lack the necessary familiarity with systemz
to complete it. I've pushed it here for someone to pick up:
https://github.com/arsenm/llvm-project/pull/new/systemz-legal-v2i64
Depends #90861
Fixes#82659
There are some functions, such as `findRegisterDefOperandIdx` and `findRegisterDefOperand`, that have too many default parameters. As a result, we have encountered some issues due to the lack of TRI parameters, as shown in issue #82411.
Following @RKSimon 's suggestion, this patch refactors 9 functions, including `{reads, kills, defines, modifies}Register`, `registerDefIsDead`, and `findRegister{UseOperandIdx, UseOperand, DefOperandIdx, DefOperand}`, adjusting the order of the TRI parameter and making it required. In addition, all the places that call these functions have also been updated correctly to ensure no additional impact.
After this, the caller of these functions should explicitly know whether to pass the `TargetRegisterInfo` or just a `nullptr`.
This commit skips the expansion of the `vector.reduce.add` intrinsic on
vector-enabled SystemZ targets in order to introduce custom handling of
`vector.reduce.add` for legal vector types using the VSUM instructions.
This is limited to full vectors with scalar types up to `i32` due to
performance concerns.
It also adds testing for the generation of such custom handling, and
adapts the related cost computation, as well as the testing for that.
The expected result is a performance boost in certain benchmarks that
make heavy use of `vector.reduce.add` with other benchmarks remaining
constant.
For instance, the assembly for `vector.reduce.add<4 x i32>` changes from
```hlasm
vmrlg %v0, %v24, %v24
vaf %v0, %v24, %v0
vrepf %v1, %v0, 1
vaf %v0, %v0, %v1
vlgvf %r2, %v0, 0
```
to
```hlasm
vgbm %v0, 0
vsumqf %v0, %v24, %v0
vlgvf %r2, %v0, 3
```
On SystemZ, the outgoing argument area which is big enough for all calls
in the function is created once during the prolog, as opposed to
adjusting the stack around each call. The call-sequence instructions are
therefore not really useful any more than to compute the maximum call
frame size, which has so far been done by PEI, but can just as well be
done at an earlier point.
This patch removes the mapping of the CallFrameSetupOpcode and
CallFrameDestroyOpcode and instead computes the MaxCallFrameSize
directly after instruction selection and then removes the ADJCALLSTACK
pseudos. This removes the confusing pseudos and also avoids the problem
of having to keep the call frame size accurate when creating new MBBs.
This fixes#76618 which exposed the need to maintain the call frame size
when splitting blocks (which was not done).
The folding of sign/zero extensions into an atomic load by specifying an
extension type is not target specific, and therefore belongs in the
DAGCombiner rather than in the SystemZ backend.
- Handle atomic loads similarly to regular loads by adding
AtomicLoadExtActions with set/get methods.
- Move SystemZ extendAtomicLoad() to DagCombiner.cpp.
We use the VSCBIQ/VSBIQ/VSBCBIQ family of instructions to implement
USUBO/USUBO_CARRY for the i128 data type. However, these instructions
use an inverted sense of the borrow indication flag (a value of 1
indicates *no* borrow, while a value of 0 indicated borrow). This
does not match the semantics of the boolean "overflow" flag of the
USUBO/USUBO_CARRY ISD nodes.
Fix this by generating code to explicitly invert the flag. These
cancel out of the result of USUBO feeds into an USUBO_CARRY.
To avoid unnecessary zero-extend operations, also improve the
DAGCombine handling of ZERO_EXTEND to optimize (zext (xor (trunc)))
sequences where appropriate.
Fixes: https://github.com/llvm/llvm-project/issues/83268
We use the VSCBIQ/VSBIQ/VSBCBIQ family of instructions to implement
USUBO/USUBO_CARRY for the i128 data type. However, these instructions
use an inverted sense of the borrow indication flag (a value of 1
indicates *no* borrow, while a value of 0 indicated borrow). This
does not match the semantics of the boolean "overflow" flag of the
USUBO/USUBO_CARRY ISD nodes.
Fix this by generating code to explicitly invert the flag. These
cancel out of the result of USUBO feeds into an USUBO_CARRY.
To avoid unnecessary zero-extend operations, also improve the
DAGCombine handling of ZERO_EXTEND to optimize (zext (xor (trunc)))
sequences where appropriate.
Fixes: https://github.com/llvm/llvm-project/issues/83268
- Instead of lowering float/double ISD::ATOMIC_LOAD / ISD::ATOMIC_STORE
nodes to regular LOAD/STORE nodes, make them legal and select those nodes
properly instead. This avoids exposing them to the DAGCombiner.
- AtomicExpand pass no longer casts float/double atomic load/stores to integer
(FP128 is still casted).
When an integer argument is promoted and *not* split (like i72 -> i128 on
a new machine with vector support), the SlotVT should be i128, which is
stored in VT - not ArgVT.
Fixes#81417
Machines with vector support handle i128 in vector registers and
therefore only have the small displacement available for memory
accesses. Update isLegalAddressingMode() to reflect this.
The usage of FP Load and Test instructions as a comparison against zero
with the assumption that the dest reg will always reflect the source reg is
actually incorrect: Unfortunately, a SNaN will be converted to a QNaN, so the
instruction may actually change the value as opposed to being a pure register
move with a test.
This patch
- changes instruction selection to always emit FP LT with a scratch def
reg, which will typically be allocated to the same reg if dead.
- Removes the conversions into FP LT in SystemZElimcompare.
When i128 is a legal type, SelectionDAG now attempts to use
SRL_PARTS etc. with type i128, which is not implemented. Fix
by marking those as Expand, just like we do for i64.
Fixes https://github.com/llvm/llvm-project/issues/77132
This follows on from #76708, allowing
`cast<ConstantSDNode>(N)->getZExtValue()` to be replaced with just
`N->getAsZextVal();`
Introduced via `git grep -l "cast<ConstantSDNode>\(.*\).*getZExtValue" |
xargs sed -E -i
's/cast<ConstantSDNode>\((.*)\)->getZExtValue/\1->getAsZExtVal/'` and
then using `git clang-format` on the result.
This helper function shortens examples like
`cast<ConstantSDNode>(Node->getOperand(1))->getZExtValue();` to
`Node->getConstantOperandVal(1);`.
Implemented with:
`git grep -l
"cast<ConstantSDNode>\(.*->getOperand\(.*\)\)->getZExtValue\(\)" | xargs
sed -E -i
's/cast<ConstantSDNode>\((.*)->getOperand\((.*)\)\)->getZExtValue\(\)/\1->getConstantOperandVal(\2)/`
and `git grep -l
"cast<ConstantSDNode>\(.*\.getOperand\(.*\)\)->getZExtValue\(\)" | xargs
sed -E -i
's/cast<ConstantSDNode>\((.*)\.getOperand\((.*)\)\)->getZExtValue\(\)/\1.getConstantOperandVal(\2)/'`.
With a couple of simple manual fixes needed. Result then processed by
`git clang-format`.
Support passing and returning values of single-element vector
types (i.e. <1 x i128> and <1 x fp128>).
Now that i128 is a legal type, supporting these types can be
done simply by providing a getRegisterTypeForCallingConv
implementation that handles them.
Fixes https://github.com/llvm/llvm-project/issues/61291
On processors supporting vector registers and SIMD instructions, enable
i128 as legal type in VRs. This allows many operations to be implemented
via native instructions directly in VRs (including add, subtract,
logical operations and shifts). For a few other operations (e.g.
multiply and divide, as well as atomic operations), we need to move the
i128 value back to a GPR pair to use the corresponding instruction
there. Overall, this is still beneficial.
The patch includes the following LLVM changes:
- Enable i128 as legal type
- Set up legal operations (in SystemZInstrVector.td)
- Custom expansion for i128 add/subtract with carry
- Custom expansion for i128 comparisons and selects
- Support for moving i128 to/from GPR pairs when required
- Handle 128-bit integer constant values everywhere
- Use i128 as intrinsic operand type where appropriate
- Updated and new test cases
In addition, clang builtins are updated to reflect the intrinsic operand
type changes (which also improves compatibility with GCC).
Let the AtomicExpand pass do more of the job of expanding
AtomicRMWInst:s in order to simplify the handling in the backend.
The only cases that the backend needs to handle itself are those of
subword size (8/16 bits) and those directly corresponding to a target
instruction.
- Clang FE now has MaxAtomicPromoteWidth / MaxAtomicInlineWidth set to 128, and now produces IR
instead of calls to __atomic instrinsics for 16 bytes as well.
- Atomic __int128 (and long double) variables are now aligned to 16 bytes by default (like gcc 14).
- AtomicExpand pass now expands 16 byte operations as well.
- tests for __atomic builtins for all integer widths, and __atomic_is_lock_free with friends.
- TODO: AtomicExpand pass handles with this patch expansion of i128 atomicrmw:s. As a next step
smaller integer types should also be possible to handle this way instead of by the backend.
Clang currently implements a set of vector rotate builtins
(__builtin_s390_verll*) in terms of platform-specific LLVM
intrinsics. To simplify the IR (and allow for common code
optimizations if applicable), this patch removes those LLVM
intrinsics and implements the builtins in terms of the
platform-independent funnel shift intrinsics instead.
Also, fix the prototype of the __builtin_s390_verll*
builtins for full compatibility with GCC.
GCC supports building individual functions with backchain using the
__attribute__((target("backchain"))) syntax, and Clang should too.
Clang translates this into the "target-features"="+backchain" attribute,
and the -mbackchain command-line option into the "backchain" attribute.
The backend currently checks only the latter; furthermore, the backchain
target feature is not defined.
Handle backchain like soft-float. Define a target feature, convert
function attribute into it in getSubtargetImpl(), and check for target
feature instead of function attribute everywhere. Add an end-to-end test
to the Clang testsuite.
The upcoming OpenMP support for SystemZ requires handling of IR insns
like `atomicrmw fadd`. Normally atomic float operations are expanded by
Clang and such insns do not occur, but OpenMP generates them directly.
Other architectures handle this using the AtomicExpand pass, which
SystemZ did not need so far. Enable it.
Currently AtomicExpand treats atomic load and stores of floats
pessimistically: it casts them to integers, which SystemZ does not need,
since the floating point load and store instructions are already atomic.
However, the way Clang currently expands them is pessimistic as well, so
this change does not make things worse. Optimizing operations on atomic
floats can be a separate change in the future.
This change does not create any differences the Linux kernel build.
When the code is built with -mbackchain, it is possible to retrieve the
caller's frame and return addresses. GCC already can do this, add this
support to Clang as well. Use RISCVTargetLowering and GCC's
s390_return_addr_rtx() as inspiration. Add tests based on what GCC is
emitting.
This PR moves some calculation out of `LowerCall` and into
`SystemZXPLINKFrameLowering::processFunctionBeforeFrameFinalized`.
We need to make this change because LowerCall isn't invoked for
functions that don't have function calls, and it is required for some
tooling to work correctly. A function that does not make any calls is
required to allocate 32 bytes for the parameter area required by the
ABI. However, we allocate 64 bytes because this additional space is
utilized by certain tools, like the debugger.
Co-authored-by: Yusra Syeda <yusra.syeda@ibm.com>
On z/OS, many library functions have a non-standard name. This change
initializes the table of runtime function which results from lowering
intrinsics to library calls.
Given a list of constraints for InlineAsm (ex. "imr") I'm looking to
modify the order in which they are chosen. Before doing so, I noticed a
fair
amount of logic is duplicated between SelectionDAGISel and GlobalISel
for this.
That is because SelectionDAGISel is also trying to lower immediates
during selection. If we detangle these concerns into:
1. choose the preferred constraint
2. attempt to lower that constraint
Then we can slide down the list of constraints until we find one that
can be lowered. That allows the implementation to be shared between
instruction selection frameworks.
This makes it so that later I might only need to adjust the priority of
constraints in one place, and have both selectors behave the same.
Irritatingly, atomic_store had operands in the opposite order from
regular store. This made it difficult to share patterns between
regular and atomic stores.
There was a previous incomplete attempt to move atomic_store into the
regular StoreSDNode which would be better.
I think it was a mistake for all atomicrmw to swap the operand order,
so maybe it's better to take this one step further.
https://reviews.llvm.org/D123143
This patch adds support for the ADA (associated data area), doing the following:
-Creates the ADA table to handle displacements
-Emits the ADA section in the SystemZAsmPrinter
-Lowers the ADA_ENTRY node into the appropriate load instruction
Differential Revision: https://reviews.llvm.org/D153788
- Creates the ADA table to handle displacements
- Emits the ADA section in the SystemZAsmPrinter
- Lowers the ADA_ENTRY node into the appropriate load instruction
Differential Revision: https://reviews.llvm.org/D153788
The Length/4 of Params field in the PPA1 ought to be the length of the parameters for the current function. Currently we are storing the length of the parameter area in the current function's stack frame, which represents the length of the params of the longest callee in the current function.
Differential Revision: https://reviews.llvm.org/D152920
Reviewed By: uweigand
The Length/4 of Params field in the PPA1 ought to be the length of the parameters for the current function. Currently we are storing the length of the parameter area in the current function's stack frame, which represents the length of the params of the longest callee in the current function.
Differential revision: https://reviews.llvm.org/D119049
Reviewed By: uweigand
The term "next stack offset" is misleading because the next argument is
not necessarily allocated at this offset due to alignment constrains.
It also does not make much sense when allocating arguments at negative
offsets (introduced in a follow-up patch), because the returned offset
would be past the end of the next argument.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D149566
Define intersectWith and unionWith as two complementary ways of
combining KnownBits. The names are chosen for consistency with
ConstantRange.
Deprecate commonBits as a synonym for intersectWith.
Differential Revision: https://reviews.llvm.org/D150443
Previously, LegalizeVectorOps used the result VT while LegalizeDAG
used the operand VT. This patch makes them both use the operand VT.
This also makes it consistent with how the default cost model works.
I've hacked the AArch64 cost model to maintain old behavior for some
f16 vectors.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D149572
In D79537, `nomerge` was made to only apply to non-tail calls. This fixes it by also applying it to tail calls.
For ARM, I only made the new MI to inherit the flag under `TCRETURNdi` and `TCRETURNri`, because that's the place tail calls got replaced. Not sure if there's any other place needed.
Fixes#61545.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D146749
The new test case showed that the NoPHIs flag needs to be cleared.
Original commit message:
[SystemZ] Bugfix in expansion of memmem operations.
Since NC, OC, and XC clobber CC, the EXRL_Pseudo targeting these must also be
marked to do so.
Original patch by uweigand.
Reviewed by: uweigand
Differential Revision: https://reviews.llvm.org/D150251
Fixes: https://github.com/llvm/llvm-project/issues/62572