Support tail calls to whole wave functions (trivial) and from whole wave
functions (slightly more involved because we need a new pseudo for the
tail call return, that patches up the EXEC mask).
Move the expansion of whole wave function return pseudos (regular and
tail call returns) to prolog epilog insertion, since that's where we
patch up the EXEC mask.
Unnecessary register spills will be dealt with in a future patch.
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
We already had corresponding f32 and i32 vector types for these sizes.
Also add VTs v[567]i8 and v[567]i16: these are needed by the Hexagon
backend which for each i1 vector types want to query information about
the corresponding i8 and i16 types in
HexagonTargetLowering::getPreferredHvxVectorAction.
These float operations were expanded for scalar f32/f64/f128, but not
for f16 and more problematically, not for vectors. A small subset of
them was separately set to expand for vectors.
Change these to always expand by default, and adjust targets to mark
these as legal where necessary instead.
This is a much safer default, and avoids unnecessary legalization
failures because a target failed to manually mark them as expand.
Fixes https://github.com/llvm/llvm-project/issues/110753.
Fixes https://github.com/llvm/llvm-project/issues/121390.
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.
Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.
At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.
This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
a SReg_1 representing `%active`, which needs to be passed into
SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
special handling of %active. Since the return may be in a different
basic block, it's difficult to add the virtual reg for %active to
SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
marks any used VGPRs as WWM registers, which are then spilled and
restored with the usual logic.
Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
For some targets, the optimization X == Const ? X : Y -> X == Const ?
Const : Y can cause extra register usage or redundant immediate encoding
for the constant in cndmask generated from the ternary operation.
This patch detects such cases and reuses the register from the compare
instruction that already holds the constant, instead of materializing it
again for cndmask.
The optimization avoids immediates that can be encoded into cndmask
instruction (including +-0.0), as well as !isNormal() constants.
The change is reworked on the base of #131146
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When performing a 64-bit sra of a negative value with a shift range from
[32-63], create the hi-half with a move of -1.
Alive verification: https://alive2.llvm.org/ce/z/kXd7Ac
Also, preserve exact flag. Alive verification:
https://alive2.llvm.org/ce/z/L86tXf.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
This PR takes the work previously done by @pawan-nirpal-031 on X86 in
#106370, and makes it available in common code. This should enable all
targets to use `__builtin_canonicalize` for all `f(16|32|64|128)` data
types.
Canonicalization is implemented here as multiplication by `1.0`, as
suggested in [the
docs](https://llvm.org/docs/LangRef.html#llvm-canonicalize-intrinsic).
When reducing 64-bit lshr to 32-bit preserve exact flag.
Alive2 verification: https://alive2.llvm.org/ce/z/LcnX7V
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Use KnownBits to convert 64-bit sra to 32-bit sra.
Scaled-down alive2 verification with 16/8-bit types:
https://alive2.llvm.org/ce/z/LamASk
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Convert vector 64-bit lshr to 32-bit if shift amt is known to be >= 32.
Also convert scalar 64-bit lshr to 32-bit if shift amt is variable but
known to be >=32.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Extend sra i64 simplification to shift constants in range [33:62]. Shift
amounts 32 and 63 were already handled.
New testing for shift amts 33 and 62 added in sra.ll. Changes to other
test files were to adapt previous test results to this extension.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
This annotates the `Twine` passed to the constructors of the various
DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes
us to warn when we would try to print the twine after it had already
been destructed.
We also update `DiagnosticInfoUnsupported` to hold a `const Twine &`
like all of the other DiagnosticInfo classes, since this warning allows
us to clean up all of the places where it was being used incorrectly.
Update the f64 to f16 lowering for targets which support f16 types.
For unsafe mode, lowered to two FP_ROUND. (This patch
https://reviews.llvm.org/D154528 stops from combining these two FP_ROUND
back). In safe mode, select LowerF64ToF16 (round-to-nearest-even
rounding mode)
Based off feedback for #129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).
This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.
A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
/*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee
This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).
Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).
For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.
---------
Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
From #106446, this adds a variant of getVectorIdxTy that returns an LLT.
Many uses only look at the width, so a getVectorIdxWidth was added as
the common base.
Reduce:
DST = shl i64 X, Y
where Y is in the range [63-32] to:
DST = [0, shl i32 X, (Y & 32)]
Alive2 analysis:
https://alive2.llvm.org/ce/z/w_u5je
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Previously we only handled cases that looked like the high element
extract of a 64-bit shift. Generalize this to handle any multiple
indexing. I was hoping this would help avoid some regressions,
but it did not. It does however reduce the number of steps the DAG
takes to process these cases.
NFC-ish, I have yet to find an example where this changes the
final output.
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.