The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.
A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
/*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee
This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).
Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).
For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.
---------
Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
From #106446, this adds a variant of getVectorIdxTy that returns an LLT.
Many uses only look at the width, so a getVectorIdxWidth was added as
the common base.
Reduce:
DST = shl i64 X, Y
where Y is in the range [63-32] to:
DST = [0, shl i32 X, (Y & 32)]
Alive2 analysis:
https://alive2.llvm.org/ce/z/w_u5je
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Previously we only handled cases that looked like the high element
extract of a 64-bit shift. Generalize this to handle any multiple
indexing. I was hoping this would help avoid some regressions,
but it did not. It does however reduce the number of steps the DAG
takes to process these cases.
NFC-ish, I have yet to find an example where this changes the
final output.
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).
Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
SMUL_LOHI and UMUL_LOHI are different operations because the high part
of the result is different, so it is not OK to optimize the signed
version to MUL_U24/MULHI_U24 or the unsigned version to
MUL_I24/MULHI_I24.
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.
"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
This patch enables the target-independent lowering of llvm.lround via
GlobalISel. For SelectionDAG, the instrinsic is custom lowered for
AMDGPU. In order to support vector floating point input for llvm.lround,
this patch extends the target independent APIs and provide support for
scalarizing. pr98950 is needed to let verifier allow vector floating
point types
Summary:
These Libcalls represent which functions are available to the backend.
If a runtime call is not available, the target sets the the name to
`nullptr`. Currently, this logic is spread around the various targets.
This patch pulls all of the locations that disable libcalls into the
intializer. This patch is effectively NFC.
The motivation behind this patch is that currently the LTO handling uses
the list of all runtime calls to determine which functions cannot be
internalized and must be extracted from static libraries. We do not want
this to happen for libcalls that are not emitted by the backend. A
follow-up patch will move out this logic so the LTO pass can know which
rtlib calls are actually used by the backend.
In the SelectionDAG lowering of the memcpy intrinsic, this optimization
introduces additional chains between fixed-size groups of loads and the
corresponding stores. While initially introduced to ensure that wider
load/store-pair instructions are generated on AArch64, this optimization
also improves code generation for AMDGPU: Ganged loads are scheduled
into a clause; stores only await completion of their corresponding load.
The chosen value of 16 performed good in microbenchmarks, values of 8,
32, or 64 would perform similarly.
The testcase updates are autogenerated by
utils/update_llc_test_checks.py.
See also:
- PR introducing this optimization: https://reviews.llvm.org/D46477
Part of SWDEV-455845.
These are redundant with the unsuffixed versions, and have a name
collision with surprising behavior when the base intrinsic is used with
v2bf16.
The global and flat variants should be removed too, but those are complicated
due to using v2i16 in place of the natural v2bf16. Those cases can soon be
completely deleted in favor of atomicrmw.
The GlobalISel codegen change is broken and substitutes handling as bf16
for handling as f16, but it's a bug that this passed the IRTranslator in the first
place.
Use LSH to lower ctlz_zero_undef instead of subtracting leading zeros
for i8 and i16.
Related to [77615](https://github.com/llvm/llvm-project/pull/77615).
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
This is the first step to eliminating shouldCastAtomicRMWIInIR. This and
the other atomic expand casting hooks should be removed. This adds
duplicate legalization machinery and interfaces. This is already what
codegen is supposed to do, and already does for the promotion case.
In the case of atomicrmw xchg, there seems to be some benefit to having
the bitcasts moved outside of the cmpxchg loop on targets with separate
int and FP registers, which we should be able to deal with by directly
checking for the legality of the underlying operation.
The casting path was also losing metadata when it recreated the
instruction.
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
38 under llvm/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".