SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
Update the check in the FMA combine to check dot10-insts instead of
dot7-insts.
The target of the combine, v_dot2_f32_f16, is available only if
dot10-insts target feature is enabled.
The issue probably dates back to the change that split out dot10-insts
out of dot7-insts.
As far as I can see, this does not affect any current targets, but if a
future target has dot7-insts, but not dot10-insts that would cause a
crash ("cannot select") for the input ir in the test.
The materialization cost of 32-bit non-inline in case of fmul is quite
relatively more, rather than if possible to combine it into ldexp
instruction for specific scenarios (for datatypes like f64, f32 and f16)
as this is being handled here :
The dag combine for any pair of select values which are exact exponent
of 2.
```
fmul x, select(y, A, B) -> ldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B) -> ldexp ((fneg x), select i32 (y, a, b))
where, A=2^a & B=2^b ; a and b are integers.
```
This dagCombine is handled separately in fmulCombine (newly defined in
SIIselLowering), targeting fmul fusing it with select type operand into
ldexp.
Thus, it fixes#104900.
It is possible that the number of hidden arguments that are selected to
be preloaded in AMDGPULowerKernel arguments and isel can differ. This
isn't an issue with explicit arguments since isel can lower the argument
correctly either way, but with hidden arguments we may have alignment
issues if we try to load these hidden arguments that were added to the
kernel signature.
The reason for the mismatch is that isel reserves an extra synthetic
user SGPR for module LDS.
Instead of teaching lowerFormalArguments how to handle these properly it
makes more sense and is less expensive to fix the mismatch and assert if
we ever run into this issue again. We should never be trying to lower
these in the normal way.
In a future change we probably want to revise how we track "synthetic"
user SGPRs and unify the handling in GCNUserSGPRUsageInfo. Sometimes
synthetic SGPRSs are considered user SGPRs and sometimes they are not.
Until then this patch resolves the inconsistency, fixes the bug, and is
otherwise a NFC.
Extend the optimization that converts s_barrier to wave_barrier (nop)
when the number of work items is not larger than wave size.
This handles the "split barrier" form of s_barrier where the barrier
is represented by separate intrinsics (s_barrier_signal/s_barrier_wait).
Note: the version where s_barrier is used in gfx12 (and later split)
has the optimization already, but some front-ends may prefer to use
split intrinsics and this is being addressed by the patch.
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).
I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.
The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
Fixes a copy-paste typo.
The typo resulted in producing bad v_perm based operands for the v_dot4
combine. When adding a corresponding byte pair to the v_dot byte pair
chains, we must take note of the byte position in the corresponding
source nodes. These byte positions are used to ensure we extract the
correct DWord from the ultimate source, and formulate a correct
perm_mask from the extracted DWord.
With the typo, we the S0 byte would used the DWord offset for the
corresponding S1 byte. If this offset was not the same as the true DWord
offset for the S0 byte, we would extract and use the wrong byte for S0
in the v_dot.
Fixes https://github.com/llvm/llvm-project/issues/112941
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
64-bit flat cmpxchg instructions do not work correctly for scratch
addresses, and need to be expanded as non-atomic.
Allow custom expansion of cmpxchg in AtomicExpand, as is
already the case for atomicrmw.
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.
Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
The correct behaviour is to insert a readfirstlane. SelectionDAG was
already doing this in some cases, but not in the general case for chain
calls. GlobalISel was already doing this for return values but not for
arguments.
This only changes `llvm/lib/Target/AMDGPU/SIISelLowering.cpp`.
There are five uses of `std::tie` remaining because they can't be
replaced with
C++17 structured bindings.
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.
Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.
Part of SWDEV-455845.
Adds hidden kernel arguments to the function signature and marks them
inreg if they should be preloaded into user SGPRs. The normal kernarg
preloading logic then takes over with some additional checks for the
correct implicitarg_ptr alignment.
Special care is needed so that metadata for the hidden arguments is not
added twice when generating the code object.
If we have a divergent value passed to an outgoing inreg argument,
the call needs to be executed in a waterfall loop and thus cannot
be tail called.
The waterfall handling of arbitrary calls is broken on the selectiondag
path, so some of these cases still hit an error later.
I also noticed the argument evaluation code in isEligibleForTailCallOptimization
is not correctly accounting for implicit argument assignments. It also seems
inreg codegen is generally broken; we are assigning arguments to the reserved
private resource descriptor.
We allow tail calls of known uniform function pointers. This
would produce a verifier error if the uniform value is in VGPRs.
Insert readfirstlanes just in case this occurs, which will fold
out later if it is unnecessary.
GlobalISel should need a similar fix, but it currently does not
attempt tail calls of indirect calls.
Fixes#107447
Fixes subissue of #110930