The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
https://github.com/llvm/llvm-project/pull/139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
We have multiple different attributes in clang representing device
kernels for specific targets/languages. Refactor them into one attribute
with different spellings to make it more easily scalable for new
languages/targets.
---------
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Solely relying on top‑down scheduling can underutilize hardware, since
long‑latency instructions often end up scheduled too late and their
latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us
move those instructions earlier, which improves latency hiding and
yields roughly a 2% performance gain on key benchmarks.
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.
But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).
Another issue is:
```
if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);
if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.
For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.
To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.
One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.
The most affected targets are: AMDGPU, AArch64, RISCV.
For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.
For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.
For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
This is a copy of the fneg.f16.ll, just with type replaced.
The final logic shall be the same as with f16 as these are
just bit operations.
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
`freeze` on `fmul` (without `nnan`) followed by `fadd` or `fsub` into a
single `fma` is supported.
This patch adds lit tests to verify the optimization behavior for both
nnan and non-nnan variants.
Two changes in this patch:
1. Covered another case in legalizeOperandVALUt16 functions and the COPY
lowering, when SALU16 is used by SALU32, need to insert a reg_sequence
after moved to valu (previously only considered SALU32 used by SALU16
case)
2. Moved the useMI analysis into addUsersToMoveVALUList. Legalize the
targetted operand when needed.
Turn on frem test with true16 mode for gfx1150 which is failing before
this patch. A few bitcast tests also impacted by this change with some
v_mov being replaced to dual mov
Add command line and function attribute configuration of hard clause
length limit (within hardware maximum).
This allows performance tuning for shaders which benefit from smaller
clauses.
bugfix on parsing FP literals for scale values in the scaled MFMA.
Due to the change in order of operands between MCinst and parsed
operands, the FP literal imms for scale values were not parsed
correctly.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.
In particular, the register counts are used:
1. When launching an entry function - in which case we're safe because
the register counts of the entry function will include the register
counts of all callees.
2. At function boundaries in dynamic VGPR mode. In this case it's safe
because whenever we set the new VGPR allocation we take into account
the outgoing_vgpr_count set by the middle-end.
The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.
---------
Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
Supports the `nestedGEP`pattern that
appears when an alloca is first indexed as an array element and then
shifted with a byte‑offset GEP:
```llvm
%SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8
%row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j
%elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4
%val = load i32, ptr addrspace(5) %elt1
```
The pass folds the two levels of addressing into a single vector lane
index and keeps the whole object in a VGPR:
```llvm
%vec = freeze <20 x i32> poison ; alloca promote <20 x i32>
%idx0 = mul i32 %j, 2 ; j * 2
%idx = add i32 %idx0, 1 ; j * 2 + 1
%val = extractelement <20 x i32> %vec, i32 %idx
```
This eliminates the scratch read.
This avoids some ugly codegen on pre-16-bit instruction targets now
from annoying f16 legalization effects. This also avoids regressions
on newer targets in a future patch.
SIFixSGPRCopies handled STRICT_WWM (and similar WWM/WQM pseudos) like a
COPY. In particular, if the source was a VGPR and the result was an
SGPR, lowerVGPR2SGPRCopies would replace it with a readfirstlane,
erasing the original pseudo and hence sabotaging the WWM region marking
which is supposed to be performed by SIWholeQuadMode.
Fix this by handling it more like INSERT_SUBREG, PHI and REG_SEQUENCE:
if the source is a VGPR then move the result to a VGPR, and keep the
pseudo.
VOP3 instructions ignore opsel source modifiers, so a constant that
contains two different [B]F16 imms cannot be encoded into instruction
with an src opsel.
E.g. without the fix the following instructions
`s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>`
`v_cvt_scalef32_pk_fp8_f16 v0, s0, v2`
lose `2.0` imm and are folded into
`v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0`
Fixes SWDEV-531672
Extend sra i64 simplification to shift constants in range [33:62]. Shift
amounts 32 and 63 were already handled.
New testing for shift amts 33 and 62 added in sra.ll. Changes to other
test files were to adapt previous test results to this extension.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
This was failing on the buffer fat pointer lowering error in the
addrspace(7) case, not the expected asm printer breakage. Also remove
the attempt at FileChecking the result, since that is dependent on the
actual fix and we want the unexpected pass whenever the assert is fixed.
It will run twice in the non-LTO pipeline with `O1` or higher. In LTO post link pipeline, it will be run once with `O2` or higher, since inline and SROA don't run in `O1`.