8148 Commits

Author SHA1 Message Date
Matt Arsenault
f4598194b5
DAG: Fold bitcast of scalar_to_vector to anyext (#122660)
scalar_to_vector is difficult to make appear and test,
but I found one case where this makes an observable difference.
It fires more often than this in the test suite, but most of them
have no net result in the final code. This helps reduce regressions
in a future commit.
2025-01-13 19:38:58 +07:00
Matt Arsenault
e9a55770dc
AMDGPU: Add gfx9 run line to scalar_to_vector test (#122659) 2025-01-13 19:35:56 +07:00
Akshat Oke
73b0e8a191
[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434) 2025-01-13 17:52:30 +05:30
Akshat Oke
7bf1cb702b
[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261) 2025-01-13 10:11:40 +05:30
Shilei Tian
f15da5fb78
[AMDGPU] Fix an invalid cast in AMDGPULateCodeGenPrepare::visitLoadInst (#122494)
Fixes: SWDEV-507695
2025-01-12 23:40:25 -05:00
Austin Kerbow
657fb4433e
[AMDGPU] Add target hook to isGlobalMemoryObject (#112781)
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
2025-01-11 09:57:57 -08:00
Austin Kerbow
2e5c298281
[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.

A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
2025-01-10 11:39:02 -08:00
Matt Arsenault
7ebf0df409 AMDGPU: Test gfx940 mfma intrinsics on gfx950
This requires splitting the xf32 cases into a separate file
2025-01-10 23:16:25 +07:00
Mirko Brkušanin
3def49cb64
[AMDGPU] Remove s_wakeup_barrier instruction (#122277) 2025-01-10 11:30:22 +01:00
Nikita Popov
eeac0ffaf4 Revert "[MachineLICM] Use RegisterClassInfo::getRegPressureSetLimit (#119826)"
This reverts commit b4e17d4a314ed87ff6b40b4b05397d4b25b6636a.

This causes a large compile-time regression.
2025-01-10 09:05:06 +01:00
Jakub Chlanda
01a7d4e26b
[AMDGPU] Allow selection of BITOP3 for some 2 opcodes and B32 cases (#122267)
This came up in downstream static analysis - as a dead code.

Admittedly, it depends on what the intention was when checking for [`if
(NumOpcodes == 2 &&
IsB32)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3792C3-L3792C32)
and I took a guess that for certain cases the selection should take
place.

If that's incorrect, that whole if statement can be removed, as it is
after a check for: [`if (NumOpcodes <
4)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3788)
2025-01-10 07:49:11 +01:00
Chinmay Deshpande
211bcf67aa
[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (#121465) 2025-01-10 09:05:41 +05:30
Brox Chen
222ff18608
[AMDGPU][True16][CodeGen] Update codegen pattern for v_med3_f16 (#121992)
true16 codegen pattern for v_med3_f16
2025-01-09 13:40:13 -05:00
Matt Arsenault
d2b78c646b
AMDGPU: Custom lower bf16 shuffles (#122252)
We already custom lower the other 16-bit element type shuffles.
2025-01-09 21:37:27 +07:00
Pengcheng Wang
b4e17d4a31
[MachineLICM] Use RegisterClassInfo::getRegPressureSetLimit (#119826)
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of
`TargetRegisterInfo::getRegPressureSetLimit` with some logics to
adjust the limit by removing reserved registers.

It seems that we shouldn't use
`TargetRegisterInfo::getRegPressureSetLimit`
directly, just like the comment "This limit must be adjusted
dynamically for reserved registers" said.

Separate from https://github.com/llvm/llvm-project/pull/118787
2025-01-09 21:05:52 +08:00
Chinmay Deshpande
659cd2a48a
[NFC][AMDGPU] Pre-commit tests for IR variant - isFMAFasterThanFMulAdd (#121925) 2025-01-09 15:51:37 +05:30
Matt Arsenault
09583dec15
AMDGPU: Reduce 64-bit add width if low bits are known 0 (#122049)
If one of the inputs has all 0 bits, the low part cannot
carry and we can just pass through the original value.

Add case: https://alive2.llvm.org/ce/z/TNc7hf
Sub case: https://alive2.llvm.org/ce/z/AjH2-J

We could do this in the general case with computeKnownBits,
but add is so common this could be potentially expensive for
something which will fire infrequently.

One potential concern is this could break the 64-bit add
we expect to see for addressing mode matching, but these
constants shouldn't appear often in addressing expressions.
One test for large offset expressions changes but isn't worse.

Fixes https://github.com/ROCm/llvm-project/issues/237
2025-01-08 22:33:54 +07:00
Matt Arsenault
637641840d
AMDGPU: Add baseline test for add64 with constant test (#122048)
Add baseline test for 64-bit adds when the low half of
an operand is known 0.
2025-01-08 22:30:04 +07:00
Changpeng Fang
68694259b2
AMDGPU: Use getSignedTargetConstant for ImmOffset in SelectScratchSVAddr (#121978)
ImmOffset is signed and we will hit an assert with negative ImmOffset
when getTargetConstant is used.

Fixes: SWDEV-506453
2025-01-07 12:02:18 -08:00
Brox Chen
49357b22db
[AMDGPU][True16][CodeGen] true16 codegen pattern for v_med3_u/i16 (#121850)
True16 codegen pattern for v_med3_u/i16
2025-01-07 13:18:28 -05:00
Matt Arsenault
7899572c88 AMDGPU: Forcibly disable verifier in test
The test added in f6365a47a1ad9ab6d432f6e40d14a11419e21282 fails the verifier
for the reason noted in the comment, but we need to skip the verifier
error in EXPENSIVE_CHECKS builds
2025-01-07 22:46:46 +07:00
Brox Chen
d0812dbbff
[AMDGPU][True16][MC] true16 for v_minmax/maxmin_f16 and v_minmax/maxmin_num_f16 (#120617)
True16 support for v_minmax/maxmin_f16(GFX11) and
v_minmax/maxmin_num_f16(GFX12).

These insts are updated at the same time since we are replacing the
`v_minmax/maxmin_f16` to `v_minmax/maxmin_fake16_f16` while
`v_minmax/maxmin_num_f16` are alias insts and share the same CodeGen
pattern.

Added a GFX12 runline in minmax.ll in fake16 flow
2025-01-07 10:27:54 -05:00
bcahoon
17c8c1c509
[AMDGPU] Do not fold into v_accvpr_mov/write/read (#120475)
In SIFoldOperands, leave copies for moving between agpr and vgpr
registers. The register coalescer is able to handle the copies
more efficiently than v_accvgpr_mov, v_accvgpr_write, and
v_accvgpr_read. Otherwise, the compiler generates unneccesary
instructions such as v_accvgpr_mov a0, a0.
2025-01-07 09:25:01 -06:00
choikwa
8d2e611802
[AMDGPU] Calculate getDivNumBits' AtLeast using bitwidth (#121758)
Previously in shrinkDivRem64, it used fixed value 32 for AtLeast which
meant that <64bit divisions would be rejected from shrinking since logic
depended only on number of sign bits. I.e. 'idiv i48 %0, %1' would
return 24 for number of sign bits if %0,%1 both had 24 division bits,
and was rejected.
2025-01-07 01:31:09 -05:00
Matt Arsenault
8c0483bba2
RegisterCoalescer: Fix assert on remat to copy-to-physreg with subregs (#121734)
Do not try to rematerialize a super-register def used by a subregister
extract copy into a copy to a physical register if the other pieces of
the
full physreg are live at the rematerialization point. It would insert
the
super-register def at the rematerialization point, and assert since the
other half of the register was already live.

This is analagous to the undef subregister def handling above,
which handled the virtual register case.

Fixes #120970
2025-01-07 12:22:23 +07:00
Matt Arsenault
93e63460a2 RegAllocGreedy: Un-disable test in expensive_checks builds
This reverts a8f3ebaf11c3745e5123054776eb71755d16f2f9. You need to
use -verify-regalloc to get a MachineVerifier run with LiveIntervals,
otherwise cases not covered by the basic liveness implementation
in the verifier are passed through (which covers most use of undefined
subrange errors).
2025-01-07 12:21:21 +07:00
Matt Arsenault
a8f3ebaf11 AMDGPU: Mark test as XFAIL in expensive_checks builds
One of the tests added in 93220e7e06473a11bf48fee26bcea16cc527e5dc
fails the machine verifier after allocation, but this is a separate
issue.
2025-01-07 08:47:59 +07:00
Matt Arsenault
f6365a47a1
AMDGPU: Fix assert on physreg MUBUF rsrc operand (#120815)
The stack case uses a physical register and should not ordinarily
reach here, but strange things happen at -O0. The testcase still
errors because we do not yet attempt to handle arbitrary dynamic
sized allocas yet.

Fixes: SWDEV-503538
2025-01-07 08:11:05 +07:00
Brox Chen
ce831a231a
[AMDGPU][True16][MC] true16 for v_fma_f16 (#119477)
Support true16 format for v_fma_f16 in MC.

Since we are replacing v_fma_f16 to v_fma_f16_t16/v_fma_f16_fake16 in
Post-GFX11, have to update the CodeGen pattern for v_fma_f16_fake16 to
get CodeGen test passing. There is no pattern modified/created, but just
replacing the v_fma_f16 with fake16 format.
2025-01-06 15:02:04 -05:00
Emma Pilkington
dc0e258fe4
[AMDGPU] Remove Dwarf encodings for subregisters (#117891)
Previously, registers and subregisters mapped to the same Dwarf
encoding. We don't really have any way to refer to subregisters directly
from Dwarf, the expression emitter should instead use DW_OPs to stencil
out the subregister from the whole register. This was also confusing
tools that need to map back to the llvm reg (e.g. dwarfdump), since
getLLVMRegNum() would arbitrarily return the _LO16 register.
2025-01-06 14:51:16 -05:00
Matt Arsenault
93220e7e06
RegAllocGreedy: Fix use after free during last chance recoloring (#120697)
Last chance recoloring can delete the current fixed interval
during recursive assignment of interfering live intervals. Check
if the virtual register value was assigned before attempting the
unassignment, as is done in other scenarios. This relies on the fact
that we do not recycle virtual register numbers.

I have only seen this occur in error situations where the allocation
will fail, but I think this can theoretically happen in working
allocations.

This feels very brute force, but I've spent over a week debugging
this and this is what works without any lit regressions. The surprising
piece to me was that unspillable live ranges may be spilled, and
a number of tests rely on optimizations occurring on them. My other
attempts to fixed this mostly revolved around not identifying unspillable
live ranges as snippet copies. I've also discovered we're making some
unproductive live range splits with subranges. If we avoid such splits,
some of the unspillable copies disappear but mandating that be precise
to fix a use after free doesn't sound right.
2025-01-06 23:12:55 +07:00
Phoebe Wang
1547382033
[X86] Support lowering of FMINIMUMNUM/FMAXIMUMNUM (#121464) 2025-01-06 21:28:58 +08:00
Vikash Gupta
fd6f8b3ce3
[AMDGPU] [GlobalIsel] Combine Fmul with Select into ldexp instruction. (#120104)
This combine pattern perform the below transformation.

fmul x, select(y, A, B)      -> fldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)   -> fldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.

It is a follow-up PR to implement the above combine for globalIsel, as
the corresponding DAG combine has been done for SelectionDAG Isel
(#111109)
2025-01-06 17:42:38 +05:30
Aaditya
0bd1c87996
[AMDGPU] Support divergent sized dynamic alloca (#121148)
Currently, AMDGPU backend can handle uniform-sized dynamic allocas. 
This patch extends support for divergent-sized dynamic allocas.
When the size argument of a dynamic alloca is divergent, 
a wave-wide reduction is performed to get the required stack space. 
`@llvm.amdgcn.wave.reduce.umax` is used to perform the 
wave reduction.

Dynamic allocas are not completely supported yet, 
as the stack is not properly restored on function exit.
This patch doesn't attempt to address the aforementioned issue.

Note: Compiler already Zero-Extends or Truncates all other 
types(of alloca size arg) to i32.
2025-01-06 12:28:24 +07:00
Matt Arsenault
d34f7ead88
DAG: Fix assuming f16 is the only 16-bit fp type in concat vector combine (#121637)
This would see if there are mixed integer and FP types and pick an
equivalently sized FP type to use as the vector element type, and only
cast if there were mixed integers. We need to insert a cast if the types
are mixed, which may include different FP types.

Fixes #121601
2025-01-06 10:38:54 +07:00
Brox Chen
d7acf03cec
[AMDGPU][True16][MC] true16 for v_rndne_f16 (#120691)
Support true16 format for v_rndne_b16 in MC
2025-01-03 16:32:15 -05:00
Brox Chen
bf274b3d80
[AMDGPU][True16][MC] true16 for v_cos_f16 (#120639)
Support true16 format for v_cos_f16 in MC
2025-01-03 15:46:41 -05:00
Brox Chen
dc307be1b5
[AMDGPU][True16][MC] true16 for v_fract_f16 (#120647)
Support true16 format for v_fract_f16 in MC
2025-01-03 15:45:33 -05:00
Jun Wang
b2adeae865
[AMDGPU][MC] Allow null where 128b or larger dst reg is expected (#115200)
For GFX10+, currently null cannot be used as dst reg in instructions
that expect the dst reg to be 128b or larger (e.g., s_load_dwordx4).
This patch fixes this problem while ensuring null cannot be used as S#,
T#, or V#.
2025-01-03 11:49:51 -08:00
Brox Chen
3b72c62e7f
[AMDGPU][True16][MC] true16 for v_frexp_mant_f16 (#120653)
Support true16 format for v_frexp_mant_f16 in MC
2025-01-03 14:42:39 -05:00
Brox Chen
34d2c3b934
[AMDGPU][True16][MC] true16 for v_sin_f16 (#120692)
Support true16 format for v_sin_f16 in MC
2025-01-03 14:11:25 -05:00
Brox Chen
c744ed53a8
[AMDGPU][True16][MC] disable incorrect VOPC t16 instruction (#120271)
The current VOPC t16 instructions are not implemented with the correct
t16 pseudo. Thus the current t16/fake16 instructions are all in fake16
format.

The plan is to remove the incorrect t16 instructions and refactor them.
The first step is to remove them in this patch. The next step will be
updating the t16/fake16 pseudo to the correct format and add back true16
instruction one by one in the upcoming patches.
2025-01-03 11:58:04 -05:00
Brox Chen
e5acb167b7
[AMDGPU][True16][MC] true16 for v_trunc_f16 (#120693)
Support true16 format for v_trunc_f16 in MC
2025-01-03 11:43:45 -05:00
Brox Chen
8b23ebb498
[AMDGPU][True16[MC] true16 for v_max3/min3_num_f16 (#121510)
V_MAX3/MIN3_NUM_F16 are alias GFX12 instructions with V_MAX3/MIN3_F16 in
GFX11 and they should be updated together.

This fix a bug introduced in
https://github.com/llvm/llvm-project/pull/113603 such that only
V_MAX3/MIN3_F16 are replaced in true16 format. Also added GFX12 runlines
for CodeGen test
2025-01-03 08:55:58 +00:00
Matt Arsenault
11e482c4a3
RegAllocGreedy: Add dummy priority advisor for writing MIR tests (#121207)
I regularly struggle reproducing failures in greedy due to changes
in priority when resuming the allocation from MIR vs. a complete
compilation starting at IR. That is, the fix in
e0919b189bf2df4f97f22ba40260ab5153988b14 did not really fix the
problem of the instruction distance mattering.

Add a way to bypass all of the priority heuristics for MIR tests,
by prioritizing only by virtual register number. Could also
give this a more specific name, like PrioritizeLowVirtRegNumber
2025-01-02 23:04:44 +07:00
Vikram Hegde
f1fa292cd6
[AMDGPU] Pre-commit tests for "lshr + mad" fold (#119509) 2025-01-01 10:17:37 +05:30
Jay Foad
2d6d723a85
[AMDGPU] Add some more GFX12 test coverage (#120581) 2024-12-23 09:42:52 +00:00
Chaitanya
21996bd69c
[AMDGPU] Remove amdgpu-no-heap-ptr and amdgpu-no-lds-kernel-id attributes from lowered kernels in amdgpu-sw-lower-lds pass (#120887)
'amdgpu-sw-lower-lds' pass internally calls '__asan_malloc_impl' for
heap memory allocation.
Pass also uses 'amdgcn_lds_kernel_id' for non-kernel lds accesses
lowering.

This patch removes 'amdgpu-no-heap-ptr' and 'amdgpu-no-lds-kernel-id'
from all kernels lowered by the pass.
2024-12-23 12:42:31 +05:30
Aaditya
c7606710f9
[AMDGPU] Update base addr of dyn alloca considering GrowingUp stack (#119822)
Currently, compiler calculates the base address of
dynamic sized stack object (alloca) as follows:
1. `NewSP = Align(CurrSP + Size)`
_where_ `Size = # of elements * wave size * alloca type`
2. `BaseAddr = NewSP`
3. The alignment is computed as: `AlignedAddr = Addr & ~(Alignment - 1)`
4. Return the `BaseAddr`
This makes sense when stack is grows downwards.

AMDGPU stack grows upwards, the base address 
needs to be aligned first and SP bump by required size later:
1. `BaseAddr = Align(CurrSP)`
2. `NewSP = BaseAddr + Size`
3. `AlignedAddr = (Addr + (Alignment - 1)) & ~(Alignment - 1)`
4. and returns the `BaseAddr`.
2024-12-20 10:27:27 +05:30
Brox Chen
08db696c87
[AMDGPU][True16][MC] V_MED3_I/U16_fake16 CodeGen pattern (#120600)
In this patch https://github.com/llvm/llvm-project/pull/113603 replace
`V_MED3_I/U16` to `V_MED3_I/U16_fake16` for Post-GFX11, but it miss to
update the CodeGen pattern. This patch update and corrert the CodeGen
pattern
2024-12-20 10:53:58 +07:00