6562 Commits

Author SHA1 Message Date
Simon Pilgrim
e9caa37e9c [DAG] Move lshr narrowing from visitANDLike to SimplifyDemandedBits
Inspired by some of the cases from D145468

Let SimplifyDemandedBits handle the narrowing of lshr to half-width if we don't require the upper bits, the narrowed shift is profitable and the zext/trunc are free.

A future patch will propose the equivalent shl narrowing combine.

Differential Revision: https://reviews.llvm.org/D146121
2023-07-17 15:50:09 +01:00
Jay Foad
92542f2a40 [AMDGPU] Add targets gfx1150 and gfx1151
This is the target definition only. Currently they are treated the same
as GFX 11.0.x.

Differential Revision: https://reviews.llvm.org/D155429
2023-07-17 13:06:12 +01:00
Jay Foad
a2453c6130 [AMDGPU] Add test case for zext of f16 to i32
Preserve the test case from this abandoned review:
D51925 [AMDGPU] Fix issue for zext of f16 to i32
2023-07-17 12:55:29 +01:00
Jay Foad
a1a9c53ae7 [GlobalISel] Fix infinite loop in reassociation combine
Don't reassociate (C1+C2)+Y -> C1+(C2+Y).

Fixes https://github.com/llvm/llvm-project/issues/63849

Differential Revision: https://reviews.llvm.org/D155284
2023-07-16 14:15:24 +01:00
Jon Chesterfield
6043d4dfec [amdgpu] Accept an optional max to amdgpu-lds-size attribute for use in PromoteAlloca 2023-07-15 21:37:21 +01:00
Matt Arsenault
ef4a2b6096 AMDGPU: Expand testing of AMDGPUCodeGenPrepare fdiv handling
- Switch to generated checks
- Use a different run line per denormal mode to reduce test duplication
- Add test coverage for rsqrt cases
- Add test coverage for repeated arcp denominator
- Fix the optnone test
2023-07-14 18:57:40 -04:00
Konstantina Mitropoulou
21ca892f69 [NFC][AMDGPU] Add automated tests in or.ll
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155265
2023-07-14 12:39:14 -07:00
pvanhout
e5296c52e5 [AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.

This worked well in most cases, but there's one case in rocRAND where
it doesn't work. In that case, a PHI node has 2 PHI users where one is
breakable but not the other. When that PHI node isn't broken performance falls by 35%.

Relaxing the restriction to "require that  half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.

Solves SWDEV-409648, SWDEV-398393

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155184
2023-07-14 09:02:51 +02:00
Jon Chesterfield
d3316bc111 [amdgpu] Delete elide-module-lds attribute
Requires D155190

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155238
2023-07-14 00:36:33 +01:00
Jon Chesterfield
74e928a081 [amdgpu][lds] Remove recalculation of LDS frame from backend
Do the LDS frame calculation once, in the IR pass, instead of repeating the work in the backend.

Prior to this patch:
The IR lowering pass sets up a per-kernel LDS frame and annotates the variables with absolute_symbol
metadata so that the assembler can build lookup tables out of it. There is a fragile association between
kernel functions and named structs which is used to recompute the frame layout in the backend, with
fatal_errors catching inconsistencies in the second calculation.

After this patch:
The IR lowering pass additionally sets a frame size attribute on kernels. The backend uses the same
absolute_symbol metadata that the assembler uses to place objects within that frame size.

Deleted the now dead allocation code from the backend. Left for a later cleanup:
- enabling lowering for anonymous functions
- removing the elide-module-lds attribute (test churn, it's not used by llc any more)
- adjusting the dynamic alignment check to not use symbol names

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155190
2023-07-13 23:54:38 +01:00
Jeffrey Byrnes
6b7805fcb1 [AMDGPU][IGLP] Add iglp_opt(1) strategy for single wave gemms
This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering.

The kernels of interest have these specific phases:
NT: 1, 2a, 2c
NN: 1, 2a, 2b
TT: 1, 2b, 2c
TN: 1, 2b

The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver.

Differential Revision: https://reviews.llvm.org/D149773

Change-Id: I1a35962a95b4bdf740602b8f110d3297c6fb9d96
2023-07-13 12:03:04 -07:00
Ivan Kosarev
289ae6525d [AMDGPU][MC] Fix handling of A16 operands in intersect_ray instructions.
The patch adds the support for 'noa16' operands in non-A16 variants of
the instructions, fixes validation of A16 operands and eliminates the
custom conversion to MCInst.

Part of <https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155057
2023-07-13 19:46:03 +01:00
Mateja Marjanovic
fa46feb314 [AMDGPU] Use V_FMA_MIX* more often
Combine mul (f32) + fptrunc (f32->f16) to "v_fma_mixlo_f16 mulSrc1, mulSrc2, 0".

Differential Revision: https://reviews.llvm.org/D153544
Reviewers: arsenm, foad
2023-07-13 16:56:16 +02:00
Mateja Marjanovic
d3140f9363 Precommit for more usage of V_FMA/MAD_MIX*
Make fdiv.f16.ll autogenerated.
2023-07-13 16:26:21 +02:00
pvanhout
07c5920487 Reland "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64"
This time without the extra `->dump()`

A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.

The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.

Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155050
2023-07-13 15:58:48 +02:00
pvanhout
aec971adec Revert "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64"
This reverts commit cfa2d0a3aa0beb5422107dc9943cb0eae6d93896.
2023-07-13 15:52:27 +02:00
Mateja Marjanovic
701c4adcea Check for denormal flushing when selecting V_FMA/MAD_MIX* 2023-07-13 15:26:20 +02:00
pvanhout
cfa2d0a3aa [AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64
A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.

The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.

Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155050
2023-07-13 15:20:58 +02:00
Jon Chesterfield
9418c40af7 [amdgpu][lds] Raise an explicit unimplemented error on absolute address LDS variables
These aren't implemented. They could be at moderate implementation
complexity. Raising an error is better than silently miscompiling.

Patching now because the patch at D155125 is a step towards using this metadata
more extensively as part of the lowering path and that will interact badly with
input variables with this annotation.

Lowering user defined variables at specific addresses would drop this error,
put them at the requested position in the frame during this pass, and then
use the same codegen that will be used for the kernel specific struct shortly.

Reviewed By: jmmartinez

Differential Revision: https://reviews.llvm.org/D155132
2023-07-13 11:32:03 +01:00
pvanhout
361e9eec51 [AMDGPU] Corrrectly emit AGPR copies in tryFoldPhiAGPR
- Don't create COPY instructions between PHI nodes.
- Don't create V_ACCVGPR_WRITE with operands that aren't AGPR_32

Solves SWDEV-410408

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155080
2023-07-13 08:55:22 +02:00
Jingu Kang
33e60484d7 [MachineLICM] Handle Subloops
MachineLICM pass handles inner loops only when outmost loop does not have unique
predecessor. If the loop has preheader and there is loop invariant code, the
invariant code can be hoisted to the preheader in general. This patch makes the
pass handle inner loops in general.

Differential Revision: https://reviews.llvm.org/D154205
2023-07-12 16:32:14 +01:00
Ivan Kosarev
15e7749e19 [Codegen] Generate fast fp64-to-fp16 conversions in unsafe mode.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154528
2023-07-12 11:55:19 +01:00
David Stenberg
6aa94c64a5 [DWARF] Add printout for op-index
This is a preparatory patch for extending DWARFDebugLine to properly
parse line number programs with maximum_operations_per_instruction > 1
for VLIW targets.

Add some scaffolding for handling op-index in line number programs, and
add printouts for that in the table. As this affects a lot of tests,
this is done in a separate commit to get a cleaner review for the actual
op-index implementation.

Verbose printouts are not present in many tests, and adding op-index to
those will require a bit more code changes, so that is done in the
actual implementation patch.

Reviewed By: StephenTozer

Differential Revision: https://reviews.llvm.org/D152535
2023-07-12 12:03:44 +02:00
Jay Foad
f7684d8510 [DAG] Use legal shift amount type in DAGTypeLegalizer::JoinIntegers
Documentation for TargetLowering::getShiftAmountTy says that LegalTypes
should generally be true during type legalization, so this patch does
that.

On AMDGPU the effect is that we use i32 (a sane type) instead of i64
(pointer sized type) for more shift amounts, which in turn allows more
formation of rotates and funnel shifts pre-legalization.

Differential Revision: https://reviews.llvm.org/D154960
2023-07-12 08:12:09 +01:00
Jon Chesterfield
e75ce77cd7 [amdgpu][lds] Fix missing markUsedByKernel calls and undef lookup table elements
More robust association between the kernels and lds struct.

Use poison instead of value() for lookup table elements introduced by dynamic lds lowering.

Extracted from D154946, new test from there verbatim. Segv fixed.

Fixes issues/63338

Fixes SWDEV-404491

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154972
2023-07-12 00:37:21 +01:00
Matt Arsenault
b59022b42e DAG: Handle lowering of unordered fcZero|fcSubnormal to fcmp 2023-07-11 18:30:15 -04:00
Jon Chesterfield
980cd18354 [amdgpu][nfc] Drop lds strategy noise from some tests 2023-07-11 21:18:49 +01:00
Matt Arsenault
fbe4ff8149 AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
2023-07-11 15:14:52 -04:00
Amara Emerson
3a80bdb316 [GlobalISel] Remove an erroneous oneuse check in the G_ADD reassociation combine.
This check was unnecessary/incorrect, it was already being done by the target
hook default implementation, and the one in the matcher was checking for a
completely different thing. This change:
 1) Removes the check and updates affected tests which now do some more reassociations.
 2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse
    check. Not sure why I didn't do this the first time.
2023-07-10 01:03:12 -07:00
Matt Arsenault
64d325454b AMDGPU: Delete custom combine on class intrinsic
This is no longer necessary as class-with-constant will always be
transformed to the generic class intrinsic.

https://reviews.llvm.org/D153901
2023-07-07 15:28:21 -04:00
Christudasan Devadasan
7a98f084c4 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

Differential Revision: https://reviews.llvm.org/D124196
2023-07-07 23:14:32 +05:30
Christudasan Devadasan
b4a62b1fa5 [AMDGPU] Enable whole wave register copy
So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.

Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.

This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D143762
2023-07-07 22:58:55 +05:30
Christudasan Devadasan
b78b36e1a2 [AMDGPU] Implement whole wave register spill
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.

This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D143759
2023-07-07 22:51:45 +05:30
Matt Arsenault
94e24624c2 AMDGPU: Remove attempt at simplifying the format string in printf lowering
This avoids computing the dominator tree by removing the
simplifyInstruction use.

This was applying simplification with some kind of questionable
load-store forwarding and looking for the global. This had to have
been an ancient hack copied from previous backends. In the OpenCL
case, this is always emitted as required the direct global reference
anyway.
2023-07-07 09:26:07 -04:00
Matt Arsenault
64df9573a7 DAG: Handle inversion of fcSubnormal | fcZero
There are a number of more test combinations here that
can be done together and reduce the number of instructions.

https://reviews.llvm.org/D143191
2023-07-06 21:19:44 -04:00
Matt Arsenault
61820f8b5d CodeGen: Optimize lowering of is.fpclass fcZero|fcSubnormal
Combine the two checks into a check if the exponent bits are 0. The
inverted case isn't reachable until a future change, and GlobalISel
currently doesn't attempt the inversion optimization.

https://reviews.llvm.org/D143182
2023-07-06 13:03:57 -04:00
Matt Arsenault
9df70e4a4d AMDGPU: Fix not applying the correct default memcpy expansion threshold
Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got
renamed at some point in the process and the target implementation was
left behind.

Fixes: SWDEV-407329
2023-07-06 12:14:14 -04:00
Matt Arsenault
c70cae6315 AMDGPU: Make SIFixVGPRCopies preserve everything
All this does is add uses of reserved registers, which
aren't tracked by anything. Saves a loop info computation.
2023-07-06 10:26:21 -04:00
Matt Arsenault
8ee1cc82c9 AMDGPU: Fold out sign bit ops on frexp_exp
The sign bit has no impact on the exponent, so strip these away. Saves
on the source modifier encoding cost. I left the GlobalISel handling
until there's a resolution to issue #62628.

We should do this in instcombine too, but legalization should be
introducing more frexps than it currently is where this would occur.
2023-07-06 10:26:21 -04:00
Ivan Kosarev
b4049b409b [AMDGPU] Add GlobalISel test coverage for floating-point truncations.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154527
2023-07-06 11:37:09 +01:00
Valery Pykhtin
98aa8439f5 [AMDGPU] Fix register class for a subreg in GCNRewritePartialRegUses.
1. Improved code that deduces register class from instruction definitions. Previously if some instruction didn't contain a reg class for an operand it was considered as no information on register class even if other instructions specified the class.

2. Added check on required size of resulting register because in some cases classes with smaller registers had been selected (for example VReg_1).

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D152832
2023-07-06 08:48:45 +02:00
Matt Arsenault
20964c901a DAG: Fix dropping flags when widening unary vector ops 2023-07-05 17:25:24 -04:00
Matt Arsenault
5491666248 AMDGPU: Correctly lower llvm.exp.f32
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.

Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.

Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
2023-07-05 17:23:49 -04:00
Matt Arsenault
ed556a1ad5 AMDGPU: Correctly lower llvm.exp2.f32
Previously this did a fast math expansion only.
2023-07-05 17:23:48 -04:00
Matt Arsenault
9c82dc6a6b AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
2023-07-05 16:53:01 -04:00
Matt Arsenault
59c311c5d4 AMDGPU: Add more tests for f16 fdiv lowering
Probably should merge the DAG and gisel tests.
2023-07-05 16:53:01 -04:00
Matt Arsenault
4e15f378ee AMDGPU: Correctly lower llvm.log.f32 and llvm.log10.f32
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.

Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
2023-07-05 15:30:35 -04:00
Stephen Thomas
2dfb4b56fe [AMDGPU] Fix incorrect hazard mitigation
GCNHazardRecognizer::fixVcmpxExecWARHazard() mitigates a specific hazard
by inserting a wait on sa_sdst==0 if such a wait isn't already present.
Unfortunately, the check for an existing wait incorrectly checks for one
that doesn't actually care about sa_sdst itself, but requires that no
other counters are waited for.

Once the check is performed correctly, a lit test needs to be updated,
since it is currently testing for the incorrect behaviour.

Differential Revision: https://reviews.llvm.org/D154438
2023-07-04 14:42:51 +01:00
Jay Foad
f2c164c815 [AMDGPU] Do not wait for vscnt on function entry and return
SIInsertWaitcnts inserts waitcnt instructions to resolve data
dependencies. The GFX10+ vscnt (VMEM store count) counter is never used
in this way. It is only used to resolve memory dependencies, and that is
handled by SIMemoryLegalizer. Hence there is no need to conservatively
wait for vscnt to be 0 on function entry and before returns.

Differential Revision: https://reviews.llvm.org/D153537
2023-07-04 12:22:38 +01:00
Matt Arsenault
8f9eee3602 AMDGPU: Fix opaque pointer conversion error in test
The * was in the wrong place so this was missed by the script.
2023-06-30 15:04:03 -04:00