7728 Commits

Author SHA1 Message Date
Christudasan Devadasan
042104985c
[AMDGPU][NewPM] Port SIShrinkInstructions to new pass manager. (#106967) 2024-09-03 10:52:50 +05:30
Shilei Tian
cb949b74e8 [NFC][FIX] Work around update_test_checks bug 2024-09-02 12:33:24 -04:00
Shilei Tian
f32f0289fd [NFC] Update check lines of the test case llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll 2024-09-02 12:23:26 -04:00
Akshat Oke
da13754103
AMDGPU/NewPM Port SILoadStoreOptimizer to NPM (#106362) 2024-09-02 11:41:56 +05:30
Changpeng Fang
26b0bef192
AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761)
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
2024-08-29 11:43:58 -07:00
Stephen Tozer
3d08ade7bd
[ExtendLifetimes] Implement llvm.fake.use to extend variable lifetimes (#86149)
This patch is part of a set of patches that add an `-fextend-lifetimes`
flag to clang, which extends the lifetimes of local variables and
parameters for improved debuggability. In addition to that flag, the
patch series adds a pragma to selectively disable `-fextend-lifetimes`,
and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes`
for this pointers only. All changes and tests in these patches were
written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer)
has handled review and merging. The extend lifetimes flag is intended to
eventually be set on by `-Og`, as discussed in the RFC
here:

https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850

This patch implements a new intrinsic instruction in LLVM,
`llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand
and has no effect other than "using" its operand, to ensure that its
operand remains live until after the fake use. This patch does not emit
fake uses anywhere; the next patch in this sequence causes them to be
emitted from the clang frontend, such that for each variable (or this) a
fake.use operand is inserted at the end of that variable's scope, using
that variable's value. This patch covers everything post-frontend, which
is largely just the basic plumbing for a new intrinsic/instruction,
along with a few steps to preserve the fake uses through optimizations
(such as moving them ahead of a tail call or translating them through
SROA).

Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>
2024-08-29 17:53:32 +01:00
Pierre van Houtryve
1f8f2ed66a
[NFC][AMDGPU] Autogenerate tests for uniform i32 promo in ISel (#106382)
Many tests were easy to update, but these are quite big and I think it's
better to autogenerate them to see the difference well.
2024-08-29 15:20:32 +02:00
Matt Arsenault
7b7b0b95b2
DAG: Check if is_fpclass is custom, instead of isLegalOrCustom (#105577)
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).

Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
2024-08-29 14:05:43 +04:00
Akshat Oke
fdca2c33a1
AMDGPU/NewPM Port GCNDPPCombine to NPM (#105816)
Co-authored-by: Akshat Oke <Akshat.Oke@amd.com>
2024-08-29 14:49:52 +05:30
Akshat Oke
2adc94cd6c
AMDGPU/NewPM: Port SIFoldOperands to new pass manager (#105801) 2024-08-29 11:34:54 +05:30
Shilei Tian
572d2fd327
[Attributor] Fix an issue that could potentially cause AccessList and OffsetBins out of sync (#106187)
The implementation of `AAPointerInfo::RangeList::set_difference` doesn't
consider the case where two ranges have the same offset but different
sizes.
This could cause `AccessList` and `OffsetBins` out of sync because a
range has
been already updated in `AccessList` but missing in `ToRemove`.

I do have a reproducer but the reproducer itself is 248kb. `llvm-reduce`
can't
further reduce it. Not sure how I can make a smaller reproducer.

Fixes: SWDEV-479757.
2024-08-29 01:02:19 -04:00
Changpeng Fang
53d95f3056
AMDGPU: Rename fail.llvm.fptrunc.round.ll to llvm.fptrunc.round.err.ll (#106452)
Also correct the suffix of the intrinsic
2024-08-28 13:52:07 -07:00
Jon Chesterfield
1bde8e0b80
[AMDGPU] Don't realign already allocated LDS. Point fix for 106412 (#106421)
Fixes 106412. The logic that skips the pass on already-lowered variables
doesn't cover the path that increases alignment of variables. If a
variable is allocated at 24 and then given 16 byte alignment, the
backend notices and fatal-errors on the inconsistency.
2024-08-28 18:30:48 +01:00
Alex MacLean
4c4908cd5d
[AMDGPU] adjust tests to prevent fpclass bitcast folding (#106268)
Make some minor tweaks to AMDGPU tests to ensure they still work as
intended after https://github.com/llvm/llvm-project/pull/97762. These
tests can be radically simplified after bitcast aware fpclass deduction.
2024-08-27 13:20:44 -07:00
Shilei Tian
d880f5a4c9
[AMDGPU][Attributor] Remove uniformity check in the indirect call specialization callback (#106177)
This patch removes the conservative uniformity check in the indirect
call
specialization callback, as whether the function pointer is uniform
doesn't
matter too much. Instead, we add an argument to control specialization.
2024-08-27 12:27:17 -04:00
Jay Foad
d0fe52d951
[AMDGPU] Fix sign confusion in performMulLoHiCombine (#105831)
SMUL_LOHI and UMUL_LOHI are different operations because the high part
of the result is different, so it is not OK to optimize the signed
version to MUL_U24/MULHI_U24 or the unsigned version to
MUL_I24/MULHI_I24.
2024-08-27 17:09:40 +01:00
Brox Chen
2e0583ef8b
[AMDGPU][CodeGen][NFC] update a mir test file with latest update_mir_test_check script (#106073)
Run latest update_mir_test_checks.py and update one codeGen test file

This is to clean up the mir test files diff generated by python script
version update
2024-08-26 16:45:12 -04:00
Sameer Sahasrabuddhe
fa4cc9ddd5
[FixIrreducible] Use CycleInfo instead of a custom SCC traversal (#101386)
[FixIrreducible] Use CycleInfo instead of a custom SCC traversal

1. CycleInfo efficiently locates all cycles in a single pass, while the
SCC is
   repeated inside every natural loop.

2. CycleInfo provides a hierarchy of irreducible cycles, and the new
implementation transforms each cycle in this hierarchy separately
instead of
reducing an entire irreducible SCC in a single step. This reduces the
number
of control-flow paths that pass through the header of each newly created
loop. This is evidenced by the reduced number of predecessors on the
"guard"
blocks in the lit tests, and fewer operands on the corresponding PHI
nodes.

3. When an entry of an irreducible cycle is the header of a child
natural loop,
the original implementation destroyed that loop. This is now preserved,
   since the incoming edges on non-header entries are not touched.

4. In the new implementation, if an irreducible cycle is a superset of a
natural
loop with the same header, then that natural loop is destroyed and
replaced
   by the newly created loop.
2024-08-26 15:51:34 +05:30
Chaitanya
1f02be2e17
[AMDGPU] Enable "amdgpu-sw-lower-lds" pass in pipeline. (#89206)
This PR enables "amdgpu-sw-lower-lds" pass in the pipeline.
Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to
enbale/disable the pass.
2024-08-26 14:21:19 +05:30
Chaitanya
7bc9d95b7e
[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (#87265)
This PR introduces new pass "amdgpu-sw-lower-lds". 

This pass lowers the local data store, LDS, uses in kernel and
non-kernel functions in module to use dynamically allocated global
memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then
instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has
instrumented the IR. It identifies that IR has been instrumented using
"nosanitize_address" module flag.

For a kernel, LDS access can be static or dynamic which are direct
(accessed within kernel) and indirect (accessed through non-kernels).

**Replacement of Kernel LDS accesses:** 
- All the LDS accesses corresponding to kernel will be packed together,
where all static LDS accesses will be allocated first and then dynamic
LDS follows. The total size with alignment is calculated. A new LDS
global will be created for the kernel called "SW LDS" and it will have
the attribute "amdgpu-lds-size" attached with value of the size
calculated. All the LDS accesses in the module will be replaced by GEP
with offset into the "Sw LDS".
- A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
the dynamic LDS. This will be marked used by kernel and will have
MD_absolue_symbol metadata set to total static LDS size, Since dynamic
LDS allocation starts after all static LDS allocation.

- A device global memory equal to the total LDS size will be allocated.
At the prologue of the kernel, a single work-item from the work-group,
does a "malloc" and stores the pointer of the allocation in "SW LDS". To
store the offsets corresponding to all LDS accesses, another global
variable is created which will be called "SW LDS metadata" in this pass.

- **SW LDS:** 
It is LDS global of ptr type with name
"llvm.amdgcn.sw.lds.<kernel-name>".

- **SW LDS Metadata:** 
It is of struct type, with n members. n equals the number of LDS globals
accessed by the kernel(direct and indirect). Each member of struct is
another struct of type {i32, i32, i32}. First member corresponds to
offset, second member corresponds to size of LDS global being replaced
and third represents the total aligned size. It will have name
"llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an
intializer with static LDS related offsets and sizes initialized. But
for dynamic LDS related entries, offsets will be intialized to previous
static LDS allocation end offset. Sizes for them will be zero initially.
These dynamic LDS offset and size values will be updated with in the
kernel, since kernel can read the dynamic LDS size allocation done at
runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.

- At the epilogue of kernel, allocated memory would be made free by the
same single work-item.

**Replacement of non-kernel LDS accesses:** 
- Multiple kernels can access the same non-kernel function. All the
kernels accessing LDS through non-kernels are sorted and assigned a
kernel-id. All the LDS globals accessed by non-kernels are sorted.

- This information is used to build two tables: 
- **Base table:** 
Base table will have single row, with elements of the row placed as per
kernel ID. Each element in the row corresponds to ptr of "SW LDS"
variable created for that kernel.

- **Offset table:** 
Offset table will have multiple rows and columns. Rows are assumed to be
from 0 to (n-1). n is total number of kernels accessing the LDS through
non-kernels. Each row will have m elements. m is the total number of
unique LDS globals accessed by all non-kernels. Each element in the row
correspond to the ptr of the replacement of LDS global done by that
particular kernel.

- A LDS variable in non-kernel will be replaced based on the information
from base and offset tables. Based on kernel-id query, ptr of "SW LDS"
for that corresponding kernel is obtained from base table. The Offset
into the base "SW LDS" is obtained from corresponding element in offset
table. With this information, replacement value is obtained.
2024-08-26 08:59:26 +05:30
Austin Kerbow
ceb587a16c
[AMDGPU] Fix crash in allowsMisalignedMemoryAccesses with i1 (#105794) 2024-08-23 11:51:37 -07:00
Jay Foad
fa2dccb377
[AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (#105550)
When a loop contains a VMEM load whose result is only used outside the
loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for
vmcnt will be required inside the loop anyway, because VMEM instructions
can write their VGPR results out of order.
2024-08-23 10:31:33 +01:00
Jay Foad
b02b5b7b59
[AMDGPU] Simplify use of hasMovrel and hasVGPRIndexMode (#105680)
The generic subtarget has neither of these features. Rather than forcing
HasMovrel on, it is simpler to expand dynamic vector indexing to a
sequence of compare/select instructions.

NFC for real subtargets.
2024-08-23 09:59:19 +01:00
Matt Arsenault
ee08d9cba5
AMDGPU: Remove global/flat atomic fadd intrinics (#97051)
These have been replaced with atomicrmw.
2024-08-22 23:27:33 +04:00
Jeffrey Byrnes
7bcf4d63cf
[AMDGPU] Correctly insert s_nops for dst forwarding hazard (#100276)
MI300 ISA section 4.5 states there is a hazard between "VALU op which
uses OPSEL or SDWA with changes the result’s bit position" and "VALU op
consumes result of that op"

This includes the case where the second op is SDWA with same dest and
dst_sel != DWORD && dst_unused == UNUSED_PRESERVE. In this case, there
is an implicit read of the first op dst and the compiler needs to
resolve this hazard. Confirmed with HW team.

We model dst_unused == UNUSED_PRESERVE as tied-def of implicit operand,
so this PR checks for that.

MI300_SP_MAS section 1.3.9.2 specifies that CVT_SR_FP8_F32 and
CVT_SR_BF8_F32 with opsel[3:2] !=0 have dest forwarding issue.
Currently, we only add check for CVT_SR_FP8_F32 with opsel[3] != 0 --
this PR adds support opsel[2] != 0 as well
2024-08-22 11:38:24 -07:00
Jay Foad
2012b25420
[AMDGPU][GlobalISel] Disable fixed-point iteration in all Combiners (#105517)
Disable fixed-point iteration in all AMDGPU Combiners after #102163.

This saves around 2% compile time in ad hoc testing on some large
graphics shaders. I did not notice any regressions in the generated
code, just a bunch of harmless differences in instruction selection and
register allocation.
2024-08-22 17:14:53 +01:00
Jay Foad
c4c5fdd933
[AMDGPU] Generate checks for vector indexing. NFC. (#105668)
This allows combining some test files that were only split because
adding new RUN lines introduced too much churn in the checks.
2024-08-22 16:11:12 +01:00
Jay Foad
5506831f7b
[AMDGPU] GFX12 VMEM loads can write VGPR results out of order (#105549)
Fix SIInsertWaitcnts to account for this by adding extra waits to avoid
WAW dependencies.
2024-08-22 11:46:51 +01:00
Jay Foad
61194617ad
[AMDGPU] Add GFX12 test coverage for vmcnt flushing in loop headers (#105548) 2024-08-22 11:42:57 +01:00
Sameer Sahasrabuddhe
5f6172f068 [Transforms] Refactor CreateControlFlowHub (#103013)
CreateControlFlowHub is a method that redirects control flow edges from a set of
incoming blocks to a set of outgoing blocks through a new set of "guard" blocks.
This is now refactored into a separate file with one enhancement: The input to
the method is now a set of branches rather than two sets of blocks.

The original implementation reroutes every edge from incoming blocks to outgoing
blocks. But it is possible that for some incoming block InBB, some successor S
might be in the set of outgoing blocks, but that particular edge should not be
rerouted. The new implementation makes this possible by allowing the user to
specify the targets of each branch that need to be rerouted.

This is needed when improving the implementation of FixIrreducible #101386.
Current use in FixIrreducible does not demonstrate this finer control over the
edges being rerouted. But in UnifyLoopExits, when only one successor of an
exiting block is an exit block, this refinement now reroutes only the relevant
control-flow through the edge; the non-exit successor is not rerouted. This
results in fewer branches and PHI nodes in the hub.
2024-08-22 12:18:01 +05:30
Matt Arsenault
8039886e6d
AMDGPU: Handle folding frame indexes into s_add_i32 (#101694)
This does not yet enable producing direct frame index
references in s_add_i32, only the lowering.
2024-08-22 09:16:37 +04:00
Jay Foad
a6bae5cb37
[AMDGPU] Split GCNSubtarget into its own file. NFC. (#105525) 2024-08-21 19:11:02 +01:00
Sumanth Gundapaneni
e78156a0e2
Scalarize the vector inputs to llvm.lround intrinsic by default. (#101054)
Verifier is updated in a different patch to let the vector types for
llvm.lround and llvm.llround intrinsics.
2024-08-21 12:13:56 -05:00
Brox Chen
ddb5480e67
[AMDGPU][True16][MC] added VOPC realtrue/faketrue flag and fake16 instructions (#104739)
VOPC instructions were defined with HasTrue16BitInst flag while these
true16 instructions are actually implemented with fake16 profile.
Seperate them to true16 version and fake16 version by adding
UseRealTrue16 and UseFakeTrue16 flag and fake16 instructions.

The code default to use fake16. This is preparing for the upcoming
changes in MC to support realtrue 16bit operands and vdst. The true16
and fake16 profile will be modified in the later patches.
2024-08-21 17:47:36 +03:00
Simon Pilgrim
8109e5de57
[DAG] Add select_cc -> abd folds (#102137)
Fixes #100810
2024-08-21 12:07:40 +01:00
Matt Arsenault
9d364286f3
AMDGPU: Remove flat/global atomic fadd v2bf16 intrinsics (#97050)
These are now fully covered by atomicrmw.
2024-08-21 14:26:42 +04:00
Nikita Popov
a105877646
[InstCombine] Remove some of the complexity-based canonicalization (#91185)
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.

However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.

The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.

For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
2024-08-21 12:02:54 +02:00
Stanislav Mekhanoshin
5fcd05967a
[AMDGPU] Add VOPD combine dependency tests. NFC. (#104841) 2024-08-19 15:43:27 -07:00
Jay Foad
564bd20658
[AMDGPU][GlobalISel] Save a copy in one case of addrspacecast (#104789)
Refactor legalization of addrspacecast local/private -> flat to avoid
building a copy in the nonnull case.
2024-08-19 18:22:29 +01:00
Jay Foad
2258bc429b
[AMDGPU] Simplify, fix and improve known bits for mbcnt (#104768)
Simplify by using KnownBits::add.

Fix GlobalISel path which was ignoring the known bits of src1.

Improve analysis of mbcnt.hi which adds at most 31 even in wave64.
2024-08-19 18:22:06 +01:00
Austin Kerbow
7d5281a66d
[AMDGPU][NFC] Fix preload-kernarg.ll test after attributor move (#98840)
Update was to stale version of the test with missing functions and extra
runlines that had been removed.
2024-08-18 17:04:27 -07:00
Changpeng Fang
16929219b0
AMDGPU: Add tonearest and towardzero roundings for intrinsic llvm.fptrunc.round (#104486)
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.

"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
2024-08-17 11:22:47 -07:00
Carl Ritson
fc6300a5f7
[AMDGPU] Disable inline constants for pseudo scalar transcendentals (#104395)
Prevent operand folding from inlining constants into pseudo scalar
transcendental f16 instructions.
However still allow literal constants.
2024-08-17 16:52:38 +09:00
Jeffrey Byrnes
fcefe957dd
[LegalizeTypes][AMDGPU]: Allow for scalarization of insert_subvector (#104236)
Legalization for when the inserted subvector is to be scalarized.

https://godbolt.org/z/vx3joWqoh
2024-08-15 08:07:05 -07:00
Shilei Tian
1ca9fe6db3 Reapply "[Attributor][AMDGPU] Enable AAIndirectCallInfo for AMDAttributor (#100952)"
This reverts commit 36467bfe89f231458eafda3edb916c028f1f0619.
2024-08-14 17:16:47 -04:00
Matt Arsenault
21cea3f3be
AMDGPU: Stop promoting allocas with addrspacecast users (#104051)
We cannot promote this case unless we know the value is only
observed through flat operations. We cannot analyze this through
a call. PointerMayBeCaptured was an imprecise check for this.
A callee with a nocapture attribute may still cast to private and
observe the address space, so really we need a different notion
of nocapture.

I doubt this was of any use anyway. The promotable cases should
have optimized out addrspacecast to begin earlier.

Fixes #66669
Fixes #104035
2024-08-14 21:53:38 +04:00
Matt Arsenault
36a0f20ac3
AMDGPU/NewPM: Fill out addPreISelPasses (#102814)
This specific callback should now be at parity with the old
pass manager version. There are still some missing IR passes
before this point.

Also I don't understand the need for the RequiresAnalysisPass at the
end. SelectionDAG should just be using the uncached getResult?
2024-08-14 20:57:00 +04:00
Craig Topper
abc1acf8df
[TargetLowering][AMDGPU][ARM][RISCV][X86] Teach SimplifyDemandedBits to combine (srl (sra X, C1), ShAmt) -> sra(X, C1+ShAmt) (#101751)
If the upper bits of the shr aren't demanded.

This helps with cases where the outer srl was originally an sra and was
converted to a srl by SimplifyDemandedBits before it had a chance to
combine with the inner sra. This can occur when the inner sra was part
of a sign_extend_inreg expansion.

There are some regressions in ARM and Thumb2.
2024-08-14 08:44:57 -07:00
Jay Foad
df57833ea8
[AMDGPU] Generate checks for llvm.amdgcn.is.private/shared (#103859)
Also combine the GlobalISel tests into the SelectionDAG ones.
2024-08-14 13:23:33 +01:00
Matt Arsenault
edded8d7b5
AMDGPU: Stop handling legacy amdgpu-unsafe-fp-atomics attribute (#101699)
This is now autoupgraded to annotate atomicrmw instructions in
old bitcode.
2024-08-13 22:02:25 +04:00