153 Commits

Author SHA1 Message Date
Matt Arsenault
efe87fbc9d
AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144) 2024-11-06 08:47:15 -08:00
Matt Arsenault
6d9fc1b846
AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113)
This did not consider the IR change to allow a scalar base with a vector
offset part. Reject any users that are not explicitly handled.

In this situation we could handle the vector GEP, but that is a larger
change. This just avoids the IR verifier error by rejecting it.
2024-10-29 22:14:24 -07:00
Jay Foad
85c17e4092
[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706)
Convert many instances of:
  Fn = Intrinsic::getOrInsertDeclaration(...);
  CreateCall(Fn, ...)
to the equivalent CreateIntrinsic call.
2024-10-17 16:20:43 +01:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
Jeremy Morse
96f37ae453
[NFC] Use initial-stack-allocations for more data structures (#110544)
This replaces some of the most frequent offenders of using a DenseMap that
cause a malloc, where the typical element-count is small enough to fit in
an initial stack allocation.

Most of these are fairly obvious, one to highlight is the collectOffset
method of GEP instructions: if there's a GEP, of course it's going to have
at least one offset, but every time we've called collectOffset we end up
calling malloc as well for the DenseMap in the MapVector.
2024-09-30 23:15:18 +01:00
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
Matt Arsenault
df138625df AMDGPU: Remove unnecessary pointer bitcast 2024-09-06 21:32:19 +04:00
Matt Arsenault
21cea3f3be
AMDGPU: Stop promoting allocas with addrspacecast users (#104051)
We cannot promote this case unless we know the value is only
observed through flat operations. We cannot analyze this through
a call. PointerMayBeCaptured was an imprecise check for this.
A callee with a nocapture attribute may still cast to private and
observe the address space, so really we need a different notion
of nocapture.

I doubt this was of any use anyway. The promotable cases should
have optimized out addrspacecast to begin earlier.

Fixes #66669
Fixes #104035
2024-08-14 21:53:38 +04:00
Pierre van Houtryve
0e73bbd345
[AMDGPU][PromoteAlloca] Don't stop when an alloca is too big to promote (#93466)
When I rewrote this, I made a mistake in the control flow. I thought we
could just stop promoting if an alloca is too big to vectorize, but we
can't. Other allocas in the list may be promotable and fit within the
budget.

Fixes SWDEV-455343
2024-05-28 08:05:50 +02:00
Shilei Tian
b4df0da9e8
[AMDGPU] Fix a potential wrong return value indicating whether a pass modifies a function (#88197)
When the alloca is too big for vectorization, the function could have
already
been modified in previous iteration of the `for` loop.
2024-04-12 09:34:46 -04:00
Pierre van Houtryve
953c13b5c9
[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735)
Update PromoteAllocaToVector so it considers the whole function before promoting allocas.
Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca.

Passed internal performance testing.
2024-03-19 11:49:22 +01:00
Pierre van Houtryve
5e379b63fc
[AMDGPU][PromoteAlloca] Drop bitcast handling (#85747)
This is no longer needed with opaque pointers.
2024-03-19 10:36:12 +01:00
bcahoon
4cf8b298cf
[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597)
The promote alloca to vector transformation assumes that the
vector index is a constant value. If it is not a constant, then
either an assert occurs or the tranformation generates an
incorrect index.
2024-03-05 08:18:17 -06:00
Pierre van Houtryve
4e958abf2f
[AMDGPU][PromoteAlloca] Support memsets to ptr allocas (#80678)
Fixes #80366
2024-02-05 14:36:15 +01:00
Mariusz Sikora
9a41a80e76
[AMDGPU] Handle object size and bail if assume-like intrinsic is used in PromoteAllocaToVector (#68744)
Attached test will cause crash without this change.

We should not remove isAssumeLikeIntrinsic instruction if it is used by
other instruction.
2023-12-20 07:47:49 +01:00
Mariusz Sikora
facead618b
[AMDGPU] PromoteAlloca - bail always if load/store is volatile (#73228)
This change is addressing case where alloca size is the same as
load/store size.
2023-11-28 12:01:35 +01:00
bcahoon
28b5054751
[AMDGPU] Fix PromoteAlloca size check of alloca for store (#72528)
When storing a subvector, too many element were written when the
size of the alloca is smaller than the size of the vector store.
This patch checks for the minimum of the alloca vector and the
store vector to determine the number of elements to store.
2023-11-20 07:57:48 -06:00
Pierre van Houtryve
5db63d29fd
[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant indexes (#71505)
I assumed indexes were always ConstantInts, but that's not always the
case. They can be other things as well. We can easily handle that by
just emitting an add and let InstSimplify do the constant folding for
cases where it's really a ConstantInt.

Solves SWDEV-429935
2023-11-07 15:29:41 +01:00
Matt Arsenault
f7dcabe502 AMDGPU: Pass in TargetMachine to AMDGPULowerModuleLDSPass
https://reviews.llvm.org/D157660
2023-09-02 12:02:36 -04:00
pvanhout
a8aabba587 [AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elements
The previous condition was incorrect in some cases, like storing <2 x i32>
into a double. If IndexVal was >0, we ended up never storing anything.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156308
2023-07-26 13:21:21 +02:00
pvanhout
3cd4afce5b [AMDGPU] Allow vector access types in PromoteAllocaToVector
Depends on D152706
Solves SWDEV-408279

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155699
2023-07-25 07:44:48 +02:00
pvanhout
3890a3b113 [AMDGPU] Use SSAUpdater in PromoteAlloca
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D152706
2023-07-25 07:44:47 +02:00
Nikita Popov
7be7f23269 [llvm] Remove uses of getWithSamePointeeType() (NFC) 2023-07-18 12:07:09 +02:00
Youngsuk Kim
243f0566dc [llvm] Replace uses of Type::getPointerTo (NFC)
Partial progress towards removing in-tree uses of `Type::getPointerTo`,
before we can deprecate the API.

If the API is used solely to support an unnecessary bitcast, get rid of
the bitcast as well.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D153933
2023-06-28 09:21:34 -04:00
pvanhout
7007b99340 Revert "[AMDGPU] Use SSAUpdater in PromoteAlloca"
This reverts commit 091bfa76db64fbe96d0e53d99b2068cc05f6aa16.
2023-06-28 11:14:17 +02:00
pvanhout
091bfa76db [AMDGPU] Use SSAUpdater in PromoteAlloca
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D152706
2023-06-28 08:12:22 +02:00
pvanhout
f104eb6e15 [AMDGPU] Reintroduce CC exception for non-inlined functions in Promote Alloca limits
This is basically a partial revert of https://reviews.llvm.org/D145586 ( fd1d60873fdc )

D145586 was originally introduced to help with SWDEV-363662, and it did, but
it also caused a 25% drop in performance in
some MIOpen benchmarks where, it seems,
functions are inlined more conservatively.

This patch restores the pre-D145586 behavior
for PromoteAlloca: functions with a non-entry CC
have a 32 VGPRs threshold, but only if the function
is not marked with "alwaysinline".

A good number of AMDGPU code makes uses of
the AMDGPUAlwaysInline pass anyway, so in our
backend "alwaysinline" seems very common.

This change does not affect SWDEV-363662 (the motivating issue for introducing D145586).

Fixes SWDEV-399519

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D150551
2023-05-23 09:01:39 +02:00
pvanhout
047bf17eab [AMDGPU] Add more verbose logs to PromoteAlloca
More specifically make it more talkative when it's looking at the users of
an alloca to promote it to a vector.

A common failure point of the pass is unknown or weird users of the alloca.
While debugging issues related to this pass one of the first thing I usually
did was to add logs to see how the users were being handled.
Having such logs in directly seems to be a nice addition.

Reviewed By: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D148629
2023-04-19 11:30:14 +02:00
pvanhout
83ae2d3618 [AMDGPU] Refactor PromoteAlloca implementation
We're getting a lot of mileage out of PromoteAlloca, and the pass had grown somewhat organically over the year.
This patch attempts to clean up the implementation and restructure it. For instance,
the exact same code path is now used for both promote alloca to LDS and
promote alloca to vector - just with different parameters.
This removes some redundancy here and there.
I also reordered functions in a way that hopefully makes more sense (e.g. all of the pass API is in the same place)

No functionality change is intended in the patch, but some checks were movved around so I'm not using the NFC tag.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D148526
2023-04-18 14:23:58 +02:00
pvanhout
fd1d60873f [AMDGPU] Remove CC exception for Promote Alloca Limits
Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D145586
2023-04-13 08:48:34 +02:00
pvanhout
d7b4b76956 [AMDGPU] Handle memset users in PromoteAlloca
Allows allocas with memset users to be promoted.

This is intended to prevent patterns such as `memset(&alloca, 0, sizeof(alloca))` (which I think can be emitted by frontends) from preventing a vectorization of allocas.

Fixes SWDEV-388784

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D146225
2023-03-28 15:01:55 +02:00
Nicolai Hähnle
10cef708a7 AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.

getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.

At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).

This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.

Differential Revision: https://reviews.llvm.org/D139468
2023-01-23 21:43:06 +01:00
Guillaume Chatelet
135f23d67b Deprecate MemIntrinsicBase::getDestAlignment() and MemTransferBase::getSourceAlignment()
Differential Revision: https://reviews.llvm.org/D141840
2023-01-16 14:22:03 +00:00
Ruiling Song
5d0ff923c3 AMDGPU: Promote array alloca if used by memmove/memcpy
Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D140599
2023-01-11 09:59:35 +08:00
Matt Arsenault
687e0e205e AMDGPU: Create alloca wide load/store with explicit alignment
This was introducing transient UB by using the default alignment of a
larger vector type.
2023-01-03 11:29:18 -05:00
Matt Arsenault
49caf70121 AMDGPU: Use cast instead of unchecked dyn_cast 2023-01-03 10:32:10 -05:00
Jay Foad
6443c0ee02 [AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.llvm.org/D139828
2022-12-14 13:22:26 +00:00
Kazu Hirata
20cde15415 [Target] Use std::nullopt instead of None (NFC)
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated.  The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-02 20:36:06 -08:00
Matt Arsenault
3830e4e58c AMDGPU: Create poison values instead of undef
These placeholders don't care about the finer points on
the difference between the two.
2022-11-16 14:47:24 -08:00
Kazu Hirata
e0039b8d6a Use llvm::less_second (NFC) 2022-06-04 22:48:32 -07:00
Nikita Popov
3ed643ea76 [AMDGPUPromoteAlloca] Make compatible with opaque pointers
This mainly changes the handling of bitcasts to not check the types
being casted from/to -- we should only care about the actual
load/store types. The GEP handling is also changed to not care about
types, and just make sure that we get an offset corresponding to
a vector element.

This was a bit of a struggle for me, because this code seems to be
pretty sensitive to small changes. The end result seems to produce
strictly better results for the existing test coverage though,
because we can now deal with more situations involving bitcasts.

Differential Revision: https://reviews.llvm.org/D121371
2022-03-11 09:20:51 +01:00
Sebastian Neubauer
6527b2a4d5 [AMDGPU][NFC] Fix typos
Fix some typos in the amdgpu backend.

Differential Revision: https://reviews.llvm.org/D119235
2022-02-18 15:05:21 +01:00
serge-sans-paille
e188aae406 Cleanup header dependencies in LLVMCore
Based on the output of include-what-you-use.

This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/

I've tried to summarize the biggest change below:

- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h

And the usual count of preprocessed lines:
$ clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after:  6189948

200k lines less to process is no that bad ;-)

Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup

Differential Revision: https://reviews.llvm.org/D118652
2022-02-02 06:54:20 +01:00
Yaxun (Sam) Liu
15f54dd5e4 AMDGPU: Account for usage HIP-style dynamic LDS
Disable promote alloca to LDS when HIP-style dynamic LDS since the size
is unknown at compile time.

Patch by: Siu Chi Chan

Reviewed by: Matt Arsenault, Yaxun Liu

Differential Revision: https://reviews.llvm.org/D117494
2022-01-19 13:05:29 -05:00
Arthur Eubanks
1172712f46 [NFC] Replace some deprecated getAlignment() calls with getAlign()
Reviewed By: gchatelet

Differential Revision: https://reviews.llvm.org/D115370
2021-12-09 08:43:19 -08:00
Neubauer, Sebastian
d1f45ed58f [AMDGPU][NFC] Fix typos
Differential Revision: https://reviews.llvm.org/D113672
2021-11-12 11:37:21 +01:00
Stanislav Mekhanoshin
cf74ef134c [AMDGPU] Limit promote alloca max size in functions
Non-entry functions have 32 caller saved VGPRs available. If we
promote alloca to consume more registers we will have to spill
CSRs. There is no reason to eliminate scratch access to get
another scratch access instead.

Differential Revision: https://reviews.llvm.org/D110372
2021-09-24 13:38:39 -07:00
Arthur Eubanks
b9b419a13c [NFC] Remove redundant code added in 04ce2de3 2021-09-01 11:30:07 -07:00
Matt Arsenault
04ce2de330 AMDGPU: Remove implicit argument attributes when introducing new calls
In a future patch, a new set of amdgpu-no-* attributes will be
introduced to indicate when a function does not need an implicitly
passed input. This pass introduces new instances of these intrinsic
calls, and should remove the attributes if they were present before.
2021-08-26 22:08:04 -04:00
Arthur Eubanks
44a3241f10 [NFC] Replace some attribute methods that use confusing indexes 2021-08-19 14:10:26 -07:00