It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.
---------
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got
renamed at some point in the process and the target implementation was
left behind.
Fixes: SWDEV-407329
This changes the costmodelling of the vecreduce.min/max nodes to use the costs
of the relevant min/max intrinsics instead of expanding them to compare and
selects. The getMinMaxReductionCost have changed to take a Opcode for the
relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are
no longer needed.
A follow up patch will add some basic fminimum/fmaximum costmodelling.
Differential Revision: https://reviews.llvm.org/D153547
Before this patch, the compiler gave a bump to the inline-threshold
when the total size of the allocas passed as arguments to the
callee was below 256 bytes.
This heuristic ignores that some of these allocas could have be removed
by SROA if inlining was applied.
Ideally, this bonus would be attributed to the threshold once the
size of all the allocas that could not be handled by SROA is known:
at the end of the InlineCost analysis.
However, we may never reach this point if the inline-cost analysis exits
early when the inline cost goes over the threshold mid-analysis.
This patch proposes:
* Attribute the bonus in the inline-threshold when allocas are passed
as arguments (regardless of their total size).
* Assigns a cost to each alloca proportional to its size,
such that the cost of all the allocas cancels the bonus.
Potential problems:
* This patch assumes that removing alloca instructions with SROA is
always profitable. This may not be the case if the total size of the
allocas is still too big to be promoted to registers/LDS.
* Redundant calls to getTotalAllocaSize
* Awkwardly, the threshold attributed contributes to the single-bb and
vector bonus.
Reviewed By: scchan
Differential Revision: https://reviews.llvm.org/D149741
Expand large or unknown size memory intrinsics into loops in the
default lowering pipeline if the target doesn't have the corresponding
libfunc. Previously AMDGPU had a custom pass which existed to call the
expansion utilities.
With a default no-libcall option, we can remove the libfunc checks in
LoopIdiomRecognize for these, which never made any sense. This also
provides a path to lifting the immarg restriction on
llvm.memcpy.inline.
There seems to be a bug where TLI reports functions as available if
you use -march and not -mtriple.
The intrinsic @llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} can only be
selected for flat address spaces (constant, flat and global).
This patch restricts the cases over which GCNTTIImpl::rewriteIntrinsicWithAddressSpace
rewrites the intrinsic.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D149938
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.
This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.
Differential Revision: https://reviews.llvm.org/D149776
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.
The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.
The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.
Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.
This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.
Depends on D143437
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D145441
Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).
This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).
Differential Revision: https://reviews.llvm.org/D148149
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.
Differential Revision: https://reviews.llvm.org/D144650
In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor.
This is based off of the approach used here: 8d36708507
The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch.
See https://reviews.llvm.org/D143723#4132349
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D144474
Reapplies 142c28ffa1323e9a8d53200a22c80d5d778e0d0f as part of D140242 which got reverted due to amdgpu openmp test failures.
This diff fixes said failures by eliding most of `adjustInliningThresholdUsingCallee` for indirect calls as the callee function is unavailable for indirect calls.
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D143498
A regression from when new PM got enabled as default. Functions with a big number of instructions will elide getting inlined but do not consider the cost of passing arguments over stack if there are a lot of function arguments. This patch attempts to add a heuristic for AMDGPU's function calling convention that also considers function arguments passed through the stack.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D140242
Right now we do opcode wise matching to identify uniform/non-divergent
AMDGPU intrinsics. It is duplicated at 2 places once at IR level uniformity analysis
and at MIR level. Moving them to single tablegen table for consistency and adding
and API rapper to access them.
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D142961
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
The most common case for string attributes parses them as integers. We
don't have a convenient way to do this, and as a result we have
inconsistent missing attribute and invalid attribute handling
scattered around. We also have inconsistent radix usage to
getAsInteger; some places use the default 0 and others use base 10.
Update a few of the uses, but there are quite a lot of these.
If a kernel has uneven dimensions we can have a value of workitem-id-x
divided by the wavefrontsize non-uniform. For example dimensions (65, 2)
will have workitems with address (64, 0) and (0, 1) packed into a same
wave which gives 1 and 0 after the division by 64 respectively.
Unfortunately, this limits the optimization to OpenCL only and only if
reqd_work_group_size attribute is set. This patch limits it to 1D kernels,
although that shall be possible to perform this optimization is the size
of the X dimension is a power of 2, we just do not currently have
infrastructure to query it.
Note that presence of amdgpu-no-workitem-id-y attribute does not help
as it only hints the lack of the workitem-id-y query, but not the absence
of the actual 2nd dimension, therefore affecting just the SGPR allocation.
Differential Revision: https://reviews.llvm.org/D132879
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done.
This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic.
Change-Id: Ifcd981709dc2ea94d4acbcb84efe7176593ec8c7
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.
E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.
Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D117723
D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.
Differential Revision: https://reviews.llvm.org/D122850
Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118443