This change improves memset code generation for non-zero values on
AArch64 by using NEON's DUP instruction instead of
the less efficient multiplication with 0x01010101 pattern.
For small sizes, the value is extracted from a larger DUP. For
non-power-of-two sizes, overlapping stores are used in some cases.
TargetLowering::findOptimalMemOpLowering is modified to allow explicitly
specifying the size of the constant in cases where the constant is
larger than the store operations.
Fixes#165949
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
There is an issue on ARM where a function wont be inlined due to
mismatching target features between caller and callee.
The caller has `HasV8Ops` and `FeatureDotProd` and the callee does not,
but AFAIK this should not be a problem.
https://godbolt.org/z/f19h3zT66 is an example showing how the call is
not inlined on armv7.
The expected asm output would be something like:
```asm
.fnstart
vsdot.s8 q0, q1, d4[0]
bx lr
.Lfunc_end0:
```
Thanks to @Amichaxx we managed to narrow it down and now can resolve
this problem by adding `ARM::FeatureDotProd, ARM::HasV8Ops` to
InlineFeaturesAllowed in llvm/lib/Target/ARM/ARMTargetTransformInfo.h,
after which the inlining occurs successfully.
Whilst we're at it we have also added some debugging to make it easier
to tell why (or why not) a function is being inlined for ARM, and a
couple other features that seem to be missing from the list.
This patch was motivated by an issue experienced with rust that was
traced back to llvm, and thus was designed to address that.
Following #165532, this patch moves scalarization‑cost computation into
BaseT::getMemIntrinsicCost and lets backends override it via their
getMemIntrinsicCost.
It also removes the masked/gather‑scatter/strided/expand‑compress
costing interfaces from TTIImpl.
Targets may keep them locally if needed.
Stacked on #170426 and #170436.
- Following #168029. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
- Alignment, CostKind, Inst);
+ InstructionCost getGatherScatterOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with same
information as before.
Back when `TargetTransformInfo::instCombineIntrinsic` was added in
https://reviews.llvm.org/D81728, several transforms common to both ARM
and AArch64 were kept in the non-target-specific `InstCombineCalls.cpp`
so they could be shared between the two targets.
I want to extend the transform of the `tbl` intrinsics into static
`shufflevector`s in a similar manner to
https://github.com/llvm/llvm-project/pull/169110 (right now it only
works with a 64-bit `tbl1`, but `shufflevector` should allow it to work
with up to 2 operands, and it can definitely work with 128-bit vectors).
I think separating out the transform into a TTI hook is a prerequisite.
~~I'm not happy about creating an entirely new module for this and
having to wire it up through CMake and everything, but I'm not sure
about the alternatives. If any maintainers can think of a cleaner way of
doing this, I'm very open to it.~~
I've moved the transforms into
`Transforms/Utils/ARMCommonInstCombineIntrinsic.cpp`, which is a lot
simpler.
This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
https://github.com/llvm/llvm-project/pull/159388.
In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).
# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.
# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.
For example,
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10
->
STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);
```
* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".
# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.
Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
- Split from #165532. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getMaskedMemoryOpCost(Opcode, Src, Alignment,
- AddressSpace, CostKind);
+ InstructionCost getMaskedMemoryOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with the
same information as before.
- Follow-up: migrate gather/scatter, strided, and expand/compress cost
queries to the same attributes-based entry point.
Update isNoWrap to only use the inbounds/nusw flags from GEPs that are
guaranteed to be dereferenced on every iteration. This fixes a case
where we incorrectly determine no dependence.
I think the issue is isolated to code that evaluates the resulting
AddRec at BTC, just using it to compute the distance between accesses
should still be fine; if the access does not execute in a given
iteration, there's no dependence in that iteration. But isolating the
code is not straight-forward, so be conservative for now. The practical
impact should be very minor (only one loop changed across a corpus with
27k modules from large C/C++ workloads.
Fixes https://github.com/llvm/llvm-project/issues/160912.
PR: https://github.com/llvm/llvm-project/pull/161445
This patch add cost kind to `getAddressComputationCost()` for #149955.
Note that this patch also remove all the default value in `getAddressComputationCost()`.
LD2 is represented in IR as deinterleave-shuffle(load), and ST2 as
store(interleave-shuffle). Whilst the shuffle would be expensive in
general for MVE (it does not have zip/uzp instructions), it should be
treated as cheap when part of the LD2/ST2 pattern. This borrows some
code from the AArch64 backed to produce lower costs. (Some of which
still shows as higher than it should - that just shows how broken the
generic shuffle costs are at the moment, they would be lower if
getShuffleCost was called directly as opposed to going through
getInstructionCost).
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.
The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.
This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.
No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.
Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.
Pull Request: https://github.com/llvm/llvm-project/pull/136655
Making `TargetTransformInfo::Model::Impl` `const` makes sure all
interface methods are `const`, in `BasicTTIImpl`, its bases, and in all
derived classes.
Pull Request: https://github.com/llvm/llvm-project/pull/136598
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).
(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need to make all hiding
overloads `const`. See the second commit in this PR.
Pull Request: https://github.com/llvm/llvm-project/pull/136575
This patch changes the preferInLoopReduction function to take a
RecurKind instead of an unsigned Opcode.
This makes it possible to distinguish non-arithmetic reductions such as
min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf
and IFindLastIV with FFindLastIV.
Related patch #118393#131830
In order to facilitate targets that only support masked loads/stores
on certain address spaces (AMDGPU will support them in an upcoming
patch, but only for address space 7), add an AddressSpace parameter
to isLegalMaskedLoad and isLegalMaskedStore
PR #117350 made changes to the SLP vectorizer which introduced a
regression on some ARM benchmarks. Investigation narrowed it down to
suboptimal codegen for benchmarks that previously only used scalar (U/S)MLAL
instructions. The linked change meant the SLPVectorizer thought that
these could be vectorized. This change makes the cost of muls in
(U/S)MLAL patterns slightly cheaper to make sure scalar instructions are
preferred in these cases over SLP vectorization on targets supporting DSP
In the implementation of the getExtendedReductionCost(), it ofter calls
getArithmeticReductionCost() with FMFs. But we shouldn't call
getArithmeticReductionCost() with FMFs for non-floating-point reductions
which will return the wrong cost.
This patch makes FMFs in getExtendedReductionCost() optional and align
to the getArithmeticReductionCost(). So the TTI will return the correct
cost for non-FP extended-reductions query without FMFs.
This patch is not quite NFC but it's hard to test from the CostModel
side.
Split from #113903.
Historically this function return an int with negative values meaning
invalid. It was migrated to InstructionCost in 43ace8b5ce07a, but the
code was not updated to return invalid cost instead of -1. In that
commit, the caller in LSR was updated to assert that the cost is valid
instead of positive. We should return invalid instead of a negative
value so LSR will assert if the cost isn't valid.
No in-tree targets currently use it in the
preferInLoopReduction/preferPredicatedReductionSelect TTI hooks. It
looks like it used to be used in LoopUtils, at least in
8ca60db40bd944dc5f67e0f200a403b4e03818ea, but I presume it was replaced
by RecurrenceDescriptor.
For processors with low overhead branching (LOB), runtime unrolling the
innermost loop is often detrimental to performance. In these cases the
loop remainder gets unrolled into a series of compare-and-jump blocks,
which in deeply nested loops get executed multiple times, negating the
benefits of LOB.
This is particularly noticable when the loop trip count of the innermost
loop varies within the outer loop, such as in the case of triangular
matrix decompositions.
In these cases we will prefer to not unroll the innermost loop, with the
intention for it to be executed as a low overhead loop.
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run for compilers that
support the ARM target. This has been resolved by adding a config file
for these tests.
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make ARM backend generate full word loads instead of
loading a single byte (ldrb) and/or half word (ldrh). Only pads
destination when it's a stack allocated constant size array and source
when it's constant string. Heuristic to decide whether to pad or not
is very basic and could be improved to allow more examples to be
padded.
- Pass works at the midend level
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make backend generate full word loads instead of loading a
single byte (ldrb) and/or half word (ldrh). Only pads destination when
it's a stack allocated constant size array and source when it's constant
array. Heuristic to decide whether to pad or not is very basic and could
be improved to allow more examples to be padded.
- Pass works within GlobalOpt but is disabled by default on all targets
except ARM.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
These intrinsics currently assume natural alignment. Instead, respect
the alignment attribute on the intrinsic. Teach InstCombine to improve
that alignment.
If desired I could also adjust the clang frontend to add alignment
annotations equivalent to the previous behavior, but I don't see any
indication that such an assumption is correct in the ARM intrinsics
docs.
Fixes https://github.com/llvm/llvm-project/issues/59081.
These can usually generate:
- qadd / qsub for signed i32 scalars
- uqadd16 / qadd16 / uqsub16 / qsub16 with an extend for signed/unsigned
i8/i16
- Are expanded to an add + cmp + sel otherwise
This can lead to differences in unrolling etc, but should be a better
cost for the instructions.
This will enable calling SimplifyDemandedBits() with a SimplifyQuery
that has CondContext set in the future.
Additionally this also marginally strengthens the analysis by
retaining the original context instruction for one-use chains.
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
extract subvector.
Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.
Reviewers: RKSimon, davemgreen, arsenm
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/79837
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.