LD2 is represented in IR as deinterleave-shuffle(load), and ST2 as
store(interleave-shuffle). Whilst the shuffle would be expensive in
general for MVE (it does not have zip/uzp instructions), it should be
treated as cheap when part of the LD2/ST2 pattern. This borrows some
code from the AArch64 backed to produce lower costs. (Some of which
still shows as higher than it should - that just shows how broken the
generic shuffle costs are at the moment, they would be lower if
getShuffleCost was called directly as opposed to going through
getInstructionCost).
Currently we always produce a cost of 1 for all CostKinds that are not
RecipThroughput, which can underestimate the cost if the type has a
higher legalization cost (like larger vectors). This relaxes it to cover
all cost kinds.
More closely match improveShuffleKindFromMask's shuffle ordering by
trying to match a SK_InsertSubvector shuffles patterns before SK_Select
- both can match many of the same patterns, but its much easier to
recognise when a SK_InsertSubvector can be converted to SK_Select than
vice-versa.
Another step towards #145335 - which I'm hoping will allow us to
generalise improveShuffleKindFromMask and remove getInstructionCost's
shuffle matching entirely.
PR #117350 made changes to the SLP vectorizer which introduced a
regression on some ARM benchmarks. Investigation narrowed it down to
suboptimal codegen for benchmarks that previously only used scalar (U/S)MLAL
instructions. The linked change meant the SLPVectorizer thought that
these could be vectorized. This change makes the cost of muls in
(U/S)MLAL patterns slightly cheaper to make sure scalar instructions are
preferred in these cases over SLP vectorization on targets supporting DSP
As can be seen in https://godbolt.org/z/qvMqY79cK, a urem by a power-2
constant will be code-generated as an And of a mask. The cost model for
funnel shifts tries to account for that by passing OP_PowerOf2 as the
operand info for the second operand. As far as I can tell returning a
lower cost for urem with a OP_PowerOf2 is only implemented on X86
though.
This patch short-cuts that by calling getArithmeticInstrCost(And, ..)
directly when we know the typesize will be a power-of-2. This is an
alternative to the patch in #126912 which is a more general solution for
power-2 udiv/urem costs, this more narrowly just fixes funnel shifts.
These can usually generate:
- qadd / qsub for signed i32 scalars
- uqadd16 / qadd16 / uqsub16 / qsub16 with an extend for signed/unsigned
i8/i16
- Are expanded to an add + cmp + sel otherwise
This can lead to differences in unrolling etc, but should be a better
cost for the instructions.
This was [addressed for AArch64
here](https://github.com/llvm/llvm-project/pull/79004), but the same
applies to ARM.
Move the enablement of neon+fp64 to `-mcpu=cortex-r52`, which optionally
supports these features.
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.
One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.
Differential Revision: https://reviews.llvm.org/D141060
This adds some basic and/or/xor reduction costs for NEON/MVE, handling them
like other reductions where vector operations are used to reduce to legal
sizes, followed by an optional VREV+VAND/VORR/VEOR step and scalarization from
there.
This adds some basic smin/smax/umin/umax reduction costs for MVE/NEON, similar
to the existing Add reduction costs. They follow the same style as Add
reductions, but include a higher cost as the costs tend to be dependant on the
element size for vminv/vmaxv. These costs may not be precise, but will be more
inline than the default that extracts each element.
Similar to the other reductions, this changes the cost of fmin/fmax reductions
under MVE/NEON to perform vector operations until the types need to be
scalarized. The fp16 vectors can perform a VREV+FMIN/FMAX to skip a step of the
reduction, and otherwise need lanewise extract fro the top lanes.
This adds some basic fadd/fmul reduction costs for MVE/NEON. It reduces by
halving the vector size until it it gets scalarized, with some additional costs
for fp16 which may require extracting the top lanes.
Differential Revision: https://reviews.llvm.org/D159367
This changes the costmodelling of the vecreduce.min/max nodes to use the costs
of the relevant min/max intrinsics instead of expanding them to compare and
selects. The getMinMaxReductionCost have changed to take a Opcode for the
relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are
no longer needed.
A follow up patch will add some basic fminimum/fmaximum costmodelling.
Differential Revision: https://reviews.llvm.org/D153547
Currently getGEPCost uses the target type of the GEP as a heuristic for
the type that will be accessed, to pass onto isLegalAddressingMode.
Targets use this to work out if a GEP can then be folded into the
load/store instruction that uses the GEP.
For example, on RISC-V loads and stores can have an offset added to a
base register folded into a single instruction, so the following GEP is
free:
%p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0
%x = load i32, ptr %p ; getInstructionCost = 1
------------------------------------------------------------------------
lw t0, a0(42)
However vector loads and stores cannot have an offset folded into them,
so the following GEP is costed:
%p = getelementptr <2 x i32>, ptr %base, i32 42 ; getInstructionCost = 1
%x = load <2 x i32>, ptr %p ; getInstructionCost = 1
------------------------------------------------------------------------
addi a0, 42
vle32 v8, (a0)
The issue arises whenever there is a mismatch between the target type of
the GEP and the type that is actually accessed:
%p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0
%x = load <2 x i32>, ptr %p ; getInstructionCost = 1
------------------------------------------------------------------------
addi a0, 42
vle32 v8, (a0)
Even though this GEP will result in an add instruction, because TTI
thinks it's loading an i32, it will think it can be folded and not
charge for it.
The target type can become mismatched with the memory access during
transformations, noticeably during SLP where a scalar base pointer will
be reused to perform a vector load or store.
This patch adds an optional AccessType argument to getGEPCost which
allows the type of memory accessed by users to be passed in as a hint,
so that we can more accurately determine if the GEP can be folded into
its users.
If AccessType is not provided, getGEPCost falls back to the old
behaviour of using the PointeeType to guess the memory access type. This
can be revisited in a later patch.
Also for now, only GEPs with exactly one user use the access type hint.
Whilst we could look through all users and use all access types to
determine if we can fold the GEP, this patch avoids doing so to prevent
O(N) behaviour.
Differential Revision: https://reviews.llvm.org/D149889
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
If a gather/scatter is masked and will need to be scalarized then the cost
should be higher than we currently produce. An additional cost for scalarizing
the mask, extracting i1s and branching on the result needs to be added, which
this patch gives a cost of 5.
Differential Revision: https://reviews.llvm.org/D147331
* Replace getUserCost with getInstructionCost, covering all cost kinds.
* Remove getInstructionLatency, it's not implemented by any backends, and we should fold the functionality into getUserCost (now getInstructionCost) to make it easier for targets to handle the cost kinds with their existing cost callbacks.
Original Patch by @samparker (Sam Parker)
Differential Revision: https://reviews.llvm.org/D79483
These three subtarget features are meant to control where MVE
instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature
is described as "Model MVE instructions as a 1 beat per tick
architecture", meaning MVE instruction will execute over 4 cycles.
mve4beat is the opposite where the entire 4 beats of the MVE instruction
execute in a single cycle. The costs for the two were backwards though,
not matching the cycle counts like they should. This patch switches the
costs on the two to bring them in-line with expectations.
Differential Revision: https://reviews.llvm.org/D129141
The MVE shuffle costing for VREV instructions was making incorrect
assumptions as to legalized vector types remaining as vectors. Add a
quick check to ensure they are indeed vectors before attempting to get
the number of elements.
Building on top of D125665, this adds MVE costs for fptosi.sat and
fptoui.sat, providing MVE is available and the types are legal.
Differential Revision: https://reviews.llvm.org/D125666
Similar to D124357, this adds some cost modelling for fptoi_sat for Arm
targets. Where VFP2 is available (and FP64/FP16 for the relevant types),
the operations are legal as the Arm instructions naturally saturate.
Otherwise they will need an extra smin/smax clamp, similar to AArch64.
Differential Revision: https://reviews.llvm.org/D125665
This adds some basic fptosi_sat and fptoui_sat target independent cost
modelling. The fptosi_sat is modelled as a fmin/fmax to saturate the
value, followed by a fp convert. The signed values then have an
additional fcmp+select for handling Nan correctly.
The AArch64/Arm costs may be more incorrect, as the instruction exist
natively. This can be fixed with target specific cost updates.
Differential Revision: https://reviews.llvm.org/D124269
The vectoriser sometimes generates predicated vector loops using
the llvm.get.active.lane.mask intrinsic so it's important that we
are able to calculate a valid cost for the call instruction. When
SVE is enabled we are able to use a single whilelo instruction
for some vector types - in such cases I've marked the cost as 1.
For all other cases I've set the cost according to how the intrinsic
will be expanded.
Tests added here:
Analysis/CostModel/AArch64/sve-intrinsics.ll
Analysis/CostModel/ARM/active_lane_mask.ll
Analysis/CostModel/RISCV/active_lane_mask.ll
Differential Revision: https://reviews.llvm.org/D121109
The code size cost model for most targets uses the legalization cost for the type of the pointer of a load. If this load is followed directly by a trunc instruction, and is the only use of the result of the load, only one instruction is generated in the target assembly language. This adds a check for this case, and uses the target type of the trunc instruction if so.
This did not show any changes in CTMark code size benchmarks.
Reviewers: paquette, samparker, dmgreen
Differential Revision: https://reviews.llvm.org/D109388
MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the
same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the
two halves. This was never treated as a legal type in llvm in the past
as there are not many 64bit instructions and no 64bit compares. There
are a few instructions that could use it though, notably a VSELECT (as
it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for
similar reasons, some gathers/scatter and long multiplies and VCTP64
instructions.
This patch goes through and makes v2i1 a legal type, handling all the
cases that fall out of that. It also makes VSELECT legal for v2i64 as a
side benefit. A lot of the codegen changes as a result - usually in way
that is a little better or a little worse, but still expensive. Costs
can change a little too in the process, again in a way that expensive
things remain expensive. A lot of the tests that changed are mainly to
ensure correctness - the code can hopefully be improved in the future
where it comes up in practice.
The intrinsics currently remain using the v4i1 they previously did to
emulate a v2i1. This will be changed in a followup patch but this one
was already large enough.
Differential Revision: https://reviews.llvm.org/D114449
Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values.
The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply.
More closely matches the default expansion from TargetLowering::expandMULO
The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted.
I believe the cost model was also incorrect for the old expansion.
The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result
to calculate theirs signs. Then 2 icmps to compare the signs. Followed
by an And. The previous cost model was using 3 icmps and 2 selects.
Digging back through git blame, those 2 selects in the cost model used to
be 2 icmps, but were changed in https://reviews.llvm.org/D90681
Differential Revision: https://reviews.llvm.org/D110739
This patch adds an initial ShuffleVectorInst::isInsertSubvectorMask helper to recognize 2-op shuffles where the lowest elements of one of the sources are being inserted into the "in-place" other operand, this includes "concat_vectors" patterns as can be seen in the Arm shuffle cost changes. This also helped fix a x86 issue with irregular/length-changing SK_InsertSubvector costs - I'm hoping this will help with D107188
This doesn't currently attempt to work with 1-op shuffles that could either be a "widening" shuffle or a self-insertion.
The self-insertion case is tricky, but we currently always match this with the existing SK_PermuteSingleSrc logic.
The widening case will be addressed in a follow up patch that treats the cost as 0.
Masks with a high number of undef elts will still struggle to match optimal subvector widths - its currently bounded by minimum-width possible insertion, whilst some cases would benefit from wider (pow2?) subvectors.
Differential Revision: https://reviews.llvm.org/D107228