If we have a shuffle which can be split via VLA where two or more of the
destinations have exactly the same elements, then we only need to
account for them once in costing. The duplicate copies are are (at
worst) whole register moves.
Note that this change only handles the single source case. Doing the
multiple source case seemed a bit more complicated, and I didn't have a
motivating test case.
This protects against invalid size requests on scalable vectors by checking the
original VT, not the legalized type when checking for scalars. The cost
returned is now invalid, which lines up with the codegen not being able to
produce a result.
We have a lot of missing Codesize costs for vector operations. This
patch starts things off by adding codesize costs for getVectorInstrCost,
returning a single cost instead of the VectorInsertExtractBaseCost
(which is typically 2). Insert of a load are given a cost of 0 as they
use ld1, otherwise the cost is 1.
After #130665 these operations are scalarized to avoid
double-rounding. This updates the cost model to match.
In the future we might be able to use SVE instructions to help, but for
the moment the costs should be higher. Costsize and Latency costs are
not yet expected to be accurate. The vector insert/extract will use the
cost of VectorInsertExtractBaseCost (2 by default).
Improve the modeling of the memory effects and instruction cost of
inline assembly.
- MemoryEffects: The CUDA spec states that inline assembly is not
assumed to have any side-effects or read or write to memory. An inline
assembly may be treated as NoModRef unless it is explictly marked as
having side effects or has an explicit memory clobber.
https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization
> Normally any memory that is written to will be specified as an out
operand, but if there is a hidden read or write on user memory (for
example, indirect access of a memory location via an operand), or if you
want to stop any memory optimizations around the asm() statement
performed during generation of PTX, you can add a “memory” clobbers
specification after a 3rd colon.
- InstructionCost: This change implements very rough string parsing
system to count the number of instructions in an inline-asm. There are
corner cases it will not handle well, but in general this is an
improvement over the current cost of the number of arguments plus one.
If a scalable vector uitofp or sitofp effectively extends the size of
each element as part of the conversion, the AArch64 backend
may need to plant multiple unpacks before converting. Increase
the cost in those cases to account for this.
Currently, the handling for calls is split between AA and BasicAA in an
awkward way. BasicAA does argument alias analysis for non-escaping
objects (but without considering MemoryEffects), while AA handles the
generic case using MemoryEffects. However, fundamentally, both of these
are really trying to do the same thing.
The new merged logic first tries to remove the OtherMR component of the
memory effects, which includes accesses to escaped memory. If a
function-local object does not escape, OtherMR can be set to NoModRef.
Then we perform the argument scan in basically the same way as AA
previously did. However, we also need to look at the operand bundles. To
support that, I've adjusted getArgModRefInfo to accept operand bundle
arguments.
PR #117350 made changes to the SLP vectorizer which introduced a
regression on some ARM benchmarks. Investigation narrowed it down to
suboptimal codegen for benchmarks that previously only used scalar (U/S)MLAL
instructions. The linked change meant the SLPVectorizer thought that
these could be vectorized. This change makes the cost of muls in
(U/S)MLAL patterns slightly cheaper to make sure scalar instructions are
preferred in these cases over SLP vectorization on targets supporting DSP
When we collect a contextual profile, we sample the threads entering its root and only collect on one at a time (see `ContextRoot::Taken`). If we want to compare profiles between contextual profiles, and/or flat profiles, we have a problem: we don't know how to compare the counter values relative to each other. To that end, we add `ContextRoot::TotalEntries`, which is incremented every time a root is entered and serves as multiplier for the counter values collected under that root.
We expose this in the profile and leave the normalization to the user of the profile, for a few reasons:
* it's only needed if reasoning about all profiles in aggregate.
* the goal, in compiler_rt, is to flush out the profile as quickly as possible, and performing multiplications adds an overhead that may not even be necessary if the consumer of the profile doesn't care about combining profiles
* the information itself may be interesting as an indication of relative sampling of various contexts.
For the purposes of alias analysis, we should only consider provenance
captures, not address captures. To support this, change (or add)
CaptureTracking APIs to accept a Mask and StopFn argument. The Mask
determines which components we are interested in (for AA that would be
Provenance).
The StopFn determines when we can abort the walk early. Currently, we
want to do this as soon as any of the components in the Mask is
captured. The purpose of making this a separate predicate is that in the
future we will also want to distinguish between capturing full
provenance and read-only provenance. In that case, we can only stop
early once full provenance is captured. The earliest escape analysis
does not get a StopFn, because it must always inspect all captures.
This is essentially the tests from b021bdbb3997 re-done with the new cost-model
output format from #130490, to add cost-model coverage for all the cost kinds.
More to come..
A run line with and without +fullfp16 is added to check the differences between
the two, and the fp16 tests are separated out to keep the other check lines
simpler. FP128 tests are added for all operations, and fmuladd tests are added
similar to fma.
In order to make the different cost model kinds easier to test, and to
manage the complexity of all the different variants, this patch
introduces a -cost-kind=all option that will print the output of all
cost model kinds. It feel especially helpful for tests that already have
multiple run lines (with / without +fullfp16 for example).
It currently produces the output:
```
Cost Model: Found costs of RThru:1 CodeSize:1 Lat:3 SizeLat:1 for: %F16 = fadd half undef, undef
```
The output is collapsed into a single value if all costs are the same.
Invalid costs print "Invalid" via the normal InstructionCost printing.
Two test files are updated to show some examples with
-intrinsic-cost-strategy=type-based-intrinsic-cost and Invalid costs.
Once we have something we are happy with I will try to use this to
update more tests, as in b021bdbb3997ef6dd13980dc44f24754f15f3652 but
for more variants.
This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236
The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
I recently realised that we return an invalid cost when requesting
the type-based cost for the get.active.lane.mask intrinsic. I've
fixed that in this patch by reusing the existing code for the
non-type-based model.
This replaces the `-prefer-intrinsic-cost` and
`type-based-intrinsic-cost` flags with a single
`-intrinsic-cost-strategy=<strategy>` flag.
The possible strategies are:
* `instruction-cost`
- Use TargetTransformInfo::getInstructionCost()
* `intrinsic-cost`
- Use TargetTransformInfo::getIntrinsicInstrCost()
* `type-based-intrinsic-cost`
- Calculate the intrinsic cost based only on argument types
Fixes (keep it open) #130110.
If the incoming value is PHI itself, we can skip this. If we can
guarantee that the other incoming values are neither undef nor poison,
then we can also guarantee that the value isn't either. If we cannot
guarantee that, it makes no sense in calculating it.