Split the input vector with an extend_low and high and then split the
results again with extend_low and high for a total of 6 instructions.
This is removes 3 shuffles and a couple of extends.
The failures should have been resolved with #180290 (which also added
WebAssembly tests).
This reverts commit 811fb223af2b3e2d68c99b346f4b75dcf3de3417.
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Currently the WebAssembly backend crashes when trying to lower some
extract.last.active intrinsic calls. Mark their cost as invalid
temporarily, to avoid them being introduced by the loop
vectorizer after 2abd6d6d7ac (#158088).
We've already optimised these, so update the cost model to reflect it.
And skip the isBeforeLegalize check when lowering i8 muls, because it
then misses the cases where, say v32i8, has been type legalised into 2x
v16i8.
Also explicitly disable memory interleaving for any factor other than
two or four.
Resolves#169058.
This adds ~~an InstCombine pass~~ a TTI hook to the WebAssembly backend
that folds `i8x16.swizzle` and `i8x16.relaxed.swizzle` operations to
`shufflevector` operations if their mask operands are constant.
This is mainly useful for abstractions over the raw intrinsics--for
instance, in architecture-generic SIMD code that may not be able to
expose the constant shuffles due to type system limitations.
I took most of this from the x86 backend (in particular,
`simplifyX86vpermilvar` in `X86InstCombineIntrinsic`), and adapted it
for the WebAssembly backend. There wasn't any previous
`instCombineIntrinsic` method on the WebAssembly `TargetTransformInfo`,
so I added it. Right now, this swizzle optimization is the only one it
performs.
As I noted in the transform itself, the "relaxed" swizzle actually has
stricter preconditions than the non-relaxed one. If a non-negative but
still out-of-bounds index is provided, the "relaxed" swizzle can choose
between returning 0 and the lane at the index modulo 16. However, it
must make the same choice every time, and we don't know which choice the
runtime will make, so we can't constant-fold it.
The regression tests were mostly generated by Claude and adapted a bit
by me (I tried to follow the [InstCombine contributor
guide](https://llvm.org/docs/InstCombineContributorGuide.html#tests)).
There was previously no WebAssembly subdirectory within the InstCombine
tests, so I created that too; as of now, the swizzle fold test is the
only file in it. Everything else was written by myself (well, partly
copy-pasted from the x86 backend).
I'm not sure how to write an Alive2 test for this; I can't find any
examples where the input is an arbitrary constant.
Fill out more information for sign and zero extend and add some truncate
information; however, the primary change is to int/fp conversions. In
particular, fp to (narrow) int appears to be relatively expensive.
We currently only support partial.reduce.add in the case where we are
performing a multiply-accumulate. Now add support for any partial
reduction where the input is being extended, where we can take advantage
of extadd_pairwise.
First pass where we calculate the cost of the memory operation, as well
as the shuffles required. Interleaving by a factor of two should be
relatively cheap, as many ISAs have dedicated instructions to perform
the (de)interleaving. Several of these permutations can be combined for
an interleave stride of 4 and this is the highest stride we allow.
I've costed larger vectors, and more lanes, as more expensive because
not only is more work is needed but the risk of codegen going 'wrong'
rises dramatically. I also filled in a bit of cost modelling for vector
stores.
It appears the main vector plan to avoid is an interleave factor of 4
with v16i8. I've used libyuv and ncnn for benchmarking, using V8 on
AArch64, and observe geomean improvement of ~3% with some kernels
improving 40-60%.
I know there is still significant performance being left on the table,
so this will need more development along with the rest of the cost
model.
This PR reapplies https://github.com/llvm/llvm-project/pull/149461
In the original `combineVectorSizedSetCCEquality`, the result of setcc
is being negated by returning setcc with the same cond code, leading to
wrong logic.
For example, with
```llvm
%cmp_16 = call i32 @memcmp(ptr %a, ptr %b, i32 16)
%res = icmp eq i32 %cmp_16, 0
```
the original PR producese all_true and then also compares the result
equal to 0 (using the same SETEQ in the returning setcc), meaning that
semantically, it effectively is calling icmp ne.
Instead, the PR should have use SETNE in the returning setcc, this way,
all true return 1, then it is compared again ne 0, which is equivalent
to icmp eq.
Fixes https://github.com/llvm/llvm-project/issues/149230
Previously, even with simd enabled via `-mattr=+simd128`, the compiler
cannot utilize v128 to optimize loads and setcc of i128, instead
legalizing it to consecutive i64s.
This PR then adds support for setcc of i128 by converting them to
v16i8's anytrue and alltrue; consequently, this benefits memcmp of 16
bytes or more (when simd128 is present).
The check for enabling this optimization is if the comparison operand is
either a load or an integer in i128, with the comparison code being
either `EQ | NE`, without `NoImplicitFloat` function flag.
Inspiration taken from RISCV's isel lowering.
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.
Pull Request: https://github.com/llvm/llvm-project/pull/136655
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).
(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need to make all hiding
overloads `const`. See the second commit in this PR.
Pull Request: https://github.com/llvm/llvm-project/pull/136575
Enable the use of partial.reduce.add that we can lower to dot or a tree
of (add (extmul_low_u, extmul_high_u)) for the unsigned case. We support
both v8i16 and v16i8 inputs.
Add inital implementations of these TTI methods for SIMD types. For
casts, The costing covers the free extensions provided by extmul_low as
well as extend_low. For memory operations we consider the use of
load32_zero and load64_zero, as well as full width v128 loads.
This patch moves the `areInlineCompatible` implementation from multiple
subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to
the base class `BasicTTIImpl`. The new implementation checks whether the
callee's target features are a subset of the caller's, enabling
consistent behavior across targets. Subclasses now simply delegate to
the base implementation, reducing code duplication and improving
maintainability.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
WebAssembly doesn't support horizontal operations nor does it have a way
of expressing fast-math or reassoc flags, so runtimes are currently
unable to use pairwise operations when generating code from the existing
shuffle patterns.
This patch allows the backend to select which, arbitary, shuffle pattern
to be used per reduction intrinsic. The default behaviour is the same as
the existing, which is by splitting the vector into a top and bottom
half. The other pattern introduced is for a pairwise shuffle.
WebAssembly enables pairwise reductions for int/fp add/sub.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
This is part of an ongoing transition to use OperandValueInfo which combines OperandValueKind and OperandValueProperties. This change adds some accessor methods and uses them to simplify backend code. The primary motivation of doing so is removing uses of the parameters so that an upcoming api change is less error prone.
enabled
The C++20 Coroutines couldn't be compiled to WebAssembly due to an
optimization named symmetric transfer requires the support for musttail
calls but WebAssembly doesn't support it yet.
This patch tries to fix the problem by adding a supportsTailCalls
method to TargetTransformImpl to skip the symmetric transfer when
tail-call feature is not supported.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D128794
I'm not sure this is the best way to approach this,
but the situation is rather not very detectable unless we explicitly call it out when refusing to advise to unroll.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D107271
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D98874
Enable partial and runtime unrolling with a threshold of 30, which
was derived from a large number of kernels running on node and
wasmtime for amd64 and aarch64.
Unrolling is enabled by default at -O2 and -O3 and is disabled at
-Oz and -Os. Compiling with -Os is recommended if the wasm binary
size is the most important factor.
Differential Revision: https://reviews.llvm.org/D95125
Allow inlining only when the Callee has a subset of the Caller's
features. In principle, we should be able to inline regardless of any
features because WebAssembly supports features at module granularity,
not function granularity, but without this restriction it would be
possible for a module to "forget" about features if all the functions
that used them were inlined.
Requested in PR46812.
Differential Revision: https://reviews.llvm.org/D85494
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.
RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html
Differential Revision: https://reviews.llvm.org/D79002
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374634
Also Revert "[LoopVectorize] Fix non-debug builds after rL374017"
This reverts commit 9f41deccc0e648a006c9f38e11919f181b6c7e0a.
This reverts commit 18b6fe07bcf44294f200bd2b526cb737ed275c04.
The patch is breaking PowerPC internal build, checked with author, reverting
on behalf of him for now due to timezone.
llvm-svn: 374091
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374017
Summary:
This patch fixes clang-tidy warnings on wasm-only files.
The list of checks used is:
`-*,clang-diagnostic-*,llvm-*,misc-*,-misc-unused-parameters,readability-identifier-naming,modernize-*`
(LLVM's default .clang-tidy list is the same except it does not have
`modernize-*`. But I've seen in multiple CLs in LLVM the modernize style
was recommended and code was fixed based on the style, so I added it as
well.)
The common fixes are:
- Variable names start with an uppercase letter
- Function names start with a lowercase letter
- Use `auto` when you use casts so the type is evident
- Use inline initialization for class member variables
- Use `= default` for empty constructors / destructors
- Use `using` in place of `typedef`
Reviewers: sbc100, tlively, aardappel
Subscribers: dschuff, sunfish, jgravelle-google, yurydelendik, kripken, MatzeB, mgorny, rupprecht, llvm-commits
Differential Revision: https://reviews.llvm.org/D57500
llvm-svn: 353075
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636