Custom lower FMINNUM, FMINIMUMNUM, FMAXNUM and FMAXIMUMNUM to generate
relaxed_min and relaxed_max when the inputs cannot be NaN or signed
zero.
Tablegen patterns have also been modified to check the above conditions
when trying to match relaxed min/max using the pmin/pmax pattern.
This patch extends the `load_zero` pattern matching to
support floating-point vector types (`v4f32` and `v2f64`).
Previously, the optimization to generate `v128.load32_zero` and
`v128.load64_zero` was only enabled for integer types
(`v4i32` and `v2i64`). This change adds the necessary TableGen
patterns to correctly match scalar floating-point loads inserted
into zero-initialized vectors.
The pattern I added for `relaxed dot` similar to normal dot @
https://github.com/llvm/llvm-project/pull/151775.
For `relaxed dot add`, i noticed that in the proposal the portion of dot
implementation is similar to `relaxed dot`, so I think we can add a
pattern where after we do relaxed dot and do extadd pairwise, we can do
`relaxed dot add`.
One current obstacles is I don't think there is any pattern to singly
create a extadd pairwise from other instructions so the `relaxed dot
add` pattern would not cover a wide range of instructions.
related to https://github.com/llvm/llvm-project/issues/55932
Lower v16i8 to v4i32 partial_smla to relaxed_dot_add. I'm still unsure
whether we could/should take advantage of the unknown signedness of the
rhs, and also lower the partial_sumla operation too.
Reland #161355, after fixing up the cross-projects-tests for the wasm
simd intrinsics.
Original commit message:
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
We currently only support partial.reduce.add in the case where we are
performing a multiply-accumulate. Now add support for any partial
reduction where the input is being extended, where we can take advantage
of extadd_pairwise.
Fixes https://github.com/llvm/llvm-project/issues/150550.
With the test case
```
void f(unsigned char *x, unsigned char *y, int n) {
// should have been vectorized into avgr_u instead of seperated vectorized add and logical right shift
for (int i = 0; i < n; i++)
x[i] = (x[i] + y[i] + 1) / 2;
}
```
the backend failed to recognize that this can be reduced to avgr_u since
the loop vectorizer doesn't transform into the existing pattern in
tablegen.
This PR sets AVGCEIL_U as legal for v8i16 and v16i8 and selects it to
avgr_u in the tablegen file.
[WebAssembly] Fold fadd contract (fmul contract) to relaxed madd w/
-mattr=+simd128,+relaxed-simd
Fixes#121311
- Precommit test for #121311
- Fold fadd contract (fmul contract) to relaxed madd w/
-mattr=+simd128,+relaxed-simd
- Move PatFrag of fadd_contract in ARM.td and WebAssembly.td to
TargetSelectionDAG.td for reuse of pattern
Enable the use of partial.reduce.add that we can lower to dot or a tree
of (add (extmul_low_u, extmul_high_u)) for the unsigned case. We support
both v8i16 and v16i8 inputs.
Getting this to work required a few additional changes:
- Add builtins for any instructions that can't be done with plain C
currently.
- Add support for the saturating version of fp_to_<s,i>_I16x8. Other
vector sizes supported this already.
- Support bitcast of f16x8 to v128. Needed to return a __f16x8 as
v128_t.
Instead of splatting a single lane, to initialise a build_vector, lower
to scalar_to_vector which can be selected to load_zero.
Also add load_zero and load_lane patterns for f32x4 and f64x2.
With fast-math, the ordered setcc nodes are converted to setcc nodes
which do not care about NaNs, so add patterns that use setlt, setle,
setgt and setge.
This reuses most of the code that was created for f32x4 and f64x2 binary
instructions and tries to follow how they were implemented.
add/sub/mul/div - use regular LL instructions
min/max - use the minimum/maximum intrinsic, and also have builtins
pmin/pmax - use the wasm.pmax/pmin intrinsics and also have builtins
Specified at:
29a9b9462c/proposals/half-precision/Overview.md
Adds a builtin and intrinsic for the f16x8.splat instruction.
Specified at:
29a9b9462c/proposals/half-precision/Overview.md
Note: the current spec has f16x8.splat as opcode 0x123, but this is
incorrect and will be changed to 0x120 soon.
Previously we expected lane constants to be in the range of signed
values for each lane size, but the included test case produced large
unsigned values that fall outside that range. Allow instruction
selection to proceed in this case rather than failing.
Fixes#63817.
WebAssembly doesn't expose inexact exceptions, so frint can be mapped to
fnearbyint. Likewise, WebAssembly always rounds ties-to-even, so
froundeven can be mapped to fnearbyint.
Differential Revision: https://reviews.llvm.org/D153451
This intrinsic is meant to lower directly to the i8x16.shuffle instruction,
which takes its lane index arguments as immmediates. The ISel for the intrinsic
assumed that the lane index arguments were constants, so bitcode that
"incorrectly" used this intrinsic with non-immediate arguments caused an
assertion failure in the backend.
Avoid the crash by defining the lane index arguments to be immediates, matching
the underlying instruction. Update ISel accordingly. This change means that the
bitcode that previously caused a crash will now fail to validate.
Fixes#55559.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D149898
The wasm64 versions of the v128.storeX_lane instructions was incorrectly defined
as returning a v128 value, which resulted in spurious drop instructions being
emitted and causing validation to fail. This was not caught earlier because
wasm64 has been experimental and not well tested. Update the relevant test file
to test both wasm32 and wasm64.
Fixes#62443.
Differential Revision: https://reviews.llvm.org/D149780
After change with D144169, the codegen generates redundant instructions
like and and wrap. This fixes it.
Differential Revision: https://reviews.llvm.org/D144360
Each operation was missing their inverted condition using olt or ogt.
Also, as we don't need to discern +/-0, I think we should also be
able to use ole and oge.
Differential Revision: https://reviews.llvm.org/D143581
Splats were selected by matching on uses of `build_vector` with
identical elements, but a while back a target independent node for
vector splatting was added.
This removes the WebAssembly specific LOAD_SPLAT intrinsic, and instead
makes SPLAT_VECTOR legal and adds patterns for splat loads.
Differential Revision: https://reviews.llvm.org/D139871
This continues the refactoring work of selecting offset + address
operands with the AddrOpsN pattern, previously called LoadOpsN.
This is not an NFC, since constant addresses are now folded into the
offset in more places for v128.storeN_lane.
Differential Revision: https://reviews.llvm.org/D139950
This refactors out the offset and address operand pattern matching into
a ComplexPattern, so that one pattern fragment can match the dynamic and
static (offset) addresses in all possible positions.
Split out from D139530, which also contained an improvement to global
address folding.
Differential Revision: https://reviews.llvm.org/D139631