Adds lowering of `addrspacecast [0 -> 20]` to allow easy conversion of
function pointers to Wasm `funcref`
When given a constant function pointer, it lowers to a direct
`ref.func`. Otherwise it lowers to a `table.get` from
`__indirect_function_table` using the provided pointer as the index.
This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc.
This caused a couple test failures, likely due to a mid-air collision.
Reverting for now to get the tree back to green and allow the original
author to run UTC/friends and verify the output.
This maybe a bug which is introduced by commit
6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever
since.
In this case, `OtherReg` always overlaps with `DstReg` cause they from
the `Copy` all.
Treat it in the same manner of zero_extend_vector_inreg and generate an
extend_low_u if possible. This is to try an prevent expensive shuffles
from being generated instead. computeKnownBitsForTargetNode has also
been updated to specify known zeros on extend_low_u.
Fixes https://github.com/llvm/llvm-project/issues/165438
With `simd128` enabled, we may meet vector type truncation in FastISel.
To respect #138479, this patch merely bails out on non-integer IR types,
though I prefer bailing out for all non-simple types as most targets
(X86, AArch64) do.
Fill out more information for sign and zero extend and add some truncate
information; however, the primary change is to int/fp conversions. In
particular, fp to (narrow) int appears to be relatively expensive.
The pattern I added for `relaxed dot` similar to normal dot @
https://github.com/llvm/llvm-project/pull/151775.
For `relaxed dot add`, i noticed that in the proposal the portion of dot
implementation is similar to `relaxed dot`, so I think we can add a
pattern where after we do relaxed dot and do extadd pairwise, we can do
`relaxed dot add`.
One current obstacles is I don't think there is any pattern to singly
create a extadd pairwise from other instructions so the `relaxed dot
add` pattern would not cover a wide range of instructions.
related to https://github.com/llvm/llvm-project/issues/55932
Lower v16i8 to v4i32 partial_smla to relaxed_dot_add. I'm still unsure
whether we could/should take advantage of the unknown signedness of the
rhs, and also lower the partial_sumla operation too.
We currently emit a check that the size operand isn't zero, to avoid
executing the wasm memory.copy instruction when it would trap.
But this isn't necessary if the operand is a constant.
Fixes#163245
Reland #161355, after fixing up the cross-projects-tests for the wasm
simd intrinsics.
Original commit message:
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
This code is activated on all INTRINSIC_WO_CHAIN but only handles
a selection. However it was trying to read the arguments before
checking which intrinsic it was handling. This fails for intrinsics
that have no arguments.
`FAKE_USE`s are essentially no-ops, so they have to be removed before
running ExplicitLocals so that `drop`s will be correctly inserted to
drop those values used by the `FAKE_USE`s.
---
This is reapplication of #160228, which broke Wasm waterfall. This PR
additionally prevents `FAKE_USE`s uses from being stackified.
Previously, a 'def' whose first use was a `FAKE_USE` was able to be
stackified as `TEE`:
- Before
```
Reg = INST ... // Def
FAKE_USE ..., Reg, ... // Insert
INST ..., Reg, ...
INST ..., Reg, ...
```
- After RegStackify
```
DefReg = INST ... // Def
TeeReg, Reg = TEE ... DefReg
FAKE_USE ..., TeeReg, ... // Insert
INST ..., Reg, ...
INST ..., Reg, ...
```
And this assumes `DefReg` and `TeeReg` are stackified.
But this PR removes `FAKE_USE`s in the beginning of ExplicitLocals. And
later in ExplicitLocals we have a routine to unstackify registers that
have no uses left:
7b28fcd2b1/llvm/lib/Target/WebAssembly/WebAssemblyExplicitLocals.cpp (L257-L269)
(This was added in #149626. Then it didn't seem it would trigger the
same assertions for `TEE`s because it was fixing the bug where a
terminator was removed in CFGSort (#149097).
Details here:
https://github.com/llvm/llvm-project/pull/149432#issuecomment-3091444141)
- After `FAKE_USE` removal and unstackification
```
DefReg = INST ...
TeeReg, Reg = TEE ... DefReg
INST ..., Reg, ...
INST ..., Reg, ...
```
And now `TeeReg` is unstackified. This triggered the assertion here,
that `TeeReg` should be stackified:
7b28fcd2b1/llvm/lib/Target/WebAssembly/WebAssemblyExplicitLocals.cpp (L316)
This prevents `FAKE_USE`s' uses from being stackified altogether,
including `TEE` transformation. Even when it is not a `TEE`
transformation and just a single use stackification, it does not trigger
the assertion but there's no point stackifying it given that it will be
deleted.
---
Fixes https://github.com/emscripten-core/emscripten/issues/25301.
`FAKE_USE`s are essentially no-ops, so they have to be removed before
running ExplicitLocals so that `drop`s will be correctly inserted to
drop those values used by the `FAKE_USE`s.
Fixes https://github.com/emscripten-core/emscripten/issues/25301.
Rather then defining these tags in each object file that requires them
we can can declare them as undefined and require that they defined
externally in, for example, compiler-rt or libcxxabi.
We currently only support partial.reduce.add in the case where we are
performing a multiply-accumulate. Now add support for any partial
reduction where the input is being extended, where we can take advantage
of extadd_pairwise.
Checking `isOperationLegalOrCustom` instead of `isOperationLegal` allows
more optimization opportunities. In particular, if a target wants to
mark `extract_vector_elt` as `Custom` rather than `Legal` in order to
optimize some certain cases, this combiner would otherwise miss some
improvements.
Previously, using `isOperationLegalOrCustom` was avoided due to the risk
of getting stuck in infinite loops (as noted in
61ec738b60).
After testing, the issue no longer reproduces, but the coverage is
limited to the regression/unit tests and the test-suite.
Replace the existing `f16` test with the version that is uses for other
architectures (typically as `half.ll`). This still covers the
conversions from the existing test, but also adds checks for most simple
ops.
Additionally, rename `half-precision.ll` to `fp-intrinsics.ll` to keep
the name similar to this test.
WebAssemblyRegStackfy checks for writes to the stack pointer to avoid
stackifying across them, but it wasn't prepared for other global_set
instructions (such as writes in addrspace 1).
Fixes#156055
Thanks to @QuantumSegfault for reporting and identifying the offending
code.
First pass where we calculate the cost of the memory operation, as well
as the shuffles required. Interleaving by a factor of two should be
relatively cheap, as many ISAs have dedicated instructions to perform
the (de)interleaving. Several of these permutations can be combined for
an interleave stride of 4 and this is the highest stride we allow.
I've costed larger vectors, and more lanes, as more expensive because
not only is more work is needed but the risk of codegen going 'wrong'
rises dramatically. I also filled in a bit of cost modelling for vector
stores.
It appears the main vector plan to avoid is an interleave factor of 4
with v16i8. I've used libyuv and ncnn for benchmarking, using V8 on
AArch64, and observe geomean improvement of ~3% with some kernels
improving 40-60%.
I know there is still significant performance being left on the table,
so this will need more development along with the rest of the cost
model.
During DAG combine, promote the operands to v8i16 by concanting with an
undef vector and then use extmul_low to perform the mul at i16. Finally,
shuffle the low bytes out of the i16 elements into the result vector.
Fixes https://github.com/llvm/llvm-project/issues/150550.
With the test case
```
void f(unsigned char *x, unsigned char *y, int n) {
// should have been vectorized into avgr_u instead of seperated vectorized add and logical right shift
for (int i = 0; i < n; i++)
x[i] = (x[i] + y[i] + 1) / 2;
}
```
the backend failed to recognize that this can be reduced to avgr_u since
the loop vectorizer doesn't transform into the existing pattern in
tablegen.
This PR sets AVGCEIL_U as legal for v8i16 and v16i8 and selects it to
avgr_u in the tablegen file.
This PR reapplies https://github.com/llvm/llvm-project/pull/149461
In the original `combineVectorSizedSetCCEquality`, the result of setcc
is being negated by returning setcc with the same cond code, leading to
wrong logic.
For example, with
```llvm
%cmp_16 = call i32 @memcmp(ptr %a, ptr %b, i32 16)
%res = icmp eq i32 %cmp_16, 0
```
the original PR producese all_true and then also compares the result
equal to 0 (using the same SETEQ in the returning setcc), meaning that
semantically, it effectively is calling icmp ne.
Instead, the PR should have use SETNE in the returning setcc, this way,
all true return 1, then it is compared again ne 0, which is equivalent
to icmp eq.
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
Fixes https://github.com/llvm/llvm-project/issues/149230
Previously, even with simd enabled via `-mattr=+simd128`, the compiler
cannot utilize v128 to optimize loads and setcc of i128, instead
legalizing it to consecutive i64s.
This PR then adds support for setcc of i128 by converting them to
v16i8's anytrue and alltrue; consequently, this benefits memcmp of 16
bytes or more (when simd128 is present).
The check for enabling this optimization is if the comparison operand is
either a load or an integer in i128, with the comparison code being
either `EQ | NE`, without `NoImplicitFloat` function flag.
Inspiration taken from RISCV's isel lowering.