Corrected various spelling mistakes such as 'occurred', 'receiver',
'initialized', 'length', and others in comments, variable names,
function names, and documentation throughout the project. These
changes improve code readability and maintain consistency in naming
and documentation.
Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
We've already optimised these, so update the cost model to reflect it.
And skip the isBeforeLegalize check when lowering i8 muls, because it
then misses the cases where, say v32i8, has been type legalised into 2x
v16i8.
Also explicitly disable memory interleaving for any factor other than
two or four.
Commit
6ad41bcc49
changed how frem is expanded during legalization and it
broke WebAssembly but we were missing test coverage. We want to maintain
our previous behavior of unrolling vectors and using a libcall to
implement scalar frem. I'm not sure why this now has to be different
(in ISelLowering) from other libcalls like fsin which work the same way
in the end, but this code does accurately describe what we want.
Fixes: https://github.com/emscripten-core/emscripten/issues/25991
The existing condition for checking whether or not to expand an frem
instruction in expand-fp is not sufficiently precise.
The expansion on other targets than AMDGPU - which is the only intended
user right now - is only prevented due to the interaction with the
MaxLegalFpConvertBitWidth check. Relying on this is conceptually wrong
and limits the use of the pass for other targets and further expansions
(e.g. merging with the similar ExpandLargeDivRem pass).
Change the expansion criterion to always expand frem of a given type
for targets that use "Expand" as the legalization action for the
underlying scalar type and use this to exit the pass early for targets
which do not require any expansions. This requires to change the
frem legalization action for all targets which do not want frem to
be expanded in this pass from "Expand" to "LibCall".
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Adds lowering of `addrspacecast [0 -> 20]` to allow easy conversion of
function pointers to Wasm `funcref`
When given a constant function pointer, it lowers to a direct
`ref.func`. Otherwise it lowers to a `table.get` from
`__indirect_function_table` using the provided pointer as the index.
Treat it in the same manner of zero_extend_vector_inreg and generate an
extend_low_u if possible. This is to try an prevent expensive shuffles
from being generated instead. computeKnownBitsForTargetNode has also
been updated to specify known zeros on extend_low_u.
Currently LibcallLoweringInfo is defined inside of TargetLowering,
which is owned by the subtarget. Pass in the subtarget so we can
construct LibcallLoweringInfo with the subtarget. This is a temporary
step that should be revertable in the future, after LibcallLoweringInfo
is moved out of TargetLowering.
This allows SDNodes to be validated against their expected type profiles
and reduces the number of changes required to add a new node.
CALL and RET_CALL do not have a description in td files, and it is not
currently possible to add one as these nodes have both variable operands
and variable results.
This also fixes a subtle bug detected by the enabled verification
functionality. `LOCAL_GET` is declared with `SDNPHasChain` property, and
thus should have both a chain operand and a chain result. The original
code created a node without a chain result, which caused a check in
`SDNodeInfo::verifyNode()` to fail.
Part of #119709.
Pull Request: https://github.com/llvm/llvm-project/pull/166259
We currently emit a check that the size operand isn't zero, to avoid
executing the wasm memory.copy instruction when it would trap.
But this isn't necessary if the operand is a constant.
Fixes#163245
Reland #161355, after fixing up the cross-projects-tests for the wasm
simd intrinsics.
Original commit message:
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
This code is activated on all INTRINSIC_WO_CHAIN but only handles
a selection. However it was trying to read the arguments before
checking which intrinsic it was handling. This fails for intrinsics
that have no arguments.
We currently only support partial.reduce.add in the case where we are
performing a multiply-accumulate. Now add support for any partial
reduction where the input is being extended, where we can take advantage
of extadd_pairwise.
During DAG combine, promote the operands to v8i16 by concanting with an
undef vector and then use extmul_low to perform the mul at i16. Finally,
shuffle the low bytes out of the i16 elements into the result vector.
Fixes https://github.com/llvm/llvm-project/issues/150550.
With the test case
```
void f(unsigned char *x, unsigned char *y, int n) {
// should have been vectorized into avgr_u instead of seperated vectorized add and logical right shift
for (int i = 0; i < n; i++)
x[i] = (x[i] + y[i] + 1) / 2;
}
```
the backend failed to recognize that this can be reduced to avgr_u since
the loop vectorizer doesn't transform into the existing pattern in
tablegen.
This PR sets AVGCEIL_U as legal for v8i16 and v16i8 and selects it to
avgr_u in the tablegen file.
This PR reapplies https://github.com/llvm/llvm-project/pull/149461
In the original `combineVectorSizedSetCCEquality`, the result of setcc
is being negated by returning setcc with the same cond code, leading to
wrong logic.
For example, with
```llvm
%cmp_16 = call i32 @memcmp(ptr %a, ptr %b, i32 16)
%res = icmp eq i32 %cmp_16, 0
```
the original PR producese all_true and then also compares the result
equal to 0 (using the same SETEQ in the returning setcc), meaning that
semantically, it effectively is calling icmp ne.
Instead, the PR should have use SETNE in the returning setcc, this way,
all true return 1, then it is compared again ne 0, which is equivalent
to icmp eq.
Fixes https://github.com/llvm/llvm-project/issues/149230
Previously, even with simd enabled via `-mattr=+simd128`, the compiler
cannot utilize v128 to optimize loads and setcc of i128, instead
legalizing it to consecutive i64s.
This PR then adds support for setcc of i128 by converting them to
v16i8's anytrue and alltrue; consequently, this benefits memcmp of 16
bytes or more (when simd128 is present).
The check for enabling this optimization is if the comparison operand is
either a load or an integer in i128, with the comparison code being
either `EQ | NE`, without `NoImplicitFloat` function flag.
Inspiration taken from RISCV's isel lowering.
The information whether a specific argument is vararg or fixed is
currently stored separately from all the other argument information in
ArgFlags. This means that it is not accessible from CCAssign, and
backends have developed all kinds of workarounds for how they can access
it after all.
Move this information to ArgFlags to make it directly available in all
relevant places.
I've opted to invert this and store it as IsVarArg, as I think that both
makes the meaning more obvious and provides for a better default (which
is IsVarArg=false).
During target DAG combine, use two i16x8.extmul_low_i8x16 and a shuffle
for v16i8 mul.
On my AArch64 machine, using V8, I observe a 3.14% geomean improvement
across 65 benchmarks, including: 9.2% for spec2017.x264, 6% for libyuv
and 1.8% for ncnn.
Fixes https://github.com/llvm/llvm-project/issues/117200.
The default behavior in TargetLoweringBase is only scalar floats on fexp
are supported by default, not the vectorized version. This PR adds
`ISD::FEXP10` to the supported list.
This adds an llvm intrinsic for WebAssembly to test the type of a
function. It is intended for adding a future clang builtin
` __builtin_wasm_test_function_pointer_signature` so we can test whether
calling a function pointer will fail with function signature mismatch.
Since the type of a function pointer is just `ptr` we can't figure out
the expected type from that.
The way I figured out to encode the type was by passing 0's of the
appropriate type to the intrinsic.
The first argument gives the expected type of the return type and the
later values give the expected
type of the arguments. So
```llvm
@llvm.wasm.ref.test.func(ptr %func, float 0.000000e+00, double 0.000000e+00, i32 0)
```
tests if `%func` is of type `(double, i32) -> (i32)`. It will lower to:
```wat
local.get $func
table.get $__indirect_function_table
ref.test (double, i32) -> (i32)
```
To indicate the function should be void, I somewhat arbitrarily picked
`token poison`, so the following tests for `(i32) -> ()`:
```llvm
@llvm.wasm.ref.test.func(ptr %func, token poison, i32 0)
```
To lower this intrinsic, we need some place to put the type information.
With `encodeFunctionSignature()` we encode the signature information
into an `APInt`. We decode it in `lowerEncodedFunctionSignature` in
`WebAssemblyMCInstLower.cpp`.
convert_iKxN_s is canonicalized into convert_iKxN_u when the argument is
known to have sign bit 0. This results in emitting Wasm opcodes that, on
some targets (like x86_64), are dramatically slower than signed versions
on major engines.
Similarly to X86, we now fix this up in isel when the instruction has
nonneg flag from canonicalization or if we know the source has zero sign
bit.
Fixes#149457.
Fixes https://github.com/llvm/llvm-project/issues/50142, a miss of
further vectorization, where we can only achieve zext (xor (any_true),
-1).
Now in test case simd-setcc-reductions, it's converted to all_true.
Also fixes https://github.com/llvm/llvm-project/issues/145177, which is
all_true (setcc x, 0, eq) -> not any_true
any_true (setcc x, 0, ne) -> any_true
all_true (setcc x, 0, ne) -> all_true
---------
Co-authored-by: badumbatish <--show-origin>
[WebAssembly] [Backend] Wasm optimize illegal bitmask for #131980.
Currently, the case for illegal bitmask (v32i8 or v64i8) is that at the
SelectionDag level, two (four) vectors of v128 will be concatenated
together, then they'll all be SETCC by the same pseudo illegal
instruction, which requires expansion later on.
I opt for SETCC-ing them seperately, bitcast and zext them and then add
them up together in the end.
---------
Co-authored-by: badumbatish <--show-origin>
This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:
* A lowering for the `ISD::UADDO` node is added which uses `add128`
where the upper bits of the two operands are constant zeros and the
result of the 128-bit addition is the result of the overflowing
addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
if the upper bits of the inputs are also zero, assisting this `UADDO`
lowering to ensure the backend knows that the carry result is a 1-bit
result.
A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
The standard libcalls for half to float and float to half conversion are
__extendhfsf2 and __truncsfhf2. However, LLVM currently uses
__gnu_h2f_ieee and __gnu_f2h_ieee instead. As far as I can tell, these
libcalls are an ARM-ism and only provided by libgcc on that platform.
compiler-rt always provides both libcalls.
Use the standard libcalls by default, and only use the __gnu libcalls on
ARM.
When lowering EXTEND_VECTOR_INREG, check whether the operand is a
shuffle that is moving the top half of a vector into the lower half. If
so, we can EXTEND_HIGH the input to the shuffle instead.
Enable the use of partial.reduce.add that we can lower to dot or a tree
of (add (extmul_low_u, extmul_high_u)) for the unsigned case. We support
both v8i16 and v16i8 inputs.