Use a setne with 0 instead of a trunc. We know we zero extended the node
so we can get by with a non-zero check only. The truncate lowering
doesn't know that we zero extended so has to mask the lsb.
I don't think DAG combine sees the trunc before we lower it to RISCVISD
nodes so we don't get a chance to use computeKnownBits to remove the
AND.
Add new elementwise popcount builtin to support HLSL function
'countbits'.
elementwise popcount only accepts integer types.
Add hlsl intrinsic 'countbits'
Closes#99094
This fixes#108936, but the calling convention doesn't match with GCC. I
doubt we have such a lib function for now, so leave the calling
convention as is.
For cross-lane shuffles, extract the mask operand (uppper) subvector directly, and make use of the free implicit extraction of the lowest subvector of the result.
For f16 with zvfhmin, we promote most ops and VP ops to f32. This does
the same for bf16 with zvfbfmin, so the two fp types should now be in
sync.
There are a few places in the custom lowering where we need to check for
a LMUL 8 f16/bf16 vector that can't be promoted and must be split, this
extracts that out into isPromotedOpNeedingSplit.
In a follow up NFC we can deduplicate the code that sets up the
promotions.
This adds zvfhmin test coverage for fceil, ffloor, fnearbyint, frint,
fround and froundeven and splits them at nxv32f16 to avoid crashing,
similarly to what we do for other nodes that we promote.
This also sets ftrunc to promote which was previously missing. We
already promote the VP version of it, vp_froundtozero.
Marking it as promoted affects some of the cost model tests since
they're no longer expanded.
This adds support for binary generation for the new EH proposal.
So far the only case that we emitted variable immediate operands in
binary has been `br_table`'s destinations. (Other `variable_ops` uses in
TableGen files are register operands, such as the operands of `call`, so
they don't get emitted in binary as a part of the same instruction.)
With this PR, variable immediate operands can include `try_table`'s
operands:
- The number of of catch clauses
- catch clauses sub-opcodes
- `catch`: 0x00
- `catch_ref`: 0x01
- `catch_all`: 0x02
- `catch_all_ref`: 0x03
- catch clauses' destinations
With `try_table`, we now have variable expr operands for `try_table`'s
catch clauses' tags. We treat their fixups in the same way we do for
tags in other instructions such as in `throw`.
Diff without whitespace will be easier to view.
These tests are good candidate for VL optimization. This is a pre-commit for
PR #108640, but can could probably also be improved by the peephole VL
optimizations.
Reverted due to large .debug_line size regressions for some
configurations; work currently in place to improve the output of this
behaviour in PR #108251.
This patch also modifies two tests that were created or modified after
the original commit landed and are affected by the revert:
llvm/test/CodeGen/X86/pseudo_cmov_lower2.ll
llvm/test/DebugInfo/X86/empty-line-info.ll
This reverts commit 5fef40c2c477e92187bd4e5c18091eca6b8465cc.
This is mostly a copy of the AArch64PostLegalizerLoweringPass, except it
removes all of the AArch64 combines.
This pass allows us to lower instructions after the generic
post-legalization combiner has had a chance to run.
We will be adding combines to this pass in future patches.
I realized after spending way too much time looking at this, that
we can avoid objdump entirely here by having the assembly simply
not print the aliases. Once we do that, we can simply autogen
this test, and updates become trivial and understandable.
Since we are using the Scalarizer pass in the backend we needed a way to
allow this pass to operate on Target intrinsics.
We achieved this by adding `TargetTransformInfo ` to the Scalarizer
pass. This allowed us to call a function available to the DirectX
backend to know if an intrinsic is a target intrinsic that should be
scalarized.
Two major changes:
- Remove use of sed preprocessing - this was being used to create two
versions of each test, and the result is much more readable if we
just duplicate the tests.
- Use a regex for matching the condition. An upcoming change causes
us to reverse the branch direction (which doesn't matter to the
purpose of these tests at all), so using the regex makes the test
more stable.
Per the comment, this test is intending to test the first constant which
can't be encoded via a c.addi. However, -32 *can* be encoded as in a
c.addi, and all that's preventing it from doing so is the register
allocators choice to use a difference destination register on the
add than it's source. (Which compressed doesn't support.)
The current LLC codegen for this test looks like:
addi a1, a0, -32
li a0, -99
bnez a1, .LBB0_2
li a0, 42
.LBB0_2:
ret
After https://github.com/llvm/llvm-project/pull/108889, we sink the LI, and
the register allocator picks the same source and dest register for the addi
resulting in the c.addi form being emitted. So, to avoid a confusing diff
let's fix the test to check what was originally intended.
This patch adds support for the missing STRICT_UINT_TO_FP and
STRICT_SINT_TO_FP for riscv and adds a test case for rv32 which was
previously crashing.
The code is in line with how other strict_* nodes are handled
(e.g., getting op(1) instead of op(0) when it's a strict node, as op(0)
in a strict node is the entry token).
The implementation was missing the fact that `G_EXTRACT_SUBVECTOR`
destination and source vector can be different types.
Also fix a bug in the MIR builder for `G_EXTRACT_SUBVECTOR` to generate
the correct opcode.
Clarify the G_EXTRACT_SUBVECTOR specification.
Using the LSR or LSL aliases of UBFM can be faster on some CPUs, so it
is worth changing 64 bit UBFM instructions, that are equivalent to 32
bit LSR/LSL operations, to 32 bit variants.
This change folds the following patterns:
* If `Imms == 31` and `Immr <= Imms`:
`UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr, Imms`
* If `Immr == Imms + 33`:
`UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr - 32, Imms`
We currently make sure to check that if folding an op to an f16 widening
op that we have zvfh. We need to do the same for bf16 vectors, but with
the further restriction that we can only combine vfmadd_vl to vfwmadd_vl
(to get vfwmaccbf16.v{v,f}).
The added test case currently crashes because we try to fold an add to a
bf16 widening add, which doesn't exist in zvfbfmin or zvfbfwma
This moves the checks into the extension support checks to keep it one
place.
The commit introduces support for fundamental DI instruction. Metadata
handlers required for this instruction is stored inside debug records
(https://llvm.org/docs/SourceLevelDebugging.html) parts of the module
which rises the necessity of it's traversal.
Use SWAR techniques for arithmetic shifts: we use the same technique as
logical right shift but with an additional step of sign extending the
result.
Also, use the logical shift left technique even on AVX512 as vpmovzxbw
and vpmovwb are actually quite expensive.
When there is a function signature mismatch between a call
instruction and the callee, lower the call to an indirect call. The
current behavior is to produce direct calls that may or may not be
valid PTX. Consider the following example with mismatching return
types:
```
%struct.1 = type <{i64}>
%struct.2 = type <{i64}>
declare %struct.1 @callee()
...
%call1 = call %struct.2 @callee()
%call2 = call i64 @callee()
```
The return type of `callee` in PTX is `.b8 _[8]`. The return type of
`%call1` will be the same and so the PTX has no problems. The return
type of `%call2` will be `.b64`, so the types will not match and PTX
will be unacceptable to ptxas. This despite all the types having the
same size. The same is true for mismatching parameter types.
If we instead convert these calls to indirect calls, we will generate
functional PTX when the types have the same size. If they do not have
the same size then the PTX will likely be incorrect, though this will not
necessarily be caught by ptxas. Also, even if the sizes are the same, if
the types differ then it is technically undefined behavior. This change
allows for more flexibility in the bitcode that can be lowered to
functioning PTX, at the cost of sometimes producing PTX that is less
clearly wrong than it would have been previously (i.e. incorrect indirect
calls are not as obviously wrong as incorrect direct calls). We consider
it okay to generate PTX with undefined behavior as the behavior of
calls with mismatching types is not explicitly defined.
In the process of adding scalarization support for DirectX target
intrinsics I found that intrinsics that weren't marked with `IntrNoMem`
did not get removed by
`RecursivelyDeleteTriviallyDeadInstructionsPermissive`. So this change
is to make it more clear that our intrinsics don't have side effects.
I only added `IntrNoMem` to the intrinics in `IntrinsicsDirectX.td` I
was involved with. There a potentially a few other cases that might
warrant this attribute, but will need input on the others.
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
This is an implementation of the saturating fp to int conversions for
GlobalISel. On AArch64 the converstion instrctions work this way,
producing saturating results. LegalizerHelper::lowerFPTOINT_SAT is
ported from SDAG.
AArch64 has a lot of existing tests for fptosi_sat, covering a wide
range of types. I have tried to make most of them work all at once, but
a few fall back due to other missing features such as f128 handling for
min/max.