This avoids the call overhead as well as the the save/restore of
fflags and the snan handling in the libm function.
The save/restore of fflags and snan handling are needed to be
correct for -ftrapping-math. I think we can ignore them in the
default environment.
The inline sequence will generate an invalid exception for nan
and an inexact exception if fractional bits are discarded.
I've used a custom inserter to explicitly create the control flow
around the float->int->float conversion.
We can probably avoid the final fsgnj after the conversion for
no signed zeros FMF, but I'll leave that for future work.
Note the comparison constant is slightly different than glibc uses.
They use 1<<53 for double, I'm using 1<<52. I believe either are valid.
Numbers >= 1<<52 can't have any fractional bits. It's ok to do the
float->int->float conversion on numbers between 1<<53 and 1<<52 since
they will all fit in 64. We only have a problem if the double can't fit
in i64
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D136508
This is a follow-on to https://reviews.llvm.org/D134073.
The number of MIPS16 changes here is a bit surprising. Many of the
fields with mismatched names were NOT previously choosing the correct
argument positionally, but instead doing something completely wrong
(e.g. it would encode a register where an immediate was expected).
But, machine-code generation for MIPS16 has never actually functioned.
It's also fully untested, thus, the MIPS16 changes, despite changing
behavior, breaks (and fixes) zero tests. This change does not fix
MIPS16 output, but it ought to be at least incrementally less broken.
Outside MIPS16, I believe the only functional change is to the 'ginvi'
instruction: it was previously encoding garbage into a field which was
specified to be '00'. Fortunately, it was covered by tests -- and the
tests were testing the incorrect behavior. So, fixed.
Differential Revision: https://reviews.llvm.org/D134220
This is a follow-on to https://reviews.llvm.org/D134073.
It renames a few fields to have consistent names, as well as renaming
operands to match the field names.
Behavior is unchanged by this cleanup. (The only generated code change
is for the disassembler for LDSTUB/LDSTUBA, but in both old and new
versions, it fails to add enough operands, and thus triggers a runtime
abort. I will address that bug in a future commit.)
Differential Revision: https://reviews.llvm.org/D134201
Instead of using vslide1up, use vslide1down and build the other
direction. This avoids the overlap constraint early clobber of
vslide1up.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D136735
Small QoL change to allow Predicates to be used in GICombineRule.
Currently only one combine in the AMDGPU backend makes use of it.
The implementation is pretty simple to get started but of course we can expand this later on and optimize predicate checking better if needed.
Reviewed By: dsanders
Differential Revision: https://reviews.llvm.org/D136681
The miscompile case's G_ZEXT has a G_FREEZE source. Similar to D127154, this patch removed isDef32, relying on the AArch64MIPeephole optimizer to remove redundant SUBREG_TO_REG nodes also in GISel.
Fix#58431
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D136433
Epilogue loop vectorization is a feature in the vectorize intended to avoid running fully scalar code when the vector length of the main loop turns out to be either longer than the trip count of the actual loop, or with a huge remainder.
In practice, this feature appears to not have been well tuned. I honestly don't think it should be on by default at all, but it definitely shouldn't be on for RISCV. Note that other targets have also disabled it, but they've done so via disabling interleaving - which is, well, completely unrelated - and we don't want to do that for RISCV.
In the near term, many examples I'm seeing have terrible codegen for epilogue vectorization. We are greatly increasing code size for little value at reasonable VLEN values for small types. In the long term, the cases that epilogue vectorization are intended to handle are likely better handled via tail folding on RISCV.
As an aside, I also don't really trust the correctness of epilogue vectorization. The code structure is such that otherwise straight forward changes sometimes break only epilogue vectorization. The reuse of an existing vplan without careful validation opens significant room for nasty bugs. Given how rarely the code is exercised, that is not a good combination.
As such, this patch introduces a TTI hook, and completely disables epilogue vectorization on RISCV.
Differential Revision: https://reviews.llvm.org/D136695
Recent Clang changes expose _bf16 types for SSE2-enabled host compilations and
that makes those types visible furing GPU-side compilation, where it currently
fails with Sema complaining that __bf16 is not supported.
Considering that __bf16 is a storage-only type, enabling it for NVPTX if it's
enabled on the host should pose no issues, correctness-wise.
Recent NVIDIA GPUs have introduced bf16 support, so we'll likely grow better
support for __bf16 on NVPTX going forward.
Differential Revision: https://reviews.llvm.org/D136311
This patch adds the assembly/disassembly for the following instructions:
SDOT: (4-way, multiple and single vector): Multi-vector signed integer dot-product by vector.
SDOT (4-way, multiple vectors): Multi-vector signed integer dot-product.
UDOT: (4-way, multiple and single vector): Multi-vector unsigned integer dot-product by vector.
(4-way, multiple vectors): Multi-vector unsigned integer dot-product.
for groups of 2 and 4 ZA registers
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on: D135563
Differential Revision: https://reviews.llvm.org/D135760
This patch adds the assembly/disassembly for the following instruction:
INT:
SDOT (2-way, multiple and single vector): Multi-vector signed integer dot-product by vector.
(2-way, multiple vectors): Multi-vector signed integer dot-product.
UDOT (2-way, multiple and single vector): Multi-vector unsigned integer dot-product by vector.
(2-way, multiple vectors): Multi-vector unsigned integer dot-product.
SUDOT (multiple and indexed vector): Multi-vector signed by unsigned integer dot-product by indexed elements.
(multiple and single vector): Multi-vector signed by unsigned integer dot-product by vector.
USDOT (multiple and single vector): Multi-vector unsigned by signed integer dot-product by vector.
(multiple vectors): Multi-vector unsigned by signed integer dot-product.
FP:
BFDOT(multiple and single vector): Multi-vector BFloat16 floating-point dot-product by vector.
(multiple vectors): Multi-vector BFloat16 floating-point dot-product.
FDOT (multiple and single vector): Multi-vector half-precision floating-point dot-product by vector.
(multiple vectors): Multi-vector half-precision floating-point dot-product.
For set of 2 and 4 ZA registers
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on:D135455
Differential Revision: https://reviews.llvm.org/D135683
V_FMAC_F32 and V_DOT2C_F32_F16 have a dummy src2 operand tied to vdst to
inform passes that the instructions read the dst operand. The VOPD
versions of these instructions lacked the dummy operand, which was a
problem for inserting s_delay_alu.
Introduce the dummy src2 operand on the VOPD versions, and fix the VOPD operand
tracking logic to account for it.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D136629
If the inner broadcast scalar type is smaller/same width as the outer broadcast scalar type then we can broadcast using the same inner type directly. Works for vbroadcast_load as well.
This patch adds the assembly/disassembly for the following instructions:
SDOT : Signed integer 2-way dot product indexed and non-indexed
UDOT : Unsigned integer 2-way dot product, indexed and non-indexed
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Differential Revision: https://reviews.llvm.org/D136464
If OP in PTEST(PG, OP(PG, ...)) has a flag-setting variant change the
opcode so the PTEST becomes redundant. This patch extends this existing
optimization in AArch64::optimizePTestInstr to cover all flag-setting
opcodes.
Reviewed By: peterwaller-arm
Differential Revision: https://reviews.llvm.org/D136083
This patch adds the assembly/disassembly for the following instructions:
BFMLSLB : BFloat16 floating-point multiply-subtract long
from single-precision (bottom)
BFMLSLT : BFloat16 floating-point multiply-subtract long
from single-precision (top)
Both the vector and indexed forms are added for each.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Differential Revision: https://reviews.llvm.org/D136439
This patch adds the assembly/disassembly for the following instruction:
FP:
FMLA (multiple and indexed vector): Multi-vector floating-point fused multiply-add by indexed element.
FMLS(multiple and indexed vector): Multi-vector floating-point fused multiply-subtract by indexed element.
BFDOT (multiple and indexed vector): Multi-vector BFloat16 floating-point dot-product by indexed element.
FDOT (multiple and indexed vector): Multi-vector half-precision floating-point dot-product by indexed element.
BFVDOT: Multi-vector BFloat16 floating-point vertical dot-product by indexed element.
FVDOT: Multi-vector half-precision floating-point vertical dot-product by indexed element.
INT:
SDOT (2-way, multiple and indexed vector): Multi-vector signed integer dot-product by indexed element.
(4-way, multiple and indexed vector): Multi-vector signed integer dot-product by indexed element.
SUDOT (multiple and indexed vector): Multi-vector signed by unsigned integer dot-product by indexed elements.
SUVDOT: Multi-vector signed by unsigned integer vertical dot-product by indexed element.
UDOT (2-way, multiple and indexed vector): Multi-vector unsigned integer dot-product by indexed element.
(4-way, multiple and indexed vector): Multi-vector unsigned integer dot-product by indexed element.
USDOT (multiple and indexed vector): Multi-vector unsigned by signed integer dot-product by indexed element.
USVDOT: Multi-vector unsigned by signed integer vertical dot-product by indexed element.
For the multi-vec ternary indexed with 2 and 4 ZA single-vectors for
32 and 64 bits according to the instruction
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on:D135563
Differential Revision: https://reviews.llvm.org/D135676
Following on from D129634, this patch fixes more X86 CodeGen test
failures with D129213 applied, which adds verification of LiveIntervals
after the TwoAddressInstruction pass runs. These failures only showed up
with LLVM_ENABLE_EXPENSIVE_CHECKS=ON which adds the equivalent of an
implicit -verify-machineinstrs on all tests.
Differential Revision: https://reviews.llvm.org/D136596
The Chain wasn't set correctly in the DAG for functions marked
with aarch64_pstate_sm_body, which meant that SelectionDAG would
dead-code some of the CopyToReg's. This didn't show up in the
existing tests because all uses were in the same block, but when
adding some control-flow, suddenly things would break.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D136579
This patch adds the assembly/disassembly for the following instruction:
Int:
SCLAMP:Multi-vector signed clamp to minimum/maximum vector.
UCLAMP:Multi-vector unsigned clamp to minimum/maximum vector.
FP:
FCLAMP: Multi-vector floating-point clamp to minimum/maximum number.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on: D135563
Differential Revision: https://reviews.llvm.org/D135601
This patch adds the assembly/disassembly for the following instruction:
INT:
SMAX (multiple and single vector): Multi-vector signed maximum by vector.
(multiple vectors): Multi-vector signed maximum.
SMIN (multiple and single vector): Multi-vector signed minimum by vector.
(multiple vectors): Multi-vector signed minimum.
UMAX (multiple and single vector): Multi-vector unsigned maximum by vector.
(multiple vectors): Multi-vector unsigned maximum.
UMIN (multiple and single vector): Multi-vector unsigned minimum by vector.
(multiple vectors): Multi-vector unsigned minimum.
SRSHL (multiple and single vector): Multi-vector signed rounding shift left by vector.
(multiple vectors): Multi-vector signed rounding shift left.
URSHL (multiple and single vector): Multi-vector unsigned rounding shift left by vector.
(multiple vectors): Multi-vector unsigned rounding shift left.
FP:
FMAX (multiple and single vector): Multi-vector floating-point maximum by vector.
(multiple vectors): Multi-vector floating-point maximum.
FMAXNM (multiple and single vector): Multi-vector floating-point maximum number by vector.
(multiple vectors): Multi-vector floating-point maximum number.
FMIN (multiple and single vector): Multi-vector floating-point minimum by vector.
(multiple vectors): Multi-vector floating-point minimum.
FMINNM (multiple and single vector): Multi-vector floating-point minimum number by vector.
(multiple vectors): Multi-vector floating-point minimum number.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
It also updates ADD and SQDMULH
Depends on: D135563
Differential Revision: https://reviews.llvm.org/D135599
Set target triple to "dxil-ms-dx" for DXIL at the end of DXILTranslateMetadata.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D131545
This patch implements:
FCVTZS: Multi-vector floating-point convert to signed integer, rounding
toward zero.
FCVTZU: Multi-vector floating-point convert to unsigned integer,
rounding toward zero.
SCVTF: Multi-vector signed integer convert to floating-point.
UCVTF: Multi-vector unsigned integer convert to floating-point.
for 2 and 4 registers
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on: D135563
Differential Revision: https://reviews.llvm.org/D135593
When lowering vector shuffles into the xxsplti32dx instruction on Power10, we
canonicalize the right operand to be a BUILD_VECTOR and as a result, get the
commuted vector shuffle node.
However, a vector shuffle will not always be returned as the result for a
commuted vector shuffle. In such a scenario, this patch updates the original
cast of a shuffle into a dyn_cast<> and checks if the shuffle is a valid vector
shuffle node prior to obtaining the commuted shuffle mask.
This patch also adds a new test case that demonstrates this scenario (primarily
seen on 32-bit), and was originally a crash prior to this fix.
Differential Revision: https://reviews.llvm.org/D135024
This appears to be a copy+paste typo in the znver1/2 AMD SoG tables, treating the byte shift instructions like bit shifts
Older AMD SoG referred to PSLLDQ/PSRLDQ as shuffles, and Agner/instlatx64 both report they are integer shuffles
On AVX512, extract legal bool vectors as bool subvectors before bitcasting to scalars to avoid spilling to stack.
This helps rust which internally represents bool vectors as bool arrays
It also exposes more missed opportunities to use the KADD instruction to add masks together before moving to gpr
Fixes#58546
Handle MULH[US] by normalizing them into newly invented nodes
HexagonISD::(S|U|US)MUL_LOHI. On HVX v60, if only the high part of
SMUL_LOHI is used, use the original MULHS expansion. In all other
cases, expand the full product.
On HVX v62, always expand the full product.
Introduce Hexagon-specific LLVM IR intrinsics for 32x32 multiplication
returning low/high parts.