For the f64 case, this gives us a cheaper to materialize 32-bit
constant. It's less obviously a win for f32 and f16. It forces us to
use a VOP3 encoding so it's a neutral code size change.
GlobalISel cases don't work because of the constant-is-copy-to-vgpr
problem.
https://reviews.llvm.org/D157111
In the machine outliner implementation for AArch64, `signOutlinedFunction()`
reimplements signing the LR value in prologue and authenticating it in
epilogue of the outlined function. This patch factors out `signLR()` and
`authenticateLR()` functions from AArch64FrameLowering code and reuses
them in `signOutlinedFunction()`.
The `mergeOutliningCandidateAttributes()` outliner callback is
introduced as well to further unify signing and authentication of the LR
value.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D157320
Replaces a pair of insert_subvectors with a single (implicitly widened) vector - also reduce uses of the src.
Hopefully this should address most of the remaining widen subvector regressions I'm seeing while trying to aggressively convert TRUNCATE to PACKSS/PACKUS.
On AVX512 targets we can concatenate these and create a X86ISD::SHUF128 node.
Prevents regression on some future work to improve codegen for concat_vectors(extract_subvector(),extract_subvector()) (mainly via vector widening) patterns.
We previously split the vector into two halves and performed two vector reduce operations followed by bit shifting and bitwise or. Now, we use NEON's zip1 to concatenate
the halves in a smart way and then perform only a single vector reduce. This boosts performance quite a bit for this small routine, as vector reduce is a rather expensive
intruction. Original discussion for this started in: https://reviews.llvm.org/D145301
Differential Revision: https://reviews.llvm.org/D156544
BuildMI automatically adds the implicit operands of the
instruction. This meant we couldn''t set the dead flag on
dead implicit defs in that case.
Fix it by introducing an opcode to mark a given implicit
def as dead.
Fixes#64565
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157515
PromoteAlloca now uses SSAUpdater, it doesn't need SROA to clean-up after it anymore.
Internal testing shows no noticeable performance impact.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D156398
This support promotion for vp.bitreverse/bswap/ctlz/ctlz_zero_undef/cttz/cttz_zero_undef/ctpop/fshr/fshl.
Reviewed By: craig.topper, luke
Differential Revision: https://reviews.llvm.org/D157607
FRINT was added to matchRoundingOp after this function was written.
So FRINT was not tested originally.
For vectors, folding this causes us to create a CSR swap that tries
to write 7 to FRM. This is an illegal value and will cause the CSR
write to fail.
While this might be a legal fold we could do, I'm disabling it for
now so we can backport to LLVM 17 with the least risk.
Differential Revision: https://reviews.llvm.org/D157583
Original commit didn't handle the case where one of the stores was a
truncating store of the build_vector. The existing codepath produced
wrong code (which thankfully also failed asserts) instead of guarding
against unexpected types. Original commit message follows..
Ran across this when making a change to RISCV memset lowering. Seems
very odd that manually merging a store into a vector prevents it from
being further merged.
Differential Revision: https://reviews.llvm.org/D156349
This is the case which triggered the revert of 660b740. Note that the test is extremely fragile as it depends on getting a truncating store at the right moment rather than folding the constant to a narrower bitwidth. This appears to happen on skylake, but not e.g. plain avx.
Set the ReplaceFlags variable to false, since there is code meant only
for the ADDItocHi/ADDItocL nodes. This has the side effect of disabling
the peephole when the load/store instruction has a non-zero offset.
This patch also fixes retrieving the `ImmOpnd` node from the AIX small
code model pseduos and does the same for the register operand node.
This allows cleaning up the later calls to replaceOperands.
Finally move calculating the MaxOffset into the code guarded by
ReplaceFlags as it is only used there and the comment is specific to the ELF
ABI.
Fixes https://github.com/llvm/llvm-project/issues/63927
Differential Revision: https://reviews.llvm.org/D155957
This was an oversight when the GFX11 early release VGPRs optimization
was reimplemented in D153279.
Sending the DEALLOC_VGPRS message is a performance optimization so there
is no need to do it at -O0. In addition it makes some kinds of post
mortem debugging hard or impossible, since VGPR values are no longer
available to inspect at the s_endpgm instruction.
Differential Revision: https://reviews.llvm.org/D157599
Add bitcast handling to the existing insert_subvector(src, extract_subvector(sub)) pattern, and recognise undef src cases to allow us to detect vector widening patterns.
Since zihintntl is ratified now, we could remove the experimental prefix and change its version to 1.0.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D151547
This PR causes the PPA1 to emit the function's name if it exists. This field is not emitted for unnamed functions.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D157494
In llvm alias analysis is off by default now.
This patch enable alias analysis on RISCV target during code generation by default,
and this makes more chances for improving performance.
Modified related test cases.
Differential Revision: https://reviews.llvm.org/D157250
After the library is linked and trivially inlined, the generic fma and
fmuladd intrinsics already handle these cases, and with precise flag
handling. This was requiring all fast math flags when we really just
need nsz for the fma(a, b, 0) case.
https://reviews.llvm.org/D156677
This was trying to constant fold these calls, and also turn some of
them into a regular fmul/fdiv. There's no point to doing that, the
underlying library implementation should be using those in the first
place. Even when the library does use the rcp intrinsics, the backend
handles constant folding of those. This was also only performing the
folds under overly strict fast-evertyhing-is-required conditions.
The one possible plus this gained over linking in the library is if
you were using all fast math flags, it would propagate them to the new
instructions. We could address this in the library by adding more fast
math flags to the native implementations.
The constant fold case also had no test coverage.
https://reviews.llvm.org/D156676
This allows use with non-0 address space stacks. llvm_ptr_ty should
never be used. This could use some more percolation up through mlir,
but this is enough to fix existing tests.
https://reviews.llvm.org/D156666
The function changeVectorElementType assumes MVT input types will
result in MVT output types. There's no gurantee this is possible
during early code generation and so this patch converts an instance
used during initial DAG construction to instead explicitly create a
new EVT.
NOTE: I could have added more MVTs, but that seemed unscalable as
you can either have MVTs with 100% element count coverage or 100%
bitwidth coverage, but not both.
Differential Revision: https://reviews.llvm.org/D157392
The example sequence
add z0.h, z0.h, #32
lsr z0.h, #6
st1b z0.h, x1
can be replaced with
rshrnb z0.b, #6
st1b z0.h, x1
As the top half of the destination elements are truncated.
In similar fashion,
add z0.s, z0.s, #32
lsr z1.s, z1.s, #6
add z1.s, z1.s, #32
lsr z0.s, z0.s, #6
uzp1 z0.h, z0.h, z1.h
Can be replaced with
rshrnb z1.h, z1.s, #6
rshrnb z0.h, z0.s, #6
uzp1 z0.h, z0.h, z1.h
Differential Revision: https://reviews.llvm.org/D155299
When emitting the assembly we perform some late global variables demotion.
Prior to this patch, this optimization was only performed on variables with
the internal linkage whereas any local global variable can be demoted.
Fix that by using `hasLocalLinkage` instead of `hasInternalLinkage`.
Without this change, global variables with the `private` linkage wouldn't
be demoted.
Differential Revision: https://reviews.llvm.org/D154507
This patch adds error diagnostics to Clang when code uses the AArch64 SME
attributes without specifying 'sme' as available target attribute.
* Function definitions marked as '__arm_streaming', '__arm_locally_streaming',
'__arm_shared_za' or '__arm_new_za' will by definition use or require SME
instructions.
* Calls from non-streaming functions to streaming-functions require
the compiler to enable/disable streaming-SVE mode around the call-site.
In some cases we can accept the SME attributes without having 'sme' enabled:
* Function declaration can have the SME attributes.
* Definitions can be __arm_streaming_compatible since the generated
code should execute on processing elements without SME.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D157269