This patch adds an IncomingValueHandler and IncomingValueAssigner, and implements minimal support for lowering formal arguments according to the RISC-V calling convention. Simple non-aggregate integer and pointer types are supported.
In the future, we must correctly handle byval and sret pointer arguments, and instances where the number of arguments exceeds the number of registers.
Coauthored By: lewis-revill
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D74977
This patch updates several functions in LLVM's IR generation code to accept
an IRBuilder object as an argument, rather than an Instruction that indicates
the insertion point for new instructions.
This change is necessary to handle sophisticated -Ofast optimization cases
from D148558 where it's unclear which instructions should be used as the
insertion point for new operations.
Differential Revision: https://reviews.llvm.org/D148703
This is pretty straight forward in the basic form. I did need to move the slideup matching earlier, but that looks generally profitable on it's own.
As follow ups, I plan to explore the v(f)slide1down variants, and see what I can do to canonicalize the shuffle then insert pattern (see _inverse tests at the end of the vslide1up.ll test).
Differential Revision: https://reviews.llvm.org/D151468
lowerBuildVectorAsBroadcast will not broadcast splat constants in all cases, resulting in a lot of situations where a full width vector load that has failed to fold but is loading splat constant values could use a broadcast load instruction just as cheaply, and save constant pool space.
This patch adds four new tests for upcoming functionality in LLVM:
* complex-deinterleaving-add-mull-fixed-contract.ll
* complex-deinterleaving-add-mull-scalable-contract.ll
* complex-deinterleaving-add-mull-fixed-fast.ll
* complex-deinterleaving-add-mull-scalable-fast.ll.
These tests were generated from the IR of vectorizable loops, which were
compiled from C++ code using different optimization flags in Clang. Each pair
of tests corresponds to Neon and SVE architectures, respectively, and
each pair contains tests compiled with -Ofast and -O3 -ffp-contract=fast
-ffinite-math-only optimization flags.
The tests were stripped of nnan and ninf flags as they have no impact on the
output.
The primary objective of these tests is to show the various sequences of
complex computations that may be encountered and to demonstrate the ability
of ComplexDeinterleaving to support any ordering.
Depends on D147451
Differential Revision: https://reviews.llvm.org/D148550
Fixes issue introduced by 0f8e0f4228805cbecce13dcfadef4c48a4f0f4cd where SimplifyDemandedBits could crash when trying to extract fp data from broadcasted constants
This results in improved codegen for half/bf16 libcalls on soft ABIs
Adds a RISCVSubtarget helper method for determining if a soft FP ABI is
being targeted (future bf16 related patches make use of this).
Differential Revision: https://reviews.llvm.org/D151434
Previously we only handled instructions with merge ops that were
also masked. This patch supports instructions with merge ops that
aren't masked, like FMA.
I'm only folding into a TU vmerge for now. Supporting TA vmerge
shouldn't be much more work, but we need to make sure we get the
policy operand for the result correct. And of course we need more
tests.
Reviewed By: fakepaper56, frasercrmck
Differential Revision: https://reviews.llvm.org/D151596
lowerBuildVectorAsBroadcast will not broadcast splat constants in all cases, resulting in a lot of situations where a full width vector load that has failed to fold but is loading splat constant values could use a broadcast load instruction just as cheaply, and save constant pool space.
NOTE: SSE3 targets can use MOVDDUP but not all SSE era CPUs can perform this as cheaply as a vector load, we will need to add scheduler model checks if we want to pursue this.
As discussed in D151436, it's safe to do this as a simple shift (as is
done in LegalizeDAG.cpp) rather than needing a libcall. The added test
cases for RISC-V previously just triggered an assertion.
Codegen for bfloat_to_double will be slightly improved by D151434.
Differential Revision: https://reviews.llvm.org/D151563
Noticed in D150143/D150526 - we currently create scalar Constant values using the broadcast instruction width, which might be wider than the original build vector width, making it tricky to recognise the original constant bits data.
If we have widened the broadcast value, its much more useful for asm comments if we create a ConstantVector with the original element data, add that to the constant-pool and load that with the same (wider) broadcast instruction.
We use a special TIED instructions for vfwadd.wv to avoid an
earlyclobber constraint preventing the first source and the destination
from being the same register.
This prevents our normal post process for forming TU instructions.
Add manual isel pattern instead. This matches what we do for FMA
for example.
The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.
This kernel has a few hundred i8 loads and stores. Merging is critical for
performance.
The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores. This limit is necessary to contain the
O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.
So we rewrite basically the whole thing to use a new algorithm. Before, we
compared every load/store to every other to see if they're consecutive. The
insight (from tra@) is that this is redundant. If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.
So that's what we do. When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain. Each instruction gets compared only to the leaders of
all the chains.
In the worst case, this is still O(n^2), because all chains might be of length
1. To prevent compile time blowup, we only consider the 64 most recently used
chains. Thus we do no more comparisons than before, but we have the potential
to make much longer chains.
This rewrite affects many tests. The changes to tests fall into two
categories.
1. The old code had what appears to be a bug when deciding whether a misaligned
vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4
has relative speed 1, and suppose that load i32 align 4 has relative speed
32.
The intent of the code seems to be that we prefer the scalar load, because
it's faster. But the old code would choose the vectorized load.
accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
even call into TTI to get the relative speed), because the scalar load is
aligned.
After this patch, we will prefer the scalar load if it's faster.
2. This patch changes the logic for how we vectorize. Usually this results in
vectorizing more.
Explanation of changes to tests:
- AMDGPU/adjust-alloca-alignment.ll: #1
- AMDGPU/flat_atomic.ll: #2, we vectorize more.
- AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call.
- AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
- AMDGPU/insertion-point.ll: #2 we vectorize more
- AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
- AMDGPU/multiple_tails.ll: #1
- AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
- AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
- NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
- NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
- X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
- X86/subchain-interleaved.ll: #2, we vectorize more
- X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
- X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
- X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll
Differential Revision: https://reviews.llvm.org/D149893
This reverts commit 9b92f70d4758f75903ce93feaba5098130820d40. The issue
with the re-applied change was an implicit truncation due to the
multiplication. Although the operations were converted to `APInt`, the
values were implicitly converted to `long` due to the typing rules.
Fixes: #59594
Differential Revision: https://reviews.llvm.org/D140347
The generic implementation is umin(TC, VF * vscale).
Lowering to vsetvli for RISC-V will come in a future patch.
This patch is a pre-requisite to be able to CodeGen vectorized code from
D99750.
Reviewed By: reames, frasercrmck
Differential Revision: https://reviews.llvm.org/D149916
For dbg.value intrinsics targeting an llvm::Argument address whose expression
starts with an entry value, we lower this to a DEBUG_VALUE targeting the livein
physical register corresponding to that Argument.
Depends on D151332
Differential Revision: https://reviews.llvm.org/D151333
This fixes a couple mistakes in 0f64d4f877. In particular, I'd not included a negative test where the slideup didn't write the entire VL, and had gotten all of my 4 element vector shuffle masks incorrect so they didn't match. Also, add a test with swapped operands for completeness.
The transform is in D151468.
When avx512 is available the lhs operand of select instruction can be
folded with mask instruction, while the rhs operand can't. This patch is
to commute the lhs and rhs of the select instruction to create the
opportunity of folding.
Differential Revision: https://reviews.llvm.org/D151535
Summary:
DbgValue intrinsics whose expression is an entry_value and whose address is
described an llvm::Argument must be lowered to the corresponding livein physical
register for that Argument.
Depends on D151329
Reviewers: aprantl
Subscribers:
For dbg.value intrinsics targeting an llvm::Argument address whose expression
starts with an entry value, we lower this to a DEBUG_VALUE targeting the livein
physical register corresponding to that Argument.
Depends on D151328
Differential Revision: https://reviews.llvm.org/D151329
Previously SGPR triples like s[3:5] were aligned on a 3-SGPR boundary
which has no basis in hardware.
Aligning them on a 4-SGPR boundary is at least justified by the
architecture reference guide which says: "Quad-alignment of SGPRs is
required for operation on more than 64-bits".
Currently there are no instructions that take SGPR triples as operands
so the issue is latent.
Differential Revision: https://reviews.llvm.org/D151463
The main purpose of this is to simplify register pressure tracking as after the pass there is no need
to track subreg liveness anymore.
On the other hand this pass creates more possibilites for the subreg unaware code, as many of the subregs
becomes ordinary registers.
Intersting sideeffect: spill-vgpr.ll has lost a lot of spills.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D139732
After D149063.
This patch adds support for both scalable and fixed-length vector.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D151176
This was originally added to preserve FMF on SETCC. Unfortunately,
it also incorrectly preserves nuw/nsw on ADD/SUB in some cases.
There's also no guarantee the new opcode is even the same opcode
as the original node.
This patch removes the code and adds code to explicitly preserve
FMF flags in the SETCC promotion function.
The other test changes are from nuw/nsw not being preserved. I
believe for all these tests it was correct to preserve the flags,
so we need new code to preserve the flags when possible. I'll post
another patch for that since it's a riskier change.
This should unblock D150769.
Differential Revision: https://reviews.llvm.org/D151472
Note that the builders are protected by is64Bit().
More fine-grained availibility checks.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D150790
Even though we only need to write to the bottom NumElts - Rotation
elements for the vslidedown.vi, we can save an extra vsetivli toggle if
we just keep the wide VL.
(I may be missing something here: is there a reason why we want to explicitly keep the vslidedown narrow?)
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D151390
This is an attempt to reland D42600 and enabling this optimisation by default.
This also resolves the issue pointed out in the context of PGO build.
Differential Revision: https://reviews.llvm.org/D42600