3803 Commits

Author SHA1 Message Date
Florian Hahn
8098f2577e
[LV] Use Legal::isUniform to detect uniform pointers.
Update collectLoopUniforms to identify uniform pointers using
Legal::isUniform. This is more powerful and  brings pointer
classification here in sync with setCostBasedWideningDecision
which uses isUniformMemOp. The existing mis-match in reasoning
can causes crashes due to D134460, which is fixed by this patch.

Fixes https://github.com/llvm/llvm-project/issues/60831.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D150991
2023-05-30 16:42:55 +01:00
Florian Hahn
1a28b9bce7
[VPlan] Handle invariant GEPs in isUniformAfterVectorization.
This fixes a crash caused by legal treating a scalable GEP as invariant,
but isUniformAfterVectorization does not handle GEPs.

Partially fixes https://github.com/llvm/llvm-project/issues/60831.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D144434
2023-05-30 15:53:26 +01:00
Florian Hahn
b750862107
[LV] Use early exit for stores storing the ptr operand. (NFC)
Cleanup suggested in D150991.
2023-05-30 12:14:12 +01:00
Kazu Hirata
75c75215e3 [Vectorize] Remove unused declaration requiresTooManyRuntimeChecks
The corresponding function definition was removed by:

  commit 644a965c1efef68f22d9495e4cefbb599c214788
  Author: Florian Hahn <flo@fhahn.com>
  Date:   Mon Jul 4 15:10:48 2022 +0100
2023-05-29 11:56:50 -07:00
Justin Lebar
420cf6927c
[LSV] Return same bitwidth from getConstantOffset.
Previously, getConstantOffset could return an APInt with a different
bitwidth than the input pointers.  For example, we might be loading an
opaque 64-bit pointer, but stripAndAccumulateInBoundsConstantOffsets
might give a 32-bit offset.

This was OK in most cases because in gatherChains, we casted the APInt
back to the original ASPtrBits.

But it was not OK when considering selects.  We'd call getConstantOffset
twice and compare the resulting APInt's, which might not have the same
bit width.

This fixes that.  Now getConstantOffset always returns offsets with the
correct width, so we don't need the hack of casting it in gatherChains,
and it works correctly when we're handling selects.

Differential Revision: https://reviews.llvm.org/D151640
2023-05-29 08:43:47 -07:00
Justin Lebar
f225471c68
[LSV] Fix the ContextInst for computeKnownBits.
Previously we used the later of GEPA or GEPB.  This is hacky because
really we should be using the later of the two load/store instructions
being considered.  But also it's flat-out incorrect, because GEPA and
GEPB might be in different BBs, in which case we cannot ask which one
comes last (assertion failure,
https://reviews.llvm.org/D149893#4378332).

Fixed, now we use the correct context instruction.

Differential Revision: https://reviews.llvm.org/D151630
2023-05-28 08:00:52 -07:00
Kazu Hirata
b1b04ed96a [Vectorize] Fix warnings
This patch fixes:

  llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:140:20: error:
  unused function 'operator<<' [-Werror,-Wunused-function]

  llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:176:6: error:
  unused function 'dumpChain' [-Werror,-Wunused-function]
2023-05-26 17:27:25 -07:00
Kazu Hirata
0508ac32cf [Vectorize] Fix a warning
This patch fixes:

  llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:1429:23:
  error: comparison of integers of different signs: 'int' and 'const
  size_t' (aka 'const unsigned long') [-Werror,-Wsign-compare]
2023-05-26 17:02:34 -07:00
Justin Lebar
8d57b00f96
Fix -Wsign-compare from D149893. 2023-05-26 16:22:16 -07:00
Justin Lebar
2be0abb7fe
Rewrite load-store-vectorizer.
The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.

This kernel has a few hundred i8 loads and stores.  Merging is critical for
performance.

The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores.  This limit is necessary to contain the
O(n^2) behavior of the pass.  I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.

So we rewrite basically the whole thing to use a new algorithm.  Before, we
compared every load/store to every other to see if they're consecutive.  The
insight (from tra@) is that this is redundant.  If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.

So that's what we do.  When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain.  Each instruction gets compared only to the leaders of
all the chains.

In the worst case, this is still O(n^2), because all chains might be of length
1.  To prevent compile time blowup, we only consider the 64 most recently used
chains.  Thus we do no more comparisons than before, but we have the potential
to make much longer chains.

This rewrite affects many tests.  The changes to tests fall into two
categories.

1. The old code had what appears to be a bug when deciding whether a misaligned
   vectorized load is fast.  Suppose TTI reports that load <i32 x 4> align 4
   has relative speed 1, and suppose that load i32 align 4 has relative speed
   32.

   The intent of the code seems to be that we prefer the scalar load, because
   it's faster.  But the old code would choose the vectorized load.
   accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
   even call into TTI to get the relative speed), because the scalar load is
   aligned.

   After this patch, we will prefer the scalar load if it's faster.

2. This patch changes the logic for how we vectorize.  Usually this results in
   vectorizing more.

Explanation of changes to tests:

 - AMDGPU/adjust-alloca-alignment.ll: #1
 - AMDGPU/flat_atomic.ll: #2, we vectorize more.
 - AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this.  Before, we'd vectorize in case 1 and not case 2.  Now we vectorize in case 2 and not case 1.  So we just move the call.
 - AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
 - AMDGPU/insertion-point.ll: #2 we vectorize more
 - AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
 - AMDGPU/multiple_tails.ll: #1
 - AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
 - AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
 - NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
 - NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
 - X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
 - X86/subchain-interleaved.ll: #2, we vectorize more
 - X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
 - X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical.  It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
 - X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll

Differential Revision: https://reviews.llvm.org/D149893
2023-05-26 15:15:39 -07:00
Alexey Bataev
95b631181a [SLP]Fix getSpillCost functions.
There are several issues in the current implementation. The instructions
are not properly ordered, if they are placed in different basic blocks,
need to reverse the order of blocks. Also, need to exclude
non-vectorizable nodes and check for CallBase, not CallInst, otherwise
invoke calls are not handled correctly.
2023-05-26 12:19:28 -07:00
Alexander Timofeev
bad4de1ae7 Don't disable loop unroll for vectorized loops on AMDGPU target
We've got a performance regression after the https://reviews.llvm.org/D115261.
Despite the loop being vectorized unroll is still required.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D149281
2023-05-25 22:54:41 +02:00
Craig Topper
6006d43e2d LLVM_FALLTHROUGH => [[fallthrough]]. NFC
Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D150996
2023-05-24 12:40:10 -07:00
Florian Hahn
299f0ff60e
[VPlan] Print IR flags for VPRecipeWithIRFlags.
Now that IR flags are modeled as part of VPRecipeWithIRFlags, include
the flags when printing recipes.

Depends on D150027.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D150029
2023-05-23 20:36:16 +01:00
Alexey Bataev
ae5ff3ca0c [SLP]Fix PR62665: compiler crash when trying to access non-existing mask
element.

Need to check at first if the SubMask element is PoisonMaskElem to avoid
compiler crash.
2023-05-22 13:43:25 -07:00
Luke Lau
c27a0b21c5 [SLP][RISCV] Account for offset folding in getPointersChainCost
For a GEP in a pointer chain, if:
1) a pointer chain is unit-strided
2) the base pointer wasn't folded and is sitting in a register somewhere
3) the distance between the GEP and the base pointer is small enough and
   can be folded into the addressing mode of the using load/store

Then we can exclude that GEP from the total cost of the pointer chain,
as it will likely be folded away.

In order to check if 3) holds, we need to know the type of memory access
being made by the users of the pointer chain. For that, we need to pass
along a new argument to getPointersChainCost. (Using the source pointer
type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for
more details).

Also note that 2) is currently an assumption, and could be modelled more
accurately.

This prevents some unprofitable cases from being SLP vectorized on
RISC-V by making the scalar costs cheaper and closer to the actual
codegen.

For now the getPointersChainCost hook is duplicated for RISC-V to prevent
disturbing other targets, but could be merged back in and shared with
other targets in a following patch.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D149654
2023-05-22 13:55:30 +01:00
Florian Hahn
8eaf7a75fe
[VPlan] Add missing ifdef after 96686796f606.
Fixes build with debug printing disabled.
2023-05-22 10:44:17 +01:00
Florian Hahn
96686796f6
[VPlan] Move live-out printing to VPLiveOut::print (NFC).
Preparation for D150398. This brings live-out printing in line with how
printing for recipes is handled.
2023-05-22 09:53:53 +01:00
Vasileios Porpodas
806dea46be [SLP] Cleanup: Remove tryToVectorizePair(), most probably NFC
`tryToVectorizePair()` adds a level of indirection over `tryToVectorizeList()`.
I am not really sure why it is needed, it looks redundant.

I replaced all calls to `tryToVectorizePair()` with calls to
`tryToVectorizeList()` and I am not seeing any failures.

Differential Revision: https://reviews.llvm.org/D151004
2023-05-19 20:25:20 -07:00
Vasileios Porpodas
338fc76200 [SLP][NFC] Cleanup: Remove KeyNodes set.
I don't see a good reason form having the `KeyNodes` set.
This patch removes the set.

Differential Revision: https://reviews.llvm.org/D150918
2023-05-19 10:30:02 -07:00
Florian Hahn
55903151a2
[VPlan] Use isUniformAfterVec in VPReplicateRecipe::execute.
I was unable to find a case where this actually changes generated code,
but it enables the bug fix in D144434. It also brings codegen in line
with the handling of stores to uniform addresses in the cost model
(D134460).

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D144491
2023-05-19 18:15:21 +01:00
Alexey Bataev
9a7248f561 [SLP]Fix crash for scalarized vectors.
Need to remove insertion of the nodes to the InVector in case of
scalarized vectors too to avoid compiler crashes.
2023-05-17 06:32:22 -07:00
Luke Lau
d088b8af93 [SLP] Rename IsUniformStride to IsUnitStride. NFCI
IsUniformStride is only used when the stride is a unit-stride, i.e. in a
plain wide vector load. This tightens the condition and renames it to
isUnitStride. It removes the old unused getUniformStrided() variant, as
isUnitStride should now imply that the stride is known.

Reviewed By: vdmitrie, ABataev

Differential Revision: https://reviews.llvm.org/D150662
2023-05-17 13:21:33 +01:00
Alexey Bataev
6c7acc6409 [SLP][NFC]Add missing finalize params in the CostEstimator, NFC.
Prepare functions for generalization of codegen/cost estimation.

Differential Revision: https://reviews.llvm.org/D150121
2023-05-15 11:17:37 -07:00
Vasileios Porpodas
ddb2188afc [SLP][NFC] Cleanup: Separate vectorization of Inserts and CmpInsts.
This deprecates `vectorizeSimpleInstructions()` and replaces it with separate
functions that vectorize CmpInsts and Inserts.

Differential Revision: https://reviews.llvm.org/D149993
2023-05-15 10:12:34 -07:00
Florian Hahn
701f7230cd
[VPlan] Use VPRecipeWithIRFlags for VPReplicateRecipe, retire poison map
Update VPReplicateRecipe to use VPRecipeWithIRFlags for IR flag
handling. Retire separate MayGeneratePoisonRecipes map.

Depends on D149082.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D150027
2023-05-15 11:49:20 +01:00
Florian Hahn
f40a7901d1
[LV] Move selecting vectorization factor logic to LVP (NFC).
Split off from D143938. This moves the planning logic to select the
vectorization factor to LoopVectorizationPlanner as a step towards only
computing costs for individual VFs in LoopVectorizationCostModel and do
planning in LVP.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D150197
2023-05-13 12:28:14 +01:00
Florian Hahn
7472f1da96
[VPlan] Change LoopVectorizationPlanner::TTI to be const reference (NFC) 2023-05-13 12:27:57 +01:00
Florian Hahn
0418d0242b
[LV] Move getVScaleForTuning out of LoopVectorizationCostModel (NFC).
Split off refactoring from D150197 to reduce diff.
2023-05-13 10:17:13 +01:00
Philip Reames
592199c8fe [LV] Use interface routines instead of internal variables
This makes a (possible) change to the internal representation easier in the future, and makes the code easier to read now.
2023-05-12 16:27:12 -07:00
Florian Hahn
bf279a0f8e
[VPlan] Remove dangling comment and newlines (NFC).
Apply missed cleanups.
2023-05-11 22:06:56 +01:00
Florian Hahn
3d4eed0133
[LV] Reuse SCEV expansion results for epilogue vectorization.
When generating code for the epilogue vector loop, we need to re-use the
expansion results for induction steps generated for the main vector
loop, as the pre-header of the epilogue vector loop may not dominate the
vector preheader of the epilogue.

This fixes a reported crash. Note that this is a workaround which should
be removed soon once induction resume value creation is handled in VPlan
directly.
2023-05-11 22:00:07 +01:00
Philip Reames
7fbfcc653f [LV/LAA] Use PSE to identify stride multiplies which simplify [mostly nfc]
LV/LAA will speculate that (some) strided access patterns have unit stride, and insert runtime checks if required.

LV cost models a multiply by such a stride as free.  We did this by keeping around the StrideSet structure, just to check if one of the operands were one of the strides we speculated.

We can instead just ask PredicatedScalarEvolution if either of the operands are one (after predicates are applied).  We get mostly the same result - PSE can prove it in more cases in theory - and simpler code.
2023-05-11 11:16:04 -07:00
Philip Reames
e41dce4d49 [LAA/LV] Simplify stride speculation logic [NFC] (try 2)
The original commit wasn't quite NFC, and this was caught by an arguably overly strong assert.  Specifically, I'd failed to strip off the integer cast off the SCEV before saving it in the map.  The result - other than a failed assert - is that we'd speculate on the casted unknown, not the unknown.  The only case I can think of where that might change behavior would be a sext(i1 load).  I doubt that case is interesting in practice, but it's good to be strictly NFC on this change regardless.

Original commit message follows..

The existing code makes it hard to tell that collectStridedAccess is really about identifying some loop invariant SCEV which is *profitable* to speculate is equal to one. The odd dual usage structure of Value and SCEV confuses this point.

We could choose to loosen the profitability analysis if desired. I'm not proposing doing so at this time as it exposes too many cases where the speculation is unprofitable.

Differential Revision: https://reviews.llvm.org/D147750
2023-05-11 10:19:23 -07:00
Philip Reames
dc0d00c5fc Revert "[LAA/LV] Simplify stride speculation logic [NFC]"
This reverts commit d5b840131223f2ffef4e48ca769ad1eb7bb1869a.  Running this through broader testing after rebasing is revealing a crash.  Reverting while I investigate.
2023-05-11 09:26:35 -07:00
Florian Hahn
236a0e82df
[LV] Use VPValue to get expanded value for SCEV step expressions.
Update skeleton creation logic to use SCEV expansion results from
expanding the pre-header. This avoids another set of SCEV expansions
that may happen after the CFG has been modified.

Fixes #58811.

Depends on D147964.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D147965
2023-05-11 16:49:19 +01:00
Philip Reames
d5b8401312 [LAA/LV] Simplify stride speculation logic [NFC]
The existing code makes it hard to tell that collectStridedAccess is really about identifying some loop invariant SCEV which is *profitable* to speculate is equal to one. The odd dual usage structure of Value and SCEV confuses this point.

We could choose to loosen the profitability analysis if desired. I'm not proposing doing so at this time as it exposes too many cases where the speculation is unprofitable.

Differential Revision: https://reviews.llvm.org/D147750
2023-05-11 08:32:56 -07:00
Hongtao Yu
9272d0f079 [PseudoProbe] Clean up dwarf discriminator and avoid duplicating factor.
A pseudo probe is created with dwarf line information shared with its nearest instruction. If the instruction comes with a dwarf discriminator, it will be shared with the probe as well. This can confuse the later FS-AFDO discriminator assignment pass. To fix this, I'm cleaning up the discriminator fields for probes when they are inserted.

I also notice another possibility to change the discriminator field of pseudo probes in the pipeline before the FS discriminator assignment pass. That is the loop unroller, which assigns duplication factor to instruction being vectorized. I'm disabling that for pseudo probe intrinsics specifically, also for callsites with probes.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D148569
2023-05-10 11:26:23 -07:00
Vasileios Porpodas
dda2a5d457 [SLP][NFC] Rename a couple of variables and replace an if-else with an std::min
- Rename `LimitForRegisterSize` to `MaxVFOnly` to make the meaning of the limit less ambiguous
- Rename `OpsWidth` to `ActualVF`, which makes it clear that this is the VF we are using for vectorization.
- Replace the if-else code for the initialization of OpsWidth with an std::min.

Differential Revision: https://reviews.llvm.org/D150241
2023-05-10 09:37:58 -07:00
Florian Hahn
c096e91735
[VPlan] Address missed suggestions from D149082.
This address 2 comments missed from D149082. It sets inbounds directly
when creating the GEP and fixes the order in the enum.
2023-05-09 15:17:20 +01:00
Florian Hahn
5f3343985b
[VPlan] Use VPRecipeWithIRFlags for VPWidenGEPRecipe (NFCI).
Extend VPRecipeWithIRFlags to also include InBounds and use for VPWidenGEPRecipe.

The last remaining recipe that needs updating for
MayGeneratePoisonRecipes is VPReplicateRecipe.

Depends on D149081.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D149082
2023-05-09 12:33:28 +01:00
Florian Hahn
127b00b25c
[VPlan] Record IR flags on VPWidenRecipe directly (NFC).
This patch introduces a VPRecipeWithIRFlags class to record various IR
flags for a recipe. This allows de-coupling of IR flags from the
underlying instructions. The main benefit is that it allows dropping of
IR flags from recipes directly, without the need to go through
State::MayGeneratePoisonRecipes. The plan is to remove
MayGeneratePoisonRecipes once all relevant recipes are transitioned.

It also allows dropping IR flags during VPlan-to-VPlan transforms, which
will be used in a follow-up patch to implement truncateToMinimalBitwidths
as VPlan-to-VPlan transform.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D149079
2023-05-08 17:28:50 +01:00
Florian Hahn
823d35fd3b
[VPlan] Use RecipeBuilder to look up member when fixing IG (NFC).
Recipes for interleave group members are recorded directly in the
RecipeBuilder. Use it directly instead of going indirectly through
VPlan's Value->VPValue mapping.
2023-05-07 18:02:27 +01:00
Florian Hahn
7b7be685d4
[VPlan] Use operands directly in VPInstructionsToVPRecipes (NFC).
New that def-use chains are modeled directly in VPlan, we can simply use
the operands of the recipe we are replacing. There is no need to use the
operands of the underlying instruction to look up a VPValue.
2023-05-06 12:36:00 +01:00
Florian Hahn
01fa764c9a
[VPlan] Assert instead of check if VF is vector when widening GEPs(NFC)
VPWidenGEPRecipe should not be generated for scalar VFs. Replace
check with an assert.
2023-05-06 09:25:56 +01:00
Kazu Hirata
2b60bd5141 [Vectorize] Use Densemap::contains (NFC) 2023-05-06 00:02:54 -07:00
Alexey Bataev
2672c6e4dc [SLP][NFC]Add processBuildVector member function, NFC.
Introduce processBuildVector as a next step to generalize code for cost
estimation and code emission for gather/buildvector nodes.

Differential Revision: https://reviews.llvm.org/D149973
2023-05-05 11:00:53 -07:00
Florian Hahn
8bd02e5aef
[VPlan] Assert instead checking if VF is vec when widening calls (NFC)
VPWidenCallRecipe should not be generated for scalar VFs. Replace check
with an assert.
2023-05-05 18:21:57 +01:00
Vasileios Porpodas
7749f6e976 [SLP][NFC] Cleanup: Outline the code that vectorizes CmpInsts into a seaparate function.
Differential Revision: https://reviews.llvm.org/D149919
2023-05-05 09:56:41 -07:00
Alexey Bataev
ca3f4236e4 [SLP][NFC]Add/use gather and createFreeeze member functions in
ShuffleInstructionBuilder, NFC.
2023-05-05 09:12:54 -07:00