Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y)
reductions where one of x or y is a decreasing induction, producing a SMin
reduction. It uses signed max as sentinel value.
PR: https://github.com/llvm/llvm-project/pull/140451
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
ArrayRef has a constructor that accepts std::nullopt. This
constructor dates back to the days when we still had llvm::Optional.
Since the use of std::nullopt outside the context of std::optional is
kind of abuse and not intuitive to new comers, I would like to move
away from the constructor and eventually remove it.
This patch takes care of the llvm side of the migration.
Patch fixes cost estimation for the extractelements from non-power-of-2
vectors, defined as subvector extracts. In this case the subvector size
might be not adjusted to a whole register size, need to get the minimum
between whole vector size and the actual difference to prevent compiler
crash.
Fixes#143513
When implementing the vectorization, we potentially need to add shuffles
for external users. In such cases, we may be shuffling a smaller vector
into a larger vector. When this happens `ResizeToVF` will just build a
poison padded identity vector. Then the to build the final shuffle, we
just use the `SK_InsertSubvector` mask.
This is possibly clearer by looking at the included test in
SLPVectorizer/AMDGPU/external-shuffle.ll
In the exit block we have a bunch of shuffles to glue the vectorized
tree match the `InsertElement` users. `TMP25` holds the result of
resizing the v2i16 vectorized sequence to match the `InsertElement` size
v16i16. Then `TMP26` is the final shuffle which replaces the
`InsertElement` sequence. This is just an insertsubvector.
However, when calculating the cost for these shuffles, we aren't
modelling this correctly. `ResizeToVF` will indicate to
`performExtractsShuffleAction` that we cannot use the original mask due
to the resize shuffle. The consequence is that the cost calculation uses
a different shuffle mask than what is ultimately used.
Going back to the included test, we can consider again `TMP26`. Clearly
we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}.
However, we will currently calculate the cost with a mask {0, 1, 2, 3,
20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector
Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not
recognize this as an `SK_InsertSubvector` mask, and targets which have
reduced costs for `SK_InsertSubvector` will not accurately calculate the
cost.
Seeing how we can't generate any debug intrinsics any more: delete a
variety of codepaths where they're handled. For the most part these are
plain deletions, in others I've tweaked comments to remain coherent, or
added a type to (what was) type-generic-lambdas.
This isn't all the DbgInfoIntrinsic call sites but it's most of the
simple scenarios.
Co-authored-by: Nikita Popov <github@npopov.com>
I believe the new variable names better convey their purpose. However, I
also believe that function is more complex than it needs to be, and this
tiny patch should be seen as a first step towards (maybe) further
refactoring.
The previous names were very generic (Size, Sz, Cnt, StartIdx). This
made it easy to get confused given that the vecotrizeStores() function
is already complex enough.
My hope would be to eventually have a function concise enough to clearly
see what are the different strategies being attempted to vectorise a
group of related store instructions.
Following the work in PR #107279, this patch applies the annotative
DebugLocs, which indicate that a particular instruction is intentionally
missing a location for a given reason, to existing sites in the compiler
where their conditions apply. This is NFC in ordinary LLVM builds (each
function `DebugLoc::getFoo()` is inlined as `DebugLoc()`), but marks the
instruction in coverage-tracking builds so that it will be ignored by
Debugify, allowing only real errors to be reported. From a developer
standpoint, it also communicates the intentionality and reason for a
missing DebugLoc.
Some notes for reviewers:
- The difference between `I->dropLocation()` and
`I->setDebugLoc(DebugLoc::getDropped())` is that the former _may_ decide
to keep some debug info alive, while the latter will always be empty; in
this patch, I always used the latter (even if the former could
technically be correct), because the former could result in some
(barely) different output, and I'd prefer to keep this patch purely NFC.
- I've generally documented the uses of `DebugLoc::getUnknown()`, with
the exception of the vectorizers - in summary, they are a huge cause of
dropped source locations, and I don't have the time or the domain
knowledge currently to solve that, so I've plastered it all over them as
a form of "fixme".
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
Gather nodes with parents, which scalar instructions are used only
outside, are generated before the whole tree vectorization. Need to
teach isGatherShuffledSingleRegisterEntry to check that such nodes are
emitted first and they cannot depend on other nodes, which are emitted
later.
Fixes#141628
TreeEntry should be a class, not a struct, since it has private members.
Also, do no repeat Does-Not-Need-To-Schedule analysis during codegen,
codegen may affect the result of the analysis in future patches.
Reviewers: hiraditya, HanKuanChen, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140734
Need to remove the check for Orders in interleaved loads analysis and
estimate shuffle cost without the reordering to correctly handle the
costs of masked compress loads.
Reviewers: hiraditya, HanKuanChen, RKSimon
Reviewed By: HanKuanChen, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140647
Instead of looking through all the vectorizable tree to find the operand
entry, better to store it in a separate map and perform quick lookup,
basing on user tree entry and operand index.
It allows to remove lots of duplicated code, simplify processing and fix
potential future issues with the analysis, affected by the codegen.
Also, improves compile time.
Reviewers: HanKuanChen, RKSimon, hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/140549
Need to set the insertion point for (non-schedulable) vector node after
the last instruction in the node to avoid def-use breakage. But it also
causes miscompilation with gather/buildvector operands of the phi nodes,
used in the same phi only in the block.
These nodes supposed to be inserted at the end of the block and after
changing the insertion point for the non-schedulable vec block, it also
may break def-use dependencies. Need to prevector such nodes, to emit
them as early as possible, so the vectorized nodes are inserted before
these nodes.
Fixes#139728
Recommit after revert 60fb92179291e848eb7b04913bdc818d081db296
Reviewers: hiraditya, HanKuanChen, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/139917
Need to set the insertion point for (non-schedulable) vector node after
the last instruction in the node to avoid def-use breakage. But it also
causes miscompilation with gather/buildvector operands of the phi nodes,
used in the same phi only in the block.
These nodes supposed to be inserted at the end of the block and after
changing the insertion point for the non-schedulable vec block, it also
may break def-use dependencies. Need to prevector such nodes, to emit
them as early as possible, so the vectorized nodes are inserted before
these nodes.
Fixes#139728
Reviewers: hiraditya, HanKuanChen, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/139917
Change getPointersDiff to return an std::optional<int64_t>, and fill
this value with using APInt::trySExtValue. This simple change requires
changes to other functions in LAA, and major changes in SLPVectorizer
changing types from 32-bit to 64-bit.
Fixes#139202.
If the interleaved loads require reordering, better to avoid generate
load + shuffle sequence, which in this case cannot be recognized as
interleaved load. Also, it fixes the issue with the incorrect codegen.
Fixes#138923
Extract values state and operands analysis/building into a separate
class. This class allows to localize instrutions state and operands
building for future support of copyable elements vectorization.
Recommit after revert 10f512074fb13ab5da9f49c25965508f51c8452a
Recommit after revert 6a2a8ebe27c1941f5b952313239fc6d155f58e9d
Reviewers: HanKuanChen, RKSimon
Reviewed By: HanKuanChen
Pull Request: https://github.com/llvm/llvm-project/pull/138724
This reverts commit 512a5d0b8aa82749995204f4852e93757192288a.
It broke RISC-V vector code generation on some inputs (oggenc.c from
llvm-test-suite), as found by our CI. Reduced test case and more
information posted in #138274.
Extract values state and operands analysis/building into a separate
class. This class allows to localize instrutions state and operands
building for future support of copyable elements vectorization.
Recommit after revert 10f512074fb13ab5da9f49c25965508f51c8452a
Reviewers: HanKuanChen, RKSimon
Reviewed By: HanKuanChen
Pull Request: https://github.com/llvm/llvm-project/pull/138724
Extract values state and operands analysis/building into a separate
class. This class allows to localize instrutions state and operands
building for future support of copyable elements vectorization.
Reviewers: HanKuanChen, RKSimon
Reviewed By: HanKuanChen
Pull Request: https://github.com/llvm/llvm-project/pull/138724
This NFC aims to simplify the interfaces used in `buildTree()` to make
it easier to understand where decisions for legality are made.
In particular, there is now a single point of definition for legality
decisions. This makes it clear where all those decisions are made.
Previously, multiple variables with a large scope were passed by
reference.
If one user node is non-schedulable and another one is schedulable, such
nodes should be considered matched. The selection of the actual insert
point in this case differs and the insert points may match, which may
cause a compiler crash because of the broken def-use chain.
Fixes#137797
Most callers want a constant index. Instead of making every caller
create a ConstantInt, we can do it in IRBuilder. This is similar to
createInsertElement/createExtractElement.
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.
Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.
This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.
Fixes: #135346
This NFC aims to simplify the control-flow and interfaces used in tryToFindDuplicates(). The point is to make it easier to understand where decisions for scalar de-duplication are made.
In particular:
- Limit indentation
- Rename some variables to better match their use case
- Always give consistent outputs for VL and ReuseShuffleIndices. This makes it possible to use the same code for building gather TreeEntry everywhere. This also allows to remove the TryToFindDuplicates lambda.
Add a new reduction recurrence kind for reductions with
minimumnum/maximumnum. Such reductions can be vectorized without
nsz/nnans, same as reductions with maximum/minimum intrinsics.
Note that a new reduction kind is needed to make sure partial reductions
are also combined with minimumnum/maximumnum.
Note that the final reduction to a scalar value is performed with
vector.reduce.fmin/fmax. This should be fine, as the results of the
partial reductions with maximumnum/minimumnum silences any sNaNs.
In-loop and reductions in SLP are not supported yet, as there's no
reduction version of maximumnum/minimumnum yet and fmax may be
incorrect.
PR: https://github.com/llvm/llvm-project/pull/137335