This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
Move vector pointer generation to a separate VPVectorPointerRecipe.
This untangles address computation from the memory recipes future
and is also needed to enable explicit unrolling in VPlan.
https://github.com/llvm/llvm-project/pull/72164
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
The disjoint flag was recently added to IR in #72583
We already set it when we turn an add into an or. This patch sets it on Ors that weren't converted from an Add.
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.
It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.
Fixes PR47437
Differential Revision: https://reviews.llvm.org/D111938
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.
For store we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `4`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110755
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.
For store we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `2`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110754
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/4rY96hnGT - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.
For store we have:
https://godbolt.org/z/vbo37Y3r9 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: =0.5`
So pick cost of `1`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110753
getScalarizationOverhead() results in a somewhat better cost estimation than counting the insertion/extraction costs directly. Notably, this is still overestimating the costs.
Original Patch by: @lebedev.ri (Roman Lebedev)
Differential Revision: https://reviews.llvm.org/D110713
This patch marks the induction increment of the main induction variable
of the vector loop as NUW when not folding the tail.
If the tail is not folded, we know that End - Start >= Step (either
statically or through the minimum iteration checks). We also know that both
Start % Step == 0 and End % Step == 0. We exit the vector loop if %IV +
%Step == %End. Hence we must exit the loop before %IV + %Step unsigned
overflows and we can mark the induction increment as NUW.
This should make SCEV return more precise bounds for the created vector
loops, used by later optimizations, like late unrolling.
At the moment quite a few tests still need to be updated, but before
doing so I'd like to get initial feedback to make sure I am not missing
anything.
Note that this could probably be further improved by using information
from the original IV.
Attempt of modeling of the assumption in Alive2:
https://alive2.llvm.org/ce/z/H_DL_g
Part of a set of fixes required for PR50412.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D103255
This patch updates IRBuilder to create insertelement/shufflevector using poison as a placeholder.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93793
This tries to improve the accuracy of extract/insert element costs by accounting for subvector extraction/insertion for >128-bit vectors and the shuffling of elements to/from the 0'th index.
It also adds INSERTPS for f32 types and PINSR/PEXTR costs for integer types (at the moment we assume the same cost as MOVD/MOVQ - which isn't always true).
Differential Revision: https://reviews.llvm.org/D74976
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
As it's causing some bot failures (and per request from kbarton).
This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.
llvm-svn: 358546