23 Commits

Author SHA1 Message Date
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
Florian Hahn
f18536d642
[VPlan] Model address separately. (#72164)
Move vector pointer generation to a separate VPVectorPointerRecipe.
This untangles address computation from the memory recipes future
and is also needed to enable explicit unrolling in VPlan.

https://github.com/llvm/llvm-project/pull/72164
2024-01-01 19:51:15 +00:00
Nikita Popov
eecb99c5f6 [Tests] Add disjoint flag to some tests (NFC)
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
2023-12-05 14:09:36 +01:00
Craig Topper
7ec4f6094e
[InstCombine] Infer disjoint flag on Or instructions. (#72912)
The disjoint flag was recently added to IR in #72583

We already set it when we turn an add into an or. This patch sets it on Ors that weren't converted from an Add.
2023-12-02 14:11:12 -08:00
Nikita Popov
30faaaf626 [LoopVectorize] Regenerate test checks (NFC) 2023-10-12 14:35:23 +02:00
Nikita Popov
7d7577256b [LoopVectorize] Convert some tests to opaque pointers (NFC) 2022-12-14 15:16:59 +01:00
Roman Lebedev
be51fa4580
[NFC] Port all runlines for LoopVectorize pass tests to -passes syntax 2022-12-05 22:17:30 +03:00
Nuno Lopes
9df0b254d2 [NFC] Switch a few uses of undef to poison as placeholders for unreachable code 2022-07-23 21:50:11 +01:00
Simon Pilgrim
6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Roman Lebedev
3a0643e9c2
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110755
2021-10-01 17:48:13 +03:00
Roman Lebedev
b12aeaec9a
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110754
2021-10-01 17:48:13 +03:00
Roman Lebedev
f44d9009c2
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/4rY96hnGT - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/vbo37Y3r9 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: =0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110753
2021-10-01 17:48:13 +03:00
Simon Pilgrim
17f1fc1e54 [TTI] BasicTTI::getInterleavedMemoryOpCost(): use getScalarizationOverhead()
getScalarizationOverhead() results in a somewhat better cost estimation than counting the insertion/extraction costs directly. Notably, this is still overestimating the costs.

Original Patch by: @lebedev.ri (Roman Lebedev)

Differential Revision: https://reviews.llvm.org/D110713
2021-09-29 16:41:53 +01:00
Florian Hahn
23c2f2e6b2
[LV] Mark increment of main vector loop induction variable as NUW.
This patch marks the induction increment of the main induction variable
of the vector loop as NUW when not folding the tail.

If the tail is not folded, we know that End - Start >= Step (either
statically or through the minimum iteration checks). We also know that both
Start % Step == 0 and End % Step == 0. We exit the vector loop if %IV +
%Step == %End. Hence we must exit the loop before %IV + %Step unsigned
overflows and we can mark the induction increment as NUW.

This should make SCEV return more precise bounds for the created vector
loops, used by later optimizations, like late unrolling.

At the moment quite a few tests still need to be updated, but before
doing so I'd like to get initial feedback to make sure I am not missing
anything.

Note that this could probably be further improved by using information
from the original IV.

Attempt of modeling of the assumption in Alive2:
https://alive2.llvm.org/ce/z/H_DL_g

Part of a set of fixes required for PR50412.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D103255
2021-06-07 10:47:52 +01:00
Juneyoung Lee
278aa65cc4 [IR] Let IRBuilder's CreateVectorSplat/CreateShuffleVector use poison as placeholder
This patch updates IRBuilder to create insertelement/shufflevector using poison as a placeholder.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D93793
2020-12-30 04:21:04 +09:00
Simon Pilgrim
28fc173819 [LoopVectorize] Remove unused check-prefixes 2020-11-09 12:18:20 +00:00
Simon Pilgrim
168a44a70e [CostModel][X86] Improve extract/insert element costs (PR43605)
This tries to improve the accuracy of extract/insert element costs by accounting for subvector extraction/insertion for >128-bit vectors and the shuffling of elements to/from the 0'th index.

It also adds INSERTPS for f32 types and PINSR/PEXTR costs for integer types (at the moment we assume the same cost as MOVD/MOVQ - which isn't always true).

Differential Revision: https://reviews.llvm.org/D74976
2020-02-27 15:54:13 +00:00
Simon Pilgrim
2769fb90f0 [LoopVectorize][X86] Regenerate tests. NFCI. 2020-02-21 18:23:55 +00:00
Sanjay Patel
5c166f1d19 [x86] make SLM extract vector element more expensive than default
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc

The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.

This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605

Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.

Differential Revision: https://reviews.llvm.org/D70607
2019-11-27 14:08:56 -05:00
Eric Christopher
cee313d288 Revert "Temporarily Revert "Add basic loop fusion pass.""
The reversion apparently deleted the test/Transforms directory.

Will be re-reverting again.

llvm-svn: 358552
2019-04-17 04:52:47 +00:00
Eric Christopher
a863435128 Temporarily Revert "Add basic loop fusion pass."
As it's causing some bot failures (and per request from kbarton).

This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.

llvm-svn: 358546
2019-04-17 02:12:23 +00:00
Mohammed Agabaria
20caee95e1 [X86] enable memory interleaving for X86\SLM arch.
Differential Revision: https://reviews.llvm.org/D28547

llvm-svn: 293040
2017-01-25 09:14:48 +00:00
Michael Kuperstein
b2443ed62b [X86] Enable interleaved memory access by default
This lets the loop vectorizer generate interleaved memory accesses on x86.

Differential Revision: https://reviews.llvm.org/D25350

llvm-svn: 284779
2016-10-20 21:04:31 +00:00