5134 Commits

Author SHA1 Message Date
Alexey Bataev
9f3b6adb15 [SLP][NFC]Exit early if the graph is empty, NFC
No need to check anything if the graph is empty, just exit early.
2024-11-06 08:33:14 -08:00
Alexey Bataev
76422385c3
[SLP]Support reordered buildvector nodes for better clustering
Patch adds reordering of the buildvector nodes for better clustering of
the compatible operations and future vectorization. Includes basic cost
estimation and if the transformation is not profitable - reverts it.

AVX512, -O3+LTO
Metric: size..text

Program                                                                          size..text
                                                                                       results     results0    diff
                        test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test    74565.00    75701.00  1.5%
                test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2014462.00  2024494.00  0.5%
                         test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   395219.00   396979.00  0.4%
                         test-suite :: MultiSource/Applications/JM/lencod/lencod.test   857795.00   859667.00  0.2%
                    test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   800472.00   802440.00  0.2%
                       test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   590699.00   591403.00  0.1%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   203006.00   203102.00  0.0%
            test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    42408.00    42424.00  0.0%
            test-suite ::  External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12451575.00  12451927.00  0.0%
            test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1396480.00  1396448.00 -0.0%
             test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1396480.00  1396448.00 -0.0%
                        test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1047708.00  1047580.00 -0.0%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   111344.00   111328.00 -0.0%
                test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1087660.00  1087500.00 -0.0%
       test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   280664.00   280616.00 -0.0%
                          test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   502646.00   502006.00 -0.1%
                      test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1033135.00  1031567.00 -0.2%
        test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2070917.00  2065845.00 -0.2%
       test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2070917.00  2065845.00 -0.2%
                        test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33893.00    33797.00 -0.3%
          test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39677.00    39549.00 -0.3%
                 test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39674.00    39546.00 -0.3%
test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11512.00 -0.4%
                 test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   653867.00   649275.00 -0.7%
                  test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   653867.00   649275.00 -0.7%

CINT2006/401.bzip2 - extra code vectorized
CINT2017rate/541.leela_r
CINT2017speed/641.leela_s - function
_ZN9FastBoard25get_pattern3_augment_specEiib not inlined anymore, better
vectorization
CFP2017rate/510.parest_r - better vectorization
JM/ldecod - better vectorization
JM/lencod - same
CINT2006/464.h264ref - extra code vectorized
CFP2006/447.dealII - extra vector code
MiBench/consumer-lame - vectorized 2 loops previously scalar
DOE-ProxyApps-C/miniGMG - small changes
Benchmarks/7zip - extra code vectorized, better vectorization
CFP2017rate/526.blender_r - extra vectorization
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorization
MiBench/consumer-jpeg - extra vectorization
CINT2006/400.perlbench - extra vectorization
Prolangs-C/TimberWolfMC - small variations
Applications/sqlite3 - extra function vectorized and inlined
Benchmarks/tramp3d-v4 - extra code vectorized
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - extra code vectorized, function digcpy gets
vectorized and inlined
CINT2006/473.astar - extra code vectorized
MiBench/telecomm-gsm - extra code vectorized, better vector code
mediabench/gsm - same
MiBench/security-blowfish - extra code vectorized
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - sub4x4_dct function vectorized and gets
inlined

RISCV-V, SiFive-p670, O3+LTO

CFP2017rate/510.parest_r - extra vectorization
CFP2017rate/526.blender_r - extra vectorization
MiBench/consumer-lame - extra vectorized code

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/114284
2024-11-06 10:51:15 -05:00
Simon Pilgrim
e3a0775651 [VectorCombine] foldExtractedCmps - (re-)enable fold on non-commutative binops
#114901 exposed that foldExtractedCmps didn't account for non-commutative binops, and were disabled by 05e838f428555bcc4507bd37912da60ea9110ef6

This patch re-enables support for non-commutative binops by ensuring that the LHS/RHS arg order of the binop is retained.
2024-11-06 12:10:31 +00:00
David Sherwood
d77a36e01b
[LoopVectorize] Use new getUniqueLatchExitBlock routine (#108231)
With PR #88385 I am introducing support for vectorising more loops with
early exits that don't require a scalar epilogue. As such, if a loop
doesn't have a unique exit block it will not automatically imply we
require a scalar epilogue. Also, in all places in the code today where
we use the variable LoopExitBlock we actually mean the exit block from
the latch. Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock that allows the caller to determine the exit
block taken from the latch and use this instead of getUniqueExitBlock. I
also renamed LoopExitBlock to be LatchExitBlock. I feel this not only
better reflects how the variable is used today, but also prepares the
code for PR #88385.

While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e.
when we're not exiting from the latch block. This doesn't always imply
we have multiple exits, e.g. see the test in

Transforms/LoopVectorize/unroll_nonlatch.ll

where the latch unconditionally branches back to the only exiting block.
2024-11-06 10:35:35 +00:00
Mel Chen
4480a22c2b
[LV][EVL] Emit vp.merge intrinsic to enable out-loop reduction in EVL vectorization. (#101641)
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.

This patch performs transformation to convert 
  select(header_mask, LHS, RHS)
into
  vp.merge(all-true, LHS, RHS, EVL) 
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.

TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
2024-11-06 14:53:49 +08:00
Vasileios Porpodas
11b768af3e [SandboxVec][BottomUpVec] Fix bug in invalidation of analyses
This makes sure we don't preserve analyses when we modify the IR.
This was causing errors in the EXPENSIVE_CHECKS build.
2024-11-05 18:01:31 -08:00
vporpo
320389d428
[SandboxVec][BottomUpVec] Generate vector instructions (#115087)
This patch implements some very basic code generation, for some opcodes.
2024-11-05 16:27:24 -08:00
vporpo
ce0d085842
[SandboxVec][Legality] Query the scheduler for legality (#114616)
This patch adds the legality check of whether the candidate instructions
can be scheduled together. This uses a Scheduler object.
2024-11-05 14:54:21 -08:00
Alexey Bataev
0c18def2c1 [SLP]Allow interleaving check only if it is less than number of elements
Need to check if the interleaving factor is less than total number of
elements in loads slice to handle it correctly and avoid compiler crash.

Fixes report https://github.com/llvm/llvm-project/pull/112361#issuecomment-2457227670
2024-11-05 07:06:15 -08:00
Simon Pilgrim
05e838f428 [VectorCombine] foldExtractedCmps - disable fold on non-commutative binops
The fold needs to be adjusted to correctly track the LHS/RHS operands, which will take some refactoring, for now just disable the fold in this case.

Fixes #114901
2024-11-05 11:42:30 +00:00
Mel Chen
70de0b8bea
[LV][NFC] Simplify initialization of MinProfitableTripCount (#113445)
Iteration runtime check confirms whether the trip count is greater than
VFxUF at least. Therefore, there is no need to adjust the
MinProfitableTripCount to VF if it is zero.
Retaining the original MinProfitableTripCount information is also
beneficial for supporting more profitable runtime checks in the future.
2024-11-05 15:13:59 +08:00
Florian Hahn
596fd103f8
[VPlan] Share logic to connect predecessors in VPBB/VPIRBB execute (NFC)
This moves the common logic to connect IRBBs created for a VPBB to their
predecessors in the VPlan CFG, making it easier to keep in sync in the
future.
2024-11-04 19:01:39 +00:00
Alexey Bataev
899336735a [SLP]Be more pessimistic about poisonous reductions
Consider all possible reductions ops as being non-poisoning boolean
logical operations, which require freeze to be fully correct.

https://alive2.llvm.org/ce/z/TKWDMP

Fixes #114738
2024-11-04 06:13:52 -08:00
Kazu Hirata
aa825b74af
[Vectorize] Remove unused includes (NFC) (#114643)
Identified with misc-include-cleaner.
2024-11-03 08:58:51 -08:00
Florian Hahn
af6ebb70d2
[VPlan] Implement computeCost for remaining VPSingleDefRecipes.
Provide computeCost implementations for all remaining sub-classes of
VPSingleDefRecipe. This pushes one of the last uses of getLegacyCost
directly to its user: VPReplicateRecipe.
2024-11-02 19:38:31 +00:00
vporpo
083369fd99
[SandboxVec][Legality] Per opcode checks (#114145)
This patch adds more opcode-specific legality checks.
2024-11-01 15:04:03 -07:00
Florian Hahn
17bad1a9da
[LV] Bail out on header phis in shouldConsiderInvariant.
This fixes an infinite recursion in rare cases.

Fixes https://github.com/llvm/llvm-project/issues/113794.
2024-11-01 20:51:25 +00:00
Han-Kuan Chen
a795a18bba
[SLP][REVEC] VF should be scaled when ScalarTy is FixedVectorType. (#114551) 2024-11-02 03:03:52 +08:00
Simon Pilgrim
718d50d6d0 [VectorCombine] foldPermuteOfBinops - prefer the new fold for matching costs.
Minor tweak to #114101 - as we're reducing the instruction count, we should prefer the fold if the old/new costs are the same.
2024-11-01 17:28:37 +00:00
Han-Kuan Chen
e4aeeba84c
[SLP][REVEC] When ScalarTy is FixedVectorType, the insertion index should consider the number of elements of ScalarTy. (#114526) 2024-11-01 21:17:57 +08:00
David Sherwood
4ed7bcb4a6
[VPlan][NFC] Add new getMiddleBlock interface to VPlan (#113558)
This work is in preparation for PRs #112138 and #88385 where
the middle block is not guaranteed to be the immediate successor
to the region block. I've simply add new getMiddleBlock()
interfaces to VPlan that for now just return

cast<VPBasicBlock>(VectorRegion->getSingleSuccessor())

Once PR #112138 lands we'll need to do more work to discover
the middle block.
2024-11-01 10:50:52 +00:00
Florian Hahn
3b4c45e4e5
[VPlan] Fix long comment added in b021464d35ca (NFC).
Fix formatting of comment added in b021464d35ca.
2024-10-31 21:05:00 +00:00
Florian Hahn
b021464d35
[VPlan] Introduce scalar loop header in plan, remove VPLiveOut. (#109975)
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.

Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.

PR: https://github.com/llvm/llvm-project/pull/109975
2024-10-31 21:36:44 +01:00
Alexey Bataev
e05def081e
[SLP]Do not vectorize code in EH and non-returning blocks
The code in EH and non-returning blocks can be skipped by the
vectorizer, since it does not add to the perfromance, just consumes
compile/link time.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112221
2024-10-31 13:50:02 -04:00
Alexey Bataev
19a34dded7
[SLP]Do not account external uses in EH block and in non-returning blocks
No need to account the cost of the external uses in EH and non-returning
basic blocks.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112045
2024-10-31 13:23:43 -04:00
Alexey Bataev
e7080fd735 [SLP]Extra check if the intruction matked for removal, must be replaced in reduction ops
If the instruction is vectorized and it is a part of the reduced values
gather/buildvector node, it should replaced in reduced operation
instructions before removal properly, to avoid compiler crash.

Fixes #114371
2024-10-31 09:59:35 -07:00
Simon Pilgrim
92af82a48d
[VectorCombine] Fold "shuffle (binop (shuffle, shuffle)), undef" --> "binop (shuffle), (shuffle)" (#114101)
Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes #94546
Fixes #49736
2024-10-31 10:58:09 +00:00
Florian Hahn
5bd1af5abc
[LV] Directly store VPlan in InnerLoopVectorizer (NFC).
The current VPlan is already passed to multiple functions and more in
the future. Store it once directly in InnerLoopVectorizer.
2024-10-30 18:39:50 +00:00
Mel Chen
8420dbf2b9
[VPlan] Refine the constructor of VPWidenIntrinsicRecipe. nfc (#113890)
Infers member MayReadFromMemory, MayWriteToMemory, and
MayHaveSideEffects based on intrinsic attributes.

---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
2024-10-30 12:22:28 +08:00
Piotr Fusik
3c02fea737
[LV][NFC] Remove stray semicolons (#114057) 2024-10-30 04:07:14 +01:00
vporpo
ca998b071e
[SandboxVec][Legality] Check wrap flags (#113975) 2024-10-29 15:37:03 -07:00
Florian Hahn
680901ed80
[VPlan] Implement VPHeaderPHIRecipe::computeCost.
Fill out computeCost implementations for various header PHI recipes,
matching the legacy cost model for now.
2024-10-29 21:04:31 +00:00
vporpo
a461869db3
[SandboxIR][Pass] Implement Analyses class (#113962)
The Analyses class provides a way to pass around commonly used Analyses
to SandboxIR passes throught `runOnFunction()` and `runOnRegion()`
functions.
2024-10-28 18:00:52 -07:00
vporpo
bf4b31ad54
[SandboxVec][Legality] Check Fastmath flags (#113967) 2024-10-28 15:32:20 -07:00
vporpo
5ea694816b
[SandboxVec][Legality] Check opcodes and types (#113741) 2024-10-28 14:05:58 -07:00
Florian Hahn
0d0abb351b
[VPlan] Use ResumePhi to create reduction resume phis. (#110004)
Use VPInstruction::ResumePhi to create phi nodes for reduction resume
values in the scalar preheader, similar to how ResumePhis are used for
first-order recurrence resume values after 9a5a8731e77.

This allows simplifying createAndCollectMergePhiForReduction to only
collect reduction resume phis when vectorizing epilogue loops and adding
extra incoming edges from the main vector loop. Updating phis for the
epilogue vector loops requires special attention, because additional
incoming values from the bypass blocks need to be added.

PR: https://github.com/llvm/llvm-project/pull/110004
2024-10-28 20:14:08 +01:00
Ellis Hoag
6ab26eab4f
Check hasOptSize() in shouldOptimizeForSize() (#112626) 2024-10-28 09:45:03 -07:00
Alexey Bataev
7152bf3bc8 [SLP]Do not create new vector node if scalars fully overlap with the existing one
If the list of scalars vectorized as the part of the same vector node,
no need to generate vector node again, it will be handled as part of
overlapping matching.

Fixes #113810
2024-10-28 06:59:41 -07:00
Florian Hahn
7fe149cdf0
[VPlan] Replace getIRBasicBlock with IRBB in VPIRBB::execute (NFC).
Suggested in https://github.com/llvm/llvm-project/pull/109975. This
makes the function consistent throughout.
2024-10-27 16:22:18 +01:00
Shih-Po Hung
266ff98cba
[LV][VPlan] Use VF VPValue in VPVectorPointerRecipe (#110974)
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.

Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
2024-10-26 23:18:50 +08:00
Vasileios Porpodas
1540f772c7 Reapply "[SandboxVec][Legality] Reject non-instructions (#113190)"
This reverts commit eb9f4756bc3daaa4b19f4f46521dc05180814de4.
2024-10-25 12:55:58 -07:00
Vasileios Porpodas
eb9f4756bc Revert "[SandboxVec][Legality] Reject non-instructions (#113190)"
This reverts commit 6c9bbbc818ae8a0d2849dbc1ebd84a220cc27d20.
2024-10-25 12:52:31 -07:00
vporpo
6c9bbbc818
[SandboxVec][Legality] Reject non-instructions (#113190) 2024-10-25 12:47:19 -07:00
Florian Hahn
e724226da7
[VPlan] Return cost of 0 for VPWidenCastRecipe without underlying value.
In some cases, VPWidenCastRecipes are created but not considered in the
legacy cost model, including truncates/extends when evaluating a reduction
in a smaller type. Return 0 for such casts for now, to avoid divergences
between VPlan and legacy cost models.

Fixes https://github.com/llvm/llvm-project/issues/113526.
2024-10-25 21:25:44 +02:00
Florian Hahn
9648271a3c
[LV] Pass flag indicating epilogue is vectorized to executePlan (NFC)
This clarifies the flag, which is now only passed if the epilogue loop
is being vectorized.
2024-10-25 20:39:47 +02:00
Florian Hahn
7b9f988a53
[VPlan] Limit stride replacement to vector region and middle VPBB (NFC).
At the moment this in NFC, but ensures we only replace uses that are
dominated by runtime checks as we model more of the skeleton in VPlan.
2024-10-24 15:08:36 -07:00
Alexey Bataev
e914421d7f [SLP]Do correct signedness analysis for externally used scalars
If the scalars is used externally is in the root node, it may have
incorrect signedness info because of the conflict with the demanded bits
analysis. Need to perform exact signedness analysis and compute it
rather than rely on the precomputed value, which might be incorrect for
alternate zext/sext nodes.

Fixes #113520
2024-10-24 08:59:24 -07:00
Alexey Bataev
d2e7ee77d3 [SLP]Do not check for clustered loads only
Since SLP support "clusterization" of the non-load instructions, the
restriction for reduced values for loads only should be removed to avoid
compiler crash.

Fixes #113516
2024-10-24 08:16:42 -07:00
Alexey Bataev
cb5046da26 [SLP]Do not ignore undefs when trying to replace with "poisonous" shuffles
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.

Fixes #113425
2024-10-24 07:47:23 -07:00
Florian Hahn
ef217a0f6b
[VPlan] Introduce and use getVectorPreheader (NFC).
Introduce a dedicated function to retrieve the vector preheader. This
ensures the correct block is used, even if the skeleton is exetended.
2024-10-23 21:01:52 -07:00